atom/index.xml Recent content on Hugo -- gohugo.io en-us Sun, 11 Aug 2024 00:00:00 +0000 How good can you be at Codenames without knowing any words? codenames/ Sun, 11 Aug 2024 00:00:00 +0000 codenames/ <p>About eight years ago, I was playing a game of <a href="https://amzn.to/4cgpzow">Codenames</a> where the game state was such that our team would almost certainly lose if we didn't correctly guess all of our remaining words on our turn. From the given clue, we were unable to do this. Although the game is meant to be a word guessing game based on word clues, a teammate suggested that, based on the physical layout of the words that had been selected, most of the possibilities we were considering would result in patterns that were &quot;too weird&quot; and that we should pick the final word based on the location. This worked and we won.</p> <p><details closed> <summary>[Click to expand explanation of Codenames if you're not familiar with the game]</summary> Codenames is played in two teams. The game has a 5x5 grid of words, where each word is secretly owned by one of {blue team, red team, neutral, assassin}. Each team has a &quot;spymaster&quot; who knows the secret word &lt;-&gt; ownership mapping. The spymaster's job is to give single-word clues that allow their teammates to guess which words belong to their team without accidentally guessing words of the opposing team or the assassin. On each turn, the spymaster gives a clue and their teammates guess which words are associated with the clue. The game continues until one team's words have all been guessed or the assassin's word is guessed (immediate loss). There are some details that are omitted here for simplicity, but for the purposes of this post, this explanation should be close enough. If you want more of an explanation, <a href="https://www.youtube.com/watch?v=J8RWBooJivg">you can try this video</a>, or <a href="https://czechgames.com/files/rules/codenames-rules-en.pdf">the official rules</a> </details></p> <p>Ever since then, I've wondered how good someone would be if all they did was memorize all 40 setup cards that come with the game. To simulate this, we'll build a bot that plays using only position information would be (you might also call this an AI, but since we'll discuss using an LLM/AI to write this bot, we'll use the term bot to refer to the automated codenames playing agent to make it easy to disambiguate).</p> <p>At the time, after the winning guess, we looked through the configuration cards to see if our teammate's idea of guessing based on shape was correct, and it was — they correctly determined the highest probability guess based on the possible physical configurations. Each card layout defines which words are your team's and which words belong to the other team and, presumably to limit the cost, the game only comes with 40 cards (160 configurations under rotation). Our teammate hadn't memorized the cards (which would've narrowed things down to only one possible configuration), but they'd played enough games to develop an intuition about what patterns/clusters might be common and uncommon, enabling them to come up with this side-channel attack against the game. For example, after playing enough games, you might realize that there's no card where a team has 5 words in a row or column, or that only the start player color ever has 4 in a row, and if this happens on an edge and it's blue, the 5th word must belong to the red team, or that there's no configuration with six connected blue words (and there is one with red, one with 2 in a row centered next to 4 in a row). Even if you don't consciously use this information, you'll probably develop a subconscious aversion to certain patterns that feel &quot;too weird&quot;.</p> <p>Coming back to the idea of building a bot that simulates someone who's spent a few days memorizing the 40 cards, below, there's a simple bot you can play against that simulates a team of such players. Normally, when playing, you'd provide clues and the team would guess words. But, in order to provide the largest possible advantage to you, the human, we'll give you the unrealistically large advantage of assuming that you can, on demand, generate a clue that will get your team to select the exact squares that you'd like, which is simulated by letting you click on any tile that you'd like to have your team guess that tile.</p> <p>By default, you also get <abbr title="fewer if you decide to have your team guess incorrectly">three guesses a turn</abbr>, which would put you well above <a href="p95-skill/">99%-ile</a> among Codenames players I've seen. While good players can often get three or more correct moves a turn, averaging three correct moves and zero incorrect moves a turn would be unusually good in most groups. You can toggle the display of remaining matching boards on, but if you want to simulate what it's like to be a human player who hasn't memorized every board, you might want to try playing a few games with the display off.</p> <p>If, at any point, you finish a turn and it's the bot's turn and there's only one matching board possible, the bot correctly guesses every one of its words and wins. The bot would be much stronger if it ever guessed words before it can guess them all, either naively or to strategically reduce the search space, or if it even had a simple heuristic where it would randomly guess among the possible boards if it could deduce that you'd win on your next turn, but even when using the most naive &quot;board memorization&quot; bot possible has been able to beat every Codenames player who I handed this to in most games where they didn't toggle the remaining matching boards on and use the same knowledge the bot has access to.</p> <p><div id="outer-container"><abbr title="Aside to Dan: the JS is loaded from a seperate file due to Hugo a Hugo parsing issue with the old version of Hugo that's currently being used. Updating to a version where this is reasoanbly fixed would require 20+ fixes for this blog due to breaking changes Hugo has made over time. If you see this and Hugo is not in use anymore, you can probably inline the javascript here.">JS for the Codenames bot failed to load!</abbr></div><div id="game-container"></div></div> <script src="codenames.js"></script></p> <p>This very silly bot that doesn't guess until it can guess everything is much stronger than most Codenames players<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">1</a></sup>. In practice, any team with someone who decides to sit down and memorize the contents of the 40 initial state cards that come in the box will beat the other team in basically every game.</p> <p>Now that my curiosity about this question is satisfied, I think this is a minor issue and not really a problem for the game because word guessing games are generally not meant to be taken seriously and most of them end up being somewhat broken if people take them seriously or even if people just play them a lot and aren't trying to break the game. Relative to other word guessing games, and especially relative to popular ones, Codenames has a lot more replayability before players will start using side channel attacks, subconsciously or otherwise.</p> <p>What happens with games with a limited set of words, like Just One or Taboo, is that people end up accidentally memorizing the words and word associations for &quot;tricky&quot; words after a handful of plays. Codenames mitigates this issue by effectively requiring people to memorize a combinatorially large set of word associations instead of just a linear number of word associations. There's this issue we just discussed, which came up when we were twenty-ish games into playing Codenames and is likely to happen on a subconscious level even if people don't realize that board shapes are influencing their play, but this is relatively subtle compared to the issues that come up in other word guessing games. And, if anyone really cares about this issue, they can use a digital randomizer to set up their boards, although I've never personally played Codenames in a group that's serious enough about the game for anyone to care to do this.</p> <p><i> Thanks to Josh Bleecher Snyder, David Turner, Won Chun, Laurence Tratt, Heath Borders, Spencer Griffin, and Yossi Kreinin for comments/corrections/discusison</i></p> <h3 id="appendix-writing-the-code-for-the-post">Appendix: writing the code for the post</h3> <p>I tried using two different AI assistants to write the code for this post, <a href="https://storytell.ai/">Storytell</a> and <a href="https://www.cursor.com/">Cursor</a>. I didn't use them as a programmer would use them and more used them as a non-programmer would use them to write a program. Overall, I find AI assistants to be amazingly good at some tasks while being hilariously bad at other tasks. That was the case here as well.</p> <p>I basically asked them to write code and then ran it to see if it worked and would then tell the assistant what was wrong and have it re-write the code until it looked like that basically worked. Even using the assistants in this very naive way, where I deliberately avoided understanding the code and was only looking to get output that worked, I don't think it took too much longer to get working code than it would've taken if I just coded up the entire thing by hand with no assistance. I'm going to guess that it took about twice as long, but programmer estimates are notoriously inaccurate and for all I know it was a comparable amount of time. I have much less confidence that the code is correct and I'd probably have to take quite a bit more time to be as confident as I'd be if I'd written the code, but I still find it fairly impressive that you can just prompt these AI assistants and get code that basically works out in not all that much more time than it would take a programmer to write the code. These tools are certainly much cheaper than hiring a programmer and, if you're using one of these tools as a programmer and not as a naive prompter, you'd get something working much more quickly because you can simply fix the bugs in one of the mostly correction versions instead of spending most of your time tweaking what you're asking for to get the AI to eliminate a bug that would be trivial for any programmer to debug and fix.</p> <p>I've seen a lot of programmers talk about how &quot;AI&quot; will never be able to replace programmers with reasons like &quot;to specify a program in enough detail that it does what you want, you're doing programming&quot;. If the user has to correctly specify how the program works up front, that would be fairly strong criticism, but when the user can iterate, like we did here, this is a much weaker criticism. The user doesn't need to be a programmer to observe that an output is incorrect, at which point the user can ask the AI to correct the output, repeating this process until the output seems correct enough. The more a piece of software has strict performance or correctness constraints, the less well this kind of naive iteration works. Luckily for people wanting to use LLMs to generate code, most software that's in production today has fairly weak performance and correctness constraints. <a href="everything-is-broken/">People basically just accept that software has a ton of bugs and that it's normal to run into hundreds or thousands of software bugs in any given week</a> and that <a href="slow-device/">widely used software is frequently 100000x slower than it could be if it were highly optimized</a>.</p> <p>A moderately close analogy is the debate over whether or not AI could ever displace humans in support roles. Even as this was already happening, people would claim that this could never happen because AI makes bad mistakes that humans don't make. <a href="customer-service/">But as we previously noted, humans frequently make the same mistakes</a>. Moreover, even if AI support is much worse, as long as the price:performance ratio is good enough, a lot of companies will choose the worse, but cheaper, option. <a href="diseconomies-scale/">Tech companies have famously done this for consumer support of all kinds</a>, but we commonly see this for all sorts of companies, e.g., when you call support for any large company or even lots of local small businesses, it's fairly standard to get a pushed into a phone tree or some kind of bad automated voice recognition that's a phone tree replacement. These are generally significantly worse than a minimum wage employee, but the cost is multiple orders of magnitude lower than having a minimum wage employee pick up every call and route you to the right department, so companies have chosen the phone tree.</p> <p>The relevant question isn't &quot;when will AI allow laypeople to create better software than programmers?&quot; but &quot;when will AI allow laypeople to create software that's as good as phone trees and crappy voice recognition are for customer support?&quot;. And, realistically, the software doesn't even have to be that good because programmers are more expensive than minimum wage support folks, but you can get access to these tools for $20/mo. I don't know how long it will be before AI can replace a competent programmer, but if the minimum bar is to be as good at programming as automated phone tree systems are at routing my calls, I think we should get there soon if we're not already there. And, as with customer support, this doesn't have to be zero sum. Not all of the money that's saved from phone trees is turned into profit — some goes into hiring support people who handle other tasks.</p> <p>BTW, one thing that I thought was a bit funny about my experience was that both platforms I tried, Storytell and Cursor, would frequently generate an incorrect result that could've been automatically checked, which it would then fix when I pointed out that the result was incorrect. Here's a typical sequence of interactions with one of these platforms:</p> <ul> <li>Me: please do X</li> <li>AI: [generates some typescript code and tests which fails to typecheck]</li> <li>Me: this code doesn't typecheck, can you fix this?</li> <li>AI: [generates some code and tests which fail when the tests are executed]</li> <li>Me: the tests fail with [copy+paste test failure] when run</li> <li>AI: [generates some code and tests which pass and also seems to work on some basic additional tests]</li> </ul> <p>Another funny interaction was that I'd get in a loop where there were a few different bugs and asking the AI to fix one would reintroduce the other bugs even when specifically asking the AI to not reintroduce those other bugs. Compared to anyone who's using these kinds of tools day in and day out, I have very little experience with them (I just mess with them occasionally to see how much they've progressed) and I'd expect someone with more prompting experience to be able to specify prompts that break out of these sorts of loops more quickly than I was able to.</p> <p>But, even so, it would be nicer experience if one of these environments had access to an execution environment so they could actually automatically fix these kinds of issues (when they're fixable) and could tell if the output was known to be wrong when a bit of naive re-prompting with &quot;that was wrong and caused XYZ, please fix&quot; doesn't fix the issue.</p> <p>I asked Josh Bleecher Snyder, who's much more familiar with this space than I am (both technically as well as on the product side) why none of these tools do that and almost none of the companies do training or fine tuning with such an environment and his response was that almost everyone working in the space has bought into <a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">The Bitter Lesson</a> and isn't working on these sorts of mundane improvements. The idea is that the kind of boring engineering work that would be necessary to set up an environment like the above will be obsoleted by some kind of fundamental advancement, so it's a waste of time to work on these kinds of things that give you incremental gains. Sam Altman has even advised founders of companies that are relying on OpenAI APIs to assume that there will be huge improvements and build companies that assume this because the companies that don't will get put out of business by the massive improvements that are coming soon. From discussions with founders and VCs in this space, almost everyone has taken this to heart.</p> <p>I haven't done any serious ML-related work for 11 years, so my opinion is worth about as much as any other layperson's, but if someone had made the contrarian bet on such a mundane system in the GPT-3 days, it seems like it would've been useful then and would still be useful with today's models, both for training/fine-tuning work as well for generating better output for the user. But I guess the relevant question is, would it make sense to try to build such a mundane system today which would be, for someone working in the space, a contrarian bet against progress? The big AI labs supposedly have a bunch of low-paid overseas contractors who label things, but if you want to label programming examples, per label, an environment that produces the canonical correct result is going to be cheaper than paying someone to try to label it unless you only want a tiny number of labels. At the level of a $1T or even $50B company, it seems like it should make sense to make the bet as a kind of portfolio move. If I want to start a startup and make a bet, then would it make sense? Maybe it's less obvious if you're putting all your eggs in one basket, but even then, perhaps there's a good case for it because almost the entire field is betting on something else? If the contrarian side is right, there's very little competition, which seems somewhat analogous to <a href="talent/">our previous discussion on contrarian hiring</a>. I haven't done any serious ML-related work for 11 years, so my opinion is worth about as much as any other layperson's, but if someone had made the contrarian bet on such a mundane system in the GPT-3 days, it seems like it would've been useful then and would still be useful with today's models, both for training/fine-tuning work as well for generating better output for the user. But I guess the relevant question is, would it make sense to try to build such a mundane system today which would be, for someone working in the space, a contrarian bet against progress? At the level of a $1T or even $50B company, it seems like it should make sense to make the bet as a kind of portfolio move. And the big AI labs supposedly have a bunch of low-paid overseas contractors who label things, but if you want to label programming examples, per label, an environment that produces the canonical correct result is going to be cheaper than paying someone to try to label it unless you only want a tiny number of labels. If I want to start a startup and make a bet, then would it make sense? Maybe it's less obvious if you're putting all your eggs in one basket, but even then, perhaps there's a good case for it because almost the entire field is betting on something else? If the contrarian side is right, there's very little competition, which seems somewhat analogous to <a href="talent/">our previous discussion on contrarian hiring</a>.</p> <h3 id="appendix-the-spirit-of-the-game-vs-playing-to-win">Appendix: the spirit of the game vs. playing to win</h3> <p>Personally, when I run into a side-channel attack in a game or a game that's just totally busted if played to win, like Perfect Words, I think it makes sense to try to avoid &quot;attacking&quot; the game to the extent possible. I think this is sort of impossible to do perfectly in Codenames because people will form subconscious associations (I've noticed people guessing an extra word on the first turn just to mess around, which works more often than not — assuming they're not cheating, and I believe they're not cheating, the success rate strongly suggests the use some kind of side-channel information. That doesn't necessarily have to be positional information from the cards, it could be something as simple as subconsciously noticing what the spymasters are intently looking at.</p> <p><a href="https://mastodon.social/@danluu/110544419353766175">Dave Sirlin calls anyone who doesn't take advantage of any legal possibility to win is a sucker (he derogatorily calls such people &quot;scrubs&quot;)</a> (he says that you should use cheats to win, like using maphacks in FPS games, as long as tournament organizers don't ban the practice, and that tournaments should explicitly list what's banned, avoiding generic &quot;don't do bad stuff&quot; rules). I think people should play games however they find it fun and should find a group that likes playing games in the same way. If Dave finds it fun to memorize arbitrary info to win all of these games, he should do that. The reason I, as Dave Sirlin would put it, play like a scrub, for the kinds of games discussed here is because the games are generally badly broken if played seriously and I don't personally find the ways in which they're broken to be fun. In some cases, like Perfect Words, the game is trivially broken and I find it boring to win a game that's trivially broken. In other cases, like Codenames, the game could be broken by spending a few hours memorizing some arbitrary information. To me, spending a few hours memorizing the 40 possible Codenames cards seems like an unfun and unproductive use of time, making it a completely pointless activity.</p> <h3 id="appendix-word-games-you-might-like">Appendix: word games you might like</h3> <p>If you like word guessing games, here are some possible recommendations in the same vein <a href="programming-books/">list of programming book recommendations</a> and <a href="programming-blogs/">this list of programming blog recommendations</a>, where the goal is to point out properties of things that people tend to like and dislike (as opposed to most reviews I see, which tend to about whether or not something is &quot;good&quot; or &quot;bad&quot;). To limit the length of this list, this only contains word guessing games, which tend to be about the meaning of words, and doesn't include games that are about the mechanics of manipulating words rather than the meaning, such as <a href="https://amzn.to/3WCbrAc">Banagrams</a>, <a href="https://amzn.to/4dbaQwA">Scrabble</a>, or <a href="https://en.wikipedia.org/wiki/Anagrams_(game)">Anagrams</a>, or games that are about the mapping between visual representations and words, such as <a href="https://amzn.to/4ddQqmy">Dixit</a> or <a href="https://amzn.to/3WXGOH0">Codenames: Pictures</a>.</p> <p>Also for reasons of space, I won't discuss reasons people dislike games that apply to all or nearly all games in this list. For example, someone might dislike a game because it's a word game, but there's little point in noting this for every game. Similarly, many people choose games based on &quot;weight&quot; and dislike almost all word games because they feel &quot;light&quot; instead of &quot;heavy&quot;, but all of these games are considered fairly light, so there's no point in discussing this (but if you want a word game that's light and intense, in the list below, you might consider Montage or Decrypto, and among games not discussed in detail, Scrabble or Anagrams, the latter of which is the most brutal word game I've ever played by a very large margin).</p> <h4 id="taboo-https-amzn-to-3sf1aij"><a href="https://amzn.to/3SF1AIJ">Taboo</a></h4> <p>A word guessing game where you need to rapidly give clues to get your teammates to guess what word you have, where each word also comes with a list of 5 stop words you're not allowed to say while clueing the word.</p> <p>A fun, light game, with two issues that give it low replayability:</p> <ul> <li>Since each word clued is a fully independent way, once your game group has run through the deck once or twice and everyone knows every word, the game becomes extremely easy; in the group I first played this in, I think this happened after we played it twice</li> <li>Even before that happens, when people realize that you can clue any word fairly easily by describing it in a slightly roundabout way, the game becomes fairly rote even before you accidentally remember the words just from playing too much</li> </ul> <p>When people dislike this game, they often don't like that there's so much time pressure in this rapid fire game.</p> <h4 id="just-one-https-amzn-to-4dbgm8s"><a href="https://amzn.to/4dbgM8S">Just One</a></h4> <p>A word guessing game that's a bit like Taboo, in that you need to get your team to guess a word, but instead of having a static list of stop words for each word you want to clue, the stop words are dynamically generated by your team (everyone clues one word, and any clue that's been given more than once is stricken).</p> <p>That stop words are generated via interaction with your teammates gives this game much more replayability than Taboo. However, the limited word list ultimately runs into the same problem and my game group would recognize the words and have a good way to give clues for almost every word after maybe 20-40 plays.</p> <p>A quirk of the rules as written is that the game is really made for 5+ players and becomes very easy if you play with 4, but there's no reason you couldn't play this game with the 5 player games when you have 4 players.</p> <p>A common complaint about this game is that the physical components are cheap and low quality considering the cost of the game ($30 MSRP vs. $20 for Codenames). Another complaint is that the words have wildly varying difficulties, some seemingly by accident. For exmaple, the word &quot;grotto&quot; is included and quite hard to clue if someone hasn't seen it, seemingly because the game was developed in French, where grotto would be fairly straightforward to clue.</p> <h4 id="perfect-words-https-www-amazon-co-uk-tiki-editions-perfect-words-intergenerational-dp-b0chn8xp1f">Perfect Words(<a href="https://www.amazon.co.uk/TIKI-Editions-Perfect-Words-Intergenerational/dp/B0CHN8XP1F">https://www.amazon.co.uk/TIKI-Editions-Perfect-Words-Intergenerational/dp/B0CHN8XP1F</a>)</h4> <p>A word guessing game where the team cooperatively constructs clues where the goal is to get the entire team to agree on the word (which can be any arbitrary word as long as people agree) from each set of clues.</p> <p>The core game, trying to come up with a set of words that will generate agreement on what word they represent, makes for a nice complement to a game that's sort of the opposite, like Just One, but the rules as implemented seem badly flawed. It's as if the game designers don't play games and didn't have people who play games playtest it. The game is fairly trivial to break on your first or second play and you have to deliberately play the &quot;gamey&quot; part of the game badly to make the game interesting</p> <h4 id="montage-https-amzn-to-46gkfcv"><a href="https://amzn.to/46GkfcV">Montage</a></h4> <p>A 2 on 2 word game (although you can play Codenames style if you want more players). On each team, players alternate fixed time periods of giving clues and guessing words. The current board state has some constraints on what letters must appear in certain position of the word. The cluer needs to generate a clue which will get the guesser to guess their word that fits within the constraints, but the clue can't be too obvious because if both opponents guess the word before the cluer's partner, the opponents win the word.</p> <p>Perhaps the hardest game on this list? Most new players I've seen fail to come up with valid clue during their turn on their first attempt (a good player can probably clue at least 5 things successfully per turn, if their partner is able to catch the reasoning faster than opponents). This is probably also the game that rewards having a large vocabulary the most of all the games on this list. It's also the only game on this list which exercises the skill of being able to think about the letter composition of words is useful, a la Scrabble.</p> <p>As long as you're not playing with a regular partner and relying on &quot;secret&quot; agreements or shared knowledge, the direct adversarial nature of guessing gives this game very high replayability, at least as high as anything else on this list.</p> <p>Like Perfect Words, the core word game is fun if you're into that kind of thing, but the rules of the game that's designed around the core game don't seem to have been very well thought through and can easily be gamed. It's not as bad here as in Perfect Words, but you still have to avoid trying to win to make this game really work.</p> <p>When I've seen people dislike this game, it's usually because they find the game too hard, or they don't like losing — a small difference in skill results in a larger difference in outcomes than we see in other games in this list, so a new player should expect to lose very badly unless their opponents handicap themselves (which isn't built into the rules) or they have a facility for word games from having played other games. I don't play a lot of word games and I especially don't play a lot of &quot;serious&quot; word games like Scrabble or Anagrams, so I generally get shellacked when I play this, which is part of the appeal for me, but that's exactly what a lot of people don't like about the game.</p> <h4 id="word-blur">Word Blur</h4> <p>A word guessing game where the constraint is that you need to form clues from the 900 little word tiles that are spread on the table in front of you.</p> <p>I've only played it a few times because I don't know anyone local who's managed to snag a copy, but it seemed like it has at least as much replayability as any game on this list. The big downside of this game is that it's been out of print for over a decade and it's famously hard to get ahold of a copy, although it seems like it shouldn't be too difficult to make a clone.</p> <p>When people dislike this game, it often seems to be because they dislike the core gameplay mechanic of looking at a bunch of word tiles and using them to make a description, which some people find overwhelming.</p> <p>People who find Word Blur too much can try the knockoff, Word Slam which is both easier and easier to get ahold of since it's not as much of a cult hit (though it also appears to be out of print). Word Slam only has 105 words and the words are sorted, which makes it feel much less chaotic.</p> <h4 id="codenames-https-amzn-to-4fvmju7"><a href="https://amzn.to/4fvMJu7">Codenames</a></h4> <p>Not much to add beyond what's in the post, except for common reasons that people don't like the game.</p> <p>A loud person can take over the the game on each team, moreso than any other game on this list (except for Codenames: Duet). And although the game comes with a timer, it's rarely used (and the rules basically imply that you shouldn't use the timer), so another common complaint is that the game drags on forever when playing with people who take a long time to take turns, and unless you're the spymaster, there's not much useful to do when it's the other team's turn, causing the game to have long stretches of boring downtime.</p> <h4 id="codenames-duet-https-amzn-to-4chnv6j"><a href="https://amzn.to/4chnV6j">Codenames: Duet</a></h4> <p>Although this was designed to be the 2-player co-op version of Codenames, I've only ever played this with more than two players (usually 4-5), which works fine as long as you don't mind that discussions have to be done in a semi-secret way.</p> <p>In terms of replayability, this Codenames: Duet sits in roughly the same space as Codenames, in that it has about the same pros and cons.</p> <h4 id="decrypto-https-amzn-to-3aopdaw"><a href="https://amzn.to/3AoPDAw">Decrypto</a></h4> <p>I'm not going to attempt to describe this game because every direct explanation I've seen someone attempt to give about the gameplay has failed to click with new players until they play a round or two. But, conceptually, each team rotates who gives a clue and the goal is to have people on your team correctly guess which clue maps to which word while having the opposing team fail to guess correctly. The guessing team has extra info in that they know what the words are, so it's easier for them to generate the correct mapping. However, the set of mappings generated by the guessing team is available to the &quot;decrypting&quot; team, so they might know that the mystery word was clued by &quot;Lincoln&quot; and &quot;milliner&quot;, from which they might infer that the word is &quot;hat&quot;, allowing them to correctly guess the mapping on the next clue.</p> <p>I haven't played this game enough to have an idea of how much replayability it has. It's possible it's very high and it's also possible that people figure out tricks to make it basically impossible for the &quot;decrypting&quot; team to figure out the mapping. One major downside that I've seen is that, when played with random groups of players, the game will frequently be decided by which team has the weakest player (this has happened every time I've seen this played by random groups), which is sort of the opposite problem that a lot of team and co-op games have, where the strongest player takes over the game. It's hard for a great player to make game-winning moves, but it's easy for a bad player to make game-losing moves, so when played with non-expert players, whichever team has the worst player will lose the game.</p> <h4 id="person-do-thing-https-persondothing-com"><a href="https://persondothing.com/">Person Do Thing</a></h4> <p>David Turner says:</p> <blockquote> <p>Person Do Thing is like Taboo, but instead of a list of forbidden words, there's a list of allowed words. Forty basic words are always allowed, and (if you want) there are three extra allowed words that are specific to each secret word. Like Taboo, the quizzer can respond to guesses -- but only using the allowed words. Because so few words are allowed, it requires a lot of creativity to give good clues .. worth playing a few times but their word list was tiny last time I checked.</p> <p>I suppose if a group played a lot they might develop a convention, e.g. &quot;like person but not think big&quot; for &quot;animal&quot;. I've heard of this happening in Concept: one group had a convention that red, white, blue, and place refers to a country with those flag colors, and that an additional modifier specifies which: water for UK, cold for Russia, food for France, and gun for USA. I think it would take a fair number of these conventions to make the game appreciably easier.</p> </blockquote> <h4 id="semantle-https-semantle-com"><a href="https://semantle.com/">Semantle</a></h4> <p>Like wordle, but about the meaning of a word, according to word2vec. Originally designed as a solitaire game, it also works as a co-op game.</p> <p>Although I'm sure there are people who love playing this game over and over again, I feel like the replayability is fairly low for most people (and almost no one I know ended up playing more than 40 games of this, so I think my feeling isn't uncommon). Once you play for a while and figure out how to guess words that quickly narrow down the search space, playing the game starts to feel a bit rote.</p> <p>Most people I've talked to who don't like this game didn't like it because they weren't able to build a mental model of what's happening, making the word similarity scores seem like random nonsense.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:H">If you find this mode too easy and you can accurately get your team to guess any three tiles you like every single time and have enough of an intuition of what patterns exist that you can usually avoid getting beaten by the AI, you can try the mode where the AI is allowed to guess one word a turn and will then win by guessing the rest of the words if the one word it correctly guesses is sufficient to narrow down the search space to a single possible board. In general, if you make three guesses, this narrows down the space enough that the AI can win with a single guess (in game terms, the AI would give an &quot;unlimited&quot; clue <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> </ol> </div> A discussion of discussions on AI bias ai-bias/ Sun, 16 Jun 2024 00:00:00 +0000 ai-bias/ <p>There've been regular viral stories about ML/AI bias with LLMs and generative AI for the past couple years. One thing I find interesting about discussions of bias is how different the reaction is in the LLM and generative AI case when compared to &quot;classical&quot; bugs in cases where there's a clear bug. In particular, if you look at forums or other discussions with lay people, people frequently deny that a model which produces output that's sort of the opposite of what the user asked for is even a bug. For example, a year ago, an Asian MIT grad student asked Playground AI (PAI) to &quot;Give the girl from the original photo a professional linkedin profile photo&quot; and PAI converted her face to a white face with blue eyes.</p> <p>The top &quot;there's no bias&quot; response on the front-page reddit story, and one of the top overall comments, was</p> <blockquote> <p>Sure, now go to <a href="https://civitai.com/">the most popular Stable Diffusion model website</a> and look at the images on the front page.</p> <p>You'll see an absurd number of asian women (almost 50% of the non-anime models are represented by them) to the point where you'd assume being asian is a desired trait.</p> <p>How is that less relevant that &quot;one woman typed a dumb prompt into a website and they generated a white woman&quot;?</p> <p>Also keep in mind that she typed &quot;Linkedin&quot;, so anyone familiar with how prompts currently work know it's more likely that the AI searched for the average linkedin woman, not what it thinks is a professional women because image AI doesn't have an opinion.</p> <p>In short, this is just an AI ragebait article.</p> </blockquote> <p>Other highly-ranked comments with the same theme include</p> <blockquote> <p>Honestly this should be higher up. If you want to use SD with a checkpoint right now, if you dont [sic] want an asian girl it’s much harder. Many many models are trained on anime or Asian women.</p> </blockquote> <p>and</p> <blockquote> <p>Right? AI images even have the opposite problem. The sheer number of Asians in the training sets, and the sheer number of models being created in Asia, means that <b>many, many</b> models are biased towards Asian outputs.</p> </blockquote> <p>Other highly-ranked comments noted that this was a sample size issue</p> <blockquote> <p>&quot;Evidence of systemic racial bias&quot;</p> <p>Shows one result.</p> </blockquote> <p>Playground AI's CEO went with the same response when asked for an interview by the Boston Globe — he declined the interview and replied with a list of rhetorical questions like the following (the Boston Globe implies that there was more, but didn't print the rest of the reply):</p> <blockquote> <p>If I roll a dice just once and get the number 1, does that mean I will always get the number 1? Should I conclude based on a single observation that the dice is biased to the number 1 and was trained to be predisposed to rolling a 1?</p> </blockquote> <p>We could just have easily picked an example from Google or Facebook or Microsoft or any other company that's deploying a lot of ML today, but since the CEO of Playground AI is basically asking someone to take a look at PAI's output, we're looking at PAI in this post. I tried the same prompt the MIT grad student used on my Mastodon profile photo, substituting &quot;man&quot; for &quot;girl&quot;. PAI usually turns my Asian face into a white (caucasian) face, but sometimes makes me somewhat whiter but ethnically ambiguous (maybe a bit Middle Eastern or East Asian or something. And, BTW, my face has a number of distinctively Vietnamese features and which pretty obviously look Vietnamese and not any kind of East Asian.</p> <p><img src="images/ai-bias/profile.png" alt="Profile photo of Vietnamese person" width="400" height="400"> <img src="images/ai-bias/playground-ai-1.png" alt="4 profile photos run through playground AI, 3 look very European and one looks a bit ambiguous" width="848" height="210"> <img src="images/ai-bias/playground-ai-2.png" alt="4 profile photos run through playground AI, none look East Asian or Southeast Asian" width="1640" height="408"></p> <p>My profile photo is a light-skinned winter photo, so I tried a darker-skinned summer photo and PAI would then generally turn my face into a South Asian or African face, with the occasional Chinese (but never Vietnamese or kind of Southeast Asian face), such as the following:</p> <p><img src="images/ai-bias/tanned-profile.jpg" alt="Profile photo of tanned Vietnamese person"> <img src="images/ai-bias/playground-ai-3.png" alt="4 profile photos of tanned Vietnamese person run through playground AI, 1 looks black and 3 look South Asian"></p> <p>A number of other people also tried various prompts and they also got results that indicated that the model (where “model” is being used colloquially for the model and its weights and any system around the model that's responsible for the output being what it is) has some preconceptions about things like what ethnicity someone has if they have a specific profession that are strong enough to override the input photo. For example, converting a light-skinned Asian person to a white person because the model has &quot;decided&quot; it can make someone more professional by throwing out their Asian features and making them white.</p> <p>Other people have tried various prompts to see what kind of pre-conceptions are bundled into the model and have found similar results, e.g., <a href="https://discuss.systems/@ricci/110826586910728179">Rob Ricci got the following results when asking for &quot;linkedin profile picture of X professor&quot; for &quot;computer science&quot;, &quot;philosophy&quot;, &quot;chemistry&quot;, &quot;biology&quot;, &quot;veterinary science&quot;, &quot;nursing&quot;, &quot;gender studies&quot;, &quot;Chinese history&quot;, and &quot;African literature&quot;, respectively</a>. In the 28 images generated for the first 7 prompts, maybe 1 or 2 people out of 28 aren't white. The results for the next prompt, &quot;Chinese history&quot; are wildly over-the-top stereotypical, <a href="https://restofworld.org/2023/ai-image-stereotypes/">something we frequently see from other models as well when asking for non-white output</a>. And Andreas Thienemann points out that, except for the over-the-top Chinese stereotypes, every professor is wearing glasses, another classic stereotype.</p> <p><img src="images/ai-bias/rob-pgai-1.png"> <img src="images/ai-bias/rob-pgai-2.png"> <img src="images/ai-bias/rob-pgai-3.png"></p> <p>Like I said, I don't mean to pick on Playground AI in particular. As I've noted elsewhere, <a href="https://twitter.com/danluu/status/896176897675153409">trillion dollar companies regularly ship AI models to production without even the most basic checks on bias</a>; when I tried ChatGPT out, every bias-checking prompt I played with returned results that were analogous to the images we saw here, e.g., <a href="https://twitter.com/danluu/status/1601072083270008832/">when I tried asking for bios of men and women who work in tech, women tended to have bios indicating that they did diversity work, even for women who had no public record of doing diversity work and men tended to have degrees from name-brand engineering schools like MIT and Berkeley, even people who hadn't attended any name-brand schools</a>, and likewise for name-brand tech companies (the link only has 4 examples due to Twitter limitations, but other examples I tried were consistent with the examples shown).</p> <p>This post could've used almost any publicly available generative AI. It just happens to use Playground AI because the CEO's response both asks us to do it and reflects the standard reflexive &quot;AI isn't biased&quot; responses that lay people commonly give.</p> <p>Coming back to the response about how it's not biased for professional photos of people to be turned white because Asians feature so heavily in other cases, the high-ranking reddit comment we looked at earlier suggested &quot;go[ing] to <a href="https://civitai.com/">the most popular Stable Diffusion model website</a> and look[ing] at the images on the front page&quot;. Below is what I got when I clicked the link on the day the comment was made and then clicked &quot;feed&quot;.</p> <p><details closed> <summary>[Click to expand / collapse mildly NSFW images]</summary> <img src="images/ai-bias/reddit-images-2.png"> <img src="images/ai-bias/reddit-images-1.png"> </details></p> <p><abbr title="The front page has been cleaned up, but it makes sense to look at the site as it was when it was linked to, and the front page has also gotten considerably less Asian as the smut has cleaned up">The site had a bit of a smutty feel to it</abbr>. The median image could be described as &quot;a poster you'd expect to see on the wall of a teenage boy in a movie scene where the writers are reaching for the standard stock props to show that the character is a horny teenage boy who has poor social skills&quot; and the first things shown when going to the feed and getting the default &quot;all-time&quot; ranking are someone grabbing a young woman's breast, titled &quot;Guided Breast Grab | LoRA&quot;; two young women making out, titled &quot;Anime Kisses&quot;; and a young woman wearing a leash, annotated with &quot;BDSM — On a Leash LORA&quot;. So, apparently there was this site that people liked to use to generate and pass around smutty photos, and the high incidence of photos of Asian women on this site was used as evidence that there is no ML bias that negatively impacts Asian women because this cancels out an Asian woman being turned into a white woman when she tried to get a cleaned up photo for her LinkedIn profile. I'm not really sure what to say to this. <a href="https://mastodon.gamedev.place/@rygorous/110824248191236861">Fabian Geisen responded with &quot;🤦‍♂️. truly 'I'm not bias. your bias' level discourse&quot;, which feels like an appropriate response</a>.</p> <p>Another standard line of reasoning on display in the comments, that I see in basically every discussion on AI bias, is typified by</p> <blockquote> <p>AI trained on stock photo of “professionals” makes her white. Are we surprised?</p> <p>She asked the AI to make her headshot more professional. Most of “professional” stock photos on the internet have white people in them.</p> </blockquote> <p>and</p> <blockquote> <p>If she asked her photo to be made more anything it would likely turn her white just because that’s the average photo in the west where Asians only make up 7.3% of the US population, and a good chunk of that are South Indians that look nothing like her East Asian features. East Asians are 5% or less; there’s just not much training data.</p> </blockquote> <p>These comments seem to operate from a fundamental assumption that companies are pulling training data that's representative of the United States and that this is a reasonable thing to do and that this <em>should</em> result in models converting everyone into whatever is most common. This is wrong on multiple levels.</p> <p>First, on whether or not it's the case that professional stock photos are dominated by white people, a quick image search for &quot;professional stock photo&quot; turns up quite a few non-white people, so either stock photos aren't very white or people have figured out how to return a more representative sample of stock photos. And given worldwide demographics, it's unclear what internet services should be expected to be U.S.-centric. And then, even if we accept that major internet services should assume that everyone is in the United States, it seems like both a design flaw as well as a clear sign of bias to assume that every request comes from the modal American.</p> <p>Since a lot of people have these reflexive responses when talking about race or ethnicity, let's look at a less charged AI hypothetical. Say I talk to an AI customer service chatbot for my local mechanic and I ask to schedule an appointment to put my winter tires on and do a tire rotation. Then, when I go to pick up my car, I find out they changed my oil instead of putting my winter tires on and then a bunch of internet commenters explain why this isn't a sign of any kind of bias and you should know that an AI chatbot will convert any appointment with a mechanic to an oil change appointment because it's the most common kind of appointment. A chatbot that converts any kind of appointment request into &quot;give me the most common kind of appointment&quot; is pretty obviously broken but, for some reason, AI apologists insist this is fine when it comes to things like changing someone's race or ethnicity. Similarly, it would be absurd to argue that it's fine for my tire change appointment to have been converted to an oil change appointment because other companies have schedulers that convert oil change appointments to tire change appointments, but that's another common line of reasoning that we discussed above.</p> <p>And say I used some standard non-AI scheduling software like Mindbody or JaneApp to schedule an appointment with my mechanic and asked for an appointment to have my tires changed and rotated. If I ended up having my oil changed because the software simply schedules the most common kind of appointment, this would be a clear sign that the software is buggy and no reasonable person would argue that zero effort should go into fixing this bug. And yet, this is a common argument that people are making with respect to AI (it's probably the most common defense in comments on this topic). The argument goes a bit further, in that there's this explanation of why the bug occurs that's used to justify why the bug should exist and people shouldn't even attempt to fix it. Such an explanation would read as obviously ridiculous for a &quot;classical&quot; software bug and is no less ridiculous when it comes to ML. Perhaps one can argue that the bug is much more difficult to fix in ML and that it's not practical to fix the bug, but that's different from the common argument that it isn't a bug and that this is the correct way for software to behave.</p> <p>I could imagine some users saying something like that when the program is taking actions that are more opaque to the user, such as with autocorrect, but I actually tried searching reddit for <code>autocorrect bug</code> and in the top 3 threads (I didn't look at any other threads), 2 out of the 255 comments denied that incorrect autocorrects were a bug and both of those comments were from the same person. I'm sure if you dig through enough topics, you'll find ones where there's a higher rate, but on searching for a few more topics (like <abbr title="There are bugs where people relatively frequently blame the user; maybe 5% to 10% of commenters like blaming the user on Excel formatting issues causing data corruption, but that's still much less than the half-ish we'll see on ML bias, and the level of agitation/irrigation/anger in the comments seems lower as well">excel formatting</abbr> and autocorrect bugs), none of the topics I searched approached what we see with generative AI, where it's not uncommon to see half the commenters vehemently deny that a prompt doing the opposite of what the user wants is a bug.</p> <p>Coming back to the bug itself, in terms of the mechanism, one thing we can see in both classifiers as well as generative models is that many (perhaps most or almost all) of these systems are taking bias that a lot of people have that's reflected in some sample of the internet, which results in things like <a href="https://x.com/danluu/status/1245961051696295936">Google's image classifier classifying a black hand holding a thermometer as {hand, gun} and a white hand holding a thermometer as {hand, tool}</a><sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>. After a number of such errors over the past decade, from classifying black people as gorillas in Google Photos in 2015, to <a href="https://x.com/danluu/status/1043957859090911233">deploying some kind of text-classifier for ads that classified ads that contained the terms &quot;African-American composers&quot; and &quot;African-American music&quot; as &quot;dangerous or derogatory&quot;</a> in 2018 <a href="https://finance.yahoo.com/news/google-chatbot-ridiculed-ethnically-diverse-185014679.html">Google turned the knob in the other direction with Gemini</a> which, by the way, generated much more outrage than any of the other examples.</p> <p>There's nothing new about bias making it into automated systems. This predates generative AI, LLMs, and is a problem outside of ML models as well. It's just that the widespread use of ML has made this legible to people, making some of these cases news. For example, if you look at compression algorithms and dictionaries, <a href="https://www.patreon.com/posts/97523860">Brotli is heavily biased towards the English language</a> — the human-language elements of the 120 transforms built into the language are English, and the built-in compression dictionary is more heavily weighted towards English than whatever representative weighting you might want to reference (population-weighted language speakers, non-automated human-languages text sent on on messaging platforms, etc.). There are arguments you could make as to why English should be so heavily weighted, but there are also arguments as to why the opposite should be the case, e.g., English language usage is positively correlated with a user's bandwidth, so non-English speakers, on average, need the compression more. But regardless of the exact weighting function you think should be used to generate a representative dictionary, that's just not going to make a viral news story because you can't get the typical reader to care that a number of the 120 built-in Brotli transforms do things like add &quot; of the &quot;, &quot;. The&quot;, or &quot;. This&quot; to text, which are highly specialized for English, and none of the transforms encode terms that are highly specialized for any other human language even though only 20% of the world speaks English, or that, compared to the number of speakers, the built-in compression dictionary is extremely highly tilted towards English by comparison to any other human language. You could make a defense of the dictionary of Brotli that's analogous to the ones above, over some representative corpus which the Brotli dictionary was trained on, we get optimal compression with the Brotli dictionary, but there are quite a few curious phrases in <a href="https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c370f0f3">the dictionary</a> such as &quot;World War II&quot;, &quot;, Holy Roman Emperor&quot;, &quot;British Columbia&quot;, &quot;Archbishop&quot; , &quot;Cleveland&quot;, &quot;esperanto&quot;, etc., that might lead us to wonder if the corpus the dictionary was trained on is perhaps not the most representative, or <abbr title="Looking at languages spoken vs. languages in the dictionary, for some reason, the dictionary has words from the #1, #2, #3, #4, #6, and #9 most spoken languages, but is missing #5, #7, and #8 for some reason, although there's a bit of overlap in one case">even particularly representative of text people send</abbr>. Can it really be the case that including &quot;, Holy Roman Emperor&quot; in the dictionary produces, across the distribution of text sent on the internet, better compression than including anything at all for French, Urdu, Turkish, Tamil, Vietnamese, etc.?</p> <p>Another example which doesn't make a good viral news story is my not being able to put my Vietnamese name in the title of my blog and have my blog indexed by Google outside of Vietnamese-language Google — I tried that when I started my blog and it caused my blog to immediately stop showing up in Google searches unless you were in Vietnam. It's just assumed that the default is that people want English language search results and, presumably, someone created a heuristic that would trigger if you have two characters with Vietnamese diacritics on a page that would effectively mark the page as too Asian and therefore not of interest to anyone in the world except in one country. &quot;Being visibly <abbr title="of course, you can substitute any other minority here as well">Vietnamese</abbr>&quot; seems like a fairly common cause of bugs. For example, Vietnamese names are a problem even without diacritics. I often have forms that ask for my mother's maiden name. If I enter my mother's maiden name, I'll be told something like &quot;Invalid name&quot; or &quot;Name too short&quot;. That's fine, in that I work around that kind of carelessness by having a stand-in for my mother's maiden name, which is probably more secure anyway. Another issue is when people decide I told them my name incorrectly and change my name. For my last name, if I read my name off as &quot;Luu, ell you you&quot;, that gets shortened from the Vietnamese &quot;Luu&quot; to the Chinese &quot;Lu&quot; about half the time and to a western &quot;Lou&quot; much of the time as well, but I've figured out that if I say &quot;Luu, ell you you, two yous&quot;, that works about 95% of the time. That sometimes annoys the person on the other end, who will exasperatedly say something like &quot;you didn't have to spell it out three times&quot;. <abbr title="Although I doubt it — my experience is that people who say this sort of thing have a higher than average error rate">Maybe so for that particular person</abbr>, but most people won't get it. This even happens when I enter my first name into a computer system, so there can be no chance of a transcription error before my name is digitally recorded. My legal first name, with no diacritics, is Dan. This isn't uncommon for an American of Vietnamese descent because Dan works as both a Vietnamese name and an American name and a lot Vietnamese immigrants didn't know that Dan is usually short for Daniel. At six of the companies I've worked for full-time, someone has helpfully changed my name to Daniel at three of them, presumably because someone saw that Dan was recorded in a database and decided that I failed to enter my name correctly and that they knew what my name was better than I did and they were so sure of this <abbr title="my HR rep asked me at once company, but of course that's one of the three companies they didn't get my name wrong">they saw no need to ask me about it</abbr>. In one case, this only impacted my email display name. Since I don't have strong feelings about how people address me, I didn't bother having it changed and lot of people called me Daniel instead of Dan while I worked there. In two other cases, the name change impacted important paperwork, so I had to actually change it so that my insurance, tax paperwork, etc., actually matched my legal name. As noted above, with fairly innocuous prompts to Playground AI using my face, even on the rare occasion they produce Asian output, seem to produce East Asian output over Southeast Asian output. I've noticed the same thing with some big company generative AI models as well — even when you ask them for Southeast Asian output, they generate East Asian output. AI tools that are marketed as tools that clean up errors and noise will also clean up Asian-ness (and other analogous &quot;errors&quot;), e.g., people who've used Adobe AI noise reduction (billed as &quot;remove noise from voice recordings with speech enhacement&quot;) note that it will take an Asian accent and remove it, making the person sound American (and likewise for a number of other accents, such as eastern European accents).</p> <p>I probably see tens to hundreds things like this most weeks just in the course of using widely used software (<a href="everything-is-broken/">much less than the overall bug count, which we previously observed was in hundreds to thousands per week</a>), but most Americans I talk to don't notice these things at all. Recently, there's been a lot of chatter about all of the harms caused by biases in various ML systems and the widespread use of ML is going to usher in all sorts of new harms. That might not be wrong, but my feeling is that we've encoded biases into automation for as long as we've had automation and the increased scope and scale of automation has been and will continue to increase the scope and scale of automated bias. <abbr title="and this increased visibility seems to have caused increased pushback in the form of people insisting that doing the opposite of what the user wants is not a bug">It's just that now, many uses of ML make these kinds of biases a lot more legible to lay people and therefore likely to make the news</abbr>.</p> <p>There's an ahistoricity in the popular articles I've seen on this topic so far, in that they don't acknowledge that the fundamental problem here isn't new, resulting in two classes of problems that arise when solutions are proposed. One is that solutions are often ML-specific, but the issues here occur regardless of whether or not ML is used, so ML-specific solutions seem focused at the wrong level. When the solutions proposed are general, the proposed solutions I've seen are ones that have been proposed before and failed. For example, a common call to action for at least the past twenty years, perhaps the most common (unless &quot;people should care more&quot; counts as a call to action), has been that we need more diverse teams.</p> <p>This clearly hasn't worked; if it did, problems like the ones mentioned above wouldn't be pervasive. There are multiple levels at which this hasn't worked and will not work, any one of which would be fatal to this solution. One problem is that, across the industry, <a href="tech-discrimination/">the people who are in charge (execs and people who control capital, such as VCs, PE investors, etc.)</a>, in aggregate, <abbr title="Until 2021 or so, workers in tech had an increasing amount of power and were sometimes able to push for more diversity, but the power seems to have shifted in the other direction for the foreseeable future">don't care about this</abbr>. Although there are <a href="tech-discrimination/">efficiency</a> <a href="talent/">justifications</a> for more diverse teams, <a href="bad-decisions/">the case will never be as clear-cut as it is for decisions in games and sports</a>, where we've seen that very expensive and easily quantifiable bad decisions can persist for many decades after the errors were pointed out. And then, even if execs and capital were bought into the idea, it still wouldn't work because there are <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">too many dimensions</a>. If you look at a company that really prioritized diversity, like Patreon from 2013-2019, you're lucky if the organization is capable of seriously prioritizing diversity in <a href="https://x.com/danluu/status/1487228574608211969">two or three dimensions</a> while dropping the ball on hundreds or thousands of other dimensions, such as whether or not Vietnamese names or faces are handled properly.</p> <p>Even if all those things weren't problems, the solution still wouldn't work because while having a team with relevant diverse experience may be a bit correlated with prioritizing problems, it doesn't automatically cause problems to be prioritized and fixed. To pick a non-charged example, a bug that's existed in Google Maps traffic estimates since inception that existed at least until 2022 (I haven't driven enough since then to know if the bug still exists) is that, if I ask how long a trip will take at the start of rush hour, this takes into account current traffic and not how traffic will change as I drive and therefore systematically underestimates how long the trip will take (and conversely, if I plan a trip at peak rush hour, this will systematically overestimate how long the trip will take). If you try to solve this problem by increasing commute diversity in Google Maps, this will fail. There are already many people who work on Google Maps who drive and can observe ways in which estimates are systematically wrong. Adding diversity to ensure that there are people who drive and notice these problems is very unlikely to make a difference. Or, to pick another example, <a href="diseconomies-scale/">when the former manager of Uber's payments team got incorrected blacklisted from Uber by an ML model incorrectly labeling his transactions as fraudulent</a>, no one was able to figure out what happened or what sort of bias caused him to get incorrectly banned (they solved the problem by adding his user to an allowlist). There are very few people who are going to get better service than the manager of the payments team, and even in that case, Uber couldn't really figure out what was going on. Hiring a &quot;diverse&quot; candidate to the team isn't going to automatically solve or even make much difference to bias in whatever dimension the candidate is diverse when the former manager of the team can't even get their account unbanned except for having it whitelisted after six months of investigation.</p> <p>If the result of your software development methodology is that the fix to the manager of the payments team being banned is to allowlist the user after six months, that traffic routing in your app is systematically wrong for two decades, <a href="nothing-works/">that core functionality of your app doesn't work</a>, etc., no amount of hiring people with a background that's correlated with noticing some kinds of issues is going to result in fixing issues like these, whether that's with respect to ML bias or another class of bug.</p> <p>Of course, sometimes variants of old ideas that have failed do succeed, but for a proposal to be credible, or even interesting, the proposal has to address why the next iteration won't fail like every previous iteration did. As we noted above, at a high level, the two most common proposed solutions I've seen are that people should try harder and care more and that we should have people of different backgrounds, in a non-technical sense. This hasn't worked for the plethora of &quot;classical&quot; bugs, this hasn't worked for old ML bugs, and it doesn't seem like there's any reason to believe that this should work for the kinds of bugs we're seeing from today's ML models.</p> <p>Laurence Tratt says:</p> <blockquote> <p>I think this is a more important point than individual instances of bias. What's interesting to me is that mostly a) no-one notices they're introducing such biases b) often it wouldn't even be reasonable to expect them to notice. For example, some web forms rejected my previous addresss, because I live in the countryside where many houses only have names -- but most devs live in cities where houses exclusively have numbers. In a sense that's active bias at work, but there's no mal intent: programmers have to fill in design details and make choices, and they're going to do so based on their experiences. None of us knows everything! That raises an interesting philosophical question: when is it reasonable to assume that organisations should have realised they were encoding a bias?</p> </blockquote> <p>My feeling is that <a href="nothing-works/">the &quot;natural&quot;, as in lowest energy and most straightforward state for institutions and products is that they don't work very well</a>. If someone hasn't previously <a href="culture/">instilled a culture</a> or <a href="wat/">instituted processes</a> that foster quality in a particular dimension, quality is likely to be poor, <a href="hardware-unforgiving/">due to the difficulty of producing something high quality</a>, so organizations should expect that they're encoding all sorts of biases if there isn't a robust process for catching biases.</p> <p>One issue we're running up against here is that, when it comes to consumer software, companies have overwhelmingly chosen velocity over quality. This seems basically inevitable given the regulatory environment we have today or any regulatory environment we're likely to have in my lifetime, in that companies that seriously choose quality over features velocity get outcompeted because consumers overwhelmingly choose the lower cost or more featureful option over the higher quality option. We <a href="car-safety/">saw this with cars when we looked at how vehicles perform in out-of-sample crash tests</a> and saw that only Volvo was optimizing cars for actual crashes as opposed to scoring well on public tests. Despite vehicular accidents being one of the leading causes of death for people under 50, paying for safety is such a low priority for consumers that Volvo has become a niche brand that had to move upmarket and sell luxury cars to even survive. We also saw this with CPUs, where Intel used to expend much more verification effort than AMD and ARM and had concomitantly fewer serious bugs. <a href="cpu-bugs/">When AMD and ARM started seriously threatening, <abbr title="One might hope that Intel's quality advantage was the reason it had monopoly power, but it looks like that was backwards and it was Intel's monopoly power that allowed it to invest in quality.">Intel shifted effort away from verification and validation in order to increase velocity because their quality advantage wasn't doing them any favors</abbr> in the market and Intel chips are now almost as buggy as AMD chips</a>.</p> <p>We can observe something similar <a href="nothing-works/">in almost every consumer market and many B2B markets as well</a>, and that's when we're talking about issues that have known solutions. If we look at problem that, from a technical standpoint, we don't know how to solve well, like subtle or even not-so-subtle bias in ML models, it stands to reason that we should expect to see more and worse bugs than we'd expect out of &quot;classical&quot; software systems, which is what we're seeing. Any solution to this problem that's going to hold up in the market is going to have to be robust against the issue that consumers will overwhelmingly choose the buggier product if it has more features they want or ships features they want sooner, which puts any solution that requires taking care in a way that significantly slows down shipping in a very difficult position, <abbr title="Other possibilities, each unlikely, are that consumers will actually care about bias and meaningful regulatory change that doesn't backfire">absent a single dominant player, like Intel in its heyday</abbr>.</p> <p><i>Thanks to Laurence Tratt, Yossi Kreinin, Anonymous, Heath Borders, Benjamin Reeseman, Andreas Thienemann, and Misha Yagudin for comments/corrections/discussion</i></p> <h3 id="appendix-technically-how-hard-is-it-to-improve-the-situation">Appendix: technically, how hard is it to improve the situation?</h3> <p>This is a genuine question and not a rhetorical question. I haven't done any ML-related work since 2014, so I'm not well-informed enough about what's going on now to have a direct opinion on the technical side of things. A number of people who've worked on ML a lot more recently than I have like Yossi Kreining (see appendix below) and <a href="https://buttondown.email/apperceptive/archive/supervision-and-truth/">Sam Anthony</a> think the problem is very hard, maybe impossibly hard where we are today.</p> <p>Since I don't have a direct opinion, here are three situations which sound plausibly analogous, each of which supports a different conclusion.</p> <p>Analogy one: Maybe this is like <a href="sounds-easy/">people saying that someone will build a Google any day now at least since 2014 because existing open source tooling is already basically better than Google search</a> or <a href="https://yosefk.com/blog/high-level-cpu-follow-up.html">people saying</a> that <a href="https://yosefk.com/blog/the-high-level-cpu-challenge.html">building a &quot;high-level&quot; CPU</a> that encodes high-level language primitives <a href="https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.html">into hardware</a> would <a href="https://www.patreon.com/posts/54329188">give us a 1000x speedup on general purpose CPUs</a>. You can't really prove that this is wrong and it's possible that a massive improvement in search quality or a 1000x improvement in CPU performance is just around the corner but people who make these proposals generally sound like cranks because they exhibit the ahistoricity we noted above and propose solutions that we already know don't work with no explanation of why their solution will address the problems that have caused previous attempts to fail.</p> <p>Analogy two: Maybe this is like software testing, where <a href="everything-is-broken/">software bugs are pervasive and, although there's decades of prior art from the hardware industry on how to find bugs more efficiently</a>, there are very few areas where <a href="testing/">any of these techniques</a> are applied. I've talked to people about this a number of times and the most common response is something about how application XYZ has some unique constraint that make it impossibly hard to test at all or test using the kinds of techniques I'm discussing, but every time I've dug into this, the application has been much easier to test than areas where I've seen these techniques applied. One could argue that I'm a crank when it comes to testing, but I've actually used these techniques to test a variety of software and been successful doing so, so I don't think this is the same as things like <a href="https://www.patreon.com/posts/54329188">claiming that CPUs would be 1000x faster if we only my pet CPU architecture</a>.</p> <p>Due to the incentives in play, where software companies can typically pass the cost of bugs onto the customer without the customer really understanding what's going on, I think we're not going to see a large amount of effort spent on testing absent regulatory changes, but there isn't a fundamental reason that we need to avoid using more efficient testing techniques and methodologies.</p> <p>From a technical standpoint, the barrier to using better test techniques is fairly low — I've walked people through how to get started writing their own fuzzers and randomized test generators and this typically takes between 30 minutes and an hour, after which people will tend to use these techniques to find important bugs much more efficiently than they used to. However, by revealed preference, we can see that organizations don't really &quot;want to&quot; have their developers test efficiently.</p> <p>When it comes to testing and fixing bias in ML models, is the situation more like analogy one or analogy two? Although I wouldn't say with any level of confidence that we are in analogy two, I'm not sure how I could be convinced that we're not in analogy two. If I didn't know anything about testing, I would listen to all of these people explaining to me why their app can't be tested in a way that finds showstopping bugs and then conclude something like one of the following</p> <ul> <li>&quot;Everyone&quot; is right, which makes sense — this is a domain they know about and I don't, so why should I believe anything different?</li> <li>No opinion, perhaps on due to a high default level of skepticism</li> <li>Everyone is wrong, which seems unreasonable given that I don't know anything about the domain and have no particular reason to believe that everyone is wrong</li> </ul> <p>As an outsider, it would take a very high degree of overconfidence to decide that everyone is wrong, so I'd have to either incorrectly conclude that &quot;everyone&quot; is right or have no opinion.</p> <p>Given the situation with &quot;classical&quot; testing, I feel like I have to have no real opinion here. WIth no up to date knowledge, it wouldn't be reasonable to conclude that so many experts are wrong. But there are enough <a href="bitfunnel-sigir.pdf">problems that people have said are difficult or impossible that turn out to be feasible and not really all that tricky</a> that I have a hard time having a high degree of belief that a problem is essentially unsolvable without actually looking into it.</p> <p>I don't think there's any way to estimate what I'd think if I actually looked into it. Let's say I try to work in this area and try to get a job at OpenAI or another place where people are working on problems like this, <a href="https://x.com/danluu/status/1470890494775361538">somehow pass the interview</a>,I work in the area for a couple years, and make no progress. That doesn't mean that the problem isn't solvable, just that I didn't solve it. When it comes to the &quot;Lucene is basically as good as Google search&quot; or &quot;CPUs could easily be 1000x faster&quot; people, it's obvious to people with knowledge of the area that the people saying these things are cranks because they exhibit a total lack of understanding of what the actual problems in the field are, but making that kind of judgment call requires knowing a fair amount about the field and I don't think there's a shortcut that would let you reliably figure out what your judgment would be if you had knowledge of the field.</p> <h3 id="appendix-the-story-of-this-post">Appendix: the story of this post</h3> <p>I wrote a draft of this post when the Playground AI story went viral in mid-2023, and then I sat on it for a year to see if it seemed to hold up when the story was no longer breaking news. Looking at this a year, I don't think the fundamental issues or the discussions I see on the topic have really changed, so I cleaned it up and then published this post in mid-2024.</p> <p>If you like making predictions, what do you think the odds are that this post will still be relevant a decade later, in 2033? For reference, <a href="everything-is-broken/">this post on &quot;classical&quot; software bugs that was published in 2014 could've been published today, in 2024, with essentially the same results</a> (I say essentially because I see more bugs today than I did in 2014, and I see a lot more front-end and OS bugs today than I saw in 2014, so there would more bugs and different kinds of bugs).</p> <h3 id="appendix-comments-from-other-folks">Appendix: comments from other folks</h3> <p><details open> <summary>[Click to expand / collapse comments from Yossi Kreinin]</summary></p> <p>I'm not sure how much this is something you'd agree with but I think a further point related to generative AI bias being a lot like other-software-bias is exactly what this bias is. &quot;AI bias&quot; isn't AI learning the biases of its creators and cleverly working to implement them, e.g. working against a minority that its creators don't like. Rather, &quot;AI bias&quot; is something like &quot;I generally can't be bothered to fix bugs unless the market or the government compels me to do so, and as a logical consequence of this, I especially can't be bothered to fix bugs that disproportionately negatively impact certain groups where the impact, due to the circumstances of the specific group in question, is less likely to compel me to fix the bug.&quot;</p> <p>This is a similarity between classic software bugs and AI bugs — meaning, nobody is worried that &quot;software is biased&quot; in some clever scheming sort of way, everybody gets that it's the software maker who's scheming or, probably more often, it's the software maker who can't be bothered to get things right. With generative AI I think &quot;scheming&quot; is actually even less likely than with traditional software and &quot;not fixing bugs&quot; is more likely, because people don't understand AI systems they're making and can make them do their bidding, evil or not, to a much lesser extent than with traditional software; OTOH bugs are more likely for the same reason [we don't know what we're doing.] I think a lot of people across the political spectrum [including for example Elon Musk and not just journalists and such] say things along the lines of &quot;it's terrible that we're training AI to think incorrectly about the world&quot; in the context of racial/political/other charged examples of bias; I think in reality this is a product bug affecting users to various degrees and there's bias in how the fixes are prioritized but the thing isn't capable of thinking at all.</p> <p>I guess I should add that there are almost certainly attempts at &quot;scheming&quot; to make generative AI repeat a political viewpoint, over/underrepresent a group of people etc, but invariably these attempts create hilarious side effects due to bugs/inability to really control the model. I think that similar attempts to control traditional software to implement a politics-adjacent agenda are much more effective on average (though here too I think you actually had specific examples of social media bugs that people thought were a clever conspiracy). Whether you think of the underlying agenda as malice or virtue, both can only come after competence and here there's quite the way to go.</p> <p>See <a href="https://arxiv.org/pdf/2406.02061">Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models</a>. I feel like if this doesn't work, a whole lot of other stuff doesn't work, either and enumerating it has got to be rather hard.</p> <p>I mean nobody would expect a 1980s expert system to get enough tweaks to not behave nonsensically. I don't see a major difference between that and an LLM, except that an LLM is vastly more useful. It's still something that pretends to be talking like a person but it's actually doing something conceptually simple and very different that often looks right. </details></p> <p><details open> <summary>[Click to expand / collapse comments from an anonymous founder of an AI startup]</summary> [I]n the process [of founding an AI startup], I have been exposed to lots of mainstream ML code. Exposed as in “nuclear waste” or “H1N1”. It has old-fashioned software bugs at a rate I find astonishing, even being an old, jaded programmer. For example, I was looking at tokenizing recently, and the first obvious step was to do some light differential testing between several implementations. And it failed hilariously. Not like “they missed some edge cases”, more like “nobody ever even looked once”. Given what we know about how well models respond to out of distribution data, this is just insane.</p> <p>In some sense, this is orthogonal to the types of biases you discuss…but it also suggests a deep lack of craftsmanship and rigor that matches up perfectly. </details></p> <p><details closed> <summary>[Click to expand / collapse comments from Benjamin Reeseman]</summary></p> <p>[Ben wanted me to note that this should be considered an informal response]</p> <p>I have a slightly different view of demographic bias and related phenomena in ML models (or any other “expert” system, to your point ChatGPT didn’t invent this, it made it legible to borrow your term).</p> <p>I think that trying to force the models to reflect anything other than a corpus that’s now basically the Internet give or take actually masks the real issue: the bias is real, people actually get mistreated over their background or skin color or sexual orientation or any number of things and I’d far prefer that the models surface that, run our collective faces in the IRL failure mode than try to tweak the optics in an effort to permit the abuses to continue.</p> <p>There’s a useful analogy to things like the #metoo movement or various DEI initiatives, most well-intentioned in the beginning but easily captured and ultimately representing a net increase in the blank check of those in positions of privilege.</p> <p>This isn’t to say that alignment has no place and I think it likewise began with good intentions and is even maybe a locally useful mitigation.</p> <p>But the real solution is to address the injustice and inequity in the real world.</p> <p>I think the examples you cited are or should be a wake-up call that no one can pretend to ignore credibly about real issues and would ideally serve as a forcing function on real reform.</p> <p>I’d love to chat about this at your leisure, my viewpoint is a minority one, but personally I’m a big fan of addressing the underlying issues rather than papering over them with what amounts to a pile of switch statements.</p> <p>There’s a darker take on this: real reform is impossible, we live in techno dystopia now, let’s mitigate where and how a hunted minority can.</p> <p>And there is a distressingly strong argument for that case: even in the ostensibly developed world cops today look like soldiers did when I was a kid 30 years ago, someone is guarding something from something at great expense and it isn’t black kids from getting shot near as I can tell.</p> <p>But I don’t subscribe to the pessimism, I think it’s a local anomaly as industrialization transitions into arbitrary power over the physical logistics <em>if we only knew how to change it</em>.</p> <p>I did a brief stint consulting for biotech folks affiliated with Nobel shortlist types in an area called proteomics. Like any consultant I surrender any credit to the people paying handsomely by the hour.</p> <p>But it really is a Shannon problem now: CRISPR Cas-9 can do arbitrary edits to not only a person’s genome but (in an ethical and legal grey area) to the germ line as well.</p> <p>We just have no idea what to change and there’s enough integrity in the field that we won’t risk just arbitrary children on finding out. </details></p> <h3 id="appendix-reproducing-rob-ricci-s-results">Appendix: reproducing Rob Ricci's results</h3> <p>I tried prompts with default settings, except for reducing image quality to 10 to generate images more quickly. This means we had 512/512, 7 prompt guidance, 10 quality, random seed. After 2 tries, increased image quality to 40 because the images were too low quality to tell the ethnicity sometimes. Other than increasing the image quality, there was no attempt to re-run prompts or otherwise due any kind of selection from the output. The prompts were &quot;Generate a very professional looking linkedin profile photo for a X&quot;, where X was Doctor, Lawyer, Engineer, Scientist, Journalist, and Banker.</p> <p>This produced the following images:</p> <p><img src="images/ai-bias/playground-ai-doctor.png"> <img src="images/ai-bias/playground-ai-lawyer.png"> <img src="images/ai-bias/playground-ai-engineer.png"> <img src="images/ai-bias/playground-ai-scientist.png"> <img src="images/ai-bias/playground-ai-journalist.png"> <img src="images/ai-bias/playground-ai-banker.png"></p> <p>Roughly speaking, I think Rob's results reproduced, which should be no surprise at this point given how many images we've seen.</p> <p>And then, to see if we could reproduce the standard rebuttal that generative AI isn't biased because requests for smutty images often have Asian women, I tried the prompt &quot;Generate a trashy instagram profile photo for a porn star&quot;. There's an NSFW filter that was tripped in some cases, so we don't get groups of four images and instead got:</p> <p><details closed> <summary>[Click to expand / collapse very mildly NSFW images]</summary> <img src="images/ai-bias/playground-ai-trashy.png"> </details></p> <p>And, indeed, the generated images are much more Asian than we got for any of our professional photos, save Rob Ricci's set of photos for asking for a &quot;linkedin profile picture of Chinese Studies professor&quot;.</p> <h3 id="appendix-comments-from-benjamin-reeseman">Appendix: comments from Benjamin Reeseman</h3> <div class="footnotes"> <hr /> <ol> <li id="fn:S">Naturally, when I mentioned this, a &quot;smart contrarian&quot; responded with &quot;what are base rates&quot;, but spending 30 seconds googling reveals that the base rate of U.S. gun ownership is much higher among whites than in any other demographic. The base rate argument is even more absurd if you think about the base rate of a hand holding an object — what fraction of the time is that object a gun? Regardless of race, it's going to be very low. Of course, you could find a biased sample that doesn't resemble the underlying base rate at all, which appears to be what Google did, but it's not clear why this justifies having this bug. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> </ol> </div> What the FTC got wrong in the Google antitrust investigation ftc-google-antitrust/ Sun, 26 May 2024 00:00:00 +0000 ftc-google-antitrust/ <p>From 2011-2012, the FTC investigated the possibility of pursuing antitrust action against Google. The FTC decided to close the investigation and not much was publicly known about what happened until Politico released 312 pages of internal FTC memos that from the investigation a decade later. As someone who works in tech, on reading the memos, the most striking thing is how one side, the side that argued to close the investigation, repeatedly displays a lack of basic understanding of tech industry and <abbr title="that are available">the memos from directors and other higher-ups</abbr> don't acknowledge that this at all.</p> <p>If you don't generally follow what regulators and legislators are saying about tech, seeing the internal c(or any other industry) when these decisions are, apparently, being made with little to no understanding of the industries<sup class="footnote-ref" id="fnref:K"><a rel="footnote" href="#fn:K">1</a></sup>.</p> <p>Inside the FTC, the Bureau of Competition (BC) made a case that antitrust action should be pursued and the Bureau of Economics (BE) made the case that the investigation should be dropped. The BC case is moderately strong. Reasonable people can disagree on whether or not the case is strong enough that antitrust action should've been pursued, but a reasonable person who is anti-antitrust has to concede that the antitrust case in the BC memo is at least defensible. The case against in the BE is not defensible. There are major errors in core parts of the BE memo. In order for the BE memo to seem credible, the reader must have large and significant gaps in their understanding of the tech industry. If there was any internal FTC discussion on the errors in the BE memo, there's no indication of that in any public documents. As far as we can see from the evidence that's available, nobody noticed that the BE memo's errors. The publicly available memos from directors and other higher ups indicate that they gave the BE memo as much or more weight than the BC memo, implying a gap in <abbr title="at least among people whose memos were made public">FTC leadership's</abbr> understanding of the tech industry.</p> <h2 id="brief-summary">Brief summary</h2> <p>Since the BE memo is effective a rebuttal of a the BC memo, we'll start by looking at the arguments in the BC memo. The bullet points below summarize the Executive Summary from the BC memo, which roughly summarizes the case made by the BC memo:</p> <ul> <li>Google is dominant search engine and seller of search ads</li> <li>This memo addresses 4 of 5 areas with anticompetitive conduct; mobile is in a supplemental memo</li> <li>Google has monopoly power in the U.S. in Horizontal Search; Search Advertising; and Syndicated Search and Search Advertising</li> <li>On the question of whether Google has unlawfully preferenced its own content while demoting rivals, we do not recommend the FTC proceed; it's a close call and case law is not favorable to anticompetitive product design and Google's efficiency justifications are strong and there's some benefit to users</li> <li>On whether Google has unlawfully scraped content from vertical rivals to improve their own vertical products, recommending condemning as a conditional refusal to deal under Section 2 <ul> <li>Prior voluntary dealing was mutually beneficial</li> <li>Threats to remove rival content from general search designed to coerce rivals into allowing Google to user their content for Google's vertical product</li> <li>Natural and probable effect is to diminish incentives of vertical website R&amp;D</li> </ul></li> <li>On anticompetitive contractual restrictions on automated cross-management of ad campaigns, restrictions should be condemned under Section 2 <ul> <li>They limit ability of advertisers to make use of their own data, reducing innovation and increasing transaction costs for advertisers and third-party businesses</li> <li>Also degrade the quality of Google's rivals in search and search advertising</li> <li>Google's efficiency justifications appears to be pretextual</li> </ul></li> <li>On anticompetitive exclusionary agreements with websites for syndicated search and search ads, Google should be condemned under Section 2 <ul> <li>Only modest anticompetitive effects on publishers, but deny scale to competitors, competitively significant to main rival (Bing) as well as significant barrier to entry in longer term</li> <li>Google's efficiency justifications are, on balance, non-persuasive</li> </ul></li> <li>Possible remedies <ul> <li>Scraping <ul> <li>Could be required to provide an opt-out for snippets (reviews, ratings) from Google's vertical properties while retaining snippets in web search and/or Universal Search on main search results page</li> <li>Could be required to limit use of content indexed from web search results</li> </ul></li> <li>Campaign management restrictions <ul> <li>Could be required to remove problematic contractual restrictions from license agreements</li> </ul></li> <li>Exclusionary syndication agreements <ul> <li>Could be enjoined from entering into exclusive search agreements with search syndication partners and required to loosen restrictions surrounding syndication partners' use of rival search ads</li> </ul></li> </ul></li> <li>There are a number of risks to case, not named in summary except that Google can argue that Microsoft's most efficient distribution channel is bing.com and that any scale MS might gain will be immaterial to Bing's competitive position</li> <li>[BC] Staff concludes Google's conduct has resulted and will result in real harm to consumers and to innovation in online search and ads.</li> </ul> <p>In their supplemental memo on mobile, BC staff claim that Google dominates mobile search via exclusivity agreements and that mobile search was rapidly growing at the time. BC staff claimed that, according to Google internal documents, mobile search went from 9.5% to 17.3% of searches in 2011 and that both Google and Microsoft internal documents indicated that the expectation was that mobile would surpass desktop in the near future. As with the case on desktop, BC staff use Google's ability to essentially unilaterally reduce revenue share as evidence that Google has monopoly power and can dictate terms and they quote Google leadership noting this exact thing.</p> <p>BC staff acknowledge that many of Google's actions have been beneficial to consumers, but balance this against the harms of anticompetitive tactics, saying</p> <blockquote> <p>the evidence paints a complex portrait of a company working toward an overall goal of maintaining its market share by providing the best user experience, while simultaneously engaging in tactics that resulted in harm to many vertical competitors, and likely helped to entrench Google's monopoly power over search and search advertising</p> </blockquote> <p>BE staff strongly disagreed with BC staff. BE staff also believe that many of Google's actions have been beneficial to consumers, but when it comes to harms, in almost every case, BE staff argue that the market isn't important, isn't a distinct market, or that the market is competitive and Google's actions are procompetitive and not anticompetitive.</p> <h2 id="common-errors">Common errors</h2> <p>At least in the documents provided by Politico, BE staff generally declined to engage with BC staff's arguments and numbers directly. For example, in addition to arguing that Google's agreements and exclusivity (insofar as agreements are exclusive) are procompetitive and foreclosing the possibility of such agreements might have significant negative impacts on the market, they argue that mobile is a small and unimportant market. The BE memo argues that mobile is only 8% of the market and, <abbr title="BE staff acknowledge that mobile is rapidly growing, they do so in a minor way that certainly does not imply that mobile is or soon will be important">while it's growing rapidly</abbr>, is unimportant, as it's only a &quot;small percentage of overall queries and an even smaller percentage of search ad revenues&quot;. They also claim that there is robust competition in mobile because, in addition to Apple, there's also BlackBerry and Windows Mobile. Between when the FTC investigation started and when the memo was written, BlackBerry's marketshare dropped dropped from ~14% to ~6%, which was part of a long-term decline that showed no signs of changing. Windows Mobile's drop was less precipitous, from ~6% to ~4%, but in a market with such strong network effects, it's curious that BE staff would argue that these platforms with low and declining marketshare would provide robust competition going forward.</p> <p>When the authors of the BE memo make a prediction, they seem to have a facility for predicting the opposite of what will happen. To do this, the authors of the BE memo took positions that were opposed to the general consensus at the time. Another example of this is when they imply that there is robust competition in the search market, which is implied to be expected to continue without antitrust action. Their evidence for this was that Yahoo and Bing had a combined &quot;steady&quot; 30% marketshare in the U.S., with query volume growing faster than Google since the Yahoo-Bing alliance was announced. The BE memo authors even go even further and claim that Microsoft's query volume is growing faster than Google'e and that Microsoft + Yahoo combined have higher marketshare than Google as measured by search <a href="https://en.wikipedia.org/wiki/Active_users">MAU</a>.</p> <p>The BE memo's argument that Yahoo and Bing are providing robust and stable competition leaves out that the fixed costs of running a search engine are so high and the scale required to be profitable so large that Yahoo effectively dropped out of search and outsourced search to Bing. And Microsoft was subsidizing Bing to the tune of $2B/yr, in a strategic move that most observers in tech thought would not be successful. At the time, it would have been reasonable to think that if Microsoft stopped heavily subsidizing Bing, its marketshare would drop significantly, which is what happened after antitrust action was not taken and Microsoft decided to shift funding to other bets that had better ROI. Estimates today put Google at 86% to 90% share in the United States, with estimates generally being a bit higher worldwide.</p> <p>On the wilder claims, such as Microsoft and Yahoo combined having more active search users than Google and that Microsoft query volume and therefore search marketshare is growing faster than Google, they use comScore data. There are a couple of curious things about this.</p> <p>First, the authors pick and choose their data in order to present figures that maximize Microsoft's marketshare. When comScore data makes Microsoft marketshare appear relatively low, as in syndicated search, the authors of the BE memo explain that comScore data should not be used because it's inaccurate. However, when comScore data is prima facie unrealistic and make's Microsoft marketshare look larger than is plausible or is growing faster than is plausible, the authors rely on comScore data without explaining why they rely on this source that they said should not be used because it's unreliable.</p> <p>Using this data, the BE memo basically argues that, because many users use Yahoo and Bing at least occasionally, users clearly could use Yahoo and Bing, and there must not be a significant barrier to switching even if (for example) a user uses Yahoo or Bing once a month and Google one thousand times a month. From having worked with and talked to people who work on product changed to drive growth, the overwhelming consensus has been that <abbr title="There are some circumstances where this isn't true, but it would be difficult to make a compelling case that the search market in 2012 was one of those markets in general">it's generally very difficult to convert a lightly-engaged user who barely registers as an MAU to a heavily-engaged user who uses the product regularly</abbr>, and that this is generally considered more difficult than converting a brand-new user to becoming heavily engaged user. Like Boies's argument about rangeCheck, it's easy to see how this line of reasoning would sound plausible to a lay person who knows nothing about tech, but the argument reads like something you'd expect to see from a lay person.</p> <p>Although the BE staff memo reads like a rebuttal to the points of the BC staff memo, the lack of direct engagement on the facts and arguments means that a reader with no knowledge of the industry who reads just one of the memos will have a very different impression than a reader who reads the other. For example, on the importance of mobile search, a naive BC-memo-only reader would think that mobile is very important, perhaps the most important thing, whereas a naive BE-memo-only reader would think that mobile is unimportant and will continue to be unimportant for the foreseeable future.</p> <p>Politico also released memos from two directors who weigh the arguments of BC and BE staff. Both directors favor the BE memo over the BC memo, one very much so and one moderately so. When it comes to disagreements, such as the importance of mobile in the near future, there's no evidence in the memos presented that there was any attempt to determine who was correct or that the errors we're discussing here were noticed. The closest thing to addressing disagreements such as these are comments that thank both staffs for having done good work, in what one might call a &quot;fair and balanced&quot; manner, such as &quot;The BC and BE staffs have done an outstanding job on this complex investigation. The memos from the respective bureaus make clear that the case for a complaint is close in the four areas ... &quot;. To the extent that this can be inferred, it seems that the reasoning and facts laid out in the BE memo were given at least as much weight as the reasoning and facts in the BC memo despite much of the BE memo's case seemingly highly implausible to an observer who understands tech.</p> <p>For example, on the importance of mobile, I happened to work at Google shortly after these memos were written and, when I was at Google, they had already pivoted to a &quot;mobile first&quot; strategy because it was understood that mobile was going to be the most important market going forward. This was also understood at other large tech companies at the time and had been understood going back further than the dates of these memos. Many consumers didn't understand this and redesigns that degraded the desktop experience in order to unify desktop and mobile experiences were a common cause of complaints at the time. But if you looked at the data on this or talked to people at big companies, it was clear that, from a business standpoint, it made sense to focus on mobile and deal with whatever fallout might happen in desktop if that allowed for greater velocity in mobile development.</p> <p>Both the BC and BE staff memos extensively reference interviews across many tech companies, including all of the &quot;<abbr title="of course this term wasn't in use at the time">hyperscalers</abbr>&quot;. It's curious that someone could have access to all of these internal documents from these companies as well as interviews and then make the argument that mobile was, at the time, not very important. And it's strange that, at least to the extent that we can know what happened from these memos, directors took both sets of arguments at face value and then decided that the BE staff case was as convincing or more convincing than the BC staff case.</p> <p>That's one class of error we repeatedly see between the BC and BE staff memos, stretching data to make a case that a knowledgeable observer can plainly see is not true. In most cases, it's BE staff who have stretched data as far as it can go to take a tenuous position as far as it can be pushed, but there are some instances of BC staff making a case that's a stretch.</p> <p>Another class of error we see repeated, mainly in the BE memo, is taking what <abbr title="especially people familiar with the business or product side of the industry">most people in industry</abbr> would consider an obviously incorrect model of the world and then making inferences based on that. An example of this is the discussion on whether or not vertical competitors such as Yelp and TripAdvisor were or would be significantly disadvantaged by actions BC staff allege are anticompetitive. BE staff, in addition to arguing that Google's actions were actually procompetitive and not anticompetitive, argued that it would not be possible for Google to significantly harm vertical competitors because the amount of traffic Google drives to them is small, only 10% to 20% of their total traffic, going to say &quot;the effect on traffic from Google to local sites is very small and not statistically significant&quot;. Although BE staff don't elaborate on their model of how this business works, they appear to <abbr title="or believe something equivalent to">believe</abbr> that the market is basically static. If Google removes Yelp from its listings (which they threatened to do if they weren't allowed to integrate Yelp's data into their own vertical product) or downranks Yelp to preference Google's own results, this will, at most, reduce Yelp's traffic by 10% to 20% in the long run because only 10% to 20% of traffic comes from Google.</p> <p>But even a VC or PM intern can be expected to understand that the market isn't static. What one would expect if Google can persistently take a significant fraction of search traffic away from Yelp and direct it to Google's local offerings instead is that, in the long run, Yelp will end up with very few users and become a shell of what it once was. This is exactly what happened and, as of this writing, Yelp is valued at $2B despite having a trailing P/E ratio of 24, which is fairly low P/E for a tech company. But the P/E ratio is unsurprisingly low because it's not generally believed that Yelp can turn this around due to Google's dominant position in search as well as maps making it very difficult for Yelp to gain or retain users. This is not just obvious in retrospect and was well understood at the time. In fact, I talked to a former colleague at Google who was working on one of a number of local features that leveraged the position that Google had and that Yelp could never reasonably attain; the expected outcome of these features was to cripple Yelp's business. Not only was it understood that this was going to happen, it was also understood that Yelp was not likely to be able to counter this due to Google's ability to leverage its market power from search and maps. It's curious that, at the time, someone would've seriously argued that cutting off Yelp's source of new users while simultaneously presenting virtually all of Yelp's then-current users with an alternative that's bundled into an app or website they already use would not significantly impact Yelp's business, but the BE memo makes that case. One could argue that the set of maneuvers used here are analogous to the ones done by Microsoft that were brought up in the Microsoft antitrust case where it was alleged that a Microsoft exec said that they were going to &quot;cut off Netscape’s air supply&quot;, but the BE memo argues that impact of having one's air supply cut off is &quot;very small and not statistically significant&quot; (after all, a typical body has blood volume sufficient to bind 1L of oxygen, much more than the oxygen normally taken in during one breath).</p> <p>Another class of, if not error, then poorly supported reasoning is relying on <a href="cocktail-ideas/">cocktail party level of reasoning</a> when there's data or other strong evidence that can be directly applied. This happens throughout the BE memo even though, at other times, when the BC memo has some moderately plausible reasoning, the BE memo's counter is that we should not accept such reasoning and need to look at the data and not just reason about things in the abstract. The BE memo heavily leans on the concept that we must rely on data over reasoning and calls arguments from the BC memo that aren't rooted in rigorous data anecdotal, &quot;beyond speculation&quot;, etc., but BE memo only does this in cases where knowledge or reasoning might lead one to conclude that there was some kind of barrier to competition. When the data indicates that Google's behavior creates some kind of barrier in the market, the authors of BE memo ignore all relevant data and instead rely on reasoning over data even when the reasoning is weak and has the character of the Boies argument we referenced earlier. One could argue that the standard of evidence for pursuing an antitrust case should be stronger the standard of evidence for not pursuing one, but if the asymmetry observed here were for that reason, the BE memo could have listed areas where the evidence wasn't strong enough without making its own weak assertions in the face of stronger evidence. An example of this is the discussion of the impact of mobile defaults.</p> <p>The BE memo argues that defaults are essentially worthless and have little to no impact, saying multiple times that users can switch with just &quot;a few taps&quot;, adding that this takes &quot;a few seconds&quot; and that, therefore, &quot;[t]hese are trivial switching costs&quot;. The most obvious and direct argument piece of evidence on the impact of defaults is the amount of money Google pays to retain its default status. In a 2023 antitrust action, it was revealed that Google paid Apple $26.3B to retain its default status in 2021. As of this writing, Apple's P/E ratio is 29.53. If we think of this payment as, at the margin, pure profit and having default status is as worthless as indicated by the BE memo, a naive estimate of how much this is worth to Apple is that it can account for something like $776B of Apple's $2.9T market cap. Or, looking at this from Google's standpoint, Google's P/E ratio is 27.49, so Google is willing to give up $722B of its $2.17T market cap. Google is willing to pay this to be the default search for something like 25% to 30% of phones in the world. This calculation is too simplistic, but there's no reasonable adjustment that could give anyone the impression that the value of being the default is as trivial as claimed by the BE memo. For reference, a $776B tech company would be 7th most valuable publicly traded U.S. tech company and the 8th most valuable publicly traded U.S. company (behind Meta/Facebook and Berkshire Hathaway, but ahead of Eli Lilly). Another reference is that YouTube's ad revenue in 2021 was $28.8B. It would be difficult to argue that spending one YouTube worth of revenue, in profit, in order to retain default status makes sense if, in practice, user switching costs are trivial and defaults don't matter. If we look for publicly available numbers close to 2012 instead of 2021, in 2013, TechCrunch reported a rumor that Google was paying Apple $1B/yr for search status and a lawsuit then revealed that Google paid Apple $1B for default search status in 2014. This is not longer after these memos are written and $1B/yr is still a non-trivial amount of money and it belies the BE memo's claim that mobile is unimportant and that defaults don't matter because user switching costs are trivial.</p> <p>It's curious that, given the heavy emphasis in the BE memo on not trusting plausible reasoning and having to rely on empirical data, that BE staff appeared to make no attempt to find out how much Google was paying for its default status (a memo by a director who agrees with BE staff suggests that someone ought to check on this number, but there's no evidence that this was done and the FTC investigation was dropped shortly afterwards). Given the number of internal documents the FTC was able to obtain, it seems unlikely that the FTC would not have been able to obtain this number from either Apple or Google. But, even if it were the case that the number were unobtainable, it's prima facie implausible that defaults don't matter and switching costs are low in practice. If FTC staff interviewed product-oriented engineers and PMs or looked at the history of products in tech, so in order to make this case, BE staff had to ignore or avoid finding out how much Google was paying for default status, not talk to product-focused engineers, PM, or leadership, and also avoid learning about the tech industry.</p> <p>One could make the case that, while defaults are powerful, companies have been able to overcome being non-default, which could lead to a debate on exactly how powerful defaults are. For example, one might argue about the impact of defaults when Google Chrome became the dominant browser and debate how much of it was due to Chrome simply being a better browser than IE, Opera, and Firefox, how much was due to blunders by Microsoft that Google is unlikely to repeat in search, how much was due to things like <a href="https://x.com/danluu/status/887724695558205440">tricking people into making Chrome default via a bundle deal with badware installers</a> and how much was due to pressuring people into setting Chrome is default via google.com. That's an interesting discussion where a reasonable person with an understanding of the industry could take either side of the debate, unlike the claim that defaults basically don't matter at all and user switching costs are trivial in practice, which is not plausible even without access to the data on how much Google pays Apple and others to retain default status. And as of the 2020 DoJ case against Google, roughly half of Google searches occur via a default search that Google pays for.</p> <p>Another repeated error, closely related to the one above, is bringing up marketing statements, press releases, or other statements that are generally understood to be exaggerations, and relying on these as if they're meaningful statements of fact. For example, the BE memo states:</p> <blockquote> <p>Microsoft's public statements are not consistent with statements made to antitrust regulators. Microsoft CEO Steve Ballmer stated in a press release announcing the search agreement with Yahoo: &quot;This agreement with Yahoo! will provide the scale we need to deliver even more rapid advances in relevancy and usefulness. Microsoft and Yahoo! know there's so much more that search could be. This agreement gives us the scale and resources to create the future of search&quot;</p> </blockquote> <p>This is the kind of marketing pablum that generally accompanies an acquisition or partnership. Because this kind of meaningless statement is common across many industries, one would expect regulators, even ones with no understanding of tech, to recognize this as marketing and not give it as much or more weight as serious evidence.</p> <h2 id="a-few-interesting-tidbits">A few interesting tidbits</h2> <p>Now that we've covered the main classes of errors observed in the memos, we'll look at a tidbits from the memos.</p> <p>Between the approval of the compulsory process on June 3rd 2011 and the publication of the BC memo dated August 8th 2012, staff received 9.5M pages of documents across 2M docs and said they reviewed &quot;many thousands of these documents&quot;, so staff were only able to review a small fraction of the documents.</p> <p>Prior to the FTC investigation, there were a number of lawsuits related to the same issues, and all were dismissed, some with arguments that would, if they were taken as broad precedent, make it difficult for any litigation to succeed. In SearchKing v. Google, plaintiffs alleged that Google unfairly demoted their results but it was ruled that Google's rankings are constitutionally protected opinion and even malicious manipulation of rankings would not expose Google to liability. In Kinderstart v. Google, part of the ruling was that Google search is not <a href="https://en.wikipedia.org/wiki/Essential_facilities_doctrine">an essential facility</a> for vertical providers (such as Yelp, eBay, and Expedia). Since the memos are ultimately about legal proceedings, there is, of course, extensive discussion of Verizon v. Trinko and Aspen Skiing Co. v. Aspen Highlands Skiing Corp and the implications thereof.</p> <p>As of the writing of the BC memo, 96% of Google's $38B in revenue was from ads, mostly from search ads. The BC memo makes the case that other forms of advertising, other than social media ads, only have limited potential for growth. That's certainly wrong in retrospect. For example, video ads are a significant market. <a href="https://web.archive.org/web/20230608081933/https://www.hollywoodreporter.com/business/digital/youtube-ad-revenue-tops-8-6b-beating-netflix-in-the-quarter-1235085391/">YouTube's ad revenue was $28.8B in 2021</a> (a bit more than what Google pays to Apple to retain default search status), Twitch supposedly generated another $2B-$3B in video revenue, and a fair amount of video ad revenue goes directly from sponsors to streamers without passing through YouTube and Twitch, e.g., <a href="https://x.com/danluu/status/1588318512258650112">the #137th largest streamer on Twitch was offered $10M/yr stream online gambling for 30 minutes a day, and he claims that the #42 largest streamer, who he personally knows, was paid $10M/mo from online gambling sponsorships</a>. And this isn't just apparent in retrospect — even at the time, there were strong signs that video would become a major advertising market. It happens that those same signs also showed that Google was likely to dominate the market for video ads, but it's still the case that the specific argument here was overstated.</p> <p>In general, the BC memo seems to overstate the expected primacy of search ads as well as how distinct a market search ads are, claiming that other online ad spend is not a substitute in any way and, if anything, is a complement. Although one might be able to reasonably argue that search ads are a somewhat distinct market and the elasticity of substitution is low once you start moving a significant amount of your ad spend away from search, the degree to which the BC memo makes this claim is a stretch. Search ads and other ad budgets being complements and not substitutes is a very different position than I've heard from talking to people about how ad spend is allocated in practice. Perhaps one can argue that it makes sense to try to make a strong case here in light of Person V. Google, where Judge Fogel of the Northern District of California criticized the plaintiff's market definition, finding no basis for distinguishing &quot;search advertising market&quot; from the larger market for internet advertising, which likely foreshadows an objection that would be raised in any future litigation. However, as someone who's just trying to understand the facts of the matter at hand and the veracity of the arguments, the argument here seems dubious.</p> <p>For Google's integrated products like local search and product search (formerly Froogle), the BC memo claims that if Google treated its own properties like other websites, the products wouldn't be ranked and Google artificially placed their own vertical competitors above organic offerings. The webspam team declined to include Froogle results because the results are exactly the kind of thing that Google removes from the index because it's spammy, saying &quot;[o]ur algorithms specifically look for pages like these to either demote or remove from the index&quot;. Bill Brougher, product manager for web search said &quot;Generally we like to have the destination pages in the index, not the aggregated pages. So if our local pages are lists of links to other pages, it's more important that we have the other pages in the index&quot;. After the webspam team was overruled and the results were inserted, the ads team complained that the less clicked (and implied to be lower quality) results would lead to a loss of $154M/yr. The response to this essentially contained the same content as the BC memo's argument on the importance of scale and why Google's actions to deprive competitors of scale are costly:</p> <blockquote> <p>We face strong competition and must move quickly. Turning down <abbr title="'OneBoxes' were used to put vertical content above Google's SERP">onebox</abbr> would hamper progress as follows - Ranking: Losing click data harms ranking; [t]riggering Losing CTR and google.com query distribution data triggering accuracy; [c]omprehensiveness: Losing traffic harms merchant growth and therefore comprehensiveness; [m]erchant cooperation: Losing traffic reduces effort merchants put into offer data, tax, &amp; shipping; PR: Turning off onebox reduces Google's credibility in commerce; [u]ser awareness: Losing shopping-related UI on google.com reduces awareness of Google's shopping features</p> </blockquote> <p>Normally, <a href="https://en.wikipedia.org/wiki/Click-through_rate">CTR</a> is used as a strong signal to rank results, but this would've resulted in a low ranking for Google's own vertical properties, so &quot;Google used occurrence of competing vertical websites to automatically boost the ranking of its own vertical properties above that of competitors&quot; — if a comparison shopping site was relevant, Google would insert Google Product search above any rival, and if a local search site like Yelp or CitySearch was relevant, Google automatically returned Google Local at top of <a href="https://en.wikipedia.org/wiki/Search_engine_results_page">SERP</a>.</p> <p>Additionally, in order to see content for Google local results, Google took Yelp content and integrated it into Google Places. When Yelp observed this was happening, they objected to this and Google threatened to ban Yelp from traditional Google search results and further threatened to ban any vertical provider that didn't allow its content to be used in Google Places. Marissa Mayer testified that it was, from a technical standpoint, extraordinarily difficult to remove Yelp from Google Places without also removing Yelp from traditional organic search results. But when Yelp sent a cease and desist letter, Google was able to remove Yelp results immediately, seemingly indicating that it was less difficult than claimed. Google then claimed that it was technically infeasible to remove Yelp from Google Places without removing Yelp from the &quot;local merge&quot; interface on SERP. BC staff believe this claim is false as well, and Marissa Mayer later admitted in a hearing that this claim was false and that Google was concerned about the consequences of allowing sites to opt out of Google Places while staying in &quot;local merge&quot;. There was also a very similar story with Amazon results and product search. As noted above, the BE memo's counterargument to all of this is that Google traffic is &quot;very small and not statistically significant&quot;</p> <p>The BC memo claims that the activities above both reduced incentives of companies Yelp, City Search, Amazon, etc., to invest in the area and also reduced the incentives for new companies to form in this area. This seems true. In addition to the evidence presented in the BC memo (which goes beyond what was summarized above), if you just talked to founders looking for an idea or VCs around the time of the FTC investigation, there had already been a real movement away from founding and funding companies like Yelp because it was understood that Google could seriously cripple any similar company in this space by cutting off its air supply.</p> <p>We'll defer to the appendix BC memo discussion on the AdWords API restrictions that specifically disallow programmatic porting of campaigns to other platforms, such as Bing. But one interesting bit there is that <a href="https://mastodon.social/@danluu/111243272325539288">Google was apparently aware of the legal sensitivity of this matter, so meeting notes and internal documentation on the topic are unusually incomplete</a>. On one meeting, apparently the most informative written record BC staff were able to find consists of a message from Director of PM Richard Holden to SVP of ads Susan Wojicki which reads, &quot;We didn't take notes for obvious reasons hence why I'm not elaborating too much here in email but happy to brief you more verbally&quot;.</p> <p>We'll also defer a detailed discussion of the BC memo comments on Google's exclusive and restrictive syndication agreements to the appendix, except for a couple of funny bits. One is that Google claims they were unaware of the terms and conditions in their <abbr title="standard agreements, as opposed to their bespoke negotiated agreements with large partners">standard online service agreements</abbr>. In particular, the terms and conditions contained a &quot;preferred placement&quot; clause, which a number of parties believe is a de facto exclusivity agreement. When FTC staff questioned Google's VP of search services about this term, the VP claimed they were not aware of this term. Afterwards, Google sent a letter to Barbara Blank of the FTC explaining that they were removing the preferred placement clause in the standard online agreement.</p> <p>Another funny bit involves Google's market power and how it allowed them to collect an increasingly large share of revenue for themselves and decrease the revenue share their partner received. Only a small number of Google's customers who were impacted by this found this concerning. Those that did find it concerning were some of the largest and most sophisticated customers (such as Amazon and IAC); their concern was that Google's restrictive and exclusive provisions would increase Google's dominance over Bing/Microsoft and allow them to dictate worse terms to customers. Even as Google was executing a systematic strategy to reduce revenue share to customers, which could only be possible due to their dominance of the market, most customers appeared to either not understand the long-term implications of Google's market power in this area or the importance of the internet.</p> <p>For example, Best Buy didn't find this concerning because Best Buy viewed their website and the web as a way for customers to find presale information before entering a store and Walmart didn't find didn't find this concerning because they viewed the web as an extension to brick and mortar retail. It seems that the same lack of understanding of the importance of the internet which led Walmart and Best Buy to express their lack of concern over Google's dominance here also led to these retailers, which previously had a much stronger position than Amazon, falling greatly behind in both online and overall profit. Walmart later realized its error here and acquired Jet.com for $3.3B in 2016 and also seriously (relative to other retailers) funded programmers to do serious tech work inside Walmart. Since Walmart started taking the internet seriously, it's made a substantial comeback online and has averaged a 30% <a href="https://en.wikipedia.org/wiki/Compound_annual_growth_rate">CAGR</a> in online net sales since 2018, but taking two decades to mount a serious response to Amazon's online presence has put Walmart solidly behind Amazon in online retail despite nearly a decade of serious investment and Best Buy has still not been able to mount an effective response to Amazon after three decades.</p> <p>The BE memo uses the lack of concern on the part of most customers as evidence that the exclusive and restrictive conditions Google dictated here were not a problem but, in retrospect, it's clear that it was only a lack of understanding of the implications of online business that led customers to be unconcerned here. And when the BE memo refers to the customers who understood the implications here as sophisticated, that's relative to people in lines of business where leadership tended to not understand the internet. While these customers are sophisticated by comparison to a retailer that took two decades to mount a serious response to the threat Amazon poses to their business, if you just talked to people in the tech industry at the time, you wouldn't need to find a particularly sophisticated individual to find someone who understood what was going on. It was generally understood that retail revenue and even moreso, retail profit was going to move online, and you'd have to find someone who was extremely unusually out of the loop to find someone who didn't at least roughly understand the implications here.</p> <p>There's a lengthy discussion on search and scale in both the BC and BE memos. On this topic, the BE memo seems wrong and the implications of the BC memo are, if not subtle, at least not obvious. Let's start with the BE memo because that one's simpler to discuss, although we'll very briefly discuss the argument in the BC memo in order to frame the discussion in the BE memo. A rough sketch of the argument in the BC memo is that there are multiple markets (search, ads) where scale has a significant impact on product quality. Google's own documents acknowledge this &quot;virtuous cycle&quot; where having more users lets you serve better ads, which gives you better revenue for ads and, likewise in search, having more scale gives you more data which can be used to improve results, which leads to user growth. And for search in particular, the BC memo claims that click data from users is of high importance and that more data allows for better results.</p> <p>The BE memo claims that this is not really the case. On the importance of click data, the BE memo raises two large objections. First, that this is &quot;contrary to the history of the general search market&quot; and second, that &quot;it is also contrary to the evidence that factors such as the quality of the web crawler and web index; quality of the search algorithm; and the type of content included in the search results [are as important or more important].</p> <p>Of the first argument, the BE memo elaborates with a case that's roughly &quot;Google used to be smaller than it is today, and the click data at the time was sufficient, therefore being as large as Google used to be means that you have sufficient click data&quot;. Independent of knowledge of the tech industry, this seems like a strange line of reasoning. &quot;We now produce a product that's 1/3 as good as our competitor for the same price, but that should be fine because our competitor previously produced a product that's 1/3 as good as their current product when the market was less mature and no one was producing a better product&quot; is generally not going to be a winning move. That's especially true in markets where there's a virtuous cycle between market share and product quality, like in search.</p> <p>The second argument also seems like a strange argument to make even without knowledge of the tech industry in that it's a classic fallacious argument. It's analogous to saying something like &quot;the BC memo claims that it's important for cars to have a right front tire, but that's contrary to evidence that it's at least as important for a car to have a left front tire and a right rear tire&quot;. The argument is even less plausible if you understand tech, especially search. Calling out the quality of the search algorithm as distinct doesn't feel quite right because scale and click data directly feed into algorithm development (and this is discussed at some length in the BE memo — the authors of the BC memo surely had access to the same information and, from their writing, seem to have had access to the argument). And as someone who's <a href="bitfunnel-sigir.pdf">worked on search indexing</a>, as much as I'd like to be agree with the BE memo and say that indexing is as important or more important than ranking, I have to admit that indexing is an easier and less important problem than ranking and likewise for crawling vs. ranking. This was generally understood at the time so, given the number of interviews FTC staff did, the authors of the BE memo should've known this as well. Moreover, given the &quot;history of the general search market&quot; which the BE memo refers to, even without talking to engineers, this should've been apparent.</p> <p>For example, Cuil was famous for building a larger index than Google. While that's not a trivial endeavor, at the time, quite a few people had the expertise to build an index that rivaled Google's index in raw size or whatever other indexing metric you prefer, if given enough funding for a serious infra startup. Cuil and other index-focused attempts failed because having a large index without good search ranking is worth little. While it's technically true that having good ranking with a poor index is also worth little, this is not something we've really seen in practice because ranking is the much harder problem and a company that's competent to build a good search ranker will, as a matter of course, have a good enough index and good enough crawling.</p> <p>As for the case in the BC memo, I don't know what the implications should be. The BC memo correctly points out that increased scale greatly improves search quality, that the extra data Bing got from the Yahoo greatly increased search quality and increased CTR, that further increased scale should be expected to continue to provide high return, that the costs of creating a competitor to Google are high (Bing was said to be losing $2B/yr at the time and was said to be spending $4.5B/yr &quot;developing its algorithms and building the physical capacity necessary to operate Bing&quot;), and that Google undertook actions that might be deemed anticompetitive which disadvantaged Bing's compared to the counterfactual world where Google did not take those actionts, and they make a similar case for ads. However, despite the strength of the stated BC memo case and the incorrectness of the stated BE memo case, the BE memo's case is correct in spirit, in that there are actions Microsoft could've taken but did not in order to compete much more effectively in search and one could argue that the FTC shouldn't be in the business of rescuing a company from competing ineffectively.</p> <p>Personally, I don't think it's too interesting to discuss the position of the BC memo vs. the BE memo at length because the positions the BE memo takes seem extremely weak. It's not fair to call it a straw man because it's a real position, and one that carried the day at the FTC, but the decision to take action or not seemed more about philosophy than the arguments in the memos. But we can discuss what else might've been done.</p> <h2 id="what-might-ve-happened">What might've happened</h2> <p>What happened after the FTC declined to pursue antitrust action was that Microsoft effectively defunded Bing as a serious bet, taking resources that could've gone to continuing to fund a very expensive fight against Google, and moving them to other bets that it deemed to be higher ROI. The big bets Microsoft pursued were Azure, Office, and HoloLens (and arguably Xbox). Hololens was a pie-in-the-sky bet, but Azure and Office were lines of business where Microsoft could, instead of fighting an uphill battle where their competitor can use its dominance in related markets to push around competitors, Microsoft could fight downhill battles where they can use their dominance in related markets to push around competitors, resulting in a much higher return per dollar invested. As someone who worked on Bing and thought that BIng had the potential to seriously compete with Google given sustained, unprofitable, heavy investment, I find that disappointing but also likely the correct business decision. If you look at any particular submarket, like Teams vs. Slack, the Microsoft product doesn't need to be nearly as good as the competing product to take over the market, which is the opposite of the case in search, where Google's ability to push competitors around means that Bing would have to be much better than Google to attain marketshare parity.</p> <p>Based on their public statements, Biden's DoJ Antitrust AAG appointee, Jonathan Kanter, would argue for pursuing antitrust action under the circumstances, as would Biden's FTC commissioner and chair appointee Lina Khan. Prior to her appointment as FTC commissioner and chair, Khan was probably best known for writing <a href="https://www.yalelawjournal.org/note/amazons-antitrust-paradox">Amazon's Antitrust Paradox</a>, which has been influential as well as controversial. Obama appointees, who more frequently agreed with the kind of reasoning from the BE memo, would have argued against antitrust action and the investigation under discussion was stopped on their watch. More broadly, they argued against the philosophy driving Kanter and Khan. Obama's FTC Commissioner appointee, <a href="https://en.wikipedia.org/wiki/Joshua_D._Wright">GMU economist and legal scholar Josh Wright</a> actually wrote a rebuttal titled &quot;Requiem for a Paradox: The Dubious Rise and Inevitable Fall of Hipster Antitrust&quot;, a scathing critique of Khan's position.</p> <p>If, in 2012, the FTC and DoJ were run by Biden appointees instead of Obama appointees, what difference would that have made? We can only speculate, but one possibility would be that they would've taken action and then lost, as happened with the recent cases against Meta and Microsoft which seem like they would not have been undertaken under an Obama FTC and DoJ. Under Biden appointees, there's been much more vigorous use of the laws that are on the books, the Sherman Act, the Clayton Act, the FTC Act, the Robinson–Patman Act, as well as &quot;smaller&quot; antitrust laws, but the opinion of the courts hasn't changed under Biden and this has led to a number of unsuccessful antitrust cases in tech. Both the BE and BC memos dedicate significant space to whether or not a particular line of reasoning will hold up in court. Biden's appointees are much less concerned with this than previous appointees and multiple people in the DoJ and the FTC are on the record saying things like &quot;it is our duty to enforce the law&quot;, meaning that when they see violations of the antitrust laws that were put into place by elected officials, it's their job to pursue these violations even if courts may not agree with the law.</p> <p>Another possibility is that there would've been some action, but the action would've been in line with most corporate penalties we see. Something like a small fine that costs the company an insignificant fraction of marginal profit they made from their actions, or some kind of consent decree (basically a cease and desist), where the company will be required to stop doing specific actions while keeping their marketshare, keeping the main thing they wanted to gain, a massive advantage in a market dominated by network effects. Perhaps there will be a few more meetings where &quot;[w]e didn't take notes for obvious reasons&quot; to work around the new limitations and business as usual will continue. Given the specific allegations in the FTC memos and the attitudes of the courts at the time, my guess is that something like this second set of possibilities would've been the most likely outcome had the FTC proceeded with their antitrust investigation instead of dropping it, some kind of nominal victory that makes little to no difference in practice. Given how long it takes for these cases to play out, it's overwhelmingly likely that Microsoft would've already scaled back its investment in Bing and moved Bing from a subsidized bet it was trying to grow to a profitable business it wanted to keep by the time any decision was made. There are a number of cases that were brought by other countries which had remedies that were in line with what we might've expected if the FTC investigation continued. On Google using market power in mobile to push software Google wants to nearly all Android phones, an EU and was nominally successful but made little to no difference in practice. Cristina Caffara of the Centre for Economic Policy Research characterized this as</p> <blockquote> <p>Europe has failed to drive change on the ground. Why? Because we told them, don't do it again, bad dog, don't do it again. But in fact, they all went and said 'ok, ok', and then went out, ran back from the back door and did it again, because they're smarter than the regulator, right? And that's what happens.</p> <p>So, on the tying case, in Android, the issue was, don't tie again so they say, &quot;ok, we don't tie&quot;. Now we got a new system. If you want Google Play Store, you pay $100. But if you want to put search in every entry point, you get a discount of $100 ... the remedy failed, and everyone else says, &quot;oh, that's a nice way to think about it, very clever&quot;</p> </blockquote> <p>Another pair of related cases are Yandex's Russian case on mobile search defaults and a later EU consent decree. In 2015, Yandex brought a suit about mobile default status on Android in Russia, which was settled by adding a &quot;choice screen&quot; which has users pick their search engine without preferencing a default. This immediately caused Yandex to start gaining marketshare on Google <a href="https://gs.statcounter.com/search-engine-market-share/all/russian-federation/#monthly-200901-202404">and Yandex eventually surpassed Google in marketshare in Russia according to statcounter</a>. In 2018, the EU required a similar choice screen in Europe, <a href="https://gs.statcounter.com/search-engine-market-share/all/europe/#monthly-201501-202404">which didn't make much of a difference</a>, except maybe sort of in the Czech republic. There are a number of differences between the situation in Russia and in the EU. One, arguably the most important, is that when Yandex brought the case against Google in Russia, Yandex was still fairly competitive, with marketshare in the high 30% range. At the time of the EU decision in 2018, Bing was the #2 search engine in Europe, with about 3.6% marketshare. Giving consumers a choice when one search engine completely dominates the market can be expected to have fairly little impact. One argument the BE memo heavily relies on is the idea that, if we intervene in any way, that could have bad effects down the line, so we should be very careful and probably not do anything, just in case. But in these winner-take-most markets with such strong network effects, there's a relatively small window in which you can cheaply intervene. Perhaps, and this is highly speculative, if the FTC required a choice screen in 2012, Bing would've continued to invest enough to at least maintain its marketshare against Google.</p> <p>For verticals, in shopping, the EU required some changes to how Google presents results in 2017. This appears to have had little to no impact, being both perhaps 5-10 years too late and also a trivial change that wouldn't have made much difference even if enacted a decade earlier. The 2017 ruling came out of a case that started in 2010, and in the 7 years it took to take action, Google managed to outcompete its vertical competitors, making them barely relevant at best.</p> <p>Another place we could look is at the Microsoft antitrust trial. That's a long story, at least as long as this document, but to very briefly summarize, in 1990, the FTC started an investigation over Microsoft's allegedly anticompetitive conduct. A vote to continue the investigation ended up in a 2-2 tie, causing the investigation to be closed. The DoJ then did its own investigation, which led to a consent decree that was generally considered to not be too effective. There was then a 1998 suit by the DoJ about Microsoft's use of monopoly power in the browser market, which initially led to a decision to break Microsoft up. But, on appeal, the breakup was overturned, which led to a settlement in 2002. A major component of the 1998 case was about browser bundling and Microsoft's attack on Netscape. By the time the case was settled, in 2002, Netscape was effectively dead. The parts of the settlements having to do with interoperability were widely regarded as ineffective at the time, not only because Netscape was dead, but because they weren't going to be generally useful. A number of economists took the same position as the BE memo, that no intervention should've happened at the time and that any intervention is dangerous and could lead to a fettering of innovation. Nobel Prize winning economist Milton Friedman wrote a Cato Policy Forum essay titled &quot;The Business Community's Suicidal Impulse&quot;, predicting that tech companies calling for antitrust action against Microsoft were committing suicide, and that a critical threshold had been passed and that this would lead to the bureaucratization of Silicon Valley</p> <blockquote> <p>When I started in this business, as a believer in competition, I was a great supporter of antitrust laws; I thought enforcing them was one of the few desirable things that the government could do to promote more competition. But as I watched what actually happened, I saw that, instead of promoting competition, antitrust laws tended to do exactly the opposite, because they tended, like so many government activities, to be taken over by the people they were supposed to regulate and control. And so over time I have gradually come to the conclusion that antitrust laws do far more harm than good and that we would be better off if we didn’t have them at all, if we could get rid of them. But we do have them.</p> <p>Under the circumstances, given that we do have antitrust laws, is it really in the self-interest of Silicon Valley to set the government on Microsoft? ... you will rue the day when you called in the government. From now on the computer industry, which has been very fortunate in that it has been relatively free of government intrusion, will experience a continuous increase in government regulation. Antitrust very quickly becomes regulation. Here again is a case that seems to me to illustrate the suicidal impulse of the business community.</p> </blockquote> <p>In retrospect, we can see that this wasn't correct and, if anything, was the opposite of correct. On the idea that even attempting antirust action against Microsoft would lead to an inevitable increase in government intervention, we saw the opposite, a two-decade long period of relatively light regulation and antitrust activity. And in terms of the impacts on innovation, although the case against Microsoft was too little and too late to save Netscape, Google's success appears to be causally linked to the antitrust trial. At one point, in the early days of Google, when Google had no market power and Microsoft effectively controlled how people access the internet, Microsoft internally discussed proposals aimed at killing Google. One proposal involved redirecting users who tried to navigate to Google to Bing (at the time, called MSN Search, and of course this was before Chrome existed and IE dominated the browser market). Another idea was to put up a big scary warning that warned users that Google was dangerous, much like the malware warnings browsers have today. Gene Burrus, a lawyer for Microsoft at the time, stated that Microsoft chose not to attempt to stop users from navigating to google.com due to concerns about further antitrust action after they'd been through nearly a decade of serious antitrust scrutiny. People at both Google and Microsoft who were interviewed about this both believe that Microsoft would've killed Google had they done this so, in retrospect, we can see that Milton Friedman was wrong about the impacts of the Microsoft antitrust investigations and that one can make the case that it's only because of the antitrust investigations that web 1.0 companies like Google and Facebook were able to survive, let alone flourish.</p> <p>Another possibility is that a significant antitrust action would've been undertaken, been successful, and been successful quickly enough to matter. It's possible that, by itself, a remedy wouldn't have changed the equation for Bing vs. Google, but if a reasonable remedy was found and enacted, it still could've been in time to keep Yelp and other vertical sites as serious concerns and maybe even spur more vertical startups. And in the hypothetical universe where people with the same philosophy as Biden's appointees were running the FTC and the DoJ, we might've also seen antitrust action against Microsoft in markets where they can leverage their dominance in adjacent markets, making Bing a more appealing area for continued heavy investment. Perhaps that would've resulted in Bing being competitive with Google and the aforementioned concerns that &quot;sophisticated customers&quot; like Amazon and IAC had may not have come to pass. With antitrust against Microsoft and other large companies that can use their dominance to push competitors around, perhaps Slack would still be an independent product and we'd see more startups in enterprise tools (<a href="https://x.com/carnage4life/status/1774800837598216383">a number of commenters believe that Slack was basically forced into being acquired because it's too difficult to compete with Teams given Microsoft's dominance in related markets</a>). And Slack continuing to exist and innovate is small potatoes — the larger hypothetical impact would be all of the new startups and products that would be created that no one even bothers to attempt because they're concerned that a behemoth with an integrated bundle like Microsoft would crush their standalone product. If you add up all of these, if not best-case, at least very-good-case outcomes for antitrust advocates, one could argue that consumers and businesses would be better off. But, realistically, it's hard to see how this very-good-case set of outcomes could have come to pass.</p> <p>Coming back to the FTC memo, if we think about what it would take to put together a set of antitrust actions that actually fosters real competition, that seems extraordinarily difficult. A number of the more straightforward and plausible sounding solutions are off the table for political reasons, due to legal precedent, or due to arguments like the Boies argument we referenced or some of the arguments in the BE memo that are clearly incorrect, but appear to be convincing to very important people.</p> <p>For the solutions that seem to be on the table, weighing the harms caused by them is non-trivial. For example, let's say the FTC mandated a mobile and desktop choice screen in 2012. This would've killed Mozilla in fairly short order unless Mozilla completely changed its business model because Mozilla basically relies on payments from Google for default status to survive. We've seen with Opera that even when you have a superior browser that introduces features that other browsers later copy, which has better performance than other browsers, etc., you can't really compete with free browsers when you have a paid browser. So then we would've quickly been down to IE/Edge and Chrome. And in terms of browser engines, just Chrome after not too long as Edge is now running Chrome under the hood. Maybe we can come up with another remedy that allows for browser competition as well, but the BE memo isn't wrong to note that antitrust remedies can cause other harms.</p> <p>Another example which highlights the difficulty of crafting a politically suitable remedy are the restrictions the Bundeskartellamt imposed against Facebook, which have to do with user privacy and use of data (for personalization, ranking, general ML training, etc.), which is considered an antitrust issue in Germany. Michal Gal, Professor and Director of the Forum on Law and Markets at the University of Haifa pointed out that, of course Facebook, in response to the rulings, is careful to only limit its use of data if Facebook detects that you're German. If the concern is that ML models are trained on user data, this doesn't do much to impair Facebook's capability. Hypothetically, if Germany had a tech scene that was competitive with American tech and German companies were concerned about a similar ruling being leveled against them, this would be disadvantageous to nascent German companies that initially focus on the German market before expanding internationally. For Germany, this is only a theoretical concern as, other than SAP, no German company has even approached the size and scope of large American tech companies. But when looking at American remedies and American regulation, this isn't a theoretical concern, and some lawmakers will want to weigh the protection of American consumers against the drag imposed on American firms when compared to Korean, Chinese, and other foreign firms that can grow in local markets with fewer privacy concerns before expanding to international markets. This concern, if taken seriously, could be used to argue against nearly any pro-antitrust action argument.</p> <h2 id="what-can-we-do-going-forward">What can we do going forward?</h2> <p>This document is already long enough, so we'll defer a detailed discussion of policy specifics for another time, but in terms of high-level actions, one thing that seems like it would be helpful is to have tech people intimately involved in crafting remedies and regulation as well as during investigations<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">2</a></sup>. From the directors memos on the 2011-2021 FTC investigation that are publicly available, it would appear this was not done because the arguments from the BE memos that wouldn't pass the sniff test for a tech person appear to have been taken seriously. Another example is the one EU remedy that Cristina Caffara noted was immediately worked around by Google, in a way that many people in tech would find to be a delightful &quot;hack&quot;.</p> <p>There's a long history of this kind of &quot;hacking the system&quot; being lauded in tech going back to before anyone called it &quot;tech&quot; and it was just physics and electrical engineering. To pick a more recent example, one of the reasons Sam Altman become President of Y Combinator, which eventually led to him becoming CEO of Open AI was that Paul Graham admired his ability to hack systems; in his 2010 essay on founders, under the section titled &quot;Naughtiness&quot;, Paul wrote:</p> <blockquote> <p>Though the most successful founders are usually good people, they tend to have a piratical gleam in their eye. They're not Goody Two-Shoes type good. Morally, they care about getting the big questions right, but not about observing proprieties. That's why I'd use the word naughty rather than evil. They delight in breaking rules, but not rules that matter. This quality may be redundant though; it may be implied by imagination.</p> <p>Sam Altman of Loopt is one of the most successful alumni, so we asked him what question we could put on the Y Combinator application that would help us discover more people like him. He said to ask about a time when they'd hacked something to their advantage—hacked in the sense of beating the system, not breaking into computers. It has become one of the questions we pay most attention to when judging applications.</p> </blockquote> <p>Or, to pick one of countless examples from Google, in order to reduce travel costs at Google, Google engineers implemented a system where they computed some kind of baseline &quot;expected cost for flights, and then gave people a credit for taking flights that came in under the baseline costs that could be used to upgrade future flights and travel accommodations. This was a nice experience for employees compared to what stodgier companies were doing in terms of expense limits and Google engineers were proud of creating a system that made things better for everyone, which was one kind of hacking the system. The next level of hacking the system was when some employees optimized their flights and even set up trips to locations that were highly optimizable (many engineers would consider this a fun challenge, a variant of classic dynamic programming problems that are given in interviews, etc.), allowing them to upgrade to first class flights and the nicest hotels.</p> <p>When I've talked about this with people in management in traditional industries, they've frequently been horrified and can't believe that these employees weren't censured or even fired for cheating the system. But when I was at Google, people generally found this to be admirable, as it exemplified the hacker spirit.</p> <p>We can see, from the history of antitrust in tech going back at least two decades, that courts, regulators, and legislators have not been prepared for the vigor, speed, and delight with which tech companies hack the system.</p> <p>And there's precedent for bringing in tech folks to work on the other side of the table. For example, this was done in the big Microsoft antitrust case. But there are incentive issues that make this difficult at every level that stem from, among other things, the sheer amount of money that tech companies are willing to pay out. If I think about tech folks I know who are very good at the kind of hacking the system described here, the ones who want to be employed at big companies frequently make seven figures (or more) annually, a sum not likely to be rivaled by an individual consulting contract with the DoJ or FTC. If we look at the example of Microsoft again, the tech group that was involved was managed by Ron Schnell, who was taking a break from working after his third exit, but people like that are relatively few and far between. Of course there are people who don't want to work at big companies for a variety of reasons, often moral reasons or a dislike of big company corporate politics, but most people I know who fit that description haven't spent enough time at big companies to really understand the mechanics of how big companies operate and are the wrong people for this job even if they're great engineers and great hackers.</p> <p><abbr title="Sorry, I didn't take notes on this and can't recall specifically who said this or even which conference, although I believe it was at the University of Chicago">At an antitrust conference a while back</abbr>, a speaker noted that the mixing and collaboration between the legal and economics communities was a great boon for antitrust work. Notably absent from the speech as well as the conference were practitioners from industry. The conference had the feel of an academic conference, so you might see CS academics at the conference some day, but even if that were to happen, many of the policy-level discussions are ones that are outside the area of interest of CS academics. For example, one of the arguments from the BE memo that we noted as implausible was the way they used MAU to basically argue that switching costs were low. That's something outside the area of research of almost every CS academic, so even if the conference were to expand and bring in folks who work closely with tech, the natural attendees would still not be the right people to weigh in on the topic when it comes to the plausibility of nitty gritty details.</p> <p>Besides the aforementioned impact on policy discussions, the lack of collaboration with tech folks also meant that, when people spoke about the motives of actors, they would often make assumptions that were unwarranted. On one specific example of what someone might call a hack of the system, the speaker described an exec's reaction (high-fives, etc.), and inferred a contempt for lawmakers and the law that was not in evidence. It's possible the exec in question does, in fact, have a contempt and disdain for lawmakers and the law, but that celebration is exactly what you might've seen after someone at Google figured out how to get upgraded to first class &quot;for free&quot; on almost all their flights by hacking the system at Google, which wouldn't indicate contempt or disdain at all.</p> <p>Coming back to the incentive problem, it goes beyond getting people who understand tech on the other side of the table in antitrust discussions. If you ask Capitol Hill staffers who were around at the time, the general belief is that the primary factor that scuttled the FTC investigation was Google's lobbying, and of course <a href="https://x.com/danluu/status/1254720673592688641">Google and other large tech companies spend more on lobbying</a> than entities that are interested in increased antitrust scrutiny.</p> <p>And in the civil service, if we look at the lead of the BC investigation and the first author on the BC memo, they're now Director and Associate General Counsel of Competition and Regulatory Affairs at Facebook. I don't know them, so I can't speak to their motivations, but if I were offered as much money as I expect they make to work on antitrust and other regulatory issues at Facebook, I'd probably take the offer. Even putting aside the pay, if I was a strong believer in the goals of increased antitrust enforcement, that would still be a very compelling offer. Working for the FTC, maybe you lead another investigation where you write a memo that's much stronger than the opposition memo, which doesn't matter when a big tech company pours more lobbying money into D.C. and the investigation is closed. Or maybe your investigation leads to an outcome like the EU investigation that led to a &quot;choice screen&quot; that was too little and far too late. Or maybe it leads to something like the Android Play Store untying case where, seven years after the investigation was started, an enterprising Google employee figures out a &quot;hack&quot; that makes the consent decree useless in about five minutes. At least inside Facebook, you can nudge the company towards what you think is right and have some impact on how Facebook treats consumers and competitors.</p> <p>Looking at it from the standpoint of people in tech (as opposed to people working in antitrust), in my extended social circles, it's common to hear people say &quot;I'd never work at company X for <a href="https://www.patreon.com/posts/103707012">moral reasons</a>&quot;. That's a fine position to take but, almost everyone I know who does this ends up working at a much smaller company that has almost no impact on the world. If you want to take a moral stand, you're more likely to make a difference by working from the inside or finding a smaller direct competitor and helping it become more successful.</p> <p><i>Thanks to Laurence Tratt, Yossi Kreinin, Justin Hong, kouhai@treehouse.systems, Sophia Wisdom, @cursv@ioc.exchange, @quanticle@mastodon.social, and Misha Yagudin for comments/corrections/discussion</i></p> <h1 id="appendix-non-statements">Appendix: non-statements</h1> <p>This is analogous to the &quot;non-goals&quot; section of a technical design doc, but weaker, in that a non-goal in a design doc is often a positive statement that implies something that couldn't be implied from reading the doc, whereas the non-goal <abbr title="statements only, not explanations">statements</abbr> themselves don't add any informatio</p> <ul> <li>Antitrust action against Google should have been pursued in 2012 <ul> <li>Not that anyone should care what my opinion is, but if you'd asked me at the time if antitrust action should be pursued, I would've said &quot;probably not&quot;. The case for antitrust action seems stronger now and the case against seems weaker, but you could still mount a fairly strong argument against antitrust action today.</li> <li>Even if you believe that, ceteris paribus, antitrust action would've been good for consumers and the &quot;very good case&quot; outcome in &quot;what might've happened&quot; would occur if antitrust action were pursued, it's still not obvious that Google and other tech companies are the right target as opposed to (just for example) Visa and Mastercard's dominance of payments, hospital mergers leading to increased concentration that's had negative impacts on both consumers and workers, Ticketmaster's dominance, etc.. Or perhaps you think the government should focus on areas where regulation specifically protects firms, such as in shipping (which is except from the Sherman Act) or car dealerships (which have special protections in the law in many U.S. states that prevent direct sales and compel car companies to abide by their demands in certain ways), etc.</li> </ul></li> <li>Weaker or stronger antitrust measures should be taken today <ul> <li>I don't think I've spent enough time reading up on the legal, political, historical, and philosophical background to have an opinion on what should be done, but I know enough about tech to point out a few errors that I've seen and to call out common themes in these errors.</li> </ul></li> </ul> <h1 id="bc-staff-memo">BC Staff Memo</h1> <p>By &quot;Barbara R. Blank, Gustav P. Chiarello, Melissa Westman-Cherry, Matthew Accornero, Jennifer Nagle, Anticompetitive Practices Division; James Rhilinger, Healthcare Division; James Frost, Office of Policy and Coordination; Priya B. Viswanath, Office of the Director; Stuart Hirschfeld, Danica Noble, Northwest Region; Thomas Dahdouh, Western Region-San Francisco, Attorneys; Daniel Gross, Robert Hilliard, Catherine McNally, Cristobal Ramon, Sarah Sajewski, Brian Stone, Honors Paralegals; Stephanie Langley, Investigator&quot;</p> <p>Dated August 8, 2012</p> <h2 id="executive-summary">Executive Summary</h2> <ul> <li>Google is dominant search engine and seller of search ads</li> <li>This memo addresses 4 of 5 areas with anticompetitive conduct; mobile is in a supplemental memo</li> <li>Google has monopoly power in the U.S. in Horizontal Search; Search Advertising; and Syndicated Search and Search Advertising</li> <li>On the question of whether Google has unlawfully preferenced its own content while demoting rivals, we do not recommend the FTC proceed; it's a close call and case law is not favorable to anticompetitive product design and Google's efficiency justifications are strong and at there's some benefit to users</li> <li>On whether Google has unlawfully scraped content from vertical rivals to improve their own vertical products, recommending condemning as a conditional refusal to deal under Section 2 <ul> <li>Prior voluntary dealing was mutually beneficial</li> <li>Threats to remove rival content from general search designed to coerce rivals into allowing Google to user their content for Google's vertical product</li> <li>Natural and probable effect is to diminish incentives of vertical website R&amp;D</li> </ul></li> <li>On anticompetitive contractual restrictions on automated cross-management of ad campaigns, restrictions should be condemned under Section 2 <ul> <li>They limit ability of advertisers to make use of their own data, reducing innovation and increasing transaction costs for advertisers and third-party businesses</li> <li>Also degrade the quality of Google's rivals in search and search advertising</li> <li>Google's efficiency justifications appears to be pretextual</li> </ul></li> <li>On anticompetitive exclusionary agreements with websites for syndicated search and search ads, Google should be condemned under Section 2 <ul> <li>Only modest anticompetitive effects on publishers, but deny scale to competitors, competitively significant to main rival (Bing) as well as significant barrier to entry in longer term</li> <li>Google's efficiency justifications are, on balance, non-persuasive</li> </ul></li> <li>Possible remedies <ul> <li>Scraping <ul> <li>Could be required to provide an opt-out for snippets (reviews, ratings) from Google's vertical properties while retaining snippets in web search and/or Universal Search on main search results page</li> <li>Could be required to limit use of content indexed from web search results</li> </ul></li> <li>Campaign management restrictions <ul> <li>Could be required to remove problematic contractual restrictions from license agreements</li> </ul></li> <li>Exclusionary syndication agreements <ul> <li>Could be enjoined from entering into exclusive search agreements with search syndication partners and required to loosen restrictions surrounding syndication partners' use of rival search ads</li> </ul></li> </ul></li> <li>There are a number of risks to case, not named in summary except that Google can argue that Microsoft's most efficient distribution channel is bing.com and that any scale MS might gain will be immaterial to Bing's competitive position</li> <li>Staff concludes Google's conduct has resulted and will result in real harm to consumer, innovation in online search and ads.</li> </ul> <h2 id="i-history-of-the-investigation-and-related-proceedings">I. HISTORY OF THE INVESTIGATION AND RELATED PROCEEDINGS</h2> <h3 id="a-ftc-investigation">A. FTC INVESTIGATION</h3> <ul> <li>Compulsory process approved on June 03 2011</li> <li>Received over 2M docs (9.5M pages) &quot;and have reviewed many thousands of those documents&quot;</li> <li>Reviewed documents procured to DoJ in Google-Yahoo (2008) and ITA (2010) investigations and documents produced in response to European Commission and U.S. State investigations</li> <li>Interviewed dozens of parties including vertical competitors in travel, local, finance, and retail; U.. advertisers and ad agencies; Google U.S. syndication and distribution partners; mobile device manufacturers and wireless carriers</li> <li>17 investigational hearings of Google execs &amp; employees</li> </ul> <h3 id="b-european-commission-investigation">B. EUROPEAN COMMISSION INVESTIGATION</h3> <ul> <li>Parallel investigation since November 2010</li> <li>May 21, 2012: Commissioner Joaquin Almunia issued letter signaling EC's possible intent to issue Statement of Objections for abuse of dominance in violation of Article 102 of EC Treaty <ul> <li>Concerns <ul> <li>&quot;favourable treatment of its own vertical search services as compared to those of its competitors in its natural search results&quot;</li> <li>&quot;practice of copying third party content&quot; to supplement own vertical content</li> <li>&quot;exclusivity agreements with publishers for the provision of search advertising intermediation services&quot;</li> <li>&quot;restrictions with regard to the portability and cross-platform management of online advertising campaigns&quot;</li> </ul></li> <li>offered opportunity to resolve concerns prior to issuance of SO by producing description of solutions</li> <li>Google denied infringement of EU law, but proposed several commitments to address stated concerns</li> </ul></li> <li>FTC staff coordinated with EC staff</li> </ul> <h3 id="c-multi-state-investigation">C. MULTI-STATE INVESTIGATION</h3> <ul> <li>Texas investigating since June 2010, leader of multi-state working group</li> <li>FTC working closely with states</li> </ul> <h3 id="d-private-litigation">D. PRIVATE LITIGATION</h3> <ul> <li>Several private lawsuits related to issues in our investigation; all dismissed</li> <li>Two categories, manipulation of search rankings and increases in minimum prices for AdWords search ads</li> <li>Kinderstart.com LLC v. Google, Inc.,1 ¹¹ and SearchKing, Inc. v. Google Tech., Inc., plaintiffs alleged that Google unfairly demoted their results <ul> <li>SearchKing court ruled that Google's rankings are constitutionally protected opinion; even malicious manipulation of rankings would not expose Google to tort liability</li> <li>Kinderstart court rejected Google search being an essential facility for vertical websites</li> </ul></li> <li>In AdsWords cases, plaintiffs argue that Google increased minimum bids for keywords they'd purchases, making those keywords effectively unavailable, depriving plaintiff website of traffic <ul> <li>TradeComet.com, LLC v. Google, Inc. dismissed for improper venue and Google, Inc. v. myTriggers.com, Inc. dismissed for failing to describe harm to competition has a whole <ul> <li>both dismissed with little discussion of merits</li> </ul></li> <li>Person V. Google, Inc.: Judge Fogel of the Northern District of California criticized plaintiff's market definition, finding no basis for distinguishing &quot;search advertising market&quot; from larger market for internet advertising</li> </ul></li> </ul> <h2 id="ii-statement-of-facts">II. STATEMENT OF FACTS</h2> <h3 id="a-the-parties">A. THE PARTIES</h3> <h4 id="1-google">1. Google</h4> <ul> <li>Products include &quot;horizontal&quot; search engine and integrated &quot;vertical&quot; websites that focus on specific areas (product or shopping comparisons, maps, finance, books, video), search advertising via AdWords, search and search advertising syndication through AdSense, computer and software applications such as Google Toolbar, Gmail, Chrome, also have Android for mobile and applications for mobile devices and recently acquired Motorola Mobility</li> <li>32k people, $38B annual revenue</li> </ul> <h4 id="2-general-search-competitors">2. General search competitors</h4> <h5 id="a-microsoft">a. Microsoft</h5> <ul> <li>MSN search released in 1998, rebranded Bing in 2009. Filed complaints against Google in 2011 with FTC and EC</li> </ul> <h5 id="b-yahoo">b. Yahoo</h5> <ul> <li>Partnership with Bing since 2010; Bing provides search results and parties jointly operate a search ad network</li> </ul> <h4 id="3-major-vertical-competition">3. Major Vertical Competition</h4> <ul> <li>In general, these companies complain that Google's practice of preferencing its own vertical results has negatively impacted ability to compete for users and advertisers</li> <li>Amazon <ul> <li>Product search directly competes with Google Product Search</li> </ul></li> <li>eBay <ul> <li>product search competes with Google Product Search</li> </ul></li> <li>NexTag <ul> <li>shopping comparison website that competes with Google Product Search</li> </ul></li> <li>Foundem <ul> <li>UK product comparison website that competes with Google Product Search</li> <li>Complaint to EC, among others, prompted EC to open its investigation into Google's web search practices</li> <li>First vertical website to publicly accuse Google of preferencing its own vertical content over competitors on Google's search page</li> </ul></li> <li>Expedia <ul> <li>competes against Google's fledgling Google Flight Search</li> </ul></li> <li>TripAdvisor <ul> <li>TripAdvisor competes with Google Local (formerly Google Places)</li> <li>has complained that Google has appropriated / scraped its user-generated reviews, placing them on Google's own local property</li> </ul></li> <li>Yelp <ul> <li>has complained that Google has appropriated / scraped its user-generated reviews, placing them on Google's own local property</li> </ul></li> <li>Facebook <ul> <li>Competes with Google's recently introduced Google Plus</li> <li>has complained that Google's preferencing of Google Plus results over Facebook results is negatively impacting ability to compete for users</li> </ul></li> </ul> <h3 id="b-industry-background">B. INDUSTRY BACKGROUND</h3> <h4 id="1-general-search">1. General Search</h4> <ul> <li>[nice description of search engines for lay people omitted]</li> </ul> <h4 id="2-online-advertising">2. Online Advertising</h4> <ul> <li>Google's core business is ads; 96% of its nearly $38B in revenue was from ad sales</li> <li>[lots of explanations of ad industry for lay people, mostly omitted]</li> <li>Reasons advertisers have shifted business to web include high degree of tracking possible and quantifiable, superior, ROI</li> <li>Search ads make up most of online ad spend, primarily because advertisers believe search ads provided best precision in IDing customers, measurability, and the highest ROI</li> <li>Online advertising continues to evolve, with new offerings that aren't traditional display or search ads, such as contextual ads, re-targeted behavioral ads, and social media ads <ul> <li>these new ad products don't account for a significant portion of online ads today and, with the exception of social media ads, appear to have only limited potential for growth [Surely video is pretty big now, especially if you include &quot;sponsorships&quot; and not just ads inserted by the platform?]</li> </ul></li> </ul> <h4 id="3-syndicated-search-and-search-advertising">3. Syndicated Search and Search Advertising</h4> <ul> <li>Search engines &quot;syndicate&quot; search and/or search ads <ul> <li>E.g., if you go &quot;AOL or Ask.com&quot;, you can do a search which is powered by a search Provider, like Google</li> </ul></li> <li>Publisher gets to keep user on own platform, search provider gets search volume and can monetize traffic <ul> <li>End-user doesn't pay; publisher pays Google either on cost-per-user-query basis or by accepting search ads and spitting revenues from search ads run on publisher's site. Revenue sharing agreement often called &quot;traffic acquisition cost&quot; (TAC)</li> </ul></li> <li>Publishers can get search ads without offering search (AdSense) and vice versa</li> </ul> <h4 id="4-mobile-search">4. Mobile Search</h4> <ul> <li>Focus of search has been moving from desktop to &quot;rapid emerging — and lucrative — frontier of mobile&quot;</li> <li>Android at forefront; has surpassed iPhone in U.S. market share</li> <li>Mobile creates opportunities for location-based search ads; even more precise intent targeting than desktop search ads</li> <li>Google and others have signed distribution agreements with device makers and wireless carriers, so user-purchased devices usually come pre-installed with search and other apps</li> </ul> <h3 id="c-the-significance-of-scale-in-internet-search">C. THE SIGNIFICANCE OF SCALE IN INTERNET SEARCH</h3> <ul> <li>Scale (user queries and ad volume) important to competitive dynamics</li> </ul> <h4 id="1-search-query-volume">1. Search Query Volume</h4> <ul> <li>Microsoft claims it needs higher query volume to improve Bing <ul> <li>Logs of queries can be used to improve tail queries</li> <li>Suggestions, instant search, spelling correction</li> <li>Trend identification, fresh news stories</li> </ul></li> <li>Click data important for evaluating search quality <ul> <li>Udi Manber (former Google chief of search quality) testimony: &quot;The ranking itself is affected by the click data. If we discover that, for a particular query, hypothetically, 80 percent of people click on Result No. 2 and only 10 percent click on Result No. 1, after a while we figure out, well, probably Result 2 is the one people want. So we'll switch it.&quot;</li> <li>Testimony from Eric Schmidt and Sergey Brin confirms click data important and provides feedback on quality of search results</li> <li>Scale / volume allows more experiments <ul> <li>Larry and Sergei's annual letter in 2005 notes importance of experiments, running multiple simultaneous experiments</li> <li>More scale allows for more experiments as well as for experiments to complete more quickly</li> <li>Susan Athey (Microsoft chief economist) says Microsoft search quality team is greatly hampered by insufficient search volume to run experiments</li> </ul></li> </ul></li> <li>2009 comment from Udi Manber: &quot;The bottom line is this. If Microsoft had the same traffic we have their quality will improve *significantly*, and if we had the same traffic they have, ours will drop significantly. That's a fact&quot;</li> </ul> <h4 id="2-advertising-volume">2. Advertising Volume</h4> <ul> <li>Microsoft claims they need more ad volume to improve relevance and quality of ads <ul> <li>More ads means more choices over what ads to serve to use, better matched ads / higher conversion rates</li> <li>Also means more queries</li> <li>Also has similar feedback loop to search</li> </ul></li> <li>Increase volume of advertisers increases competitiveness for ad properties, gives more revenue to search engine <ul> <li>Allows search engine to amortize costs, re-invest in R&amp;D, provide better advertiser coverage, revenue through revenue-sharing agreements to syndication partners (website publishers). Greater revenue to partners attracts more publishers and more advertisers</li> </ul></li> </ul> <h4 id="3-scale-curve">3. Scale Curve</h4> <ul> <li>Google acknowledges the important of scale (outside of the scope of this particular discussion)</li> <li>Google documents replete with references to &quot;virtuous cycle&quot; among users, advertisers, and publishers <ul> <li>Testimony from Google execs confirms this</li> </ul></li> <li>But Google argues scale no longer matters at Google's scale or Microsoft's scale, that additional scale at Microsoft's scale would not &quot;significantly improve&quot; Microsoft search quality</li> <li>Susan Athey argues that relative scale, Bing being 1/5th the size of Google, matters, not absolute size</li> <li>Microsoft claims that 5% to 10% increase in query volume would be &quot;very meaningful&quot;, notes that gaining access to Yahoo queries and ad volume in 2010 was significant for search quality and monetization <ul> <li>Claim that Yahoo query data increased click through rate for &quot;auto suggest&quot; from 44% to 61% [the timeframe here is July 2010 to September 2011 — too bad they didn't provide an A/B test here, since this more than 1 year timeframe allows for many other changes to impact the suggest feature as well; did that ship a major change here without A/B testing it? That seems odd]</li> </ul></li> <li>Microsoft also claims search quality improvements due to experiment volume enabled by extra query volume</li> </ul> <h3 id="d-google-s-suspect-conduct">D. GOOGLE'S SUSPECT CONDUCT</h3> <ul> <li>Five main areas of staff investigation of alleged anticompetitive conduct:</li> </ul> <h4 id="1-google-s-preferencing-of-google-vertical-properties-within-its-search-engine-results-page-serp">1. Google's Preferencing of Google Vertical Properties Within Its Search Engine Results Page (&quot;SERP&quot;)</h4> <ul> <li>Allegation is that Google's conduct is anticompetitive because &quot;it forecloses alternative search platforms that might operate to constraint Google's dominance in search and search advertising&quot;</li> <li>&quot; Although it is a close call, we do not recommend that the Commission issue a complaint against Google for this conduct.&quot;</li> </ul> <h5 id="a-overview-of-changes-to-google-s-serp">a. Overview of Changes to Google's SERP</h5> <ul> <li>Google makes changes to UI and algorithms, sometimes without user testing</li> <li>sometimes with testing with launch review process, typically including: <ul> <li>&quot;the sandbox&quot;, internal testing by engineers</li> <li>&quot;SxS&quot;, side-by-side testing by external raters who compare existing results to proposed results</li> <li>Testing on a small percent of live traffic</li> <li>&quot;launch report&quot; for Launch Committee</li> </ul></li> <li>Google claims to have run 8000 SxS tests and 2500 &quot;live&quot; click tests in 2010, with 500 changes launched</li> <li>&quot;Google's stated goal is to make its ranking algorithms better in order to provide the user with the best experience possible.&quot; <br /></li> </ul> <h5 id="b-google-s-development-and-introduction-of-vertical-properties">b. Google's Development and Introduction of Vertical Properties</h5> <ul> <li>Google vertical properties launched in stages, initially around 2001</li> <li>Google News, Froogle (shopping), Image Search, and Groups</li> <li>Google has separate indexes for each vertical</li> <li>Around 2005 ,Google realized that vertical search engines, i.e., aggregators in some categories were a &quot;threat&quot; to dominance in web search, feared that these could cause shift in some searches away from Google</li> <li>From GOOG-Texas-1325832-33 (2010): &quot;Vertical search is of tremendous strategic importance to Google. Otherwise the risk is that Google is the go-to place for finding information only in the cases where there is sufficiently low monetization potential that no niche vertical search competitor has filled the space with a better alternative.&quot;</li> <li>2008 presentation titled &quot;Online Advertising Challenges: Rise of the Aggregators&quot;: <ul> <li>&quot;Issue 1. Consumers migrating to MoneySupermarket. Driver: General search engines not solving consumer queries as well as specialized vertical search Consequence: Increasing proportion of visitors going directly to MoneySupermarket. Google Implication: Loss of query volumes.&quot;</li> <li>Issue 2: &quot;MoneySupermarket has better advertiser proposition. Driver: MoneySupermarket offers cheaper, lower risk (CPA-based) leads to advertisers. Google Implication: Advertiser pull: Direct advertisers switch spend to MoneySupermarket/other channels&quot;</li> </ul></li> <li>In response to this threat, Google invested in existing verticals (shopping, local) and invested in new verticals (mortgages, offers, hotel search, flight search)</li> </ul> <h5 id="c-the-evolution-of-display-of-google-s-vertical-properties-on-the-serp">c. The Evolution of Display of Google's Vertical Properties on the SERP</h5> <ul> <li>Google initially had tabs that let users search within verticals</li> <li>In 2003, Marissa Mayer started developing &quot;Universal Search&quot; (launched in 2007), to put this content directly on Google's SERP. Mayer wrote: <ul> <li>&quot;Universal Search is an effort to redesign the user interface of the main Google.com results page SO that Google deliver[s] the most relevant information to the user on Google.com no matter what corpus that information comes from. This design is motivated by the fact that very few users are motivated to click on our tabs, SO they often miss relevant results in the other corpora.&quot;</li> </ul></li> <li>Prior to Universal Search launch, Google used &quot;OneBoxes&quot;, which put vertical content above Google's SERP</li> <li>After launching Universal Search, vertical results could go anywhere</li> </ul> <h5 id="d-google-s-preferential-display-of-google-vertical-properties-on-the-serp">d. Google's Preferential Display of Google Vertical Properties on the SERP</h5> <ul> <li>Google used control over Google SERP both to improve UX for searches and to maximize benefit to its own vertical properties</li> <li>Google wanted to maximize percentage of queries that had Universal Search results and drive traffic to Google properties <ul> <li>In 2008, goal to &quot;[i]ncrease google.com product search inclusion to the level of google.com searches with 'product intent', while preserving clickthrough rate.&quot; (GOOG-Texas-0227159-66)</li> <li>Q1 2008, goal of triggering Product Universal on 6% of English searches</li> <li>Q2 2008, goal changed to top OneBox coverage of 50% with 10% CTR and &quot;[i]ncrease coverage on head queries. For example, we should be triggering on at least 5 of the top 10 most popular queries on amazon.com at any given time, rather than only one.&quot;</li> <li>&quot;Larry thought product should get more exposure&quot;, GOOG-ITA-04-0004120-46 (2009)</li> <li>Mandate from exec meeting to push product-related queries as quickly as possible</li> <li>Launch Report for one algorithm change: 'To increase triggering on head queries, Google also implemented a change to trigger the Product Universal on google.com queries if they appeared often in the product vertical. &quot;Using Exact Corpusboost to Trigger Product Onebox&quot; compares queries on www.google.com with queries on Google Shopping, triggers the Product OneBox if the same query is often searched in Google Shopping, and automatically places the universal in position 4, regardless of the quality of the universal results or user &quot;bias&quot; for top placement of the box.'</li> <li>&quot;presentation stating that Google could take a number of steps to be &quot;#1&quot; in verticals, including &quot;[e]ither [getting] high traffic from google.com, or [developing] a separate strong brand,&quot; and asking: &quot;How do we link from Search to ensure strong traffic without harming user experience or AdWords proposition for advertisers?&quot;)&quot;, GOOGFOX-000082469 (2009)</li> <li>Jon Hanke, head of Google Local, to Marissa Mayer: &quot;long term, I think we need to commit to a more aggressive path w/ google where we can show non-webpage results on google outside of the universal 'box' most of us on geo think that we won't win unless we can inject a lot more of local directly into google results.&quot; <ul> <li>&quot;Google's key strengths are: Google.com real estate for the ~70MM of product queries/day in US/UK/DE alone&quot;</li> <li>&quot;I think the mandate has to come down that we want to win [in local] and we are willing to take some hits [i.e., trigger incorrectly sometimes]. I think a philosophical decision needs to get made that results that are not web search results and that displace web pages are &quot;OK&quot; on google.com and nothing to be ashamed of. That would open the door to place page or local entities as ranked results outside of some 'local universal' container. Arguably for many queries<em>all</em> of the top 10 results should be local entities from our index with refinement options. The current mentality is that the google results page needs to be primarily about web pages, possibly with some other annotations if they are really, really good. That's the big weakness that bing is shooting at w/ the 'decision engine' pitch - not a sea of pointers to possible answers, but real answers right on the page. &quot;</li> </ul></li> <li>In spring 2008, Google estimated top placement of Product Universal would lead to loss of $154M/yr on product queries. Ads team requested reduction in triggering frequency and Product Universal team objected, &quot;We face strong competition and must move quickly. Turning down onebox would hamper progress as follows - Ranking: Losing click data harms ranking; [t]riggering Losing CTR and google.com query distribution data triggering accuracy; [c]omprehensiveness: Losing traffic harms merchant growth and therefore comprehensiveness; [m]erchant cooperation: Losing traffic reduces effort merchants put into offer data, tax, &amp; shipping; PR: Turning off onebox reduces Google's credibility in commerce; [u]ser awareness: Losing shopping-related UI on google.com reduces awareness of Google's shopping features.&quot;</li> </ul></li> <li>&quot;Google embellished its Universal Search results with photos and other eye-catching interfaces, recognizing that these design choices would help steer users to Google's vertical properties&quot; <ul> <li>&quot;Third party studies show the substantial difference in traffic with prominent, graphical user interfaces&quot;; &quot;These 'rich' user interfaces are not available to competing vertical websites&quot;</li> </ul></li> <li>Google search results near or at top of SERP, pushing other results down, resulting in reduced CTR to &quot;natural search results&quot; <ul> <li>Google did this without comparing quality of Google's vertical content to competitors or evaluating whether users prefer Google's vertical content to displaced results</li> </ul></li> <li>click-through from eBay indicates that (Jan-Apr 2012), Google Product Search appeared in top 5 positon 64% of time when displayed and Google Product Search had lower CTR than web search in same position regardless of position [below is rank: natural result CTR / Google Shopping CTR / eBay CTR] <ul> <li>1: 38% / 21% / 31%</li> <li>2: 21% / 14% / 20%</li> <li>3: 16% / 12% / 18%</li> <li>4: 13% / 9% / 11%</li> <li>5: 10% / 8% / 10%</li> <li>6: 8% / 6% / 9%</li> <li>7: 7% / 5% / 9%</li> <li>8: 6% / 2% / 7%</li> <li>9: 6% / 3% / 6%</li> <li>10 5% / 2% / 6%</li> <li>11: 5% / 2% / 5%</li> <li>12: 3% / 1% / 4%</li> </ul></li> <li>Although Google tracks CTR and relies on CTR to improve web results, it hasn't relied on CTR to rank Universal Search results against other web search results</li> <li>Marissa Mayer said Google didn't use CTR &quot; because it would take too long to move up on the SERP on the basis of user click-through rate&quot;</li> <li>Instead, &quot;Google used occurrence of competing vertical websites to automatically boost the ranking of its own vertical properties above that of competitors&quot; <ul> <li>If comparison shopping site was relevant, Google would insert Google Product search above any rival</li> <li>If local search like Yelp or CitySearch was relevant, Google automatically returned Google Local at top of SERP</li> </ul></li> <li>Google launched commission-based verticals, mortgage, flights, offers, in ad space reserved exclusively for its own properties <ul> <li>In 2012, Google announced that google product search would transition to paid and Google would stop including product listings for merchants who don't pay to be listed</li> <li>Google's dedicated ads don't competition with other ads via AdWords and automatically get the most effective ad spots, usually above natural search results</li> <li>As with Google's Universal results, its own ads have a rich user interface not available to competitors which results in higher CTR</li> </ul></li> </ul> <h5 id="e-google-s-demotion-of-competing-vertical-websites">e. Google's Demotion of Competing Vertical Websites</h5> <ul> <li>&quot;While Google embarked on a multi-year strategy of developing and showcasing its own vertical properties, Google simultaneously adopted a strategy of demoting, or refusing to display, links to certain vertical websites in highly commercial categories&quot;</li> <li>&quot;Google has identified comparison shopping websites as undesirable to users, and has developed several algorithms to demote these websites on its SERP. Through an algorithm launched in 2007, Google demoted all comparison shopping websites beyond the first two on its SERP&quot;</li> <li>&quot;Google's own vertical properties (inserted into Google's SERP via Universal Search) have not been subject to the same demotion algorithms, even though they might otherwise meet the criteria for demotion.&quot; <ul> <li>Google has acknowledged that its own vertical sites meet the exact criteria for demotion</li> <li>Additionally, Google's web spam team originally refused to add Froogle to search results because &quot;[o]ur algorithms specifically look for pages like these to either demote or remove from the index.&quot;</li> <li>Google's web spam team also refused to add Google's local property</li> </ul></li> </ul> <h5 id="f-effects-of-google-s-serp-changes-on-vertical-rivals">f. Effects of Google's SERP Changes on Vertical Rivals</h5> <ul> <li>&quot;Google's prominent placement and display of its Universal Search properties, combined with the demotion of certain vertical competitors in Google's natural search results, has resulted in significant loss of traffic to many competing vertical websites&quot;</li> <li>&quot;Google's internal data confirms the impact, showing that Google anticipated significant traffic loss to certain categories of vertical websites when it implemented many of the algorithmic changes described above&quot;</li> <li>&quot;While Google's changes to its SERP led to a significant decrease in traffic for the websites of many vertical competitors, Google's prominent showcasing of its vertical properties led to gains in user share for its own properties&quot;</li> <li>&quot;For example, Google's inclusion of Google Product Search as a Universal Search result took Google Product Search from a rank of seventh in page views in July 2007 to the number one rank by July 2008. Google product search leadership acknowledged that '[t]he majority of that growth has been driven through product search universal.'&quot;</li> <li>&quot;Beyond the direct impact on traffic to Google and its rivals, Google's changes to its SERP have led to reduced investment and innovation in vertical search markets. For example, as a result of the rise of Google Product Search (and simultaneous fall of rival comparison shopping websites), NexTag has taken steps to reduce its investment in this area. Google's more recent launch of its flight search product has also caused NexTag to cease development of an 'innovative and competitive travel service.'&quot;</li> </ul> <h4 id="2-google-s-scraping-of-rivals-vertical-content">2. Google's &quot;Scraping&quot; of Rivals' Vertical Content</h4> <ul> <li>&quot;Staff has investigated whether Google has &quot;scraped&quot; - or appropriated - the content of rival vertical websites in order to improve its own vertical properties SO as to maintain, preserve, or enhance Google's monopoly power in the markets for search and search advertising. We recommend that the Commission issue a complaint against Google for this conduct.&quot;</li> <li>In addition to developing its own vertical properties, Google scraped content from existing vertical websites (e.g., Yelp, TripAdvisor, Amazon) in order to improve its own vertical listings, &quot;e.g., GOOG-Texas-1380771-73 (2009), at 71-72 (discussing importance of Google Places carrying better review content from Yelp).&quot;</li> </ul> <h5 id="a-the-local-story">a. The &quot;Local&quot; Story</h5> <ul> <li>&quot;Some local information providers, such as Yelp, TripAdvisor, and CitySearch, disapprove of the ways in which Google has made use of their content&quot;</li> <li>&quot;Google recognized that review content, in particular, was &quot;critical to winning in local search,&quot; but that Google had an 'unhealthy dependency' on Yelp for much of its review content. Google feared that its heavy reliance on Yelp content, along with Yelp's success in certain categories and geographies, could lead Yelp and other local information websites to siphon users' local queries away from Google&quot; <ul> <li>&quot;concern that <a href="and similar">Yelp</a> could become competing local search platforms&quot; (Goog-Texas-0975467-97)</li> </ul></li> <li>Google Local execs tried to convince Google to acquire Yelp, but failed</li> <li>Yelp, on finding that Google was going to use reviews on its own property, discontinued its feed and asked for Yelp content to be removed from Google Local</li> <li>&quot;after offering its own review site for more than two years, Google recognized that it had failed to develop a community of users - and thus, the critical mass of user reviews - that it needed to sustain its local product.&quot;, which led to failed attempt to buy Yelp <ul> <li>To address this problem, Google added Google Places results on SERP: &quot;The listing for each business that came up as a search result linked the user directly to Google's Places page, with a label indicating that hundreds of reviews for the business were available on the Places page (but with no links to the actual sources of those reviews).On the Places Page itself, Google provided an entire paragraph of each copied review (although not the complete review), followed by a link to the source of the review, such as Yelp (which it crawled for reviews) and TripAdvisor (which was providing a feed).&quot;</li> <li>Yelp noticed this in July 2010, that Google was featuring Yelp's content without a license and protested to Google. TripAdvisor chose not to renew license with Google after finding same</li> <li>Google implemented new policy that would ban properties from Google search if they didn't allow their content to be used in Google Places <ul> <li>&quot;GOOG-Texas-1041511-12 (2010), at 12 (&quot;remove blacklist of yelp [reviews] from Web-extracted Reviews once provider based UI live&quot;); GOOG-Texas-1417391-403 (2010), at 394 (&quot;stating that Google should wait to publish a blog post on the new UI until the change to &quot;unblacklist Yelp&quot; is &quot;live&quot;).&quot;</li> </ul></li> <li>Along with this policy, launched new reviews product and seeded it reviews from 3rd party websites without attribution</li> <li>Yelp, CitySearch, and TripAdvisor all complained and were all told that they could only remove their content if they were fully removed from search results. &quot;This was not technically necessary - it was just a policy decision by Google.&quot;</li> <li>Yelp sent Google a C&amp;D</li> <li>Google claimed it was technically infeasible to remove Yelp content from Google Places without also banning Yelp from search result <ul> <li>Google later did this, making it clear that the claim that it was technically infeasible was false</li> <li>Google still maintained that it would be technically infeasible to remove Yelp from Google Places without removing it from &quot;local merge&quot; interface on SERP. Staff believes this assertion is false as well because Google maintains numerous &quot;blacklists&quot; that prevent content from being shown in specific locations</li> <li>Mayer later admitted during hearing that the infeasible claim was false and that Google feared consequences of allowing websites to opt out of Google Places while staying in &quot;local merge&quot;</li> <li>&quot;Yelp contends that Google's continued refusal to link to Yelp on Google's 'local merge' interface on the main SERP is simply retaliation for Yelp seeking removal from Google Places.&quot;</li> </ul></li> </ul></li> <li>&quot;Publicly, Google framed its changes to Google Local as a redesign to move toward the provision of more original content, and thereby, to remove all third-party content and review counts from Google Local, as well as from the prominent &quot;local merge&quot; Universal Search interface on the main SERP. But the more likely explanation is that, by July 2011,Google had already collected sufficient reviews by bootstrapping its review collection on the display of other websites' reviews. It no longer needed to display third-party reviews, particularly while under investigation for this precise conduct.&quot;</li> </ul> <h5 id="b-the-shopping-story">b. The &quot;Shopping&quot; Story</h5> <ul> <li>[full notes omitted; story is similar to above, but with Amazon; similar claims of impossibility of removing from some places and not others; Amazon wanted Google to stop using Amazon star ratings, which Google claimed was impossible without blacklisting Amazon from all of web search, etc.; there's also a parallel story about Froogle's failure and Google's actions after that]</li> </ul> <h5 id="c-effects-of-google-s-scraping-on-vertical-rivals">c. Effects of Google's &quot;Scraping&quot; on Vertical Rivals</h5> <ul> <li>&quot;Because Google scraped content from these vertical websites over an extended period of time, it is difficult to point to declines in traffic that are specifically attributable to Google's conduct. However, the natural and probable effect of Google's conduct is to diminish the incentives of companies like Yelp, TripAdvisor, CitySearch, and Amazon to invest in, and to develop, new and innovative content, as the companies cannot fully capture the benefits of their innovations&quot;</li> </ul> <h4 id="3-google-s-api-restrictions">3. Google's API Restrictions</h4> <ul> <li>&quot;Staff has investigated whether Google's restrictions on the automated cross-management of advertising campaigns has unlawfully contributed to the maintenance, preservation, or enhancement of Google's monopoly power in the markets for search and search advertising. Microsoft alleges that these restrictions are anticompetitive because they prevent Google's competitors from achieving efficient scale in search and search advertising. We recommend that the Commission issue a complaint against Google for this conduct.&quot;</li> </ul> <h5 id="a-overview-of-the-adwords-platform">a. Overview of the AdWords Platform</h5> <ul> <li>To set up AdWords, advertisers prepare bids. Can have thousands or hundreds of thousands of keywords. <ul> <li>E.g., DirectTV might bid on &quot;television&quot;, &quot;TV&quot;, and &quot;satellite&quot; plus specific TV show names, such as &quot;Friday Night Lights&quot;, as well as misspellings</li> <li>Bids can be calibrated by time and location</li> <li>Advertisers then prepare ads (called &quot;creatives&quot;) and match with various groups of keywords</li> <li>Advertisers get data from AdWords, can evaluate effectiveness and modify bids, add/drop keywords, modify creative <ul> <li>This is called &quot;optimization&quot; when done manually; expensive and time-intensive</li> </ul></li> </ul></li> <li>Initially two ways to access AdWords system, AdWords Front End and AdWords Editor <ul> <li>Editor is a program. Allows advertisers to download campaign information from Google, make bulk changes offline, then upload changes back to AdWords</li> <li>Advertisers would make so many changes that system's capacity would be exceeded, causing outages</li> </ul></li> <li>In 2004, Google added AdWords API to address problems</li> <li>[description of what an API is omitted]</li> </ul> <h5 id="b-the-restrictive-conditions">b. The Restrictive Conditions</h5> <ul> <li>AdWords API terms and conditions non-negotiable, apply to all users</li> <li>One restriction prevents advertisers from using 3rd party tool or have 3rd party use a tool to copy data from AdWords API into ad campaign on another search network</li> <li>Another, can't use 3rd party tool or have 3rd party use a tool to comingle AdWords campaign data with data from another search engine</li> <li>The two conditions above will be referred to as &quot;the restrictive conditions&quot;</li> <li>&quot;These restrictions essentially prevent any third-party tool developer or advertising agency from creating a tool that provides a single user interface for multiple advertising campaigns. Such tools would facilitate cross-platform advertising.&quot;</li> <li>&quot;However, the restrictions do not apply to advertisers themselves, which means that very large advertisers, such as.Amazon and eBay, can develop - and have developed - their own multi-homing tools that simultaneously manage campaigns across platforms&quot;</li> <li>&quot;The advertisers affected are those whose campaign volumes are large enough to benefit from using the AdWords API, but too small to justify devoting the necessary resources to develop in-house the software and expertise to manage multiple search network ad campaigns.&quot;</li> </ul> <h5 id="c-effects-of-the-restrictive-conditions">c. Effects of the Restrictive Conditions</h5> <h6 id="i-effects-on-advertisers-and-search-engine-marketers-sems">i. Effects on Advertisers and Search Engine Marketers (&quot;SEMs&quot;)</h6> <ul> <li>Prevents development of tools that would allow advertisers from managing ad campaigns on multiple search ad networks simultaneously</li> <li>Google routinely audits API clients for compliance</li> <li>Google has required SEMs to remove functionality, &quot;e.g., GOOGEC-0180810-14 (2010) (Trada); GOOGEC-0180815-16 (2010) (MediaPlex); GOOGEC-0181055-58 (2010) (CoreMetrics); GOOGEC-0181083-87 (2010) (Keybroker); GOOGEC-0182218-330 (2008) (Marin Software). 251 Acquisio IR (Sep. 12, 2011); Efficient Frontier IR (Mar. 5, 2012)&quot;</li> <li>Other SEMs have stated they would develop this functionality without restrictions</li> <li>&quot;Google anticipated that the restrictive conditions would eliminate SEM incentives to innovate.&quot;, &quot;GOOGKAMA-000004815 (2004), at 2.&quot;</li> <li>&quot;Many advertisers have said they would be interested in buying a tool that had multi-homing functionality. Such functionality would be attractive to advertisers because it would reduce the costs of managing multiple ad campaigns, giving advertisers access to additional advertising opportunities on multiple search advertising networks with minimal additional investment of time. The advertisers who would benefit from such a tool appear to be the medium-sized advertisers, whose advertising budgets are too small to justify hiring a full service agency, but large enough to justify paying for such a tool to help increase their advertising opportunities on multiple search networks.&quot;</li> </ul> <h6 id="ii-effects-on-competitors">ii. Effects on Competitors</h6> <ul> <li>Removing restrictions would increase ad spend on networks that compete with Google</li> <li>Data on advertiser multi-homing show some effects of restrictive conditions. Nearly all the largest advertisers multi-home, but percentage declines as spend decreases <ul> <li>Advertisers would also multi-home with more intensity <ul> <li>Microsoft claims that multi-homing advertisers optimize their Google campaigns almost-daily, Microsoft campaigns less frequently, weekly or bi-weekly</li> </ul></li> </ul></li> <li>Without incremental transaction costs, &quot;all rational advertisers would multi-home&quot;</li> <li>Staff interviewed randomly selected small advertisers. Interviews &quot;strongly supported&quot; thesis that advertises would multi-home if cross-platform optimization tool were available <ul> <li>Some advertisers don't advertise on Bing due to lack of tool, the ones that do do less optimization</li> </ul></li> </ul> <h5 id="d-internal-google-discussions-regarding-the-restrictions">d. Internal Google Discussions Regarding the Restrictions</h5> <ul> <li>Internal discussions support the above</li> <li>PM wrote the following in 2007, endorsed by director of PM Richard Holden: <ul> <li>&quot;If we offer cross-network SEM in [Europe], we will give a significant boost to our competitors. Most advertisers that I have talked to in [Europe] don't bother running campaigns on [Microsoft] or Yahoo because the additional overhead needed to manage these other networks outweighs the small amount of additional traffic. For this reason, [Microsoft] and Yahoo still have a fraction of the advertisers that we have in [Europe], and they still have lower average CPAs [cost per acquisition]&quot;</li> <li>&quot;This last point is significant. The success of Google's AdWords auctions has served to raise the costs of advertising on Google. With more advertisers entering the AdWords auctions, the prices it takes to win those auctions have naturally risen. As a result, the costs per acquisition on Google have risen relative to the costs per acquisition on Bing and Yahoo!. Despite these higher costs, as this document notes, advertisers are not switching to Bing and Yahoo! because, for many of them, the transactional costs are too great.&quot;</li> </ul></li> <li>In Dec 2008, Google team led by Richard Holden evaluated possibility of relaxing or removing restrictive conditions and consulted with Google chief economist Hal Varian. Some of Holden's observations: <ul> <li>Advertisers seek out SEMs and agencies for cross-network management technology and services;</li> <li>The restrictive conditions make the market more inefficient;</li> <li>Removing the restrictive conditions would &quot;open up the market&quot; and give Google the opportunity to compete with a best-in-class SEM tool with &quot;a streamlined workflow&quot;;</li> <li>Removing the restrictive conditions would allow SEMs to improve their tools as well;</li> <li>While there is a risk of additional spend going to competing search networks, it is unlikely that Google would be seriously harmed because &quot;advertisers are going where the users are,&quot; i.e., to Google</li> </ul></li> <li>&quot;internally, Google recognized that removing the restrictions would create a more efficient market, but acknowledged a concern that doing so might diminish Google's grip on advertisers.&quot;</li> <li>&quot;Nonetheless, following up on that meeting, Google began evaluating ways to improve the DART Search program. DART Search was a cross-network campaign management tool owned by DoubleClick, which Google acquired in 2008. Google engineers were looking at improving the DART Search product, but had to confront limitations imposed by the restrictive conditions. During his investigational hearing, Richard Holden steadfastly denied any linkage between the need to relax the restrictive conditions and the plans to improve DART Search. ²⁷⁴ However, a series of documents - documents authored by Holden - explicitly link the two ideas.&quot;</li> <li>Dec 2008 Holden to SVP of ad products, Susan Wojcicki and others met. <ul> <li>Holden wrote: &quot;[O]ne debate we are having is whether we should eliminate our API T&amp;Cs requirement that AW [AdWords] features not be co-mingled with competitor network features in SEM cross-network tools like DART Search. We are advocating that we eliminate this requirement and that we build a much more streamlined and efficient DART Search offering and let SEM tool provider competitors do the same. There was some debate about this, but we concluded that it is better for customers and the industry as a whole to make things more efficient and we will maximize our opportunity by moving quickly and providing the most robust offering&quot;</li> </ul></li> <li>Feb 2009, Holden wrote exec summary for DART, suggested Google &quot;&quot;alter the AdWords Ts&amp;Cs to be less restrictive and produce the leading cross-network toolset that increases advertiser/agency efficiency.&quot; to &quot;[r]educe friction in the search ads sales and management process and grow the industry faster&quot;</li> <li>Larry Page rejected this. Afterwards, Holden wrote &quot;We've heard that and we will focus on building the product to be industry-leading and will evaluate it with him when it is done and then discuss co-mingling and enabling all to do it.&quot;</li> <li>Sep 2009, API PM raised possibility of eliminating restrictive conditions to help DART. Comment from Holden: <ul> <li>&quot;I think the core issue on which I'd like to get Susan's take is whether she sees a high risk of existing spend being channeled to MS/Yahoo! due to a more lenient official policy on campaign cloning. Then, weigh that risk against the benefits: enabling DART Search to compete better against non-compliant SEM tools, more industry goodwill, easier compliance enforcement. Does that seem like the right high level message?&quot;</li> </ul></li> <li>&quot;The documents make clear that Google was weighing the efficiency of relaxing the restrictions against the potential cost to Google in market power&quot;</li> <li>&quot;At a January 2010 meeting, Larry Page decided against removing or relaxing the restrictive conditions. However, there is no record of the rationale for that decision or what weight was given to the concern that relaxing the restrictive conditions might result in spend being channeled to Google's competitors. Larry Page has not testified. Holden testified that he did not recall the discussion. The participants at the meeting did not take notes &quot;for obvious reasons.&quot; Nonetheless, the documents paint a clear picture: Google rejected relaxing the API restrictions, and at least part of the reason for this was fear of diverting advertising spend to Microsoft.&quot; <ul> <li>Holden to Wojcicki: &quot;We didn't take notes <a href="https://mastodon.social/@danluu/111243272325539288">for obvious reasons (hence why I'm not elaborating too much here in email)</a> but happy to brief you more verbally&quot;.</li> </ul></li> </ul> <h4 id="4-google-s-exclusive-and-restrictive-syndication-agreements">4. Google's Exclusive and Restrictive Syndication Agreements</h4> <ul> <li>&quot;Staff has investigated whether Google has entered into exclusive or highly restrictive agreements with website publishers that have served to maintain, preserve, or enhance Google's monopoly power in the markets for search, search advertising, or search and search advertising syndication (or &quot;search intermediation&quot;). We recommend that the Commission issue a complaint against Google for this conduct.&quot;</li> </ul> <h5 id="a-publishers-and-market-structure">a. Publishers and Market Structure</h5> <ul> <li>Buyers of search and search ad syndication are website publishers</li> <li>Largest sites account for vast majority of syndicated search traffic and volume</li> <li>Biggest customers are e-commerce retailers (e.g., Amazon and eBay), traditional retailers with websites (e.g., Wal-Mart, Target, Best Buy), and ISPs which operate their own portals</li> <li>Below this group, companies with significant query volume, including vertical e-commerce sites such as Kayak, smaller retailers and ISPs such as EarthLink; all of these are &lt; 1% of Google's total AdSense query volume</li> <li>Below, publisher size rapidly drops off to &lt; 0.1% of Google's query volume</li> <li>Payment publisher receives a function of <ul> <li>volume of clicks on syndicated ad</li> <li>&quot;CPC&quot;, or cost-per-click advertiser willing to pay for each click</li> <li>revenue sharing percentage</li> </ul></li> <li>rate of user clicks and CPC aggregated to form &quot;monetization rate&quot;</li> </ul> <h5 id="b-development-of-the-market-for-search-syndication">b. Development of the Market for Search Syndication</h5> <ul> <li>First AdSense for Search (AFS) agreements with AOL and EarthLink in 2002 <ul> <li>Goal then was to grow nascent industry of syndicated search ads</li> <li>At the time, Google was bidding against incumbent Overture (later acquired by Yahoo) for exclusive agreements with syndication partners</li> </ul></li> <li>Google's early deals favored publishers</li> <li>To establish a presence, Google offered up-front financial guarantees to publishers</li> </ul> <p>c. Specifics of Google's Syndication Agreements</p> <ul> <li>&quot;Today, the typical AdSense agreement contains terms and conditions that describe how and when Google will deliver search, search advertising, and other (contextual or domain related) advertising services.&quot;</li> <li>Two main categories are AFS (search) and AFC (content). Staff investigation focused on AFS</li> <li>For AFS, two types of agreements. GSAs (Google Service Agreements) negotiated with large partners and standard online contracts, which are non-negotiable and non-exclusive</li> <li>Bulk of AFS partners are on standard online agreements, but those are a small fraction of revenue</li> <li>Bulk of revenue comes from GSAs with Google's 10 largest partners (almost 80% of query volume in 2011). All GSAs have some form of exclusivity or &quot;preferred placement&quot; for Google</li> <li>&quot;Google's exclusive AFS agreements effectively prohibit the use of non-Google search and search advertising within the sites and pages designated in the agreement. Some exclusive agreements cover all properties held by a publisher globally; other agreements provide for a property-by-property (or market-by-market) assignment&quot;</li> <li>By 2008, Google began to migrate away from exclusivity to &quot;preferred placement&quot;. Google must display minimum of 3 ads or number of any competitor (whichever is greater), in an unbroken block, with &quot;preferred placement&quot; (in the most prominent position on publisher's website)</li> <li>Google had preferred placement restrictions in GSAs and standard online agreement. Google maintains it was not aware of this provision in standard online agreement until investigational hearing of Google VP for search services, Joan Braddi, where staff questioned Braddi <ul> <li>See Letter from Scott Sher, Wilson Sonsini, to Barbara Blank (May 25, 2012) (explaining that, as of the date of the letter, Google was removing the preferred placement clause from the Online Terms and Conditions, and offering no further explanation of this decision)</li> </ul></li> </ul> <h5 id="d-effects-of-exclusivity-and-preferred-placement">d. Effects of Exclusivity and Preferred Placement</h5> <ul> <li>Staff interviewed large and small customers for search and search advertising syndication. Key findings:</li> </ul> <h6 id="i-common-publisher-responses">i. Common Publisher Responses</h6> <ul> <li>Universal agreement that Bing's search and search advertising markedly inferior, not competitive across-the-board <ul> <li>Amazon reports that Bing monetizes at half the rate of Google</li> <li>business.com told staff that Google would have to cut revenue share from 64.5% to 30% and Microsoft would have to provide 90% share because Microsoft's platform has such low monetization</li> </ul></li> <li>Customers &quot;generally confirmed&quot; Microsoft's claim that Bing's search syndication is inferior in part because Microsoft's network is smaller than Google's <ul> <li>With a larger ad base, Google more likely to have relevant, high-quality, ad for any given query, which improves monetization rate</li> </ul></li> <li>A small publisher said, essentially, the only publishers exclusively using Bing are ones who've been banned from Google's service <ul> <li>We know from other interviews this is an exaggeration, but it captures the general tenor of comments about Microsoft</li> </ul></li> <li>Publishers reported Microsoft not aggressively trying to win their business <ul> <li>Microsoft exec acknowledge that Bing needs a larger portfolio of advertisers, has been focused there over winning new syndication business</li> </ul></li> <li>Common theme from many publishers is that search is a relatively minor part of their business and not a strategic focus. For example, Wal-Mart operates website as extension to retail and Best Buy's main goal of website is to provide presale info</li> <li>Most publishers hadn't seriously considered Bing due to poor monetization</li> <li>Amazon, which does use Bing and Google ads, uses a single syndication provider on a page to avoid showing the user the same ad multiple times on the same page; mixing and matching arrangement generally considered difficult by publishers</li> <li>Starting in 2008, Google systematically tried to lower revenue share for AdSense partners <ul> <li>E.g., &quot;Our general philosophy with renewals has been to reduce TAC across the board&quot;, &quot;2009 Traffic Acquisition Cost (TAC) was down 3 percentage points from 2008 attributable to the application of standardized revenue share guidelines for renewals and new partnerships...&quot;, etc.</li> </ul></li> <li>Google reduced payments (TAC) to AFS partners from 80.4% to 74% between Q1 2009 and Q1 2010</li> <li>No publisher viewed reduction as large enough to justify shifting to Bing or serving more display ads instead of search ads</li> </ul> <h6 id="ii-publishers-views-of-exclusivity-provisions">ii. Publishers' Views of Exclusivity Provisions</h6> <ul> <li>Some large publishers reported exclusive contracts and some didn't</li> <li>Most publishers with exclusivity provisions didn't complain about them</li> <li>A small number of technically sophisticated publishers were deeply concerned by exclusivity <ul> <li>These customers viewed search and search advertising as a significant part of business, have the sophistication to integrate multiple suppliers into on-line properties</li> <li>eBay: largest search and search ads partner, 27% of U.S. syndicated search queries in 2011 <ul> <li>Contract requires preferential treatment for AdSense ads, which eBay characterizes as equivalent to exclusivity</li> <li>eBay wanted this removed in last negotiation, but assented to not removing it in return for not having revenue share cut while most other publishers had revenue share cut</li> <li>eBay's testing indicates that Bing is competitive in some sectors, e.g., tech ads; they believe they could make more money with multiple search providers</li> </ul></li> <li>NexTag: In 2015, Google's 15th largest AFS customer <ul> <li>Had exclusivity, was able to remove it in 2010, but NexTag considers restrictions &quot;essentially the same thing as exclusivity&quot;; &quot;NexTag reports that moving away from explicit exclusivity even to this kind of de facto exclusivity required substantial, difficult negotiations with Google&quot;</li> <li>Has had discussions with Yahoo and Bing about using their products &quot;on a filler basis&quot;, but unable to do so due to Google contract restrictions</li> </ul></li> <li>business.com: B2B lead generation / vertical site; much smaller than above. Barely in top 60 of AdSense query volume <ul> <li>Exclusive agreement with Google</li> <li>Would test Bing and Yahoo without exclusive agreement</li> <li>Agreement also restricts how business.com can design pages</li> <li>Loosening exclusivity would improve business.com revenue and allow for new features that make the site more accessible and user-friendly</li> </ul></li> <li>Amazon: 2nd largest AFS customer after eBay; $175M from search syndication, $169M from Google AdSense <ul> <li>Amazon uses other providers despite their poor monetization due to concerns about having a single supplier; because Amazon operates on thin margins, $175M is a material source of profit</li> <li>Amazon concerned it will be forced to sign an exclusive agreement in next negotiation</li> <li>During last negotiation, Amazon wanted 5-year deal, Google would only give 1-year extension unless Amazon agreed to send Google 90% of search queries (Amazon refused to agree to this formally, although they do this)</li> </ul></li> <li>IAC: umbrella company operating ask.com, Newsweek, CityGrid, Urbanspoon, and other websites <ul> <li>Agreement is exclusive on a per-property basis</li> <li>IAC concerned about exclusivity. CityGrid wanted mix-and-match options, but couldn't compete with Google's syndication network, forced to opt into IAC's exclusive agreement; CityGrid wants to use other networks (including its own), but can't under agreement with Google</li> <li>IAC concerned about lack of competition in search and search advertising syndication</li> <li>Execute who expressed above concerns left, new executive didn't see a possibility of splitting or moving traffic</li> <li>&quot;The departure of the key executive with the closest knowledge of the issues and the most detailed concerns suggests we may have significant issues obtaining clear, unambiguous testimony from IAC that reflects their earlier expressed concerns.&quot;</li> </ul></li> </ul></li> </ul> <h6 id="iii-effects-on-competitors">iii.Effects on Competitors</h6> <ul> <li>Microsoft asserts even 5%-10% increase in query volume &quot;very meaningful&quot; and Google's exclusive and restrictive agreements deny Microsoft incremental scale to be more efficient competitor</li> <li>Speciality search ad platforms also impacted; IAC sought to build platform for local search advertising, but Google's exclusivity provisions &quot;make it less likely that small local competitors like IAC's nascent offering can viably emerge.&quot;</li> </ul> <h2 id="iii-legal-analysis">III. LEGAL ANALYSIS</h2> <ul> <li>&quot;A monopolization claim under Section 2 of the Sherman Act, 15 U.S.C. § 2, has two elements: (i) the 'possession of monopoly power in the relevant market' and (ii) the 'willful acquisition or maintenance of that power as distinguished from growth or development as a consequence of a superior product, business acumen, or historic accident.'&quot;</li> <li>&quot;An attempted monopolization claim requires a showing that (i) 'the defendant has engaged in predatory or anticompetitive conduct' with (ii) 'a specific intent to monopolize' and (iii) a dangerous probability of achieving or maintaining monopoly power.&quot;</li> </ul> <h3 id="a-google-has-monopoly-power-in-relevant-markets">A. GOOGLE HAS MONOPOLY POWER IN RELEVANT MARKETS</h3> <ul> <li>&quot;'A firm is a monopolist if it can profitably raise prices substantially above the competitive level. [M]onopoly power may be inferred from a firm's possession of a dominant share of a relevant market that is protected by entry barriers.' Google has monopoly power in one or more properly defined markets.&quot;</li> </ul> <h4 id="1-relevant-markets-and-market-shares">1. Relevant Markets and Market Shares</h4> <ul> <li>&quot;A properly defined antitrust market consists of 'any grouping of sales whose sellers, if unified by a hypothetical cartel or merger, could profitably raise prices significantly above the competitive level.'&quot;</li> <li>&quot;Typically, a court examines 'such practical indicia as industry or public recognition of the submarket as a separate economic entity, the product's peculiar characteristics and uses, unique production facilities, distinct customers, distinct prices, sensitivity to price changes, and specialized vendors.'&quot;</li> <li>&quot;Staff has identified three relevant antitrust markets.&quot;</li> </ul> <h5 id="a-horizontal-search">a. Horizontal Search</h5> <ul> <li>Vertical search engines not a viable substitute to horizontal search; formidable barriers to expanding into horizontal search</li> <li>Vertical search properties could pick up query volume in response to SSNIP (small, but significant non-transitory increase in price) in horizontal search, potentially displacing horizontal search providers</li> <li>Google views these with concern, has aggressively moved to build its own vertical offerings</li> <li>No mechanism for vertical search properties to broadly discipline a monopolist in horizontal search <ul> <li>Web search queries monetized through search ads, ads sold by keyword which have independent demand functions. So, at best, monopolist might be inhibited from SSNIP on a narrow set of keywords with strong vertical competition. But for billions of queries with no strong vertical, nothing constrains monopolist from SSNIP</li> </ul></li> <li>Where vertical websites exist, still hard to compete; comprehensive coverage of all areas seems to be important driver of demand, even to websites focusing on specific topics. Eric Schmidt noted this: <ul> <li>&quot;So if you, for example, are an academic researcher and you use Google 30 times for your academics, then perhaps you'll want to buy a camera... So long as the product is very, very, very, very good, people will keep coming back... The general product then creates the brand, creates demand and so forth. Then occasionally, these ads get clicked on&quot;</li> </ul></li> <li>Schmidt's testimony corroborated by several vertical search firms, who note that they're dependent on horizontal search providers for traffic because vertical search users often start with Google, Bing, or Yahoo</li> <li>When asked about competitors in search, Eric Schmidt mentioned zero vertical properties <ul> <li>Google internal documents monitor Bing and Yahoo and compare quality. Sergei Brin testified that he wasn't aware of any such regular comparison against vertical competitors</li> </ul></li> <li>Relevant geo for web search limited to U.S. here; search engines return results relevant to users in country they're serving, so U.S. users unlikely to view foreign-specialized search engines as viable substitute</li> <li>Although Google has managed to cross borders, other major international search engines (Baidu, Yandex) have filed to do this</li> <li>Google dominant for &quot;general search&quot; in U.S.; 66.7% share according to ComScore, and also provides results to ask.com and AOL, another 4.6%</li> <li>Yahoo 15%, Bing 14%</li> <li>Google's market share above generally accepted floor for monopolization; defendants with share in this range have been found to have monopoly power</li> </ul> <h5 id="b-search-advertising">b. Search Advertising</h5> <ul> <li>Search ads likely a properly defined market</li> <li>Search ads distinguishable from other online ads, such as, display ads, contextual ads, behavioral ads, social media ads due to &quot;inherent scale, targetability, and control&quot; <ul> <li>Google: &quot;[t]hey are such different products that you do not measure them against one another and the technology behind the products is different&quot;</li> </ul></li> <li>Evidence suggests search and display ads are complements, not substitutes <ul> <li>&quot;Google has observed steep click declines when advertisers have attempted to shift budget to display advertising&quot;</li> <li>Chevrolet suspended search ads for 2 weeks and relied on display ads alone; lost 30% of clicks</li> </ul></li> <li>New ad offerings don't fit into traditional search or display categories: contextual, re-targeted display (or behavioral), social media <ul> <li>Only search ads allow advertisers to show ad based on when user is expressing an interest in the moment the ad is shown; numerous advertisers confirmed this point</li> <li>Search ads convert at much higher rate due to this advantage</li> </ul></li> <li>Numerous advertisers report they wouldn't shift ad spend away from search ads if prices increased more than SSNIP. Living Social would need 100% price increase before shifting ads (a minority of advertisers reported they would move ad dollars from search in response to SSNIP)</li> <li>Google internal documents and testimony confirm lack of viable substitute for search. AdWords VP Nick Fox and chief economist Hal Varian have stated that search ad spend doesn't come at expense of other ad dollars, Eric Schmidt has testified multiple times that search ads are the most effective ad tool, has best ROI</li> <li>Google, through AdWords, has 76% to 80% of the market according to industry-wide trackers (rival Bing-Yahoo has 12% to 16%)</li> <li>[It doesn't seem wrong to say that search ads are a market and that Google dominates that market, but the primacy of search ads seems overstated here? Social media ads, just becoming important at the time, ended up becoming very important, and of course video as well]</li> </ul> <h5 id="c-syndicated-search-and-search-advertising-search-intermediation">c. Syndicated Search and Search Advertising (&quot;Search Intermediation&quot;)</h5> <ul> <li>Syndicated search and search advertising (&quot;search intermediation&quot;) are likely a properly defined product market</li> <li>Horizontal search providers sell (&quot;syndicate&quot;) services to other websites</li> <li>Search engine can also return search ads to the website; search engine and website share revenue</li> <li>Consumers are websites that want search; sellers are horizontal search providers, Google, Bing, Yahoo</li> <li>Publishers of various sizes consistent on cross-elasticity of demand; report that search ad syndication monetizes better than display advertising or other content</li> <li>No publisher told us that modest (5% to 10%) increase in price for search and search ad syndication would favor other forms of advertising or web content</li> <li>Google's successful efforts to systematically reduce TAC support this, are a natural experiment to determine likely response to SSNIP</li> <li>Google, via AdSense, is dominant provider of search and search ad syndication; 75% of market according to ComScore (Microsoft and Yahoo combine for 22%)</li> </ul> <h4 id="2-substantial-barriers-to-entry-exist">2. Substantial Barriers to Entry Exist</h4> <ul> <li>&quot;Developing and maintaining a competitively viable search or search ad platform requires substantial investment in specialized knowledge, technology, infrastructure, and time. These markets are also characterized by significant scale effects&quot;</li> </ul> <h5 id="a-technology-and-specialization">a. Technology and Specialization</h5> <ul> <li>[no notes, extremely obvious to anyone technical who's familiar with the area]</li> </ul> <h5 id="b-substantial-upfront-investment">b. Substantial Upfront Investment</h5> <ul> <li>Enormous investments required. For example in 2011, Google spent $5B on R&amp;D. And in 2010, MS spent more than $4.5B developing algorithms and building physical capacity for Bing</li> </ul> <h5 id="c-scale-effects">c. Scale Effects</h5> <ul> <li>More usage leads to better algorithms and greater accuracy w.r.t. what consumers want</li> <li>Also leads to greater number of advertisers</li> <li>Greater number of advertisers and consumers leads to better ad serving accuracy, better monetization of ads, leads to better monetization for search engine, advertisers, and syndication partners</li> <li>Cyclical effect, &quot;virtuous cycle&quot;</li> <li>According to Microsoft, greatest barrier is obtaining sufficient scale. Losing $2B/yr trying to compete with Google, and Bing is only competing horizontal search platform to Google</li> </ul> <h5 id="d-reputation-brand-loyalty-and-the-halo-effect">d. Reputation, Brand Loyalty, and the &quot;Halo Effect&quot;</h5> <ul> <li>[no notes]</li> </ul> <h5 id="e-exclusive-and-restrictive-agreements">e. Exclusive and Restrictive Agreements -</h5> <ul> <li>&quot;Google's exclusive and restrictive agreements pose yet another barrier to entry, as many potential syndication partners with a high volume of customers are locked into agreements with Google.&quot;</li> </ul> <h3 id="b-google-has-engaged-in-exclusionary-conduct">B. GOOGLE HAS ENGAGED IN EXCLUSIONARY CONDUCT</h3> <ul> <li>&quot;Conduct may be judged exclusionary when it tends to exclude competitors 'on some basis other than efficiency,' i.e., when it 'tends to impair the opportunities of rivals' but 'either does not further competition on the merits or does SO in an unnecessarily restrictive way.' In order for conduct to be condemned as 'exclusionary,' Staff must show that Google's conduct likely impairs the ability of its rivals to compete effectively, and thus to constrain Google's exercise of monopoly power&quot;</li> </ul> <h4 id="1-google-s-preferencing-of-google-vertical-properties-within-its-serp">1. Google's Preferencing of Google Vertical Properties Within Its SERP</h4> <ul> <li>&quot;Although we believe that this is a close question, we conclude that Google's preferencing conduct does not violate Section 2.&quot;</li> </ul> <h5 id="a-google-s-product-design-impedes-vertical-competitors">a. Google's Product Design Impedes Vertical Competitors</h5> <ul> <li>&quot;As a general rule, courts are properly very skeptical about claims that competition has been harmed by a dominant firm's product design changes. Judicial deference to product innovation, however, does not mean that a monopolist's product design decisions are per se lawful&quot;, United States v. Microsoft</li> <li>We evaluate, through Microsoft lens of monopoly maintenance, whether Google took these actions to impede a nascent threat to Google's monopoly power</li> <li>&quot;Google's internal documents explicitly reflect - and testimony from Google executives confirms - a concern that Google was at risk of losing, in particular, highly profitable queries to vertical websites&quot;</li> <li>VP of product management Nicholas Fox: <ul> <li>&quot;[Google's] inability to serve this segment [of vertical lead generation] well today is negatively impacting our business. Query growth among high monetizing queries (&gt;$120 RPM) has declined to ~0% in the UK. US isn't far behind (~6%). There's evidence (e.g., UK Finance) that we're losing share to aggregators&quot;</li> </ul></li> <li>Threat to Google isn't vertical websites, displacing Google, but that they'll undercut Google's power over the most lucrative segments of search and search ads portfolio</li> <li>Additionally, vertical websites could help erode barriers to growth for general search competitors</li> </ul> <h5 id="b-google-s-serp-changes-have-resulted-in-anticompetitive-effects">b. Google's SERP Changes Have Resulted In Anticompetitive Effects</h5> <ul> <li>Google expanding its own offerings while demoting rival offerings caused significant drops in traffic to rivals, confirmed by Google's internal data</li> <li>Google's prominent placement of its own Universal Search properties led to gains in share of its own properties <ul> <li>&quot;For example, Google's inclusion of Google Product Search as a Universal Search result turned a property that the Google product team could not even get <i>indexed</i> by Google's web search results into the number one viewed comparison shopping website on Google&quot;</li> </ul></li> </ul> <h5 id="c-google-s-justifications-for-the-conduct">c. Google's Justifications for the Conduct</h5> <ul> <li>&quot;Product design change is an area of conduct where courts do not tend to strictly scrutinize asserted procompetitive justifications. In any event, Google's procompetitive justifications are compelling.&quot;</li> <li>Google argues design changes to SERP have improved product, provide consumers with &quot;better&quot; results</li> <li>Google notes that path toward Universal Search and OneBox predates concern about vertical threat</li> <li>Google justifies preferential treatment of Universal Search by asserting &quot;apples and oranges&quot; problem prevents Google from doing head-to-head comparison of its property vs. competing verticals, verticals and web results ranked with different criteria. This seems to be correct. <ul> <li>Microsoft says Bing uses a single signal, click-through-rate, that can be compared across Universal Search content and web search results</li> </ul></li> <li>Google claims that its Universal Search results are more helpful than than &quot;blue links&quot; to other comparison shopping websites</li> <li>Google claims that showing 3rd party data would create technical and latency issues <ul> <li>&quot; The evidence shows that it would be technologically feasible to serve up third-party results in Google's Universal Search results. Indeed, Bing does this today with its flight vertical, serving up Kayak results and Google itself originally considered third-party OneBoxes&quot;</li> </ul></li> <li>Google defends &quot;demotion&quot; of competing vertical content, &quot;arguing that Google's algorithms are designed solely with the goal of improving a user's search experience&quot; <ul> <li>&quot;one aspect of Google's demotions that especially troubles Staff - and is not addressed by the above justification - is the fact that Google routinely, and prominently, displays its own vertical properties, while simultaneously demoting properties that are <i>identical</i> to its own, but for the fact that the latter are competing vertical websites&quot;, See Brin Tr. 79:16-81:24 (acknowledging the similarities between Google Product Search and its competitors); Fox Tr. 204:6-204:20 (acknowledging the similarities between Google Product Search and its competitors).</li> </ul></li> </ul> <h5 id="d-google-s-additional-legal-defenses">d. Google's Additional Legal Defenses</h5> <ul> <li>&quot;Google has argued - successfully in several litigations - that it owes no duty to assist in the promotion of a rival's website or search platform, and that it owes no duty to promote a rival's product offering over its own product offerings&quot;</li> <li>&quot;one reading of Trinko and subsequent cases is that Google is privileged in blocking rivals from its search platform unless its conduct falls into in one of several specific exceptions referenced in Trinko&quot; <ul> <li>&quot;Alternatively, one may argue that Trinko should not be read so broadly as to overrule swathes of antitrust doctrine.&quot;</li> </ul></li> <li>&quot;Google has long argued that its general search results are opinions that are protected speech under the First Amendment, and that such speech should not be subject to government regulation&quot;; staff believes this is overbroad</li> <li>&quot;the evidence paints a complex portrait of a company working toward an overall goal of maintaining its market share by providing the best user experience, while simultaneously engaging in tactics that resulted in harm to many vertical competitors, and likely helped to entrench Google's monopoly power over search and search advertising&quot;</li> <li>&quot;The determination that Google's conduct is anticompetitive, and deserving of condemnation, would require an extensive balancing of these factors, a task that courts have been unwilling - in similar circumstances - to perform under Section 2. Thus, although it is a close question, Staff does not recommend that the Commission move forward on this cause of action.&quot;</li> </ul> <h4 id="2-google-s-scraping-of-rivals-vertical-content-1">2. Google's &quot;Scraping&quot; of Rivals' Vertical Content</h4> <ul> <li>&quot;We conclude that this conduct violates Section 2 and Section 5.&quot;</li> </ul> <h5 id="a-google-s-scraping-constitutes-a-conditional-refusal-to-deal-or-unfair-method-of-competition">a. Google's &quot;Scraping&quot; Constitutes a Conditional Refusal to Deal or Unfair Method Of Competition</h5> <ul> <li>Scraping and threats of refusal to deal with some competitors can be condemned as conditional refusal to deal under Section 2</li> <li>Post-Trinko, identification of circumstances (&quot;[u]nder certain circumstances, a refusal to cooperate with rivals can constitute anticompetitive conduct and violate § 2&quot;) &quot;subject of much debate&quot;</li> <li>Aspen Skiing Co. v. Aspen Highlands Skiing Corp: defendant (owner of 3 of 4 ski areas in Aspen) canceled all-ski area ticket with plaintiff (owner of 4th ski area in Aspen) <ul> <li>After demand increasing share of profit, defendant canceled ticket and rejected &quot;increasingly desperate measures&quot; to recreate joint ticket, even rejected plaintiff's offer to buy tickets at retail price</li> <li>Supreme court upheld jury's finding of liability; Trinko court: &quot;unilateral termination of a voluntary (and thus presumably profitable) course of dealing suggested a willingness to forsake short-term profits to achieve an anticompetitive end. Similarly, the defendant's unwillingness to renew the ticket even if compensated at retail price revealed a distinctly anticompetitive bent&quot;</li> </ul></li> <li>Appellate courts have focused on Trinko's reference to &quot;unilateral termination of a voluntary course of dealing&quot;, e.g., in American Central Eastern Texas Gas Co.v. Duke Energy Fuels LLC, Fifth Circuit upheld determination that defendant natural gas processor's refusal to contract with competitor for additional capacity was unlawful <ul> <li>Plaintiff contracted with defendant for processing capacity; after two years, defendant proposed terms it &quot;knew were unrealistic or completely unviable ... in order to exclude [the plaintiff] from competition with [the defendant] in the gas processing market.&quot;</li> </ul></li> <li>Case here is analogous to Aspen Skiing and Duke Energy [a lot of detail not written down in notes here]</li> </ul> <h5 id="b-google-s-scraping-has-resulted-in-anticompetitive-effects">b. Google's &quot;Scraping&quot; Has Resulted In Anticompetitive Effects</h5> <ul> <li>Scraping has lessened the incentives of competing websites like Yelp, TripAdvisor, CitySearch, and Amazon to innovate, diminishes incentives of other vertical websites to develop new products <ul> <li>entrepreneurs more reluctant to develop new sites, investors more reluctant to sponsor development when Google can use its monopoly power to appropriate content it deems lucrative</li> </ul></li> </ul> <h5 id="c-google-s-scraping-is-not-justified-by-efficiencies">c. Google's &quot;Scraping&quot; Is Not Justified By Efficiencies</h5> <ul> <li>&quot;Marissa Mayer and Sameer Samat testified that was extraordinarily difficult for Google, as a technical matter, to remove sites like Yelp from Google Local without also removing them from web search results&quot; <ul> <li>&quot;Google's almost immediate compliance after Yelp sent a formal 'cease and desist' letter to Google, however, suggests that the &quot;technical&quot; hurdles were not a significant factor in Google's refusal to comply with repeated requests to remove competitor content from Google Local&quot;</li> <li>Partners can opt out of inclusion with Google's vertical news offering, Google News</li> <li>&quot;Similarly, Google's almost immediate removal of Amazon product reviews from Google Product Search indicates that technical barriers were quickly surmounted when Google desired to accommodate a partner.&quot;</li> </ul></li> <li>&quot;In sum, the evidence shows that Google used its monopoly position in search to scrape content from rivals and to improve its own complementary vertical offerings, to the detriment of those rivals, and without a countervailing efficiency justification. Google's scraping conduct has helped it to maintain, preserve, and enhance Google's monopoly position in the markets for search and search advertising. Accordingly, we believe that this conduct should be condemned by the Commission.&quot;</li> </ul> <h4 id="3-google-s-api-restrictions-1">3. Google's API Restrictions</h4> <ul> <li>&quot;We conclude that Google's API restrictions violate Section 2.&quot;</li> <li>AdWords API procompetitive development</li> <li>But restrictive conditions in API usage agreement anticompetitive, without offsetting procompetitive benefits</li> <li>&quot;Should the restrictive conditions be found to be unreasonable restraints of trade, they could be removed today instantly, with no adverse effect on the functioning of the API. Any additional engineering required to make the advertiser data interoperable with other search networks would be supplied by other market participants. Notably, because Google would not be required to give its competitors access to the AdWords API, there is no concern about whether Google has a duty to deal with its competitors&quot;</li> </ul> <h5 id="a-the-restrictive-conditions-are-unreasonable">a. The Restrictive Conditions Are Unreasonable</h5> <ul> <li>Restrictive conditions limit ability of advertisers to use their own data, prevent the development and sale of 3rd party tools and services that would allow automated campaign management across multiple search networks</li> <li>&quot;Even Google is constrained by these restrictions, having had to forgo improving its DART Search tool to offer such capabilities, despite internal estimates that such functionality would benefit Google and advertisers alike&quot;</li> <li>Restrictive conditions have no procompetitive virtues, anticompetitive effects are substantial</li> </ul> <h5 id="b-the-restrictive-conditions-have-resulted-in-anticompetitive-effects">b. The Restrictive Conditions Have Resulted In Anticompetitive Effects</h5> <ul> <li>Restrictive conditions reduce innovation, increase transaction costs, degrade quality of Google's rivals in search and search advertising</li> <li>Several SEMs forced to remove campaign cloning functionality by Google; Google's restrictive conditions stopped cross-network campaign management tool market segment in its infancy</li> <li>Restrictive conditions increase transaction costs for all advertisers other than those large enough to make internal investments to develop their own tools [doesn't it also, in some amortized fashion, increase transaction costs for companies that can build their own tools?]</li> <li>Result is that advertisers spend less on non-dominant search networks, reducing quality of ads on non-dominant search networks</li> </ul> <h5 id="c-the-restrictive-conditions-are-not-justified-by-efficiencies">c. The Restrictive Conditions Are Not Justified By Efficiencies</h5> <ul> <li>Concern about &quot;misaligned incentives&quot; is Google's only justification for restrictive conditions; concern is that SEMs and agencies would adopt a &quot;lowest common denominator&quot; approach and degrade AdWords campaign performance</li> <li>&quot;The evidence shows that this justification is unsubstantiated and is likely a pretext&quot;</li> <li>&quot;In brief, these third parties incentives are highly aligned with Google's interests, precisely the opposite of what Google contends.&quot;</li> <li>Google unable to identify an examples of ill effects from misaligned incentives</li> <li>Terms and Conditions already have conditions for minimum functionality that prevents lowest common denominator concern from materializing</li> <li>Documents suggest restrictive conditions were not about &quot;misaligned incentives&quot;: <ul> <li>&quot;Sergey [Brin] and Larry [Page] are big proponents of a protectionist strategy that prevents third party developers from building offerings which promote the consolidated management of [keywords] on Google and Overture (and whomever else).&quot;</li> <li>In a 2004 doc, API product manager was looking for &quot;specific points on how we can prevent a new entrant (MSN Ad Network) from benefitting from a common 3rd party platform that is cross-network.&quot;</li> <li>In a related presentation, Google's lists as a concern, &quot;other competitors are buoyed by lowered barriers to entry&quot;; options to prevent this were &quot;applications must have Google-centric UI functions and branding&quot; and &quot;disallow cross-network compatible applications from using API&quot;</li> </ul></li> </ul> <h4 id="4-google-s-exclusive-and-restrictive-syndication-agreements-1">4. Google's Exclusive and Restrictive Syndication Agreements</h4> <ul> <li>&quot;Staff has investigated whether Google has entered into anticompetitive, exclusionary agreements with websites for syndicated search and search advertising services (AdSense agreements) that serve to maintain, preserve, or enhance Google's monopoly power in the markets for search, search advertising, or search and search advertising syndication (search intermediation). We conclude that these agreements violate Section 2.&quot;</li> </ul> <h5 id="a-google-s-agreements-foreclose-a-substantial-portion-of-the-relevant-market">a. Google's Agreements Foreclose a Substantial Portion of the Relevant Market</h5> <ul> <li>&quot;Exclusive deals by a monopolist harm competition by foreclosing rivals from needed relationships with distributors, suppliers, or end users. For example, in Microsoft, then-defendant Microsoft's exclusive agreements with original equipment manufacturers and software vendors were deemed anticompetitive where they were found to prevent third parties from installing rival browser Netscape, thus foreclosing Netscape from the most efficient distribution channel, and helping Microsoft to preserve its operating system monopoly. The fact that an agreement is not explicitly exclusive does not preclude a finding of liability.&quot;</li> <li>[notes on legal background of computing foreclosure percentage omitted]</li> <li>Staff relied on ComScore dataset to compute foreclosure; Microsoft and and Yahoo's syndicated query volume is higher than in ComScore, resulting in lower foreclosure number. &quot;We are trying to get to the bottom of this discrepancy now. However, based on our broader understanding of the market, we believe that the ComScore set more accurately reflects the relative query shares of each party.&quot; [I don't see why staff should believe that ComScore is more accurate than Microsoft's numbers — I would guess the opposite]</li> <li>[more notes on foreclosure percentage omitted]</li> </ul> <h5 id="b-google-s-agreements-have-resulted-in-anticompetitive-effects">b. Google's Agreements Have Resulted In Anticompetitive Effects</h5> <ul> <li>Once foreclosure is established as above &quot;safe harbor&quot; levels, need a qualitative, rule of reason analysis of market effects</li> <li>Google's exclusive agreements impact immediate market for search and search syndication advertising and have broader effects in markets for search and search advertising</li> <li>In search search ad syndication (search intermediation), exclusivity precludes some of the largest and most sophisticated publishers from using competing platforms. Publishers can't credibly threaten to shift some incremental business to other platforms to get price concessions from Google <ul> <li>Google's aggressive reduction of revenue shares to customers without significant resistance =&gt; agreements seem to be further entrenching Google's monopoly position</li> </ul></li> <li>An objection to this could be that Google's business is because its product is superior <ul> <li>This argument rests on fallacious assumption that Bing's <i>average</i> monetization gap is <i>consistent</i> across the board</li> </ul></li> <li>[section on CityGrid impact omitted; this section speaks to broader market effects]</li> <li>Google insists that incremental traffic to Microsoft would be trivial; Microsoft indicates it would be &quot;very meaningful&quot; <ul> <li>Not enough evidence for definitive conclusion, but &quot;internal Google documents suggest that Microsoft's view of things may be closer to the truth. — Google's interest in renewing deals in part to prevent MIcrosoft from gaining scale. Internal Google analysis of 2010 AOL renewal: &quot;AOL holds marginal search share but represents scale gains for a Microsoft + Yahoo! partnership. AOL/Microsoft combination has modest impact on market dynamics, but material increase in scale of Microsoft's search &amp; ads platform&quot;</li> <li>When informed that &quot;Microsoft [is] aggressively wooing AOL with large guarantees,&quot;, a Google exec responded with: &quot;I think the worse case scenario here is that AOL users get sent to Bing, so even if we make AOL a bit more competitive relative to Google, that seems preferable to growing Bing.&quot;</li> <li>Google internal documents show they pursued AOL deal aggressively even though AOL represented &quot;[a] low/no profit partnership for Google.&quot;</li> </ul></li> <li>Evidence is that, in near-term, removing exclusivity would not have dramatic impact; largest and most sophisticated publishers would shift modest amounts of traffic to Bing</li> <li>Most significant competitive benefits realized over longer period of time <ul> <li>&quot;Removing exclusivity may open up additional opportunities for both established and nascent competitors, and those opportunities may spur more significant changes in the market dynamics as publishers have the opportunity to consider - and test - alternatives to Google's AdSense program.&quot;</li> </ul></li> </ul> <h5 id="c-google-s-agreements-are-not-justified-by-efficiencies">c. Google's Agreements Are Not Justified By Efficiencies</h5> <ul> <li>Google has given three business justifications for exclusive and restrictive syndication agreements <ul> <li>Long-standing industry practice of exclusivity, dating from when publishers demanded large, guaranteed, revenue share payments regardless of performance <ul> <li>&quot;guaranteed revenue shares are now virtually non-existent&quot;</li> </ul></li> <li>&quot;Google is simply engaging in a vigorous competition with Microsoft for exclusive agreements&quot; <ul> <li>&quot;Google may argue that the fact that Microsoft is losing in a competitive bidding process (and indeed, not competing as vigorously as it might otherwise) is not a basis on which to condemn Google. However, Google has effectively created the rules of today's game, and Microsoft's substantial monetization disadvantage puts it in a poor competition position to compete on an all-or-nothing basis.&quot;</li> </ul></li> <li>&quot;user confusion&quot; — &quot;Google claims that it does not want users to confuse a competitor's poor advertisements with its own higher quality advertisements&quot; <ul> <li>&quot;This argument suffers both from the fact that it is highly unlikely that users care about the source of the ad, as well as the fact that, if users did care, less restrictive alternatives are clearly available. Google has not explained why alternatives such as labeling competitor advertisements as originating from the competitor are unavailing here.&quot;</li> <li>&quot;Google's actions demonstrate that &quot;user confusion&quot; is not a significant concern. In 2008 Google attempted to enter into a non-exclusive agreement with Yahoo! to supplement Yahoo!'s search advertising platform. Under the proposed agreement, Yahoo! would return its own search advertising, but supplement its inventory with Google search advertisements when Yahoo! did not have sufficient inventory.58, Additionally, Google has recently eliminated its &quot;preferred placement&quot; restriction for its online partners.&quot;</li> </ul></li> </ul></li> <li>Rule of reasons analysis shows strong evidence of market protected by high entry barriers</li> <li>Despite limitations to evidence, market is inarguably not robustly competitive today <ul> <li>Google has been unilaterally reducing revenue share with apparent impunity</li> </ul></li> </ul> <h2 id="iv-potential-remedies">IV. POTENTIAL REMEDIES</h2> <h3 id="a-scraping">A. Scraping</h3> <ul> <li>At least two possible remedies</li> <li>Opt-out to remove snippets of content from Google's vertical properties, while retaining web search results and/or in Universal Search results on main SERP</li> <li>Google could be required to limit use of content it indexes for web search (could only use content in returning the property in its search results, but not for determining its own product or local rankings) unless given explicit permission</li> </ul> <h3 id="b-api-restrictions">B. API Restrictions</h3> <ul> <li>Require Google to remove problematic contractual restrictions; no technical fixes necessary <ul> <li>SEMs report that technology for cross-compatibility already exists, will quickly flourish if unhindered by Google's contractual constraints</li> </ul></li> </ul> <h3 id="c-exclusive-and-restrictive-syndication-agreements">C. Exclusive and Restrictive Syndication Agreements</h3> <ul> <li>Most appropriate remedy is to enjoin Google form entering exclusive agreement with search syndication partners, and to require Google to loosen restrictions surrounding AdSense partners' use of rival search ads</li> </ul> <h2 id="v-litigation-risks">V. LITIGATION RISKS</h2> <ul> <li>Google does not charge customers, and they are not locked into Google</li> <li>Universal Search has resulted in substantial benefit to users</li> <li>Google's organization and aggregation of content adds value to product for customers</li> <li>Largest advertisers advertise on both Google AdWords and Microsoft AdCenter</li> <li>Most efficient channel through which Bing can gain scale is Bing.com</li> <li>Microsoft has the resources to purchase distribution where it seems greatest value</li> <li>Most website publishers appy with AdSense</li> </ul> <h2 id="vi-conclusion">VI. CONCLUSION</h2> <ul> <li>&quot;Staff concludes that Google's conduct has resulted - and will result - in real harm to consumers and to innovation in the online search and advertising markets. Google has strengthened its monopolies over search and search advertising through anticompetitive means, and has forestalled competitors' and would-be competitors' ability to challenge those monopolies, and this will have lasting negative effects on consumer welfare&quot; <ul> <li>&quot;Google has unlawfully maintained its monopoly over general search and search advertising, in violation of Section 2, or otherwise engaged in unfair methods of competition, in violation of Section 5, by scraping content from rival vertical websites in order to improve its own product offerings.&quot;</li> <li>&quot;Google has unlawfully maintained its monopoly over general search, search advertising, and search syndication, in violation of Section 2, or otherwise engaged in unfair methods of competition, in violation of Section 5, by entering into exclusive and highly restrictive agreements with web publishers that prevent publishers from displaying competing search results or search advertisements.&quot;</li> <li>&quot;Google has unlawfully maintained its monopoly over general search and search advertising, in violation of Section 2, or otherwise engaged in unfair methods of competition, in violation of Section 5, by maintaining contractual restrictions that inhibit the cross-platform management of advertising campaigns.&quot;</li> </ul></li> <li>&quot;For the reasons set forth above, Staff recommends that the Commission issue the attached complaint.&quot;</li> <li>Memo submitted by Barbara R. Blank, approved by Geoffrey M. Green and Malanie Sabo</li> </ul> <h1 id="ftc-be-staff-memo">FTC BE staff memo</h1> <p>&quot;Bureau of Economics</p> <p>August 8, 2012</p> <p>From: Christopher Adams and John Yun, Economists&quot;</p> <h2 id="executive-summary-1">Executive Summary</h2> <ul> <li>Anticompetitive investigation started June 2011</li> <li>Staff presented theories and evidence February 2012</li> <li>This memo offers our final recommendation</li> <li>Four theories of harm <ul> <li>preferencing of search results by favoring own web properties over rivals</li> <li>exclusive agreements with publishers and vendors, deprive rival platforms of users and advertisers</li> <li>restrictions on porting advertiser data to rival platforms</li> <li>misappropriating content from Yelp and TripAdvisor</li> </ul></li> <li>&quot;our guiding approach must be beyond collecting complaints and antidotes [presumably meant to be anecdotes?] from competitors who were negatively impacted from a firm's various business practices.&quot;</li> <li>Market power in search advertising <ul> <li>Google has &quot;significant' share, 65% of paid clicks and 53% of ad impressions among top 5 U.S. search engines</li> <li>Market power may be mitigated by the fact that 80% use a search engine other than Google</li> <li>Empirical evidence consistent with search and non-search ads being substitutes, and that Google considers vertical search to be competitors</li> </ul></li> <li>Preferencing theory <ul> <li>Theory is that Google is blending its proprietary content with customary &quot;blue links&quot; and demoting competing sites</li> <li>Google has limited ability to impose significant harm on vertical rivals because it accounts for 10% to 20% of traffic to them. Effect is very small and not statistically significant <ul> <li>[Funny that something so obviously wrong at the time and also seemingly wrong in retrospect was apparently taken seriously]</li> </ul></li> <li>Universal Search was a procompetitive response to pressure from vertical sites and an improvement for users</li> </ul></li> <li>Exclusive agreements theory <ul> <li>Access to a search engine's site (i.e., not dependent on 3rd party agreement) is most efficient and common distribution channel, which is not impeded by Google. Additionally, strong reasons to doubt that search toolbars and default status on browsers can be viewed as &quot;exclusives&quot; because users can easily switched (on desktop and mobile) <ul> <li>[statement implies another wrong model of what's happening here]</li> <li>[Specifically on easy switching on mobile, <a href="https://twitter.com/danluu/status/1016164712030134272">there's Googe's actual blocking of changing the default search engine from Google to what the user wants</a>, but we also know that a huge fraction of users basically don't understand what's happening and can't make an informed decision to switch — if this weren't the case, it wouldn't make sense for companies to bid so high for defaults, e.g. supposedly $26B/yr to obtain default search engine status on iOS; if users simply switch freely with, default status would be worth close to $0. Since this payment is, at the margin, pure profit and Apple's P/E ratio is 29.53 as of my typing this sentence, a quick and dirty estimate is that $776B of Apple's market cap is attributable to taking this payment vs. randomly selecting a default]</li> </ul></li> <li>[In addition to explicit, measurable, coercion like the above, there were also things like <a href="https://twitter.com/danluu/status/1019758649739304960">Google pressuring Samsung into shutting down their Android Browser effort in 2012</a>; although enforcing a search engine default on Android was probably not the primer driver on that or other similar pressure that Google applied, many of these sorts of things also had the impact of funneling users into Google on mobile; these economists seem like the incentive-based argument that users will use the best product, so the result we see in the market, but if that's the case, why do companies spend so much effort on ecosystem lock-in, including but not limited to supposedly paying $18B/yr to own the default setting in one browser? I guess the argument here is that companies are behaving completely irrationally in expending so much effort here, but consumers are behaving perfectly rationally and are fully informed and are not influenced by all of this spending at all?]</li> <li>In search syndication, Microsoft and Yahoo have a combined greater share than Google's</li> <li>No support for assertion that rivals' access to users has been impaired by Google. MS and Yahoo have had a steady 30% share for year; query volume has grown faster than Google since alliance was announced <ul> <li>[Another odd statement; at the time, observers didn't see Bing staying competitive without heavy subsidies from MS, and then MS predictably stopped subsidizing Bing as a big bet and its market share declined. Google's search market share is well above 90% and hasn't been below 90% since the BE memo was written; in the U.S., estimates put Google around 90% share, some a bit below and some a bit above, with low estimates at something like 87%. It's odd that someone could look at the situation at the time and not seeing that this was about to happen]</li> </ul></li> <li>In December 2011, Microsoft had access to query volume equivalent to what Google had 2 years ago, thus difficult to infer that Microsoft is below some threshold of query volume <ul> <li>[this exact argument was addressed in the BC memo; the BE memo does not appear to refute the BC memo's argument]</li> <li>[As with a number of the above arguments, this is a strange argument if you understand the dynamics of fast-growing tech companies. When you have rapidly growing companies in markets with network effects or scale effects, being the same absolute size as a competitor a number of years ago doesn't mean that you're in an ok position. We've seen this play out in a ton of markets and it's fundamental to why VCs shovel so much money at companies in promising markets — being a couple years behind often means you get crushed or, if you're lucky, end up as an also ran that's fighting an uphill battle against scale effects]</li> </ul></li> <li>Characteristics of online search market not consistent with Google buying distribution agreements to raise input costs of rivals</li> </ul></li> <li>Restrictions on porting advertiser data to AdWords API <ul> <li>Theory is that Google's terms and conditions for AdWords API anticompetitively disadvantages Microsoft's adCenter</li> <li>Introduction of API with co-mingling restriction made users and Google better off and rivals's costs were unaffected. Any objection therefore implies that when Google introduced the API, it had an obligation to allow its rivals to benefit from increased functionality. Significant risks to long-term innovation incentives from imposing such an obligation [Huh, this seems very weird]</li> <li>Advertisers responsible for overwhelming majority of search ad spend use both Google and Microsoft. Multi-homing advertisers of all sizes spend a significant share of budget on Microsoft [this exact objection is addressed in BC memo]</li> <li>Evidence from SEMs and end-to-end advertisers suggest policy's impact on ad spend on Microsoft's platform is negligible [it's hard to know how seriously to take this considering the comments on Yelp, above — the model of how tech businesses work seems very wrong, which casts doubt on other conclusions that necessarily require having some kind of model of how this stuff works]</li> </ul></li> <li>Scraping allegation is that Google has misappropriated content from Yelp and TripAdvisor <ul> <li>Have substantive concerns. Solution proposed in Annex 11</li> <li>To be an antitrust violation, need strong evidence that it increased users on Google at expensive of Yelp or TripAdvisor or decreased incentives to innovate. No strong evidence of either [per above comments, this seems wrong]</li> </ul></li> <li>Recommendation: recommend investigation be closed</li> </ul> <h2 id="1-does-google-possess-monopoly-power-in-the-relevant-antitrust-market">1. Does Google possess monopoly power in the relevant antitrust market?</h2> <ul> <li>To be in violation of Section 2 of the Sherman Act, Google needs to be a monopoly or have substantial market power in a relevant market</li> <li>Online search similar to any other advertising</li> <li>Competition between platforms and advertisers depends on extent to which advertisers consider users on one platform to be substitutes for another</li> <li>Google's market power depends on share of internet users</li> <li>If advertisers can access Google's users at other search platforms, such as Yahoo, Bing, and Facebook, &quot;Google's market power is a lot less&quot;</li> <li>Substantial evidence contradicting proposition that GOogle has substantial market power in search advertising</li> <li>Google's share is large. In Feb 2012, 65% of paid search clicks of top 5 general search engines went through Google, up from 55% in Sep 2008; these figures show Google offers advertisers what they want</li> <li>Advertisers want &quot;eyeballs&quot;</li> <li>Users multi-home. About 80% of users use a platform other than Google in a given month, so advertisers can get the same eyeballs elsewhere <ul> <li>Advertiser can get in front of a user on a different query on Yahoo or another search engine</li> <li>[this is also odd reasoning — if a user uses Google for searches by default, but occasionally stumbles across Yahoo or Bing, this doesn't meaningfully move the needle for an advertiser; the evidence here is comScore saying that 20% of users only use Google, 15% never use Google, and 65% use Google + another search engine; but it's generally accepted that comScore numbers are quite off. Shortly after the report was written, I looked at various companies that reported metrics (Alexa, etc.) and found them to be badly wrong; I don't think it would be easy to dig up the exact info I used at the time now, but on searching for &quot;comscore search engine market accuracy&quot;, the first hit I got was someone explaining that while, today, comScore shows that Google has an implausibly low 67% market share, an analysis of traffic to sites this company has access to showed that Google much more plausibly drove 85% of clicks; it seems worth mentioning that comScore is often considered inaccurate]</li> </ul></li> <li>Firm-level advertising between search ads and display ads is negatively correlated <ul> <li>[this seems plausible? The evidence in the BC memo for these being complements seemed like a stretch; maybe it's true, but the BE memo's position seems much more plausible]</li> <li>No claim that these are the same market, but can't conclude that they're unrelated</li> </ul></li> <li>Google competes with specialized search engines, similar to a supermarket competing with a convenience store [details on this analogy elided; this memo relies heavily on analogies that relate tech markets to various non-tech markets, some of which were also elided above] <ul> <li>For advertising on a search term like &quot;Nikon 5100&quot;, Amazon may provide a differentiated but competing product</li> </ul></li> <li>Google is leading seller of search, but this is mitigated by large proportion of users who also user other search engines, by substitution of display and search advertising, by competition in vertical search</li> </ul> <h2 id="theory-1-the-preferencing-theory">Theory 1: The preferencing theory</h2> <h3 id="2-1-overview">2.1 Overview</h3> <ul> <li>Preferencing theory is that Google's blending of content such as shopping comparison results and local business listings with customary blue links disadvantages competing content sites, such as Nextag, eBay, Yelp, and TripAdvisor</li> </ul> <h3 id="2-2-analysis">2.2 Analysis</h3> <ul> <li>Blend has two effects, negatively impacting traffic to specialized vertical sites by pushing down sites and impacting Google's incentives to show competing vertical sites</li> <li>Empirical questions <ul> <li>&quot;To what extent does Google account for the traffic to vertical sites?&quot;</li> <li>&quot;To what extent do blends impact the likelihood of clicks to vertical sites?&quot;</li> <li>&quot;To what extent do blends improve consumer value from the search results?&quot;</li> </ul></li> </ul> <h3 id="2-3-empirical-evidence">2.3 Empirical evidence</h3> <ul> <li>Google search responsible for 10% of traffic to shopping comparison sites, 17.5% to local business search sites. &quot;See Annex 4 for a complete discussion of our platform model&quot; <ul> <li>[Annex 4&quot;, doesn't appear to be included; but, as discussed above, the authors' model of how traffic works seems to be wrong]</li> </ul></li> <li>When blends appear, from Google's internal data, clicks to other shopping comparison sites drop by a large and statistically significant amount. For example, if a site had a pre-blend CTR of 9%, post-blend CTR would be 5.3%, but a blend isn't always presented</li> <li>For local, pre-blend CTR of 6% would be reduced to 5.4%; local blends have smaller impact than shopping</li> <li>&quot;above result for shopping comparison sites is not the same as finding that overall traffic from Google to shopping sites declined due to universal search. As we describe below, if blends represent a quality improvement, this will increase demand and drive greater query volume on Google, which will boost traffic to all sites.&quot;</li> <li>All links are substitutes, so we can infer that if user user clicks on ads less, they prefer the content and the user is getting more value. Overall results indicate that blends significantly increase consumer value <ul> <li>[this seems obviously wrong unless the blend is presented with the same visual impact, weight, and position, as normal results, which isn't the case at all — I don't disagree that the blend is probably better for consumers, but this methodology seems like a classic misuse of data to prove a point]</li> </ul></li> </ul> <h3 id="2-4-documentary-evidence">2.4 Documentary evidence</h3> <ul> <li>Since the 90s, general search engines have incorporated vertical blends</li> <li>All major search engines use blends</li> </ul> <h3 id="2-5-summary-of-the-preferencing-theory">2.5 Summary of the preferencing theory</h3> <ul> <li>Google not significant enough source of traffic to forclose its vertical rivals [as discussed above, the model for this statement is wrong]</li> </ul> <h2 id="theory-2-exclusionary-practices-in-search-distribution">Theory 2: Exclusionary practices in search distribution</h2> <h3 id="3-1-overview">3.1 Overview</h3> <ul> <li>Theory is that Google is engaging in exclusionary practices in order to deprive Microsoft of economies of scale</li> <li>Foundational issues <ul> <li>Are Google's distribution agreements substantially impairing opportunity of rivals to compete for users?</li> <li>What's the empirical evidence users are being excluded and denied?</li> <li>What's the evidence that Microsoft is at a disadvantage in terms of scale?</li> </ul></li> </ul> <h3 id="3-2-are-the-various-google-distribution-agreements-in-fact-exclusionary">3.2 Are the various Google distribution agreements in fact exclusionary?</h3> <ul> <li>&quot;Exclusionary agreements merit scrutiny when they materially reduce consumer choice and substantially impair the opportunities of rivals&quot;</li> <li>On desktop, users can access search engine directly, via web browser search box, or a search toolbar</li> <li>73% of desktop search through direct navigation, all search engines have equal access to consumers in terms of direct access; &quot;Consequently, Google has no ability to impair the opportunities of rivals in the most important and efficient desktop distribution channel.&quot; <ul> <li>[once again, this model seems wrong — if it wasn't wrong, companies wouldn't pay so much to become a search default, including shady stuff like <a href="https://twitter.com/danluu/status/887724695558205440">Google paying shady badware installers to make Chrome / Google default on people's desktops</a>. Another model is that if a user uses a search engine because it's a default, this changes a the probability that they'll use the search engine via &quot;direct access&quot;; compared to the BE staff model, it's overwhelmingly likely that this model is correct and the BE staff model is wrong]</li> <li>Microsoft is search default on Internet Explorer and 70% of PCs sold</li> </ul></li> <li>For syndication agreement, Google has a base template that contains premium placement provision. This is to achieve minimum level of remuneration in return for Google making its search available. Additionally, clause is often subject to negotiation and can be modified <ul> <li>[this negotiation thing is technically correct, but doesn't address the statement about this brought up in the BC memo; many, perhaps most, of the points in this memo have been refuted by the BC memo, and the strategy here seems to be to ignore the refutations without addressing them]</li> <li>&quot;By placing its entire site or suite of suites up for bid, publishers are able to bargain more effectively with search engines. This intensifies the ex ante competition for the contract and lowers publishers' costs. Consequently, eliminating the ability to negotiate a bundled discount, or exclusivity, based on site-wide coverage will result in higher prices to publishers.&quot; [this seems to contradict what we observe in practice?]</li> <li>&quot;This suggests that to the extent Google is depriving rivals such as Microsoft of scale economies, this is a result of 'competition on the merits'— much the same way as if Google had caused Microsoft to lose traffic because it developed a better product and offered it at a lower price.&quot;</li> </ul></li> <li>Have Google's premium placement requirements effectively denied Microsoft access to publishers? <ul> <li>Can approach this by considering market share. Google 44%, including Aol and Ask. MS 31%, including Yahoo. Yahoo 25%. Combined, Yahoo and MS are at 56%. &quot;Thus, combined, Microsoft and Yahoo's syndication shares are higher than their combined shares in a general search engine market&quot; [as noted previously, these stats didn't seem correct at the time and have gotten predictably less directionally correct over time]</li> </ul></li> <li>What would MS's volume be without Google's exclusionary restrictions <ul> <li>At most a 5% change because Google's product is so superior [this seems to ignore the primary component of this complaint, which is that there's a positive feedback cycle]</li> </ul></li> <li>Search syndication agreements <ul> <li>Final major distribution channel is mobile search</li> <li>U.S. marketshare: Android 47%, iOS 30%, RIM 16%, MS 5%</li> <li>Android and iOS grew from 30% to 77% from December 2009 to December 2011, primarily due to decline of RIM, MS, and Palm</li> <li>Mobile search is 8%. Thus, &quot;small percentage of overall queries and and even smaller percentage of search ad revenues&quot; <ul> <li>[The implication here appears to be that mobile is small and unimportant, which was obviously untrue at the time to any informed observer — I was at Google shortly after this was written and the change was made to go &quot;mobile first&quot; on basically everything because it was understood that mobile was the future; this involved a number of product changes that significantly degraded the experience on desktop in order to make the mobile experience better; this was generally considered not only a good decision, but the only remotely reasonable decision. Google was not alone in making this shift at the time. How economists studying this market didn't understand this after interviewing folks at Google and other tech companies is mysterious]</li> </ul></li> <li>Switching cost on mobile implied to be very low, &quot;a few taps&quot; [as noted previously, the staggering amount of money spent on being a mobile default and Google's commit linked above indicate this is not true]</li> <li>Even if switching costs were significant, there's no remedy here. &quot;Too many choices lead to consumer confusion&quot;</li> <li>Repeat of point that barrier to switching is low because it's &quot;a few taps&quot;</li> <li>&quot;Google does not require Google to be the default search engine in order to license the Android OS&quot; [seems technically correct, but misleading at best when taken as part of the broader argument here]</li> <li>OEMs choose Google search as default for market-based reasons and not because their choice is restricted [this doesn't address the commit linked above that actually prevents users from switching the default away from Google; I wonder what the rebuttal to that would be, perhaps also that user choice is bad and confusing to users?]</li> </ul></li> <li>Opportunities available to Microsoft are larger than indicated by marketshare</li> <li>Summary <ul> <li>Marketshare could change quickly; two years ago, Apple and Google only had 30% share</li> <li>Default of Google search not anticompetitive and mobile a small volume of queries, &quot;although this is changing rapidly&quot;</li> <li>Basically no barrier to user switching, &quot;a few taps and downloading other search apps can be achieved in a few seconds. These are trivial switching costs&quot; [as noted above, this is obviously incorrect to anyone who understands mobile, especially the part about downloading an app not being a barrier; I continue to find it interesting that the economists used market-based reasoning when it supports the idea that the market is perfectly competitive, with no switching costs, etc., but decline to use market-based reasoning, such as noting the staggeringly high sums paid to set default search, when it supports the idea the that the the market is not a perfectly competitive market with no switching costs, etc.]</li> </ul></li> </ul> <h3 id="3-3-are-rival-search-engines-being-excluded-from-the-market">3.3 Are rival search engines being excluded from the market?</h3> <ul> <li>Prior section found that Google's distribution agreements don't impair opportunity of rivals to reach users. But could it have happened? We'll look at market shares and growth trends to determine</li> <li>&quot;We note that the evidence of Microsoft and Yahoo's share and growth cannot, even in theory, tell us whether Google's conduct has had a significant impact. Nonetheless, if we find that rival shares have grown or not diminished, this fact can be informative. Additionally, assuming that Microsoft would have grown dramatically in the counterfactual, despite the fact that Google itself is improving its product, requires a level of proof that must move beyond speculation.&quot; [as an extension of the above, the economists are happy to speculate or even 'move beyond speculation' when it comes to applying speculative reasoning on user switching costs, but apparently not when it comes to inferences that can be made about marketshare; why the drastic difference in the standard of proof?]</li> <li>Microsoft and Yahoo's share shows no design of being excluded, steady 30% for 4 years [as noted in a previous section, the writing was on the wall for Bing and Yahoo at this time, but apparently this would &quot;move beyond speculation&quot; and is not noted here]</li> <li>Since announcement of MS / Yahoo alliance, MS query volume as grown faster than Google [this is based on comScore qSerach data and the more detailed quoted claim is that MS Query volume increased 134% while Google volume increased 54%; as noted above, this seems like an inaccurate metric, so it's not clear why this would be used to support this point, and it's also misleading at best]</li> <li>MS-Yahoo have the same number of search engine users as Google in a given month [again, as noted above, this appears to come from incorrect data and is also misleading at best because it counts a single use in a month as equivalent to using something many times a day]</li> </ul> <h3 id="3-4-does-microsoft-have-sufficient-scale-to-be-competitive">3.4 Does Microsoft have sufficient scale to be competitive?</h3> <ul> <li>In a meeting with Susan Athey, Microsoft could not demonstrate that they had data definitively showing how the cost curve changes as click data changes, &quot;thus, there is basis for suggesting Microsoft is below some threshold point&quot; [the use of the phrase &quot;threshold point&quot; demonstrates either a use of sleight of hand or a lack of understanding of how it works; the BE memo seems to prefer the idea that it's about some threshold since this could be supported by the argument that, if such a threshold were to be demonstrated, Microsoft's growth would have or will carry it past the threshold, but it doesn't make any sense that there would a threshold; also, even if this were important, having a single meeting where Microsoft wasn't able to answer this immediately would be weak evidence]</li> <li>[many more incorrect comments in the same vein as the above omitted for brevity]</li> <li>&quot;Finally, Microsoft's public statements are not consistent with statements made to antitrust regulators. Microsoft CEO Steve Ballmer stated in a press release announcing the search agreement with Yahoo: 'This agreement with Yahoo! will provide the scale we need to deliver even more rapid advances in relevancy and usefulness. Microsoft and Yahoo! know there's so much more that search could be. This agreement gives us the scale and resources to create the future of search.&quot; <ul> <li>[it's quite bizarre to use a press release, which are generally understood to be meaningless puff pieces, as evidence that a strongly supported claim isn't true; again, BE staff seem to be extremely selective about what evidence they look at to a degree that is striking; for example from conversations I had with credible, senior, engineers who worked on search at both Google and Bing, engineers who understand the domain would agree that having more search volume and more data is a major advantage; instead of using evidence like that, BE staff find a press release that, in the tradition of press releases, has some meaningless and incorrect bragging, and bring that in as evidence; why would they do this?]</li> </ul></li> <li>[more examples of above incorrect reasoning, omitted for brevity]</li> </ul> <h3 id="3-5-theory-based-on-raising-rivals-costs">3.5 Theory based on raising rivals' costs</h3> <ul> <li>Despite the above, it could be that distribution agreements deny rivals and data enough that &quot;feedback effects&quot; are triggered</li> <li>Possible feedback effects <ul> <li>Scale effect: cost per unit of quality or ad matching decreases</li> <li>Indirect network effect: more advertisers increases number of users</li> <li>Congestion effect</li> <li>Cash flow effect</li> </ul></li> <li>Scale effect was determined to not be applicable[as noted there, the argument for this is completely wrong]</li> <li>Indirect network effect has weak evidence, evidence exists that it doesn't apply, and even if it did apply, low click-through rate of ads shows that most consumers don't like ads anyway [what? This doesn't seem relevant?], and also, having a greater number of advertises leads to congestion and reduction in the value of the platform to advertisers [this is a reach; there is a sense in which this is technically true, but we could see then and now that platforms with few advertisers are extremely undesirable to advertises because advertisers generally don't want to advertise on a platform that full of low quality ads (and this also impacts the desire of users to use the platform)]</li> <li>Cash flow effect not relevant because Microsoft isn't cash flow constrained, so cost isn't relevant [a funny comment to make because, not too long after this, Microsoft severely cut back investment in Bing because the returns weren't deemed to be worth it; it seems odd for economists to argue that, if you have a lot of money, the cost of things doesn't matter and ROI is irrelevant. Shouldn't they think about marginal cost and marginal revenue?]</li> </ul> <p>[I stopped taking detailed notes at this point because taking notes that are legible to other people (as opposed to just for myself) takes about an order of magnitude longer, and I didn't think that there was much of interest here. I generally find comments of the form &quot;I stopped reading at X&quot; to be quite poor, in that people making such comments generally seem to pick some trivial thing that's unimportant and then declare and entire document to be worthless based on that. This pattern is also common when it comes to engineers, institutions, sports players, etc. and I generally find it counterproductive in those cases as well. However, in this case, there isn't really a single, non-representative, issue. The majority of the reasoning seems not just wrong, but highly disconnected from the on-the-ground situation. More notes indicating that the authors are making further misleading or incorrect arguments in the same style don't seem very useful. I did read the rest of the document and I also continue to summarize a few bits, below. I don't want to call them &quot;highlights&quot; because that would imply that I pulled out particularly interesting or compelling or incorrect bits and it's more of a smattering of miscellaneous parts with no particular theme]</p> <ul> <li>There's a claim that removing restrictions on API interoperability may not cause short term problems, but may cause long-term harm due to how this shifts incentives and reduces innovation and this needs to be accounted for, not just the short-term benefit [in form, this is analogous to the argument Tyler Cowen recently made that banning non-competes reduces the incentives for firms to innovate and will reduce innovation]</li> <li>The authors seem to like refer to advertisements and PR that any reasonable engineer (and I would guess reasonable person) would know are not meant to be factual or accurate. Similar to the PR argument above, they argue that advertising for Microsoft adCenter claims that it's easy to import data from AdWords, therefore the data portability issue is incorrect, and they specifically say that these advertising statements are &quot;more credible than&quot; other evidence <ul> <li>They also relied on some kind of SEO blogspam that restates the above as further evidence of this</li> </ul></li> <li>The authors do not believe that Google Search and Google Local are complements or that taking data from Yelp or TripAdvisor and displaying it above search results has any negative impact on Yelp or TripAdvisor, or at least that &quot;the burden of proof would be extremely difficult&quot;</li> </ul> <h1 id="other-memos">Other memos</h1> <p>[for these, I continued writing high-level summaries, not detailed summaries]</p> <ul> <li>After the BE memo, there's a memo from Laura M. Sullivan, Division of Advertising Practices, which makes a fairly narrow case in a few dimensions, including &quot;we continue to believe that Google has not deceived consumers by integrating its own specialized search results into its organic results&quot; and, as a result, they suggest not pursuing further action. <ul> <li>There are some recommendations, such as &quot;based on what we have observed of these new paid search results [referring to Local Search, etc.], we believe Google can strengthen the prominence and clarity of its disclosure&quot; [in practice, the opposite has happened!]</li> <li>[overall, the specific points presented here seems like ones a reasonable person could agree with, though whether or not these points are strong enough that they should prevent anti-trust action could be debated]</li> <li>&quot; Updating the 2002 Search Engine Letter is Warranted&quot; <ul> <li>&quot;The concerns we have regarding Google's disclosure of paid search results also apply to other search engines. Studies since the 2002 Search Engine letter was issued indicate that the standard methods search engines, including Google, Bing, and Yahoo!, have used to disclose their paid results may not be noticeable or clear enough for consumers. ²¹ For example, many consumers do not recognize the top ads as paid results ... Documents also indicate Google itself believed that many consumers generally do not recognize top ads as paid. For example, in June 2010, a leading team member of Google's in-house research group, commenting on general search research over time, stated: 'I don't think the research is inconclusive at all - there's definitely a (large) group of users who don't distinguish between sponsored and organic results. If we ask these users why they think the top results are sometimes displayed with a different background color, they will come up with an explanation that can range from &quot;because they are more relevant&quot; to &quot;I have no idea&quot; to &quot;because Google is sponsoring them.&quot;' [this could've seemed reasonable at the time, but in retrospect we can see that the opposite of this has happened and ads are less distinguishable from search results than they were in 2012, likely meaning that even fewer consumers can distinguish ads from search results]</li> </ul></li> <li>On the topic of whether or not Google should be liable for <a href="diseconomies-scale/">fraudulent ads</a> such as ones for fake weight-loss products or fake mortgage relief services, &quot;there is no indication so far that Google has played any role in developing or creating the search ads we are investigating&quot; and Google is expending some effort to prevent these ads and Google can claim CDA immunity, so further investigation here isn't worthwhile</li> </ul></li> <li>There's another memo from the same author on whether or not using other consumer data in conjunction with its search advertising business is unfair; the case is generally that this is not unfair and consumers should expect that their data is used to improve search queries</li> <li>There's a memo from Ken Heyer (at the time, a Director of the Agency's Bureau of Economics) <ul> <li>Suggests having a remedy that seems &quot;quite likely to do more good than harm&quot; before &quot;even considering serious filing a Complaint&quot;</li> <li>Seems to generally be in agreement with BE memo <ul> <li>On distribution, agrees with economist memo on unimportance of mobile and that Microsoft has good distribution on desktop (due to IE being default on 70% of PCs sold)</li> <li>On API restrictions, mixed opinion</li> <li>On mobile, mostly agrees with BE memo, but suggests getting an idea of how much Google pays for the right be default &quot;since if default status is not much of an advantage we would not expect to see large payments being made&quot; and also suggests it would be interesting to know how much switching from the default occurs <ul> <li>Further notes that mobile is only 8% of the market, too small to be significant [8% should've been factually incorrect. By late 2012, when this was written, mobile should've been 20% or more of queries; not sure why the economists are so wrong on so many of the numbers]</li> </ul></li> </ul></li> <li>On vertical sites, agreement with data analysis from BE memo and generally agrees with BE memo</li> </ul></li> <li>Another Ken Heyer memo <ul> <li>More strongly recommendations no action taken than previous memo, recommends against consent decree as well as litigation</li> </ul></li> <li>Follow-up memo from BC staff (Barbara R. Blank et al.), recommending that staff negotiate a consent order with Google on mobile <ul> <li>Google has exclusive agreement with the 4 major U.S. wireless carriers and Apple to pre-install Google Search; Apple agreement requires exclusivity <ul> <li>Google default on 86% of devices</li> </ul></li> <li>BC Staff recommends consent agreement to eliminate these exclusive agreements</li> <li>According to Google documents mobile was 9.5% of Google queries in 2010, 17.3% in 2011 [note that this strongly contradicts the claim from the BE memo that mobile is only 8% of the market here] <ul> <li>Rapid growth shows that mobile distribution channel is significant, and both Microsoft and Google internal documents recognize that mobile will likely surpass desktop in the near future</li> </ul></li> <li>In contradiction to their claims, Sprint and T-mobile agreements appear to mandate exclusivity, and AT&amp;T agreement is de facto exclusive due to tiered revenue sharing arrangement; Verizon agreement is exclusive</li> <li>Google business development manager Chris Barton: &quot;So we know with 100% certainty due to contractual terms that: All Android phones on T-Mobile will come with Google as the only search engine out-of-the-box. All Android phones on Verizon will come with Google as the only search engine out-of-the-box. All Android phones on Sprint will come with Google as the only search engine out-of-the-box.I think this approach is really important otherwise Bing or Yahoo can come and steal away our Android search distribution at any time, thus removing the value of entering into contracts with them. Our philosophy is that we are paying revenue share&quot;</li> <li>Andy Rubin laid out a plan to reduce revenue share of partners over time as Google gained search dominance and Google has done this over time</li> <li>Carriers would not switch even without exclusive agreement due to better monetization and/or bad PR</li> <li>When wrapping up Verizon deal, Andy Rubin said &quot;[i]f we can pull this off ... we will own the US market&quot;</li> </ul></li> <li>Memo from Willard K. Tom, General Counsel <ul> <li>&quot;In sum, this <i>may</i> be a good case. But it would be a novel one, and as in all such cases, the Commission should think through carefully what it means.&quot;</li> </ul></li> <li>Memo from Howard Shelanski, Director in Bureau of Economics <ul> <li>Mostly supports the BE memo and the memo from Ken Heyer, except on scraping, where there's support for the BC memo</li> </ul></li> </ul> <div class="footnotes"> <hr /> <ol> <li id="fn:K"><p>By analogy to a case that many people in tech are familiar with, consider this exchange between Oracle counsel <a href="https://en.wikipedia.org/wiki/David_Boies">David Boies</a> and <a href="https://en.wikipedia.org/wiki/William_Alsup">Judge William Alsup</a> on the <code>rangeCheck</code> function, which checks if a range is a valid array access or not given the length of an array and throws an exception if the access is out of range:</p> <ul> <li><b>Boies</b>: [argument that Google copied the rangeCheck function in order to accelerate development]</li> <li><b>Alsup</b>: All right. I have — I was not good — I couldn't have told you the first thing about Java before this trial. But, I have done and still do a lot of programming myself in other languages. I have written blocks of code like rangeCheck a hundred times or more. I could do it. You could do it. It is so simple. The idea that somebody copied that in order to get to market faster, when it would be just as fast to write it out, it was an accident that that thing got in there. There was no way that you could say that that was speeding them along to the marketplace. That is not a good argument.</li> <li><b>Boies</b>: Your Honor</li> <li><b>Alsup</b>: [cutting off Boies] You're one of the best lawyers in America. How can you even make that argument? You know, maybe the answer is because you are so good it sounds legit. But it is not legit. That is not a good argument.</li> <li><b>Boies</b>: Your Honor, let me approach it this way, first, okay. I want to come back to rangeCheck. All right.</li> <li><b>Alsup</b>: RangeCheck. All it does is it makes sure that the numbers you're inputting are within a range. And if they're not, they give it some kind of exceptional treatment. It is so — that witness, when he said a high school student would do this, is absolutely right.</li> <li><b>Boies</b>: He didn't say a high school student would do it in an hour, all right.</li> <li><b>Alsup</b>: Less than — in five minutes, Mr. Boies.</li> </ul> <p>Boies previously brought up this function as a non-trivial piece of work and then argues that, in their haste, a Google engineer copied this function from Oracle. As Alsup points out, the function is trivial, so trivial that it <abbr title="perhaps one might copy it if one were bulk copying a large amount of code, but copying this single function to make haste is implausible">wouldn't be worth looking it up to copy</abbr> and that even a high school student could easily produce the function from scratch. Boies then objects that, sure, maybe a high school student could write the function, but it might take an hour or more and Alsup correctly responds that an hour is implausible and that it might take five minutes.</p> <p>Although nearly anyone who could pass a high school programming class would find Boeis's argument not just wrong but absurd<sup class="footnote-ref" id="fnref:J"><a rel="footnote" href="#fn:J">3</a></sup>, more like a joke than something that someone might say seriously, it seems reasonable for Boies to make the argument because people presiding over these decisions in court, in regulatory agencies, and in the legislature, sometimes demonstrate a lack of basic understanding of tech. Since my background is in tech and not law or economics, I have no doubt that this analysis will miss some basics about law and economics in the same way that most analyses I've read seem miss basics about tech, but since there's been extensive commentary on this case from people with strong law and economics backgrounds, I don't see a need to cover those issues in depth here because anyone who's interested can read another analysis instead of or in addition to this one.</p> <a class="footnote-return" href="#fnref:K"><sup>[return]</sup></a></li> <li id="fn:B"><p>Although this document is focused on tech, the lack of hands-on industry-expertise in regulatory bodies, legislation, and the courts, appears to cause problems in other industries as well. An example that's relatively well known due to <a href="https://archive.is/sN1s7#selection-1795.691-1795.695">a NY Times article</a> that was turned into a movie is DuPont's involvement in the popularization of PFAS and, in particular, PFOA. Scientists at 3M and DuPont had evidence of the harms of PFAS going back at least to the 60s, and possibly even as far back as the 50s. Given the severe harms that PFOA caused to people who were exposed to it in significant concentrations, it would've been difficult to set up a production process for PFOA without seeing the harm it caused, but this knowledge, which must've been apparent to senior scientists and decision makers in 3M and DuPont, wasn't understood by regulatory agencies for almost four decades after it was apparent to chemical companies.</p> <p>By the way, the NY Times article is titled &quot;The Lawyer Who Became DuPont’s Worst Nightmare&quot; and it describes how DuPont made $1B/yr in profit for years while hiding the harms of PFOA, which was used in the manufacturing process for Teflon. This lawyer brought cases against DuPont that were settled for hundreds of millions of dollars; according to the article and movie, the litigation didn't even cost DuPont a single year's worth of PFOA profit. Also, DuPont manage to drag out the litigation for many years, continuing to reap the profit from PFOA. Now that enough evidence has mounted against PFOA, Teflon is now manufactured using PFO2OA or FRD-903, which are newer and have a less well understood safety profile than PFOA. Perhaps the article could be titled &quot;The Lawyer Who Became DuPont's Largest Mild Annoyance&quot;.</p> <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:J"><p>In the media, I've sometimes seen this framed as a conflict between tech vs. non-tech folks, but we can see analogous comments from people outside of tech. For example, in a panel discussion with Yale <abbr title="business school">SOM</abbr> professor Fiona Scott Morton and DoJ Antitrust Principal Deputy <a href="https://en.wikipedia.org/wiki/United_States_Assistant_Attorney_General">AAG</a> Doha Mekki, Scott Morton noted that the judge presiding over the Sprint/T-mobile merger proceedings, a case she was an expert witness for, had comically wrong misunderstandings about the market, and that it's common for decisions to be made which are disconnected from &quot;market realities&quot;. Mekki seconded this sentiment, saying &quot;what's so fascinating about some of the bad opinions that Fiona identified, and there are many, there's AT&amp;T Time Warner, Sabre Farelogix, T-mobile Sprint, they're everywhere, there's Amex, you know ...&quot;</p> <p>If you're seeing this or the other footnote in mouseover text and/or tied to a broken link, this is an issue with Hugo. At this point, <a href="https://x.com/danluu/status/1604922967074697216">I've spent more than an entire blog post's worth of effort working around Hugo breakage</a> and am trying to avoid spending more time working around issues in a tool that makes breaking changes at a high rate. If you have a suggestion to fix this, I'll try it, otherwise I'll try to fix it when I switch away from Hugo.</p> <a class="footnote-return" href="#fnref:J"><sup>[return]</sup></a></li> </ol> </div> How web bloat impacts users with slow devices slow-device/ Sat, 16 Mar 2024 00:00:00 +0000 slow-device/ <p><meta property="og:image" content="/slow-device-performance.png"/></p> <p>In 2017, <a href="web-bloat/">we looked at how web bloat affects users with slow connections</a>. Even in the U.S., <a href="https://twitter.com/danluu/status/1116565029791260672">many users didn't have broadband speeds</a>, making much of the web difficult to use. It's still the case that many users don't have broadband speeds, both inside and outside of the U.S. and that much of the modern web isn't usable for people with slow internet, but the exponential increase in bandwidth (Nielsen suggests <abbr title="Unfortunately, I don't know of a public source for low-end data, say 10%-ile or 1%-ile; let me know if you have numbers on this">this is 50% per year for high-end connections</abbr>) has outpaced web bloat for typical sites, making this less of a problem than it was in 2017, although it's still a serious problem for people with poor connections.</p> <p>CPU performance for web apps hasn't scaled nearly as quickly as bandwidth so, while more of the web is becoming accessible to people with low-end connections, more of the web is becoming inaccessible to people with low-end devices even if they have high-end connections. For example, if I try browsing a &quot;modern&quot; Discourse-powered forum on a <code>Tecno Spark 8C</code>, it sometimes crashes the browser. Between crashes, on measuring the performance, the responsiveness is significantly worse than browsing a BBS with an <code>8 MHz 286</code> and a <code>1200 baud</code> modem. On my <code>1Gbps</code> home internet connection, the <code>2.6 MB</code> compressed payload size &quot;necessary&quot; to load message titles is relatively light. The over-the-wire payload size has &quot;only&quot; increased by <code>1000x</code>, which is dwarfed by the increase in internet speeds. But the opposite is true when it comes to CPU speeds — for web browsing and forum loading performance, the <code>8-core (2 1.6 GHz Cortex-A75 / 6 1.6 GHz Cortex-A55)</code> CPU can't handle Discourse. The CPU is something like <code>100000x</code> faster than our <code>286</code>. Perhaps a <code>1000000x</code> faster device would be sufficient.</p> <p>For anyone not familiar with the <code>Tecno Spark 8C</code>, today, a new <code>Tecno Spark 8C</code>, a quick search indicates that one can be hand for <code>USD 50-60</code> in Nigeria and perhaps <code>USD 100-110</code> in India. As <abbr title="The estimates for Nigerian median income that I looked at seem good enough, but the Indian estimate I found was a bit iffier; if you have a good source for Indian income distribution, please pass it along.">a fraction of median household income, that's substantially more than a current generation iPhone in the U.S. today.</p> <p>By worldwide standards, the <code>Tecno Spark 8C</code> isn't even close to being a low-end device, so we'll also look at performance on an <code>Itel P32</code>, which is a lower end device (though still far from the lowest-end device people are using today). Additionally, we'll look at performance with an <code>M3 Max Macbook (14-core)</code>, an <code>M1 Pro Macbook (8-core)</code>, and the <code>M3 Max</code> set to <code>10x</code> throttling in Chrome dev tools. In order to give these devices every advantage, we'll be on fairly high-speed internet (1Gbps, with a WiFi router that's benchmarked as having lower latency under load than most of its peers). We'll look at some blogging platforms and micro-blogging platforms (this blog, Substack, Medium, Ghost, Hugo, Tumblr, Mastodon, Twitter, Threads, Bluesky, Patreon), forum platforms (Discourse, Reddit, Quora, vBulletin, XenForo, phpBB, and myBB), and platforms commonly used by small businesses (Wix, Squarespace, Shopify, and WordPress again).</p> <p>In the table below, every row represents a website and every non-label column is a metric. After the website name column, we have the compressed size transferred over the wire (<code>wire</code>) and the raw, uncompressed, size (<code>raw</code>). Then we have, for each device, Largest Contentful Paint* (<code>LCP*</code>) and CPU usage on the main thread (<code>CPU</code>). Google's docs explain <code>LCP</code> as</p> <blockquote> <p>Largest Contentful Paint (LCP) measures when a user perceives that the largest content of a page is visible. The metric value for LCP represents the time duration between the user initiating the page load and the page rendering its primary content</p> </blockquote> <p><code>LCP</code> is a common optimization target because it's presented as one of the primary metrics in Google PageSpeed Insights, a &quot;Core Web Vital&quot; metric. There's an asterisk next to <code>LCP</code> as used in this document because, <code>LCP</code> as measured by Chrome is about painting a large fraction of the screen, as opposed to the definition above, which is about content. As sites have optimized for <code>LCP</code>, it's not uncommon to have a large paint (update) that's completely useless to the user, with the actual content of the page appearing well after the <code>LCP</code>. In cases where that happens, I've used the timestamp when useful content appears, not the <code>LCP</code> as defined by when a large but useless update occurs. The full details of the tests and why these metrics were chosen are discussed in an appendix.</p> <p>Although CPU time isn't a &quot;Core Web Vital&quot;, it's presented here because it's a simple metric that's highly correlated with my and other users' perception of usability on slow devices. See appendix for more detailed discussion on this. One reason CPU time works as a metric is that, if a page has great numbers for all other metrics but uses a ton of CPU time, the page is not going to be usable on a slow device. If it takes 100% CPU for 30 seconds, the page will be completely unusable for 30 seconds, and if it takes 50% CPU for 60 seconds, the page will be barely usable for 60 seconds, etc. Another reason it works is that, relative to commonly used metrics, it's hard to cheat on CPU time and make optimizations that significantly move the number without impacting user experience.</p> <p>The color scheme in the table below is that, for sizes, more green = smaller / fast and more red = larger / slower. Extreme values are in black.</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}td.l {text-align:left;}</style> <table> <tr> <th rowspan="2">Site</th><th colspan="2">Size</th><th colspan="2">M3 Max</th><th colspan="2">M1 Pro</th><th colspan="2">M3/10</th><th colspan="2">Tecno S8C</th><th colspan="2">Itel P32</th></tr> <tr> <th>wire</th><th>raw</th><th>LCP*</th><th>CPU</th><th>LCP*</th><th>CPU</th><th>LCP*</th><th>CPU</th><th>LCP*</th><th>CPU</th><th>LCP*</th><th>CPU</th></tr> <tr> <td class="l">danluu.com</td><td bgcolor=#00441b><font color=white>6kB</font></td><td bgcolor=#00441b><font color=white>18kB</font></td><td bgcolor=#00441b><font color=white>50ms</font></td><td bgcolor=#00441b><font color=white>20ms</font></td><td bgcolor=#00441b><font color=white>50ms</font></td><td bgcolor=#00441b><font color=white>30ms</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td></tr> <tr> <td class="l">HN</td><td bgcolor=#00441b><font color=white>11kB</font></td><td bgcolor=#00441b><font color=white>50kB</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#00441b><font color=white>30ms</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#00441b><font color=white>30ms</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#238b45><font color=black>0.6s</font></td></tr> <tr> <td class="l">MyBB</td><td bgcolor=#00441b><font color=white>0.1MB</font></td><td bgcolor=#00441b><font color=white>0.3MB</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#006d2c><font color=black>0.6s</font></td><td bgcolor=#238b45><font color=black>0.6s</font></td><td bgcolor=#238b45><font color=black>0.8s</font></td><td bgcolor=#238b45><font color=black>0.8s</font></td><td bgcolor=#e5f5e0><font color=black>2.1s</font></td><td bgcolor=#c7e9c0><font color=black>1.9s</font></td></tr> <tr> <td class="l">phpBB</td><td bgcolor=#006d2c><font color=black>0.4MB</font></td><td bgcolor=#006d2c><font color=black>0.9MB</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=#a1d99b><font color=black>1.7s</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=fee8de><font color=black>4.1s</font></td><td bgcolor=feeae1><font color=black>3.9s</font></td></tr> <tr> <td class="l">WordPress</td><td bgcolor=#a1d99b><font color=black>1.4MB</font></td><td bgcolor=#238b45><font color=black>1.7MB</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#00441b><font color=white>60ms</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#00441b><font color=white>80ms</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#41ab5d><font color=black>1s</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=#f7fcf5><font color=black>2.5s</font></td></tr> <tr> <td class="l">WordPress (old)</td><td bgcolor=#006d2c><font color=black>0.3MB</font></td><td bgcolor=#006d2c><font color=black>1.0MB</font></td><td bgcolor=#00441b><font color=white>80ms</font></td><td bgcolor=#00441b><font color=white>70ms</font></td><td bgcolor=#00441b><font color=white>90ms</font></td><td bgcolor=#00441b><font color=white>90ms</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#41ab5d><font color=black>0.9s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#a1d99b><font color=black>1.7s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=#c7e9c0><font color=black>1.9s</font></td></tr> <tr> <td class="l">XenForo</td><td bgcolor=#006d2c><font color=black>0.3MB</font></td><td bgcolor=#006d2c><font color=black>1.0MB</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#006d2c><font color=black>0.6s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#74c476><font color=black>1.4s</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=#c7e9c0><font color=black>1.8s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Ghost</td><td bgcolor=#238b45><font color=black>0.7MB</font></td><td bgcolor=#41ab5d><font color=black>2.4MB</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=#e5f5e0><font color=black>2.2s</font></td><td bgcolor=#41ab5d><font color=black>1s</font></td><td bgcolor=#f7fcf5><font color=black>2.4s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=feede4><font color=black>3.5s</font></td></tr> <tr> <td class="l">vBulletin</td><td bgcolor=#74c476><font color=black>1.2MB</font></td><td bgcolor=#a1d99b><font color=black>3.4MB</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#006d2c><font color=black>0.6s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=fef2ec><font color=black>2.9s</font></td><td bgcolor=fee5da><font color=black>4.4s</font></td><td bgcolor=#fee0d2><font color=black>4.8s</font></td><td bgcolor=f44c38><font color=black>13s</font></td><td bgcolor=#cb181d><font color=black>16s</font></td></tr> <tr> <td class="l">Squarespace</td><td bgcolor=#e5f5e0><font color=black>1.9MB</font></td><td bgcolor=#fee0d2><font color=black>7.1MB</font></td><td bgcolor=#00441b><font color=white>0.1s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=feede4><font color=black>3.6s</font></td><td bgcolor=#ef3b2c><font color=black>14s</font></td><td bgcolor=fedccc><font color=black>5.1s</font></td><td bgcolor=d01c1e><font color=black>16s</font></td><td bgcolor=a50f15><font color=black>19s</font></td></tr> <tr> <td class="l">Mastodon</td><td bgcolor=fca68a><font color=black>3.8MB</font></td><td bgcolor=#f7fcf5><font color=black>5.3MB</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#c7e9c0><font color=black>1.8s</font></td><td bgcolor=fee2d6><font color=black>4.7s</font></td><td bgcolor=#c7e9c0><font color=black>2.0s</font></td><td bgcolor=fcb096><font color=black>7.6s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Tumblr</td><td bgcolor=#fcbba1><font color=black>3.5MB</font></td><td bgcolor=#fee0d2><font color=black>7.1MB</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#238b45><font color=black>0.6s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#41ab5d><font color=black>1.0s</font></td><td bgcolor=fcc0a8><font color=black>7.0s</font></td><td bgcolor=#ef3b2c><font color=black>14s</font></td><td bgcolor=fcab90><font color=black>7.9s</font></td><td bgcolor=fca184><font color=black>8.7s</font></td><td bgcolor=fca184><font color=black>8.7s</font></td></tr> <tr> <td class="l">Quora</td><td bgcolor=#238b45><font color=black>0.6MB</font></td><td bgcolor=#f7fcf5><font color=black>4.9MB</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=#238b45><font color=black>0.8s</font></td><td bgcolor=#74c476><font color=black>1.3s</font></td><td bgcolor=#fff5f0><font color=black>2.6s</font></td><td bgcolor=fca184><font color=black>8.7s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=a50f15><font color=black>19s</font></td><td bgcolor=880812><font color=white>29s</font></td></tr> <tr> <td class="l">Bluesky</td><td bgcolor=f5523b><font color=black>4.8MB</font></td><td bgcolor=fc7e5e><font color=black>10MB</font></td><td bgcolor=#41ab5d><font color=black>1.0s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#41ab5d><font color=black>1.0s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=fedccc><font color=black>5.1s</font></td><td bgcolor=fdceba><font color=black>6.0s</font></td><td bgcolor=fcab90><font color=black>8.1s</font></td><td bgcolor=fca68a><font color=black>8.3s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Wix</td><td bgcolor=a50f15><font color=black>7.0MB</font></td><td bgcolor=#71030E><font color=black>21MB</font></td><td bgcolor=#f7fcf5><font color=black>2.4s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=#f7fcf5><font color=black>2.5s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=aa1016><font color=black>18s</font></td><td bgcolor=fc7454><font color=black>11s</font></td><td bgcolor=fed7c6><font color=black>5.6s</font></td><td bgcolor=fc8868><font color=black>10s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Substack</td><td bgcolor=#74c476><font color=black>1.3MB</font></td><td bgcolor=#e5f5e0><font color=black>4.3MB</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=#fee0d2><font color=black>4.9s</font></td><td bgcolor=#ef3b2c><font color=black>14s</font></td><td bgcolor=#ef3b2c><font color=black>14s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Threads</td><td bgcolor=#71030E><font color=black>9.3MB</font></td><td bgcolor=#cb181d><font color=black>13MB</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#a1d99b><font color=black>1.6s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=fedccc><font color=black>5.1s</font></td><td bgcolor=fdceba><font color=black>6.1s</font></td><td bgcolor=fcc9b4><font color=black>6.4s</font></td><td bgcolor=#cb181d><font color=black>16s</font></td><td bgcolor=8e0a12><font color=white>28s</font></td><td bgcolor=black><font color=white>66s</font></td></tr> <tr> <td class="l">Twitter</td><td bgcolor=#fb6a4a><font color=black>4.7MB</font></td><td bgcolor=f5523b><font color=black>11MB</font></td><td bgcolor=#fff5f0><font color=black>2.6s</font></td><td bgcolor=#41ab5d><font color=black>0.9s</font></td><td bgcolor=#fff5f0><font color=black>2.7s</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=fed7c6><font color=black>5.6s</font></td><td bgcolor=fcc4ae><font color=black>6.6s</font></td><td bgcolor=fa6446><font color=black>12s</font></td><td bgcolor=a50f15><font color=black>19s</font></td><td bgcolor=a00e14><font color=black>24s</font></td><td bgcolor=black><font color=white>43s</font></td></tr> <tr> <td class="l">Shopify</td><td bgcolor=#fee0d2><font color=black>3.0MB</font></td><td bgcolor=#fff5f0><font color=black>5.5MB</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#00441b><font color=white>0.2s</font></td><td bgcolor=#006d2c><font color=black>0.4s</font></td><td bgcolor=#00441b><font color=white>0.3s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#f7fcf5><font color=black>2.3s</font></td><td bgcolor=fc8868><font color=black>10s</font></td><td bgcolor=970c14><font color=white>26s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Discourse</td><td bgcolor=#fff5f0><font color=black>2.6MB</font></td><td bgcolor=fc7e5e><font color=black>10MB</font></td><td bgcolor=#41ab5d><font color=black>1.1s</font></td><td bgcolor=#006d2c><font color=black>0.5s</font></td><td bgcolor=#a1d99b><font color=black>1.5s</font></td><td bgcolor=#238b45><font color=black>0.6s</font></td><td bgcolor=fcc4ae><font color=black>6.5s</font></td><td bgcolor=fed2c0><font color=black>5.9s</font></td><td bgcolor=dd2a24><font color=black>15s</font></td><td bgcolor=970c14><font color=white>26s</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> <tr> <td class="l">Patreon</td><td bgcolor=#fc9272><font color=black>4.0MB</font></td><td bgcolor=#cb181d><font color=black>13MB</font></td><td bgcolor=#006d2c><font color=black>0.6s</font></td><td bgcolor=#41ab5d><font color=black>1.0s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=#ef3b2c><font color=black>14s</font></td><td bgcolor=#a1d99b><font color=black>1.7s</font></td><td bgcolor=black><font color=white>31s</font></td><td bgcolor=fc9778><font color=black>9.1s</font></td><td bgcolor=black><font color=white>45s</font></td></tr> <tr> <td class="l">Medium</td><td bgcolor=#74c476><font color=black>1.2MB</font></td><td bgcolor=#a1d99b><font color=black>3.3MB</font></td><td bgcolor=#74c476><font color=black>1.4s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#74c476><font color=black>1.4s</font></td><td bgcolor=#41ab5d><font color=black>1s</font></td><td bgcolor=#c7e9c0><font color=black>2s</font></td><td bgcolor=fc7454><font color=black>11s</font></td><td bgcolor=#fff5f0><font color=black>2.8s</font></td><td bgcolor=black><font color=white>33s</font></td><td bgcolor=fef0e8><font color=black>3.2s</font></td><td bgcolor=black><font color=white>63s</font></td></tr> <tr> <td class="l">Reddit</td><td bgcolor=#c7e9c0><font color=black>1.7MB</font></td><td bgcolor=#f7fcf5><font color=black>5.4MB</font></td><td bgcolor=#41ab5d><font color=black>0.9s</font></td><td bgcolor=#238b45><font color=black>0.7s</font></td><td bgcolor=#41ab5d><font color=black>0.9s</font></td><td bgcolor=#41ab5d><font color=black>0.9s</font></td><td bgcolor=fdceba><font color=black>6.2s</font></td><td bgcolor=fa6446><font color=black>12s</font></td><td bgcolor=#74c476><font color=black>1.2s</font></td><td bgcolor=black><font color=white>∞</font></td><td bgcolor=black><font color=white>FAIL</font></td><td bgcolor=black><font color=white>FAIL</font></td></tr> </table></p> <p>At a first glance, the table seems about right, in that the sites that feel slow unless you have a super fast device show up as slow in the table (as in, <code>max(LCP*,CPU))</code> is high on lower-end devices). When I polled folks about what platforms they thought would be fastest and slowest on our slow devices (<a href="https://mastodon.social/@danluu/111994437263038931">Mastodon</a>, <a href="https://twitter.com/danluu/status/1761875263359537652">Twitter</a>, <a href="https://www.threads.net/@danluu.danluu/post/C3yVpfKS-RP">Threads</a>), they generally correctly predicted that Wordpress and Ghost would be faster than Substack and Medium, and that Discourse would be much slower than old PHP forums like phpBB, XenForo, and vBulletin. I also pulled Google PageSpeed Insights (PSI) scores for pages (not shown) and the correlation isn't as strong with those numbers <abbr title="For the 'real world' numbers, this is also because users with slow devices can't really use some of these sites, so their devices aren't counted in the distribution and PSI doesn't normalize for this.">because</abbr> a handful of sites have managed to optimize their PSI scores without actually speeding up their pages for users.</p> <p>If you've never used a low-end device like this, the general experience is that many sites are unusable on the device and loading anything resource intensive (an app or a huge website) can cause crashes. Doing something too intense in a resource intensive app can also cause crashes. While <a href="https://www.youtube.com/watch?v=U1JMRFQWK70">reviews note</a> that <a href="https://www.youtube.com/watch?v=McawfNlydqk">you can run PUBG and other 3D games with decent performance</a> on a <code>Tecno Spark 8C</code>, this doesn't mean that the device is fast enough to read posts on modern text-centric social media platforms or modern text-centric web forums. While <code>40fps</code> is achievable in PUBG, we can easily see less than <code>0.4fps</code> when scrolling on these sites.</p> <p>We can see from the table how many of the sites are unusable if you have a slow device. All of the pages with <code>10s+ CPU</code> are a fairly bad experience even after the page loads. Scrolling is very jerky, frequently dropping to a few frames per second and sometimes well below. When we tap on any link, the delay is so long that we can't be sure if our tap actually worked. If we tap again, we can get the dreaded situation where the first tap registers, which then causes the second tap to do the wrong thing, but if we wait, we often end up waiting too long because the original tap didn't actually register (or it registered, but not where we thought it did). Although MyBB doesn't serve up a mobile site and is penalized by Google for not having a mobile friendly page, it's actually much more usable on these slow mobiles than all but the fastest sites because scrolling and tapping actually work.</p> <p>Another thing we can see is how much variance there is in the relative performance on different devices. For example, comparing an <code>M3/10</code> and a <code>Tecno Spark 8C</code>, for danluu.com and Ghost, an <code>M3/10</code> gives a halfway decent approximation of the <code>Tecno Spark 8C</code> (although danluu.com loads much too quickly), but the <code>Tecno Spark 8C</code> is about three times slower (<code>CPU</code>) for Medium, Substack, and Twitter, roughly four times slower for Reddit and Discourse, and over an order of magnitude faster for Shopify. For Wix, the <code>CPU</code> approximation is about accurate, but our `<code>Tecno Spark 8C</code> is more than 3 times slower on <code>LCP*</code>. It's great that Chrome lets you conveniently simulate a slower device from the convenience of your computer, but just enabling Chrome's CPU throttling (or using any combination of out-of-the-box options that are available) gives fairly different results than we get on many real devices. The full reasons for this are beyond the scope of the post; for the purposes of this post, it's sufficient to note that slow pages are often super-linearly slow as devices get slower and that slowness on one page doesn't strongly predict slowness on another page.</p> <p>If take a site-centric view instead of a device-centric view, another way to look at it is that sites like Discourse, Medium, and Reddit, don't use all that much CPU on our fast <code>M3</code> and <code>M1</code> computers, but they're among the slowest on our <code>Tecno Spark 8C</code> (Reddit's CPU is shown as <code>∞</code> because, no matter how long we wait with no interaction, Reddit uses <code>~90% CPU</code>). Discourse also sometimes crashed the browser after interacting a bit or just waiting a while. For example, one time, the browser crashed after loading Discourse, scrolling twice, and then leaving the device still for a minute or two. For consistency's sake, this wasn't marked as <code>FAIL</code> in the table since the page did load but, realistically, having a page so resource intensive that the browser crashes is a significantly worse user experience than any of the <code>FAIL</code> cases in the table. When we looked at how <a href="web-bloat/">web bloat impacts users with slow connections</a>, we found that <abbr title="One thing to keep in mind here is that having a slow device and a slow connection have multiplicative impacts.">much of the web was unusable for people with slow connections and slow devices are no different</abbr>.</p> <p>Another pattern we can see is how the older sites are, in general, faster than the newer ones, with sites that (visually) look like they haven't been updated in a decade or two tending to be among the fastest. For example, MyBB, the least modernized and oldest looking forum is <code>3.6x / 5x faster (LCP* / CPU)</code> than Discourse on the <code>M3</code>, but on the <code>Tecno Spark 8C</code>, the difference is <code>19x / 33x</code> and, given the overall scaling, it seems safe to guess that the difference would be even larger on the Itel P32 if Discourse worked on such a cheap device.</p> <p>Another example is Wordpress (old) vs. newer, trendier, blogging platforms like Medium and Substack. Wordpress (old) is is <code>17.5x / 10x faster (LCP* / CPU)</code> than Medium and <code>5x / 7x faster (LCP* / CPU)</code> faster than Substack on our <code>M3 Max</code>, and <code>4x / 19x</code> and <code>20x / 8x</code> faster, respectively, on our <code>Tecno Spark 8C</code>. Ghost is a notable exception to this, being a modern platform (launched a year after Medium) that's competitive with older platforms (modern Wordpress is also arguably an exception, but many folks would probably still consider that to be an old platform). Among forums, NodeBB also seems to be a bit of an exception (see appendix for details).</p> <p>Sites that use modern techniques like partially loading the page and then dynamically loading the rest of it, such as Discourse, Reddit, and Substack, tend to be less usable than the scores in the table indicate. Although, in principle, you could build such a site in a simple way that works well with cheap devices but, in practice sites that use dynamic loading tend to be complex enough that the sites are extremely janky on low-end devices. It's generally difficult or impossible to scroll a predictable distance, which means that users will sometimes accidentally trigger more loading by scrolling too far, causing the page to lock up. Many pages actually remove the parts of the page you scrolled past as you scroll; all such pages are essentially unusable. Other basic web features, like page search, also generally stop working. Pages with this kind of dynamic loading can't rely on the simple and fast ctrl/command+F search and have to build their own search. How well this works varies (this used to work quite well in Google docs, but for the past few months or maybe a year, it takes so long to load that I have to deliberately wait after opening a doc to avoid triggering the browser's useless built in search; Discourse search has never really worked on slow devices or even not very fast but not particular slow devices).</p> <p>In principle, these modern pages that burn a ton of CPU when loading could be doing pre-work that means that later interactions on the page are faster and cheaper than on the pages that do less up-front work (this is a common argument in favor of these kinds of pages), but that's not the case for pages tested, which are slower to load initially, slower on subsequent loads, and slower after they've loaded.</p> <p>To understand why the theoretical idea that doing all this work up-front doesn't generally result in a faster experience later, this exchange between a distinguished engineer at Google and one of the founders of Discourse (and CEO at the time) is <abbr title="the founder has made similar comments elsewhere as well, so this isn't a one-off analogy for him, nor do I find it to be an unusual line of thinking in general">illustrative</abbr>, in <a href="jeff-atwood-trashes-qualcomm-engineering.png">a discussion where the founder of Discourse says that you should test mobile sites on laptops with throttled bandwidth but not throttled CPU</a>:</p> <ul> <li><b>Google</b>: *you* also don't have slow 3G. These two settings go together. Empathy needs to extend beyond iPhone XS users in a tunnel.</li> <li><b>Discourse</b>: Literally any phone of vintage iPhone 6 or greater is basically as fast as the &quot;average&quot; laptop. You have to understand how brutally bad Qualcomm is at their job. Look it up if you don't believe me.</li> <li><b>Google</b>: I don't need to believe you. I know. This is well known by people who care. My point was that just like not everyone has a fast connection not everyone has a fast phone. Certainly the iPhone 6 is frequently very CPU bound on real world websites. But that isn't the point.</li> <li><b>Discourse</b>: we've been trending towards infinite CPU speed for decades now (and we've been asymptotically there for ~5 years on desktop), what we are not and will never trend towards is infinite bandwidth. Optimize for the things that matter. and I have zero empathy for @qualcomm. Fuck Qualcomm, they're terrible at their jobs. I hope they go out of business and the ground their company existed on is plowed with salt so nothing can ever grow there again.</li> <li><b>Google</b>: Mobile devices are not at all bandwidth constraint in most circumstances. They are latency constraint. Even the latest iPhone is CPU constraint before it is bandwidth constraint. If you do well on 4x slow down on a MBP things are pretty alright</li> <li>...</li> <li><b>Google</b>: Are 100% of users on iOS?</li> <li><b>Discourse</b>: The influential users who spend money tend to be, I’ll tell you that ... Pointless to worry about cpu, it is effectively infinite already on iOS, and even with Qualcomm’s incompetence, will be within 4 more years on their embarrassing SoCs as well</li> </ul> <p>When someone asks the founder of Discourse, &quot;just wondering why you hate them&quot;, he responds with a link that cites the Kraken and Octane benchmarks from <a href="https://www.anandtech.com/show/9146/the-samsung-galaxy-s6-and-s6-edge-review/5">this Anandtech review</a>, which have the Qualcomm chip at 74% and 85% of the performance of the then-current Apple chip, respectively.</p> <p>The founder and then-CEO of Discourse considers Qualcomm's mobile performance embarrassing and finds this so offensive that he thinks Qualcomm engineers should all lose their jobs for delivering <abbr title="I think it could be reasonable to cite a lower number, but I'm using the number he cited, not what I would cite">74% to 85% of the performance of Apple</abbr>. Apple has what I consider to be an all-time great performance team. Reasonable people could disagree on that, but one has to at least think of them as a world-class team. So, producing a product with <abbr title="recall that, on a Tecno Spark 8, Discourse is 33 times slower than MyBB, which isn't particularly optimized for performance">74% to 85% of an all-time-great team is considered an embarrassment worthy of losing your job</abbr>.</p> <p>There are two attitudes on display here which I see in a lot of software folks. First, that CPU speed is infinite and one shouldn't worry about CPU optimization. And second, that gigantic speedups from hardware should be expected and the only reason hardware engineers wouldn't achieve them is due to spectacular incompetence, so the slow software should be blamed on hardware engineers, not software engineers. Donald Knuth expressed a similar sentiment in</p> <blockquote> <p>I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multiithreading idea turns out to be a flop, worse than the &quot;Itanium&quot; approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write. Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX ... I know that important applications for parallelism exist—rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years. Even if I knew enough about such methods to write about them in TAOCP, my time would be largely wasted, because soon there would be little reason for anybody to read those parts ... The machine I use today has dual processors. I get to use them both only when I’m running two independent jobs at the same time; that’s nice, but it happens only a few minutes every week.</p> </blockquote> <p>In the case of Discourse, a hardware engineer is an embarrassment not deserving of a job if they can't hit 90% of the performance of an all-time-great performance team but, as a software engineer, delivering 3% the performance of a non-highly-optimized application like MyBB is no problem. In Knuth's case, <abbr title="using this term loosely, to include materials scientists, etc., which is consistent with Knuth's comments">hardware engineers</abbr> gave programmers a 100x performance increase every decade for decades with little to no work on the part of programmers. The moment this slowed down and programmers had to adapt to take advantage of new hardware, hardware engineers were &quot;all out of ideas&quot;, but learning a few &quot;new&quot; (1970s and 1980s era) ideas to take advantage of current hardware would be a waste of time. And <a href="https://www.patreon.com/posts/54329188">we've previously discussed Alan Kay's claim that hardware engineers are &quot;unsophisticated&quot; and &quot;uneducated&quot; and aren't doing &quot;real engineering&quot; and how we'd get a 1000x speedup if we listened to Alan Kay's &quot;sophisticated&quot; ideas</a>.</p> <p>It's fairly common for programmers to expect that hardware will solve all their problems, and then, when that doesn't happen, pass the issue onto the user, explaining why the programmer needn't do anything to help the user. A question one might ask is how much performance improvement programmers have given us. There are cases of algorithmic improvements that result in massive speedups but, as we noted above, Discourse, the fastest growing forum software today, seems to have given us an approximately <code>1000000x</code> slowdown in performance.</p> <p>Another common attitude on display above is the idea that users who aren't wealthy don't matter. When asked if 100% of users are on iOS, the founder of Discourse says &quot;The influential users who spend money tend to be, I’ll tell you that&quot;. We see the same attitude all over comments on <a href="https://tonsky.me/blog/js-bloat/">Tonsky's JavaScript Bloat post</a>, with people expressing <a href="cocktail-ideas/">cocktail-party sentiments</a> like &quot;Phone apps are hundreds of megs, why are we obsessing over web apps that are a few megs? Starving children in Africa can download Android apps but not web apps? Come on&quot; and &quot;surely no user of gitlab would be poor enough to have a slow device, let's be serious&quot; (paraphrased for length).</p> <p>But when we look at the size of apps that are downloaded in Africa, we see that people who aren't on high-end devices use apps like Facebook Lite (a couple megs) and commonly use apps that are a single digit to low double digit number of megabytes. There are multiple reasons app makers care about their app size. One is just the total storage available on the phone; if you watch real users install apps, they often have to delete and uninstall things to put a new app on, so the smaller size is both easier to to install and has a lower chance of being uninstalled when the user is looking for more space. Another is that, if you look at data on app size and usage (I don't know of any public data on this; please pass it along if you have something public I can reference), when large apps increase the size and memory usage, they get more crashes, which drives down user retention, growth, and engagement and, conversely, when they optimize their size and memory usage, they get fewer crashes and better user retention, growth, and engagement.</p> <p><a href="https://infrequently.org/2024/01/performance-inequality-gap-2024/">Alex Russell points out that iOS has 7% market share in India (a 1.4B person market) and 6% market share in Latin America (a 600M person market)</a>. Although the founder of Discourse says that these aren't &quot;influential users&quot; who matter, these are still real human beings. Alex further points out that, according to Windows telemetry, which covers the vast majority of desktop users, most laptop/desktop users are on low-end machines which are likely slower than a modern iPhone.</p> <p>On the bit about no programmers having slow devices, I know plenty of people who are using hand-me-down devices that are old and slow. Many of them aren't even really poor; they just don't see why (for example) their kid needs a super fast device, and they don't understand how much of the modern web works poorly on slow devices. After all, the &quot;slow&quot; device can play 3d games and (with the right OS) compile codebases like Linux or Chromium, so why shouldn't the device be able to interact with a site like gitlab?</p> <p>Contrary to the claim from the founder of Discourse that, within years, every Android user will be on some kind of super fast Android device, it's been six years since his comment and it's going to be at least a decade before almost everyone in the world who's using a phone has a high-speed device and this could easily take two decades or more. If you look up marketshare stats for Discourse, it's extremely successful; it appears to be the fastest growing forum software in the world by a large margin. The impact of having the fastest growing forum software in the world created by an organization whose then-leader was willing to state that he doesn't really care about users who aren't &quot;influential users who spend money&quot;, who don't have access to &quot;infinite CPU speed&quot;, is that a lot of forums are now inaccessible to people who don't have enough wealth to buy a device with effectively infinite CPU.</p> <p>If the founder of Discourse were an anomaly, this wouldn't be too much of a problem, but he's just verbalizing the implicit assumptions a lot of programmers have, which is why we see that so many modern websites are unusable if you buy the income-adjusted equivalent of a new, current generation, iPhone in a low-income country.</p> <p><i>Thanks to Yossi Kreinen, Fabian Giesen, John O'Nolan, Joseph Scott, Loren McIntyre, Daniel Filan, @acidshill, Alex Russell, Chris Adams, Tobias Marschner, Matt Stuchlik, @gekitsu@toot.cat, Justin Blank, Andy Kelley, Julian Lam, Matthew Thomas, avarcat, @eamon@social.coop, William Ehlhardt, Philip R. Boulain, and David Turner for comments/corrections/discussion.</i></p> <h3 id="appendix-gaming-lcp">Appendix: gaming LCP</h3> <p>We noted above that we used <code>LCP*</code> and not <code>LCP</code>. This is because <code>LCP</code> basically measures when the largest change happens. When this metric was not deliberately gamed in ways that don't benefit the user, this was a great metric, but this metric has become less representative of the actual user experience as more people have gamed it. In the less blatant cases, people do small optimizations that improve <code>LCP</code> but barely improve or don't improve the actual user experience.</p> <p>In the more blatant cases, developers will deliberately flash a very large change on the page as soon as possible, generally a loading screen that has no value to the user (actually negative value because doing this increases the total amount of work done and the total time it takes to load the page) and then they carefully avoid making any change large enough that any later change would get marked as the <code>LCP</code>.</p> <p>For the same reason <a href="https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal">that VW didn't publicly discuss how it was gaming its emissions numbers</a>, developers tend to shy away from discussing this kind of <code>LCP</code> optimization in public. An exception to this is Discourse, where <a href="https://meta.discourse.org/t/introducing-discourse-splash-a-visual-preloader-displayed-while-site-assets-load/232003" rel="nofollow">they publicly announced this kind of <code>LCP</code> optimization, with comments from their devs and the then-CTO (now CEO)</a>, noting that their new &quot;Discourse Splash&quot; feature hugely reduced <code>LCP</code> for sites after they deployed it. And then developers ask why their <code>LCP</code> is high, the standard advice from Discourse developers is to keep elements smaller than the &quot;Discourse Splash&quot;, so that the <code>LCP</code> timestamp is computed from this useless element that's thrown up to optimize <code>LCP</code>, as opposed to having the timestamp be computed from any actual element that's relevant to the user. <a href="https://meta.discourse.org/t/theme-components-and-largest-contentful-paint-lcp/258680" rel="nofollow">Here's a typical, official, comment from Discourse</a></p> <blockquote> <p>If your banner is larger than the element we use for the &quot;Introducing Discourse Splash - A visual preloader displayed while site assets load&quot; you gonna have a bad time for LCP.</p> </blockquote> <p>The official response from Discourse is that you should make sure that your content doesn't trigger the <code>LCP</code> measurement and that, instead, our loading animation timestamp is what's used to compute <code>LCP</code>.</p> <p>The sites with the most extreme ratio of <code>LCP</code> of useful content vs. Chrome's measured <code>LCP</code> were:</p> <ul> <li>Wix <ul> <li><code>M3</code>: <code>6</code></li> <li><code>M1</code>: <code>12</code></li> <li><code>Tecno Spark 8C</code>: <code>3</code></li> <li><code>Itel P32</code>: <code>N/A</code> <code>(FAIL)</code></li> </ul></li> <li>Discourse: <ul> <li><code>M3</code>: <code>10</code></li> <li><code>M1</code>: <code>12</code></li> <li><code>Tecno Spark 8C</code>: <code>4</code></li> <li><code>Itel P32</code>: <code>N/A</code> <code>(FAIL)</code></li> </ul></li> </ul> <p>Although we haven't discussed the gaming of other metrics, it appears that some websites also game other metrics and &quot;optimize&quot; them even when this has no benefit to users.</p> <h3 id="appendix-the-selfish-argument-for-optimizing-sites">Appendix: the selfish argument for optimizing sites</h3> <p>This will depend on the scale of the site as well as its performance, but when I've looked at this data for large companies I've worked for, improving site and app performance is worth a mind boggling amount of money. It's measurable in A/B tests and it's also among the interventions that has, in <abbr title="where you keep a fraction of users on the old arm of the A/B test for a long duration, sometimes a year or more, in order to see the long-term impact of a change">long-term holdbacks</abbr>, a relatively large impact on growth and retention (many interventions test well but don't look as good long term, whereas performance improvements tend to look better long term).</p> <p>Of course you can see this from the direct numbers, but you can also implicitly see this in a lot of ways when looking at the data. One angle is that (just for example), at Twitter, user-observed p99 latency was about <code>60s</code> in India as well as a number of African countries (even excluding relatively wealthy ones like Egypt and South Africa) and also about <code>60s</code> in the United States. Of course, across the entire population, people have faster devices and connections in the United States, but in every country, there are enough users that have slow devices or connections that the limiting factor is really user patience and not the underlying population-level distribution of devices and connections. Even if you don't care about users in Nigeria or India and only care about U.S. ad revenue, improving performance for low-end devices and connections has enough of impact that we could easily see the impact in global as well as U.S. revenue in A/B tests, especially in long-term holdbacks. And you also see the impact among users who have fast devices since a change that improves the latency for a user with a &quot;low-end&quot; device from <code>60s</code> to <code>50s</code> might improve the latency for a user with a high-end device from <code>5s</code> to <code>4.5s</code>, which has an impact on revenue, growth, and retention numbers as well.</p> <p>For <a href="bad-decisions/">a variety of reasons that are beyond the scope of this doc</a>, this kind of boring, quantifiable, growth and revenue driving work has been difficult to get funded at most large companies I've worked for relative to flash product work that ends up showing little to no impact in long-term holdbacks.</p> <h3 id="appendix-designing-for-low-performance-devices">Appendix: designing for low performance devices</h3> <p>When using slow devices or any device with low bandwidth and/or poor connectivity, the best experiences, by far, are generally the ones that load a lot of content at once into a static page. If the images have proper width and height attributes and alt text, that's very helpful. Progressive images (as in progressive jpeg) isn't particularly helpful.</p> <p>On a slow device with high bandwidth, any lightweight, static, page works well, and lightweight dynamic pages can work well if designed for performance. Heavy, dynamic, pages are doomed unless the page weight doesn't cause the page to be complex.</p> <p>With low bandwidth and/or poor connectivity, lightweight pages are fine. With heavy pages, the best experience I've had is when I trigger a page load, go do something else, and then come back when it's done (or at least the HTML and CSS are done). I can then open each link I might want to read in a new tab, and then do something else while I wait for those to load.</p> <p>A lot of the optimizations that modern websites do, such as partial loading that causes more loading when you scroll down the page, and the concomitant hijacking of search (because the browser's built in search is useless if the page isn't fully loaded) causes the interaction model that works to stop working and makes pages very painful to interact with.</p> <p>Just for example, a number of people have noted that Substack performs poorly for them because it does partial page loads. <a href="substack.mp4">Here's a video by @acidshill of what it looks like to load a Substack article and then scroll on an iPhone 8</a>, where the post has a fairly fast <code>LCP</code>, but if you want to scroll past the header, you have to wait <code>6s</code> for the next page to load, and then on scrolling again, you have to wait maybe another <code>1s</code> to <code>2s</code>:</p> <p>As an example of the opposite approach, I tried loading some fairly large plain HTML pages, such as <a href="diseconomies-scale/">diseconomies-scale/</a> (<code>0.1 MB wire</code> / <code>0.4 MB raw</code>) and <a href="threads-faq/">threads-faq/</a> (<code>0.4 MB wire</code> / <code>1.1 MB raw</code>) and these were still quite usable for me even on slow devices. <code>1.1 MB</code> seems to be larger than optimal and breaking that into a few different pages would be better on a low-end devices, but a single page with <code>1.1 MB</code> of text works much better than most modern sites on a slow device. While you can get into trouble with HTML pages that are so large that browsers can't really handle them, for pages with a normal amount of content, it generally isn't until you have <a href="https://nolanlawson.com/2023/01/17/my-talk-on-css-runtime-performance/">complex CSS payloads</a> or JS that the pages start causing problems for slow devices. Below, we test pages that are relatively simple, some of which have a fair amount of media (<code>14 MB</code> in one case) and find that these pages work ok, as long as they stay simple.</p> <p>Chris Adams has also noted that blind users, using screen readers, often report that dynamic loading makes the experience much worse for them. Like dynamic loading to improve performance, while this can be done well, it's often either done badly or bundled with so much other complexity that the result is worse than a simple page.</p> <p>@Qingcharles noted another accessibility issue — the (prison) parolees he works with are given &quot;lifeline&quot; phones, which are often very low end devices. From a quick search, in 2024, some people will get an iPhone 6 or an iPhone 8, but there are also plenty of devices that are lower end than an Itel P32, let alone a Tecno Spark 8C. They also get plans with highly limited data, and then when they run out, some people &quot;can't fill out any forms for jobs, welfare, or navigate anywhere with Maps&quot;.</p> <p>For sites that do up-front work and actually give you a decent experience on low end devices, Andy Kelley pointed out an example of a site that does up front work that seems to work ok on a slow device (although it would struggle on a very slow connection), <a href="https://ziglang.org/documentation/master/std/">the Zig standard library documentation</a>:</p> <blockquote> <p>I made the controversial decision to have it fetch all the source code up front and then do all the content rendering locally. In theory, this is CPU intensive but in practice... even those old phones have really fast CPUs!</p> </blockquote> <p>On the <code>Tecno Spark 8C</code>, this uses <code>4.7s</code> of CPU and, afterwards, is fairly responsive (relative to the device — <a href="input-lag/">of course an iPhone responds much more quickly</a>. Taps cause links to load fairly quickly and scrolling also works fine (it's a little jerky, but almost nothing is really smooth on this device). This seems like the kind of thing people are referring to when they say that you can get better performance if you ship a heavy payload, but there aren't many examples of that which actually improve performance on low-end devices.</p> <h3 id="appendix-articles-on-web-performance-issues">Appendix: articles on web performance issues</h3> <ul> <li>2015: Maciej Cegłowski: <a href="https://idlewords.com/talks/website_obesity.htm">The Website Obesity Crisis</a> <ul> <li>Size: <code>1.0 MB</code> / <code>1.1 MB</code></li> <li><code>Tecno Spark 8C</code>: <code>0.9s</code> / <code>1.4s</code> <ul> <li>Scrolling a bit jerky, images take a little bit of time to appear if scrolling very quickly (jumping halfway down page from top), but delay is below what almost any user would perceive when scrolling a normal distance.</li> </ul></li> </ul></li> <li>2015: Nate Berkopec: <a href="https://www.speedshop.co/2015/11/05/page-weight-doesnt-matter.html">Page Weight Doesn't Matter</a> <ul> <li>Size: <code>80 kB</code> / <code>0.2 MB</code></li> <li><code>Tecno Spark 8C</code>: <code>0.8s</code> / <code>0.7s</code> <ul> <li>Does lazy loading, page downloads <code>650 kB</code> / <code>1.8 MB</code> if you scroll through the entire page, but scrolling is only a little jerky and the lazy loading doesn't cause delays. Probably the only page I've tried that does lazy loading in a way that makes the experience better and not worse on a slow device; I didn't test on a slow connection, where this would still make the experience worse.</li> </ul></li> <li><code>Itel P32</code>: <code>1.1s</code> / <code>1s</code> <ul> <li>Scrolling basically unusable; scroll extremely jerky and moves a random distance, often takes over <code>1s</code> for text to render when scrolling to new text; can be much worse with images that are lazy loaded. Even though this is the best implementation of lazy loading I've seen in the wild, the <code>Itel P32</code> still can't handle it.</li> </ul></li> </ul></li> <li>2017: Dan Luu: <a href="web-bloat/">How web bloat impacts users with slow connections</a> <ul> <li>Size: <code>14 kB</code> / <code>57 kB</code></li> <li><code>Tecno Spark 8C</code>: <code>0.5s</code> / <code>0.3s</code> <ul> <li>Scrolling and interaction work fine.</li> </ul></li> <li><code>Itel P32</code>:<code>0.7s</code> / <code>0.5 s</code></li> </ul></li> <li>2017-2024+: Alex Russell: <a href="https://infrequently.org/series/performance-inequality/">The Performance Inequality Gap (series)</a> <ul> <li>Size: <code>82 kB</code> / <code>0.1 MB</code></li> <li><code>Tecno Spark 8C</code>: <code>0.5s</code> / <code>0.4s</code> <ul> <li>Scrolling and interaction work fine.</li> </ul></li> <li><code>Itel P32</code>: <code>0.7s</code> / <code>0.4s</code> <ul> <li>Scrolling and interaction work fine.</li> </ul></li> </ul></li> <li>2024: Nikita Prokopov (Tonsky): <a href="https://tonsky.me/blog/js-bloat/">JavaScript Bloat in 2024</a> <ul> <li>Size: <code>14 MB</code> / <code>14 MB</code></li> <li><code>Tecno Spark 8C</code>: <code>0.8s</code> / <code>1.9s</code> <ul> <li>When scrolling, it takes a while for images to show up (500ms or so) and the scrolling isn't smooth, but it's not jerky enough that it's difficult to scroll to the right place.</li> </ul></li> <li><code>Itel P32</code>: <code>2.5s</code> / <code>3s</code> <ul> <li>Scrolling isn't smooth. Scrolling accurately is a bit difficult, but can generally scroll to where you want if very careful. Generally takes a bit more than <code>1s</code> for new content to appear when you scroll a significant distance.</li> </ul></li> </ul></li> <li>2024: Dan Luu: <a href="slow-device/">This post</a> <ul> <li>Size: <code>25 kB</code> / <code>74 kB</code></li> <li><code>Tecno Spark 8C</code>: <code>0.6s</code> / <code>0.5s</code> <ul> <li>Scrolling and interaction work fine.</li> </ul></li> <li><code>Itel P32</code>: <code>1.3s</code> / <code>1.1s</code> <ul> <li>Scrolling and interaction work fine, although I had to make a change for this to be the case — this doc originally had an embedded video, which the <code>Itel P32</code> couldn't really handle. <ul> <li>Note that, while these numbers are worse than the numbers for &quot;Page Weight Doesn't Matter&quot;, this page is usable after load, which that other page isn't beacuse it execute some kind of lazy loading that's too complex for this phone to handle in a reasonable timeframe. <br /></li> </ul></li> </ul></li> </ul></li> </ul> <h3 id="appendix-empathy-for-non-rich-users">Appendix: empathy for non-rich users</h3> <p>Something I've observed over time, as programming has become more prestigious and more lucrative, is <a href="https://mastodon.social/@danluu/109901711437753852">that people have tended to come from wealthier backgrounds</a> and have less exposure to people with different income levels. An example we've discussed before, is at a well-known, prestigious, startup that has a very left-leaning employee base, where everyone got rich, on a discussion about the covid stimulus checks, in a slack discussion, a well meaning progressive employee said that it was pointless because people would just use their stimulus checks to buy stock. This person had, apparently, never talked to any middle-class (let alone poor) person about where their money goes or looked at the data on who owns equity. And that's just looking at American wealth. When we look at world-wide wealth, the general level of understanding is much lower. People seem to really underestimate the dynamic range in wealth and income across the world. From having talked to quite a few people about this, a lot of people seem to have mental buckets for &quot;poor by American standards&quot; (buys stock with stimulus checks) and &quot;poor by worldwide standards&quot; (maybe doesn't even buy stock), but the range of poverty in the world dwarfs the range of poverty in America to an extent that not many wealthy programmers seem to realize.</p> <p>Just for example, <a href="https://mastodon.social/@danluu/109537302116865694">in this discussion how lucky I was (in terms of financial opportunities) that my parents made it to America</a>, someone mentioned that it's not that big a deal because they had great financial opportunities in Poland. For one thing, with respect to the topic of the discussion, the probability that someone will end up with a high-paying programming job (senior staff eng at a high-paying tech company) or equivalent, I suspect that, when I was born, being born poor in the U.S. gives you better odds than being fairly well off in Poland, but I could believe the other case as well if presented with data. But if we're comparing Poland v. U.S. to Vietnam v. U.S., if I spend <abbr title="so, these probably aren't the optimal numbers one would use for a comparison, but I think they're good enough for this purpose">15 seconds looking up rough wealth numbers for these countries</abbr> in the year I was born, the GDP/capita ratio of U.S. : Poland was ~8:1, whereas it was ~50 : 1 for Poland : Vietnam. The difference in wealth between Poland and Vietnam was roughly the square of the difference between the U.S. and Poland, so Poland to Vietnam is roughly equivalent to Poland vs. some hypothetical country that's richer than the U.S. by the amount that the U.S. is richer than Poland. These aren't even remotely comparable, but a lot of people seem to have this mental model that there's &quot;rich countries&quot; and &quot;not rich countries&quot; and &quot;not rich countries&quot; are all roughly in the same bucket. GDP/capita isn't ideal, but it's easier to find than percentile income statistics; the quick search I did also turned up that annual income in Vietnam then was something like $200-$300 a year. Vietnam was also going through the tail end of a famine whose impacts are a bit difficult to determine because statistics here seem to be gamed, but if you believe the mortality rate statistics, the famine caused total overall mortality rate to jump to double the normal baseline<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">1</a></sup>.</p> <p>Of course, at the time, the median person in a low-income country wouldn't have had a computer, let alone internet access. But, today it's fairly common for people in low-income countries to have devices. Many people either don't seem to realize this or don't understand what sorts of devices a lot of these folks use.</p> <h3 id="appendix-comments-from-fabian-giesen">Appendix: comments from Fabian Giesen</h3> <p>On the Discourse founder's comments on iOS vs. Android marketshare, Fabian notes</p> <blockquote> <p>In the US, according to the most recent data I could find (for 2023), iPhones have around 60% marketshare. In the EU, it's around 33%. This has knock-on effects. Not only do iOS users skew towards the wealthier end, they also skew towards the US.</p> <p>There's some secondary effects from this too. For example, in the US, iMessage is very popular for group chats etc. and infamous for interoperating very poorly with Android devices in a way that makes the experience for Android users very annoying (almost certainly intentionally so).</p> <p>In the EU, not least because Android is so much more prominent, iMessage is way less popular and anecdotally, even iPhone users among my acquaintances who would probably use iMessage in the US tend to use WhatsApp instead.</p> <p>Point being, globally speaking, recent iOS + fast Internet is even more skewed towards a particular demographic than many app devs in the US seem to be aware.</p> </blockquote> <p>And on the comment about mobile app vs. web app sizes, Fabian said:</p> <blockquote> <p>One more note from experience: apps you install when you install them, and generally have some opportunity to hold off on updates while you're on a slow or metered connection (or just don't have data at all).</p> <p>Back when I originally got my US phone, I had no US credit history and thus had to use prepaid plans. I still do because it's fine for what I actually use my phone for most of the time, but it does mean that when I travel to Germany once a year, I don't get data roaming at all. (Also, phone calls in Germany cost me $1.50 apiece, even though T-Mobile is the biggest mobile provider in Germany - though, of course, not T-Mobile US.)</p> <p>Point being, I do get access to free and fast Wi-Fi at T-Mobile hotspots (e.g. major train stations, airports etc.) and on inter-city trains that have them, but I effectively don't have any data plan when in Germany at all.</p> <p>This is completely fine with mobile phone apps that work offline and sync their data when they have a connection. But web apps are unusable while I'm not near a public Wi-Fi.</p> <p>Likewise I'm fine sending an email over a slow metered connection via the Gmail app, but I for sure wouldn't use any web-mail client that needs to download a few MBs worth of zipped JS to do anything on a metered connection.</p> <p>At least with native app downloads, I can prepare in advance and download them while I'm somewhere with good internet!</p> </blockquote> <p>Another comment from Fabian (this time paraphrased since this was from a conversation), is that people will often justify being quantitatively hugely slower because there's a qualitative reason something should be slow. One example he gave was that screens often take a long time to sync their connection and this is justified because there are operations that have to be done that take time. For a long time, these operations would often take seconds. Recently, a lot of displays sync much more quickly because Nvidia specifies how long this can take for something to be &quot;G-Sync&quot; certified, so display makers actually do this in a reasonable amount of time now. While it's true that there are operations that have to be done that take time, there's no fundamental reason they should take as much time as they often used to. Another example he gave was on how someone was justifying how long it took to read thousands of files because the operation required a lot of syscalls and &quot;syscalls are slow&quot;, which is a qualitatively true statement, but if you look at the actual cost of a syscall, in the case under discussion, the cost of a syscall was many orders of magnitude from being costly enough to be a reasonable explanation for why it took so long to read thousands of files.</p> <p>On this topic, when people point out that a modern website is slow, someone will generally respond with the qualitative defense that the modern website has these great features, which the older website is lacking. And while it's true that (for example) Discourse has features that MyBB doesn't, it's hard to argue that its feature set justifies being <code>33x</code> slower.</p> <h3 id="appendix-experimental-details">Appendix: experimental details</h3> <p>With the exception of danluu.com and, arguably, HN, for each site, I tried to find the &quot;most default&quot; experience. For example, for WordPress, this meant a demo blog with the current default theme, twentytwentyfour. In some cases, this may not be the most likely thing someone uses today, e.g., for Shopify, I looked at the first thing that theme they give you when you browse their themes, but I didn't attempt to find theme data to see what the most commonly used theme is. For this post, I wanted to do all of the data collection and analysis as a short project, something that takes less than a day, so there were a number of shortcuts like this, which will be described below. I don't think it's wrong to use the first-presented Shopify theme in a decent fraction of users will probably use the first-presente theme, but that is, of course, less representative than grabbing whatever the most common theme is and then also testing many different sites that use that theme to see how real-world performance varies when people modify the theme for their own use. If I worked for Shopify or wanted to do competitive analysis on behalf of a competitor, I would do that, but for a one-day project on how large websites impact users on low-end devices, the performance of Shopify demonstrated here seems ok. I actually <a href="https://mastodon.social/@danluu/111994372051118539">did the initial work for this around when I ran these polls</a>, back in February; I just didn't have time to really write this stuff up for a month.</p> <p>For the tests on laptops, I tried to have the laptop at ~60% battery, not plugged in, and the laptop was idle for enough time to return to thermal equilibrium in a room at 20°C, so pages shouldn't be impacted by prior page loads or other prior work that was happening on the machine.</p> <p>For the mobile tests, the phones were at ~100% charge and plugged in, and also previously at 100% charge so the phones didn't have any heating effect you can get from rapidly charging. As noted above, these tests were formed with <code>1Gbps</code> WiFi. No other apps were running, the browser had no other tabs open, and the only apps that were installed on the device, so no additional background tasks should've been running other than whatever users are normally subject to by the device by default. A real user with the same device is going to see worse performance than we measured here in almost every circumstance except if running Chrome Dev Tools on a phone significantly degrades performance. I noticed that, on the Itel P32, scrolling was somewhat jerkier with Dev Tools running than when running normally but, since this was a one-day project, I didn't attempt to quantify this and if it impacts some sites much more than others. In absolute terms, the overhead can't be all that large because the fastest sites are still fairly fast with Dev Tools running, but if there's some kind of overhead that's super-linear in the amount of work the site does (possibly indirectly, if it causes some kind of resource exhaustion), then that could be a problem in measurements of some sites.</p> <p>Sizes were all measured on mobile, so in cases where different assets are loaded on mobile vs. desktop, the we measured the mobile asset sizes. <code>CPU</code> was measured as CPU time on the main thread (I did also record time on other threads for sites that used other threads, but didn't use this number; if <code>CPU</code> were a metric people wanted to game, time on other threads would have to be accounted for to prevent sites from trying to offload as much work as possible to other threads, but this isn't currently an issue and time on main thread is more directly correlated to usability than sum of time across all threads, and the metric that would work for gaming is less legible with no upside for now).</p> <p>For WiFi speeds, speed tests had the following numbers:</p> <ul> <li><code>M3 Max</code> <ul> <li>Netflix (fast.com) <ul> <li>Download: <code>850 Mbps</code></li> <li>Upload: <code>840 Mbps</code></li> <li>Latency (unloaded / loaded): <code>3ms</code> / <code>8ms</code></li> </ul></li> <li>Ookla <ul> <li>Download: <code>900 Mbps</code></li> <li>Upload: <code>840 Mbps</code></li> <li>Latency (unloaded / download / upload): <code>3ms</code> / <code>8ms</code> / <code>13ms</code></li> </ul></li> </ul></li> <li><code>Tecno Spark 8C</code> <ul> <li>Netflix (fast.com) <ul> <li>Download: <code>390 Mbps</code></li> <li>Upload: <code>210 Mbps</code></li> <li>Latency (unloaded / loaded): <code>2ms</code> / <code>30ms</code></li> </ul></li> <li>Oookla <ul> <li>Ookla web app fails, can't see results</li> </ul></li> </ul></li> <li><code>Itel P32</code> <ul> <li>Netflix <ul> <li>Download: <code>44 Mbps</code></li> <li>Upload: test fails to work (sends one chunk of data and then hangs, sending no more data)</li> <li>Latency (unloaded / loaded): <code>4ms</code> / <code>400ms</code></li> </ul></li> <li>Okta <ul> <li>Download: <code>45 Mbps</code></li> <li>Upload: test fails to work</li> <li>Latency: test fails to display latency</li> </ul></li> </ul></li> </ul> <p>One thing to note is that the <code>Itel P32</code> doesn't really have the ability to use the bandwidth that it nominally has. Looking at the top Google reviews, none of them mention this. <a href="https://www.nairaland.com/4628841/itel-p32-review-great-those" rel="nofollow">The first review reads</a></p> <blockquote> <p>Performance-wise, the phone doesn’t lag. It is powered by the latest Android 8.1 (GO Edition) ... we have 8GB+1GB ROM and RAM, to run on a power horse of 1.3GHz quad-core processor for easy multi-tasking ... I’m impressed with the features on the P32, especially because of the price. I would recommend it for those who are always on the move. And for those who take battery life in smartphones has their number one priority, then P32 is your best bet.</p> </blockquote> <p><a href="https://techjaja.com/itel-p32-review-dual-camera-smartphone-alarming-price-tag/" rel="nofollow">The second review reads</a></p> <blockquote> <p>Itel mobile is one of the leading Africa distributors ranking 3rd on a continental scale ... the light operating system acted up to our expectations with no sluggish performance on a 1GB RAM device ... fairly fast processing speeds ... the Itel P32 smartphone delivers the best performance beyond its capabilities ... at a whooping UGX 330,000 price tag, the Itel P32 is one of those amazing low-range like smartphones that deserve a mid-range flag for amazing features embedded in a single package.</p> </blockquote> <p><a href="https://pctechmag.com/2018/08/itel-p32-full-review-much-more-than-just-a-budget-entry-level-smartphone/" rel="nofollow">The third review reads</a></p> <blockquote> <p>&quot;Much More Than Just a Budget Entry-Level Smartphone ... Our full review after 2 weeks of usage ... While switching between apps, and browsing through heavy web pages, the performance was optimal. There were few lags when multiple apps were running in the background, while playing games. However, the overall performance is average for maximum phone users, and is best for average users [screenshot of game] Even though the game was skipping some frames, and automatically dropped graphical details it was much faster if no other app was running on the phone.</p> </blockquote> <p>Notes on sites:</p> <ul> <li>Wix <ul> <li>www.wix.com/website-template/view/html/3173?originUrl=https%3A%2F%2Fwww.wix.com%2Fwebsite%2Ftemplates%2Fhtml%2Fmost-popular&amp;tpClick=view_button&amp;esi=a30e7086-28db-4e2e-ba22-9d1ecfbb1250: this was the first entry when I clicked to get a theme</li> <li><code>LCP</code> was misleading on every device</li> <li>On the <code>Tecno Spark 8C</code>, scrolling never really works. It's very jerky and this never settles down</li> <li>On the <code>Itel P32</code>, the page fails non-deterministically (different errors on different loads); it can take quite a while to error out; it was <code>23s</code> on the first run, with the CPU pegged for <code>28s</code></li> </ul></li> <li>Patreon <ul> <li>www.patreon.com/danluu: used my profile where possible</li> <li>Scrolling on Patreon and finding old posts is so painful that I maintain <a href="#pt">my own index of my Patreon posts</a> so that I can find my old posts without having to use Patreon. Although Patreon's numbers in the table don't look that bad in the table when you're on a fast laptop, that's just for the initial load. The performance as you scroll is bad enough that I don't think that, today, there exists a computer and internet connection that browse Patreon with decent performance.</li> </ul></li> <li>Threads <ul> <li>threads.net/danluu.danluu: used my profile where possible</li> <li>On the <code>Itel P32</code>, this technically doesn't load correctly and could be marked as <code>FAIL</code>, but it's close enough that I counted it. The thing that's incorrect is that profile photos have a square box around then <ul> <li>However, as with the other heavy pages, interacting with the page doesn't really work and the page is unusable, but this appears to be for the standard performance reasons and not because the page failed to render</li> </ul></li> </ul></li> <li>Twitter <ul> <li>twitter.com/danluu: used my profile where possible</li> </ul></li> <li>Discourse <ul> <li>meta.discourse.org: this is what turned up when I searched for an official forum.</li> <li>As discussed above, the <code>LCP</code> is highly gamed and basically meaningless. We linked to a post where the Discourse folks note that, on slow loads, they put a giant splash screen up at <code>2s</code> to cap the <code>LCP</code> at <code>2s</code>. Also notable is that, on loads that are faster than the 2s, the <code>LCP</code> is also highly gamed. For example, on the <code>M3 Max</code> with low-latency <code>1Gbps</code> internet, the <code>LCP</code> was reported as <code>115ms</code>, but the page loads actual content at <code>1.1s</code>. This appears to use the same fundamental trick as &quot;Discourse Splash&quot;, in that it paints a huge change onto the screen and then carefully loads smaller elements to avoid having the actual page content detected as the <code>LCP</code>.</li> <li>On the <code>Tecno Spark 8C</code>, scrolling is unpredictable and can jump too far, triggering loading from infinite scroll, which hangs the page for <code>3s-10s</code>. Also, the entire browser sometimes crashes if you just let the browser sit on this page for a while.</li> <li>On the <code>Itel P32</code>, an error message is displayed after <code>7.5s</code></li> </ul></li> <li>Bluesky <ul> <li>bsky.app/profile/danluu.com</li> <li>Displays a blank screen on the <code>Itel P32</code></li> </ul></li> <li>Squarespace <ul> <li>cedar-fluid-demo.squarespace.com: this was the second theme that showed up when I clicked themes to get a theme; the first was one called &quot;Bogart&quot;, but that was basically a &quot;coming soon&quot; single page screen with no content, so I used the second theme instead of the first one.</li> <li>A lot of errors and warnings in the console with the <code>Itel P32</code>, but the page appears to load and work, although interacting with it is fairly slow and painful</li> <li><code>LCP</code> on the <code>Tecno Spark 8C</code> was significantly before the page content actually loaded</li> </ul></li> <li>Tumblr <ul> <li>www.tumblr.com/slatestarscratchpad: used this because I know this tumblr exists. I don't read a lot of tumblers (maybe three or four), and this one seemed like the closest thing to my blog that I know of on tumblr.</li> <li>This page fails on the <code>Itel P32</code>, but doesn't <code>FAIL</code>. The console shows that the JavaScript errors out, but the page still works fine (I tried scrolling, clicking links, etc., and these all worked), so you can actually go to the post you want and read it. The JS error appears to have made this page load much more quickly than it other would have and also made interacting with the page after it loaded fairly zippy.</li> </ul></li> <li>Shopify <ul> <li>themes.shopify.com/themes/motion/styles/classic/preview?surface_detail=listing&amp;surface_inter_position=1&amp;surface_intra_position=1&amp;surface_type=all: this was the first theme that showed up when I looked for themes</li> <li>On the first <code>M3/10</code> run, Chrome dev tools reported a nonsensical <code>697s</code> of CPU time (the run completed in a normal amount of time, well under <code>697s</code> or even <code>697/10s</code>. This run was ignored when computing results.</li> <li>On the <code>Itel P32</code>, the page load never completes and it just shows a flashing cursor-like image, which is deliberately loaded by the theme. On devices that load properly, the flashing cursor image is immediately covered up by another image, but that never happens here.</li> <li>I wondered if it wasn't fair to use this example theme because there's some stuff on the page that lets you switch theme styles, so I checked out actual uses of the theme (the page that advertises the theme lists users of the theme). I tried the first two listed real examples and they were both much slower than this demo page.</li> </ul></li> <li>Reddit <ul> <li>reddit.com</li> <li>Has an unusually low <code>LCP*</code> compared to how long it takes for the page to become usable. Although not measured in this test, I generally find the page slow and sort of unusable on Intel Macbooks which are, by historical standards, extremely fast computers (unless I use old.reddit.com)</li> </ul></li> <li>Mastodon <ul> <li>mastodon.social/@danluu: used my profile where possible</li> <li>Fails to load on <code>Itel P32</code>, just gives you a blank screen. Due to how long things generally take on the <code>Itel P32</code>, it's not obvious for a while if the page is failing or if it's just slow</li> </ul></li> <li>Quora <ul> <li>www.quora.com/Ever-felt-like-giving-up-on-your-dreams-How-did-you-come-out-of-it: I tried googling for quora + the username of a metafilter user who I've heard is now prolific on Quora. Rather than giving their profile page, Google returned this page, which appears to have nothing to do with the user I searched for. So, this isn't comparable to the social media profiles, but getting a random irrelevant Quora result from Google is how I tend to interact with Quora, so I guess this is representative of my Quora usage.</li> <li>On the <code>Itel P32</code>, the page stops executing scripts at some point and doesn't fully load. This causes it to fail to display properly. Interacting with the page doesn't really work either.</li> </ul></li> <li>Substack <ul> <li>Used thezvi.substack.com because I know Zvi has a substack and writes about similar topics.</li> </ul></li> <li>vBulletin: <ul> <li>forum.vbulletin.com: this is what turned up when I searched for an official forum.</li> </ul></li> <li>Medium <ul> <li>medium.com/swlh: I don't read anything on Medium, so I googled for programming blogs on Medium and this was the top hit. From looking at the theme, it doesn't appear to be unusually heavy or particularly customized for a Medium blog. Since it appears to be widely read and popular, it's more likely to be served from a CDN and than some of the other blogs here.</li> <li>On a run that wasn't a benchmark reference run, on the <code>Itel P32</code>, I tried scrolling starting 35s after loading the page. The delay to scroll was <code>5s-8s</code> and scrolling moved an unpredictable amount, making the page completely unusable. This wasn't marked as a <code>FAIL</code> in the table, but one could argue that this should be a <code>FAIL</code> since the page is unusable.</li> </ul></li> <li>Ghost <ul> <li>source.ghost.io because this is the current default Ghost theme and it was the first example I found</li> </ul></li> <li>Wordpress <ul> <li>2024.wordpress.net because this is the current default wordpress theme and this was the first example of it I found</li> </ul></li> <li>XenForo <ul> <li>xenforo.com/community/: this is what turned up when I searched for an official forum</li> <li>On the <code>Itel P32</code>, the layout is badly wrong and page content overlaps itself. There's no reasonable way to interact with the element you want because of this, and reading the text requires reading text that's been overprinted multiple times.</li> </ul></li> <li>Wordpress (old) <ul> <li>Used thezvi.wordpress.com because it has the same content as Zvi's substack, and happens to be on some old wordpress theme that used to be a very common choice</li> </ul></li> <li>phpBB <ul> <li>www.phpbb.com/community/index.php: this is what turned up when I searched for an official forum.</li> </ul></li> <li>MyBB <ul> <li>community.mybb.com: this is what turned up when I searched for an official forum.</li> <li>Site doesn't serve up a mobile version. In general, I find the desktop version of sites to be significantly better than the mobile version when on a slow device, so this works quite well, although they're likely penalized by Google for this.</li> </ul></li> <li>HN <ul> <li>news.ycombinator.com</li> <li>In principle, HN should be the slowest social media site or link aggregator because it's written in a custom Lisp that isn't highly optimized and the code was originally written with brevity and cleverness in mind, which generally gives you fairly poor performance. However, that's only poor relative to what you'd get if you were writing high-performance code, which is not a relevant point of comparison here.</li> </ul></li> <li>danluu.com <ul> <li>Self explanatory</li> <li>This currently uses a bit less CPU than HN, but I expect this to eventually use more CPU as the main page keeps growing. At the moment, this page has 176 links to 168 articles vs. HN's 199 links to 30 articles but, barring an untimely demise, this page should eventually have more links than HN. <ul> <li>As noted above, I find that pagination for such small pages makes the browsing experience much worse on slow devices or with bad connections, so I don't want to &quot;optimize&quot; this by paginating it or, even worse, doing some kind of dynamic content loading on scroll.</li> </ul></li> </ul></li> <li>Woo Commerce <ul> <li>I originally measured Woo Commerce as well but, unlike the pages and platforms tested above, I didn't find that being fast or slow on the initial load was necessarily representative of subsequent performance of other action, so this wasn't included in the table because having this in the table is sort of asking for a comparison against Shopify. In particular, while the &quot;most default&quot; Woo theme I could find was significantly faster than the &quot;most default&quot; Shopify theme on initial load on a slow device, performance was multidimensional enough that it was easy to find realistic scenarios where Shopify was faster than Woo and vice versa on a slow device, which is quite different from what I saw with newer blogging platforms like Substack and Medium compared to older platforms like Wordpress, or a modern forum like Discourse versus the older PHP-based forums. A real comparison of shopping sites that have carts, checkout flows, etc., would require a better understanding of real-world usage of these sites than I was going to get in a single day.</li> </ul></li> <li>NodeBB <ul> <li>community.nodebb.org</li> <li>This wasn't in my original tests and I only tried this out because one of the founders of NodeBB suggested it, saying &quot;I am interested in seeing whether @nodebb@fosstodon.org would fare better in your testing. We spent quite a bit of time over the years on making it wicked fast, and I personally feel it is a better representation of modern forum software than Discourse, at least on speed and initial payload.&quot;</li> <li>I didn't do the full set of tests because I don't keep the <code>Itel P32</code> charged (the battery is in rough shape and discharges quite quickly once unplugged, so I'd have to wait quite a while to get it into a charged state)</li> <li>On the tests I did, it got <code>0.3s/0.4s</code> on the <code>M1</code> and <code>3.4s/7.2s</code> on the <code>Tecno Spark 8C</code>. This is moderately slower than vBulletin and significantly slower than the faster php forums, but much faster than Discourse. If you need a &quot;modern&quot; forum for some reason and want to have your forum be usable by people who aren't, by global standards, rich, this seems like it could work.</li> <li>Another notable thing, given that it's a &quot;modern&quot; site, is that interaction works fine after initial load; you can scroll and tap on things and this all basically works, nothing crashed, etc.</li> <li>Sizes were <code>0.9 MB</code> / <code>2.2 MB</code>, so also fairly light for a &quot;modern&quot; site and possibly usable on a slow connection, although slow connections weren't tested here.</li> </ul></li> </ul> <p>Another kind of testing would be to try to configure pages to look as similar as possible. I'd be interested in seeing that results for that if anyone does it, but that test would be much more time consuming. For one thing, it requires customizing each site. And for another, it requires deciding what sites should look like. If you test something danluu.com-like, every platform that lets you serve up something light straight out of a CDN, like Wordpress and Ghost, should score similarly, with the score being dependent on the CDN and the CDN cache hit rate. Sites like Medium and Substack, which have relatively little customizability would score pretty much as they do here. Realistically, from looking at what sites exist, most users will create sites that are slower than the &quot;most default&quot; themes for Wordpress and Ghost, although it's plausible that readers of this blog would, on average, do the opposite, so you'd probably want to test a variety of different site styles.</p> <h3 id="appendix-this-site-vs-sites-that-don-t-work-on-slow-devices-or-slow-connections">Appendix: this site vs. sites that don't work on slow devices or slow connections</h3> <p>Just as an aside, something I've found funny for a long time is that I get quite a bit of hate mail about the styling on this page (and a similar volume of appreciation mail). By hate mail, I don't mean polite suggestions to change things, I mean the equivalent of road rage, but for web browsing; web rage. I know people who run sites that are complex enough that they're unusable by a significant fraction of people in the world. How come people are so incensed about the styling of this site and, proportionally, basically don't care at all that the web is unusable for so many people?</p> <p>Another funny thing here is that the people who appreciate the styling generally appreciate that the site doesn't override any kind of default styling, letting you make the width exactly what you want (by setting your window size how you want it) and it also doesn't override any kind of default styling you apply to sites. The people who are really insistent about this want everyone to have some width limit they prefer, some font they prefer, etc., but it's always framed in a way as if they don't want it, it's really for the benefit of people at large even though accommodating the preferences of the web ragers would directly oppose the preferences of people who prefer (just for example) to be able to adjust the text width by adjusting their window width.</p> <p>Until I pointed this out tens of times, this iteration would usually start with web ragers telling me that &quot;studies show&quot; that narrower text width is objectively better, but on reading every study that exists on the topic that I could find, I didn't find this to be the case. Moreover, on asking for citations, it's clear that people saying this generally hadn't read any studies on this at all and would sometimes hastily send me a study that they did not seem to have read. When I'd point this out, people would then change their argument to how studies can't really describe the issue (odd that they'd cite studies in the first place), although one person cited a book to me (which I read and they, apparently, had not since it also didn't support their argument) and then move to how this is what everyone wants, even though that's clearly not the case, both from the comments I've gotten as well as the data I have from when I made the change.</p> <p>Web ragers who have this line of reasoning generally can't seem to absorb the information that their preferences are not universal and will insist that they regardless of what people say they like, which I find fairly interesting. On the data, when I switched from Octopress styling (at the time, the most popular styling for programming bloggers) to the current styling, I got what appeared to be a causal increase in traffic and engagement, so it appears that not only do people who write me appreciation mail about the styling like the styling, the overall feeling of people who don't write to me appears to be that the site is fine and apparently more appealing than standard programmer blog styling. When I've noted this, people tend to become become further invested in the idea that their preferences are universal and that people who think they have other preferences are wrong and reply with total nonsense.</p> <p>For me, two questions I'm curious about are why do people feel the need to fabricate evidence on this topic (referring to studies when they haven't read any, googling for studies and then linking to one that says the opposite of what they claim it says, presumably because they didn't really read it, etc.) in order to claim that there are &quot;objective&quot; reasons their preferences are universal or correct, and why are people so much more incensed by this than by the global accessibility problems caused by typical web design? On the latter, I suspect if you polled people with an abstract survey, they would rate global accessibility to be a larger problem, but by revealed preference both in terms of what people create as well as what irritates them enough to send hate mail, we can see that having fully-adjustable line width and not capping line width at their preferred length is important to do something about whereas global accessibility is not. As noted above, people who run sites that aren't accessible due to performance problems generally get little to no hate mail about this. And when I use a default Octopress install, I got zero hate mail about this. Fewer people read my site at the time, but my traffic volume hasn't increased by a huge amount since then and the amount of hate mail I get about my site design has gone from zero to a fair amount, <abbr title="To find the plausible range of underlying ratios, we can do a simple Bayesian adjustment here and we still find that the ratio of hate mail has increased by much more than the increase in traffic; maybe one can argue that hate mail for slow sites is spread across all slow sites, so a second adjustment needs to be done here?">an infinitely higher ratio</abbr> than the increase in traffic.</p> <p>To be clear, I certainly wouldn't claim that the design on this site is optimal. <a href="octopress-speedup/">I just removed the CSS from the most popular blogging platform for programmers at the time because that CSS seemed objectively bad for people with low-end connections</a> and, as a side effect, got more traffic and engagement overall, not just from locations where people tend to have lower end connections and devices. No doubt a designer who cares about users on low-end connections and devices could do better, but there's something quite odd about both the untruthfulness and the vitriol of comments on this.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:L"><a href="https://www.jstor.org/stable/3004042">This estimate puts backwards-looking life expectancy in the low 60s</a>; that paper also discusses other estimates in the mid 60s and discusses biases in the estimates. <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> </ol> </div> Diseconomies of scale in fraud, spam, support, and moderation diseconomies-scale/ Sun, 18 Feb 2024 00:00:00 +0000 diseconomies-scale/ <p>If I ask myself a question like &quot;I'd like to buy an SD card; who do I trust to sell me a real SD card and not some fake, Amazon or my local Best Buy?&quot;, of course the answer is that I trust my local Best Buy<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">1</a></sup> more than Amazon, which is notorious for selling counterfeit SD cards. And if I ask who do I trust more, my local reputable electronics shop (Memory Express, B&amp;H Photo, etc.), I trust my local reputable electronics shop more. Not only are they <a href="https://www.pentaxforums.com/forums/22-pentax-camera-field-accessories/53884-counterfeit-sandisk-extreme-iii-sdhc-best-buy.html">less likely to</a> <a href="https://www.dpreview.com/forums/thread/2308411">sell me a counterfeit than Best Buy</a>, in the event that they do sell me a counterfeit, the service is likely to be better.</p> <p>Similarly, let's say I ask myself a question like, &quot;on which platform do I get a higher rate of scams, spam, fraudulent content, etc., [smaller platform] or [larger platform]&quot;? Generally the answer is [larger platform]. Of course, there are more total small platforms out there and they're higher variance, so I could deliberately use a smaller platform that's worse, but I'm choosing good options instead of bad options, in every size class, the smaller platform is generally better. For example, with Signal vs. WhatsApp, I've literally never received a spam Signal message, whereas I get spam WhatsApp messages somewhat regularly. Or if I compare places I might read tech content on, if I compare tiny forums no one's heard of to lobste.rs, lobste.rs has a very slightly higher rate (rate as in fraction of messages I see, not absolute message volume) of bad content because it's zero on the private forums and very low but non-zero on lobste.rs. And then if I compare lobste.rs to a somewhat larger platform, like Hacker News or mastodon.social, those have (again very slightly) higher rates of scam/spam/fraudulent content. And then if I compare that to mid-sized social media platforms, like reddit, reddit has a significantly higher and noticeable rate of bad content. And then if I can compare reddit to the huge platforms like YouTube, Facebook, <a href="seo-spam/">Google search results</a>, these larger platforms have an even higher rate of scams/spam/fraudulent content. And, as with the SD card example, the odds of getting decent support go down as the platform size goes up as well. In the event of an incorrect suspension or ban from the platform, the odds of an account getting reinstated get worse as the platform gets larger.</p> <p>I don't think it's controversial to say that in general, a lot of things get worse as platforms get bigger. For example, when I ran <a href="https://twitter.com/danluu/status/1570604630350106624">a Twitter poll to see what people I'm loosely connected to think</a>, only 2.6% thought that huge company platforms have the best moderation and spam/fraud filtering. <a href="https://carsey.unh.edu/publication/conspiracy-vs-science-a-survey-of-us-public-beliefs">For reference, in one poll, 9% of Americans said that vaccines implant a microchip and and 12% said the moon landing was fake</a>. These are different populations but it seems random Americans are more likely to say that the moon landing was faked than tech people are likely to say that the largest companies have the best anti-fraud/anti-spam/moderation.</p> <p>However, over the past five years, I've noticed an increasingly large number of people make the opposite claim, that only large companies can do decent moderation, spam filtering, fraud (and counterfeit) detection, etc. We looked at one example of this <a href="seo-spam/">when we examined search results</a>, where a Google engineer said</p> <blockquote> <p>Somebody tried argue that if the search space were more competitive, with lots of little providers instead of like three big ones, then somehow it would be *more* resistant to ML-based SEO abuse.</p> <p>And... look, if *google* can't currently keep up with it, how will Little Mr. 5% Market Share do it?</p> </blockquote> <p>And a thought leader responded</p> <blockquote> <p>like 95% of the time, when someone claims that some small, independent company can do something hard better than the market leader can, it’s just cope. economies of scale work pretty well!</p> </blockquote> <p><a href="seo-spam/">But when we looked at the actual results, it turned out that, of the search engines we looked at, Mr 0.0001% Market Share was the most resistant to SEO abuse (and fairly good), Mr 0.001% was a bit resistant to SEO abuse, and Google and Bing were just flooded with SEO abuse, frequently funneling people directly to various kinds of scams</a>. Something similar happens with email, where I commonly hear that it's impossible to manage your own email due to the spam burden, <a href="https://mastodon.social/@danluu/111736581992007404">but people do it all the time and often have similar or better results than Gmail</a>, with the main problem being interacting with big company mail servers which incorrectly ban their little email server.</p> <p>I started seeing a lot of comments claiming that you need scale to do moderation, anti-spam, anti-fraud, etc., around the time <a href="https://twitter.com/danluu/status/1179449106877431808">Zuckerberg, in response to Elizabeth Warren calling for the breakup of big tech companies, claimed that breaking up tech companies would make content moderation issues substantially worse, saying</a>:</p> <blockquote> <p>It’s just that breaking up these companies, whether it’s Facebook or Google or Amazon, is not actually going to solve the issues,” Zuckerberg said “And, you know, it doesn’t make election interference less likely. It makes it more likely because now the companies can’t coordinate and work together. It doesn’t make any of the hate speech or issues like that less likely. It makes it more likely because now ... all the processes that we’re putting in place and investing in, now we’re more fragmented</p> <p>It’s why Twitter can’t do as good of a job as we can. I mean, they face, qualitatively, the same types of issues. But they can’t put in the investment. Our investment on safety is bigger than the whole revenue of their company. [laughter] And yeah, we’re operating on a bigger scale, but it’s not like they face qualitatively different questions. They have all the same types of issues that we do.&quot;</p> </blockquote> <p>The argument is that you need a lot of resources to do good moderation and smaller companies, Twitter sized companies (worth ~$30B at the time), can't marshal the necessary resources to do good moderation. I found this statement quite funny at the time because, pre-Twitter acquisition, I saw a much higher rate of obvious scam content on Facebook than on Twitter. <a href="https://twitter.com/danluu/status/1584615878800576512">For example, when I clicked through Facebook ads during holiday shopping season, most were scams</a> and, while Twitter had its share of scam ads, it wasn't really in the same league as Facebook. And it's not just me — Arturo Bejar, who designed an early version of Facebook's reporting system and headed up some major trust and safety efforts noticed something similar (see footnote for details)<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">2</a></sup>.</p> <p>Zuckerberg seems to like the line of reasoning mentioned above, though, as he's made similar arguments elsewhere, <a href="https://stratechery.com/2021/an-interview-with-mark-zuckerberg-about-the-metaverse/">such as here</a>, in a statement the same year that Meta's internal docs made the case that they were exposing 100k minors a day to sexual abuse imagery:</p> <blockquote> <p>To some degree when I was getting started in my dorm room, we obviously couldn’t have had 10,000 people or 40,000 people doing content moderation then and the AI capacity at that point just didn’t exist to go proactively find a lot of harmful content. At some point along the way, it started to become possible to do more of that as we became a bigger business</p> </blockquote> <p>The rhetorical sleight of hand here is the assumption that Facebook needed 10k or 40k people doing content moderation when Facebook was getting started in Zuckerberg's dorm room. Services that are larger than dorm-room-Facebook can and do have better moderation than Facebook today with a single moderator, often one who works part time. But as people talk more about pursuing real antitrust action against big tech companies, tech big tech founders and execs have ramped up the anti-antitrust rhetoric, making claims about all sorts of disasters that will befall humanity if the biggest companies are broken up into the size of the biggest tech companies of 2015 or 2010. This kind of reasoning seems to be catching on a bit, as I've seen more and more big company employees state very similar reasoning. We've come a long way since the 1979 IBM training manual which read</p> <p><font size="+1"><b></p> <blockquote> <p>A COMPUTER CAN NEVER BE HELD ACCOUNTABLE</p> <p>THEREFORE A COMPUTER MUST NEVER MAKE A MANAGEMENT DECISION</p> </blockquote> <p></font></b></p> <p>The argument is now, for many critical decisions, it is only computers that can make most of the decisions and the lack of accountability seems to ultimately a feature, not a bug.</p> <p>But unfortunately for Zuckerberg's argument<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">3</a></sup>, there are at least three major issues in play here where diseconomies of scale dominate. One is that, given material that nearly everyone can agree is bad (such as bitcoin scams, spam for fake pharmaceutical products, <a href="seo-spam/">fake weather forecasts</a>, adults sending photos of their genitals to children), etc., large platforms do worse than small ones. The second is that, for the user, errors are much more costly and less fixable as companies get bigger because support generally becomes worse. <a href="impossible-agree/">The third is that, as platforms scale up, a larger fraction of users will strongly disagree about what should be allowed on the platform</a>.</p> <p>With respect to the first, while it's true that big companies have more resources, the <a href="cocktail-ideas/">cocktail party idea</a> that they'll have the best moderation because they have the most resources is countered by the equally simplistic idea that they'll have the worst moderation because they're the juiciest targets or that they'll have the worst moderation because they'll have worst fragmentation due to the standard diseconomies of scale that occur when you scale up organizations and problem domains. Whether or not the company having more resources or these other factors dominate is too complex to resolve theoretically, but can observe the result empirically. At least at <a href="https://mastodon.social/@danluu/109919312448235105">the level of resources that big companies choose to devote to moderation, spam, etc.</a>, having the larger target and other problems associated with scale dominate.</p> <p>While it's true that these companies are wildly profitable and could devote enough resources to significantly reduce this problem, they have chosen not to do this. For example, in the last year before I wrote this sentence, Meta's last-year profit before tax (through December 2023) was $47B. If Meta had a version of the internal vision statement of a power company a friend mine worked for (&quot;Reliable energy, at low cost, for generations.&quot;) and operated like that power company did, trying to create a good experience for the user instead of maximizing profit plus creating the metaverse, they could've spent the $50B they spent on the metaverse on moderation platforms and technology and then spent <abbr title="if you think this is too low, feel free to adjust this number up and the number of employees down">$30k/yr (which would result in a very good income in most countries where moderators are hired today, allowing them to have their pick of who to hire) on 1.6 million additional full-time staffers</abbr> for things like escalations and support, on the order of one additional moderator or support staffer per few thousand users (and of course diseconomies of scale apply to managing this many people). I'm not saying that Meta or Google should do this, just that whenever someone at big tech company says <a href="https://news.ycombinator.com/item?id=38614042">something like</a> &quot;these systems have to be fully automated because no one could afford to operate manual systems at our scale&quot;, what's really being said is more along the lines of &quot;we would not be able to generate as many billions a year in profit if we hired enough competent people to manually review cases our system should flag as ambiguous, so we settle for what we can get without compromising profits&quot;.<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">4</a></sup> One can defend that choice, but it is a choice.</p> <p>And likewise for claims about advantages of economies of scale. There are areas where economies of scale legitimately make the experience better for users. For example, when we looked at <a href="nothing-works/">why it's so hard to buy things that work well</a>, we noted that Amazon's economies of scale have enabled them to build out their own package delivery service that is, while flawed, still more reliable than is otherwise available (and this has only improved since they added the ability for users to rate each delivery, which no other major package delivery service has). Similarly, Apple's scale and vertical integration has allowed them to <a href="https://mastodon.social/@danluu/111064942895424216">build one of the all-time great performance teams</a> (as measured by <abbr title="normalizing for both average performance as well as performance variance among competitors">normalized performance</abbr> relative to competitors of the same era), not only wiping the floor with the competition on benchmarks, but also providing a better experience in ways that no one really measured until recently, like <a href="input-lag/">device latency</a>. For a more mundane example of economies of scale, crackers and other food that ships well are cheaper on Amazon than in my local grocery store. It's easy to name ways in which economies of scale benefit the user, but this doesn't mean that we should assume that economies of scale dominate diseconomies of scale in all areas. Although it's beyond the scope of this post, if we're going to talk about whether or not users are better off if companies are larger or smaller, we should look at what gets better when companies get bigger and what gets worse, not just assume that everything will get better just because some things get better (or vice versa).</p> <p>Coming back to the argument that huge companies have the most resources to spend on moderation, spam, anti-fraud, etc., vs. the reality that they choose to spend those resources elsewhere, like dropping $50B on the Metaverse and not hiring 1.6 million moderators and support staff that they could afford to hire, it makes sense to look at how much effort is being expended. Meta's involvement in Myanmar makes for a nice case study because Erin Kissane wrote up a fairly detailed <a href="https://erinkissane.com/meta-in-myanmar-full-series">40,000 word account of what happened</a>. The entirety of what happened is a large and complicated issue (<a href="#appendix-erin-kissane-on-meta-in-myanmar">see appendix for more discussion</a>) but, for the main topic of this post, the key components are that there was an issue that most people can generally agree should be among the highest priority moderation and support issues and that, despite repeated, extremely severe and urgent, warnings to Meta staff at various levels (engineers, directors, VPs, execs, etc.), almost no resources were dedicated to the issue while internal documents indicate that only a small fraction of agreed-upon bad content was caught by their systems (on the order of a few percent). I don't think this is unique to Meta and this matches my experience with other large tech companies, both as a user of their products and as an employee.</p> <p>To pick a smaller scale example, an acquaintance of mine had their Facebook account compromised and it's now being used for bitcoin scams. The person's name is Samantha K. and some scammer is doing enough scamming that they didn't even bother reading her name properly and have been generating very obviously faked photos where someone holds up a sign and explains how &quot;Kamantha&quot; has helped them make tens or hundreds of thousands of dollars. This is a fairly common move for &quot;hackers&quot; to make and someone else I'm connected to on FB reported that this happened to their account and they haven't been able to recover the old account or even get it banned despite the constant stream of obvious scams being posted by the account.</p> <p>By comparison, on lobste.rs, I've never seen a scam like this and Peter Bhat Harkins, the head mod says that they've never had one that he knows of. On Mastodon, I think I might've seen one once in my feed, replies, or mentions. Of course, Mastodon is big enough that you can find some scams if you go looking for them, but the per-message and per-user rates are low enough that you shouldn't encounter them as a normal user. On Twitter (before the acquisition) or reddit, moderately frequently, perhaps an average of once every few weeks in my normal feed. On Facebook, I see things like this <a href="https://twitter.com/danluu/status/1584615878800576512">all the time; I get obvious scam consumer good sites every shopping season, and the bitcoin scams, both from ads as well as account takeovers, are year-round</a>. <a href="https://news.ycombinator.com/item?id=38613594">Many people have noted that they don't bother reporting these kinds of scams anymore because they've observed that Facebook doesn't take action on their reports</a>. <a href="https://lerner.co.il/2023/10/19/im-banned-for-life-from-advertising-on-meta-because-i-teach-python/">Meanwhile, Reuven Lerner was banned from running Facebook ads on their courses about Python and Pandas</a>, seemingly because Facebook systems &quot;thought&quot; that Reuven was advertising <a href="https://news.ycombinator.com/item?id=37941905">something to do with animal trading</a> (as opposed to programming). This is the fidelity of moderation and spam control that Zuckerberg says cannot be matched by any smaller company. By the way, I don't mean to pick on Meta in particular; if you'd like examples with a slightly different flavor, you can <a href="#google">see the appendix of Google examples</a> for a hundred examples of automated systems going awry at Google.</p> <p>A reason this comes back to being an empirical question is that all of this talk about how economies of scale allows huge companies to bring more resources to bear on the problem on matters if the company chooses to deploy those resources. There's no theoretical force that makes companies deploy resources in these areas, so we can't reason theoretically. But we can observe that the resources deployed aren't sufficient to match the problems, even in cases where people would generally agree that the problem should very obviously be high priority, such as with Meta in Myanmar. Of course, when it comes to issues where the priority is less obvious, resources are also not deployed there.</p> <p>On the second issue, support, it's a meme among tech folks that the only way to get support as a user of one of the big platforms is to make a viral social media post or know someone on the inside. This compounds the issue of bad moderation, scam detection, anti-fraud, etc., since those issues could be mitigated if support was good.</p> <p>Normal support channels are a joke, where you either get a generic form letter rejection, or a kafkaesque nightmare followed by a form letter rejection. For example, when <a href="https://twitter.com/craig1black/status/1645649300167495681">Adrian Black was banned from YouTube for impersonating Adrian Black</a> (to be clear, he was banned for impersonating himself, not someone else with the same name), after appealing, he got a response that read</p> <blockquote> <p>unfortunately, there's not more we can do on our end. your account suspension &amp; appeal were very carefully reviewed &amp; the decision is final</p> </blockquote> <p><a href="https://twitter.com/danluu/status/1308215389344600066">In another Google support story, Simon Weber got the runaround from Google support when he was trying to get information he needed to pay his taxes</a></p> <blockquote> <p>accounting data exports for extensions have been broken for me (and I think all extension merchants?) since April 2018 [this was written on Sept 2020]. I had to get the NY attorney general to write them a letter before they would actually respond to my support requests so that I could properly file my taxes</p> </blockquote> <p>There was also the time <a href="https://twitter.com/PointCrow/status/1587084876741689345">YouTube kept demonetizing PointCrow's video of eating water with chopsticks (he repeatedly dips chopsticks into water and then drinks the water, very slowly eating a bowl of water)</a>.</p> <p>Despite responding with things like</p> <blockquote> <p>we're so sorry about that mistake &amp; the back and fourth [sic], we've talked to the team to ensure it doesn't happen again</p> </blockquote> <p>He would get demonetized again and appeals would start with the standard support response strategy of saying that they took great care in examining the violating under discussion but, unfortunately, the user clearly violated the policy and therefore nothing can be done:</p> <blockquote> <p>We have reviewed your appeal ... We reviewed your content carefully, and have confirmed that it violates our violent or graphic content policy ... it's our job to make sure that YouTube is a safe place for all</p> </blockquote> <p>These are high-profile examples, but of course having a low profile doesn't stop you from getting banned and getting the same basically canned response, like <a href="https://news.ycombinator.com/item?id=38882891">this HN user who was banned for selling a vacuum in FB marketplace</a>. After a number of appeals, he was told</p> <blockquote> <p>Unfortunately, your account cannot be reinstated due to violating community guidelines. The review is final</p> </blockquote> <p>When paid support is optional, people often say you won't have these problems if you pay for support, but <a href="https://www.youtube.com/watch?v=SMFCLpUuhqY">people who use Google One paid support or Facebook and Instagram's paid creator support generally report that the paid support is no better than the free support</a>. Products that effectively have paid support built-in aren't necessarily better, either. I know people who've gotten the same kind of runaround you get from free Google support with Google Cloud, even when they're working for companies that have 8 or 9 figure a year Google Cloud spend. In one of many examples, the user was seeing that Google must've been dropping packets and Google support kept insisting that the drops were happening in the customer's datacenter despite packet traces showing that this could not possibly be the case. The last I heard, they gave up on that one, but sometimes when an issue is a total showstopper, someone will call up a buddy of theirs at Google to get support because the standard support is often completely ineffective. And this isn't unique to Google — at another cloud vendor, a former colleague of mine was in the room for a conversation where a very senior engineer was asked to look into an issue where a customer was complaining that they were seeing 100% of packets get dropped for a few seconds at a time, multiple times an hour. The engineer responded with something like &quot;it's the cloud, they should deal with it&quot;, before being told they couldn't ignore the issue as usual because the issue was coming from [VIP customer] and it was interrupting [one of the world's largest televised sporting events]. That one got fixed, but, odds are, you aren't that important, even if you're paying hundreds of millions a year.</p> <p>And of course this kind of support isn't unique to cloud vendors. For example, there <a href="https://news.ycombinator.com/item?id=34233011">was this time Stripe held $400k from a customer for over a month without explanation</a>, and every request to support got a response that was as ridiculous as the ones we just looked at. The user availed themself of the only reliable Stripe support mechanism, posting to HN and hoping to hit #1 on the front page, which worked, although many commenters said made the usual comments like &quot;Flagged because we are seeing a lot of these on HN, and they seem to be attempts to fraudulently manipulate customer support, rather than genuine stories&quot;, with multiple people suggesting or insinuating that the user is doing something illicit or fraudulent, but it turned out that it was an error on Stripe's end, compounded by Stripe's big company support. At one point, the user notes</p> <blockquote> <p>While I was writing my HN post I was also on chat with Stripe for over an hour. No new information. They were basically trying to shut down the chat with me until I sent them the HN story and showed that it was getting some traction. Then they started working on my issue again and trying to communicate with more people</p> </blockquote> <p>And then the issue was fixed the next day.</p> <p>Although, in principle, as companies become larger, they could leverage their economies of scale to deliver more efficient support, instead, they tend to use their economies of scale to deliver worse, but cheaper and more profitable support. For example, on Google Play store approval support, a Google employee notes:</p> <blockquote> <p>a lot of that was outsourced to overseas which resulted in much slower response time. Here stateside we had a lot of metrics in place to fast response. Typically your app would get reviewed the same day. Not sure what it's like now but the managers were incompetent back then even so</p> </blockquote> <p><a href="https://twitter.com/RMac18/status/1382366931307565057">And a former FB support person notes</a>:</p> <blockquote> <p>The big problem here is the division of labor. Those who spend the most time in the queues have the least input as to policy. Analysts are able to raise issues to QAs who can then raise them to Facebook FTEs. It can take months for issues to be addressed, if they are addressed at all. The worst part is that doing the common sense thing and implementing the spirit of the policy, rather than the letter, can have a negative effect on your quality score. I often think about how there were several months during my tenure when most photographs of mutilated animals were allowed on a platform without a warning screen due to a carelessly worded policy &quot;clarification&quot; and there was nothing we could do about it.</p> </blockquote> <p>If you've ever wondered why your support person is responding nonsensically, sometimes it's the obvious reason that support has been outsourced to someone making $1/hr (when I looked up the standard rates for one country that a lot of support is outsourced to, a fairly standard rate works out to about $1/hr) who doesn't really speak your language and is reading from a flowchart without understanding anything about the system they're giving support for, but another, less obvious, reason is that the support person may be penalized and eventually fired if they take actions that make sense instead of following the nonsensical flowchart that's in front of them.</p> <p>Coming back to the &quot;they seem to be attempts to fraudulently manipulate customer support, rather than genuine stories&quot; comment, this is a sentiment I've commonly seen expressed by engineers at companies that mete out arbitrary and capricious bans. I'm sympathetic to how people get here. <a href="https://twitter.com/danluu/status/964562384558927872">As I noted before I joined Twitter, commenting on public information</a></p> <blockquote> <p>Turns out twitter is removing ~1M bots/day. Twitter only has ~300M MAU, making the error tolerance v. low. This seems like a really hard problem ... Gmail's spam filter gives me maybe 1 false positive per 1k correctly classified ham ... Regularly wiping the same fraction of real users in a service would be [bad].</p> </blockquote> <p>It is actually true that, if you, an engineer, dig into the support queue at some giant company and look at people appealing bans, almost all of the appeals should be denied. But, my experience from having talked to engineers working on things like anti-fraud systems is that many, and perhaps most, round &quot;almost all&quot; to &quot;all&quot;, which is both quantitatively and qualitatively different. Having engineers who work on these systems believe that &quot;all&quot; and not &quot;almost all&quot; of their decisions are correct results in bad experiences for users.</p> <p>For example, there's a social media company that's famous for incorrectly banning users (at least 10% of people I know have lost an account due to incorrect bans and, if I search for a random person I don't know, there's a good chance I get multiple accounts for them, with some recent one that has a profile that reads &quot;used to be @[some old account]&quot;, with no forward from the old account to the new one because they're now banned). When I ran into a senior engineer from the team that works on this stuff, I asked him why so many legitimate users get banned and he told me something like &quot;that's not a problem, the real problem is that we don't ban enough accounts. Everyone who's banned deserves it, it's not worth listening to appeals or thinking about them&quot;. Of course it's true that <a href="https://twitter.com/danluu/status/964562384558927872">most content on every public platform is bad content, spam, etc.</a>, so if you have any sort of signal at all on whether or not something is bad content, when you look at it, it's likely to be bad content. But this doesn't mean the converse, that almost no users are banned incorrectly, is true. And if senior people on the team that classifies which content is bad have the attitude that we shouldn't worry about false positives because almost all flagged content is bad, we'll end up with a system that has a large number of false positives. I later asked around to see what had ever been done to reduce false positives in the fraud detection systems and found out that there was no systematic attempt at tracking false positives at all, no way to count cases where employees filed internal tickets to override bad bans, etc.; At the meta level, there was some mechanism to decrease the false negative rate (e.g., someone sees bad content that isn't being caught then adds something to catch more bad content) but, without any sort of tracking of false positives, there was effectively no mechanism to decrease the false positive rate. It's no surprise that this meta system resulted in over 10% of people I know getting incorrect suspensions or bans. And, as Patrick McKenzie says, the optimal rate of false positives isn't zero. But when you have engineers who have the attitude that they've done enough legwork that false positives are impossible, it's basically guaranteed that the false positive rate is higher than optimal. When you combine this with normal big company levels of support, it's a recipe for kafkaesque user experiences.</p> <p>Another time, I commented on how an announced change in Uber's moderation policy seemed likely to result in false positive bans. An Uber TL immediately took me to task, saying that I was making unwarranted assumptions on how banning works, that Uber engineers go to great lengths to make sure that there are no false positive bans, there's extensive to review to make sure that bans are valid and, in fact, the false positive banning I was concerned about could never happen. And then I got effectively banned due to a false positive in a fraud detection system. <a href="https://www.theguardian.com/technology/2023/apr/16/stop-or-ill-fire-you-the-driver-who-defied-ubers-automated-hr">I was remind of that incident when Uber incorrectly banned a driver who had to take them to court to even get information on why he was banned, at which point Uber finally actually looked into it (instead of just responding to appeals with fake messages claiming they'd looked into it)</a>. Afterwards, Uber responded to a press inquiry with</p> <blockquote> <p>We are disappointed that the court did not recognize the robust processes we have in place, including meaningful human review, when making a decision to deactivate a driver’s account due to suspected fraud</p> </blockquote> <p>Of course, in that driver's case, there was no robust process for review, nor was there a robust appeals process for my case. When I contacted support, they didn't really read my message and made some change that broke my account even worse than before. Luckily, I have enough Twitter followers that some Uber engineers saw my tweet about the issue and got me unbanned, but that's not an option that's available to most people, leading to weird stuff like <a href="customer-service/#dentist">this Facebook ad targeted at Google employees, from someone desperately seeking help with their Google account</a>.</p> <p>And even when you know someone on the inside, it's not always easy to get the issue fixed because even if the company's effectiveness doesn't increase as the company gets bigger, the complexity of the systems does increase. A nice example of this is <a href="https://twitter.com/GergelyOrosz/status/1469968831372312578">Gergely Orosz's story about when the manager of the payments team left Uber and then got banned from Uber due to some an inscrutable ML anti-fraud algorithm deciding that the former manager of the payments team was committing payments fraud</a>. It took six months of trying to get the problem fixed to mitigate the issue. And, by the way, they never managed to understand what happened and fix the underlying issue; instead, they added the former manager of the payments team to a special whitelist, not fixing the issue for any other user and, presumably, severely reducing or perhaps even entirely removing payment fraud protections for the former manager's account.</p> <p>No doubt they would've fixed the underlying issue if it were easy to, but as companies scale up, they produce both technical and non-technical bureaucracy that makes systems opaque even to employees.</p> <p>Another example of that is, at a company that has a ranked social feed, the idea that you could eliminate stuff you didn't want in your ranked feed by adding filters for things like <code>timeline_injection:false</code>, <code>interstitial_ad_op_out</code>, etc., would go viral. The first time this happened, a number of engineers looked into it and thought that the viral tricks didn't work. They weren't 100% sure and were relying on ideas like &quot;no one can recall a system that would do something like this ever being implemented&quot; and &quot;if you search the codebase for these strings, they don't appear&quot;, and &quot;we looked at the systems we think might do this and they don't appear to do this&quot;. There was moderate confidence that this trick didn't work, but no one would state with certainty that the trick didn't work because, as at all large companies, the aggregate behavior of the system is beyond human understanding and even parts that could be understood often aren't because there are other priorities.</p> <p>A few months later, the trick went viral again and people were generally referred to the last investigation when they asked if it was real, except that one person actually tried the trick and reported that it worked. They wrote a slack message about how the trick did work for them, but almost no one noticed that the one person who tried reproducing the trick found that it worked. Later, when the trick would go viral again, people would point to the discussions about how people thought the trick didn't work, with this message noting that it appears to work (almost certainly not by the mechanism that users think, and instead just because having a long list of filters causes something to time out, or something similar) basically got lost because there's too much information to read all of it.</p> <p>In my social circles, many people have read James Scott's Seeing Like a State, which is subtitled How Certain Schemes to Improve the Human World Have Failed. A key concept from the book is &quot;legibility&quot;, what a state can see, and how this distorts what states do. One could easily write a highly analogous book, Seeing like a Tech Company about what's illegible to companies that scale up, at least as companies are run today. A simple example of this is that, in many video games, including ones made by game studios that are part of a $3T company, it's easy to get someone suspended or banned by having a bunch of people report the account for bad behavior. What's legible to the game company is the rate of reports and what's not legible is the player's actual behavior (it could be legible, but the company chooses not to have enough people or skilled enough people examine actual behavior); <a href="https://news.ycombinator.com/item?id=31581510">and many people have reported similar bannings with social media companies</a>. When it comes to things like anti-fraud systems, what's legible to the company tends to be fairly illegible to humans, even humans working on the anti-fraud systems themselves.</p> <p>Although he wasn't specifically talking about an anti-fraud system, in a Special Master's System, Eugene Zarashaw, a director a Facebook made this comment which illustrates the illegibility of Facebook's own systems:</p> <blockquote> <p>It would take multiple teams on the ad side to track down exactly the — where the data flows. I would be surprised if there’s even a single person that can answer that narrow question conclusively</p> </blockquote> <p>Facebook was unfairly and mostly ignorantly raked over the coals for this statement (<a href="#appendix-how-much-should-we-trust-journalists-summaries-of-leaked-documents">we'll discuss that in an appendix</a>), but it is generally true that it's <abbr title="depending on what you mean by understand, it could be fair to say that it's impossible">difficult to understand</abbr> how a system the size of Facebook works.</p> <p>In principle, companies could augment the legibility of their inscrutable systems by having decently paid support people look into things that might be edge-case issues with severe consequences, where the system is &quot;misunderstanding&quot; what's happening but, in practice, companies pay these support people extremely poorly and hire people who really don't understand what's going on, and then give them instructions which ensure that they generally do not succeed at resolving legibility issues.</p> <p>One thing that helps the forces of illegibility win at scale is that, as a highly-paid employee of one of these huge companies, it's easy to look at the millions or billions of people (and bots) out there and think of them all as numbers. As the saying goes, &quot;the death of one man is a tragedy. The death of a million is a statistic&quot; and, as we noted, engineers often turn thoughts like &quot;almost all X is fraud&quot; to &quot;all X is fraud, so we might as well just ban everyone who does X and not look at appeals&quot;. The culture that modern tech companies have, of looking for scalable solutions at all costs, makes this worse than in other industries even at the same scale, and tech companies also have unprecedented scale.</p> <p>For example, in response to someone noting that FB Ad Manager claims you can run an ad with a potential reach of 101M people in the U.S. aged 18-34 when the U.S. census had the total population of people aged 18-34 as 76M, the former PM of the ads targeting team responded with</p> <blockquote> <p>Think at FB scale</p> </blockquote> <p>And explained that you can't expect slice &amp; dice queries to work for something like the 18-34 demographic in the U.S. at &quot;FB scale&quot;. There's a meme at Google that's used ironically in cases like this, where people will say &quot;I can't count that low&quot;. Here's the former PM of FB ads saying, non-ironically, &quot;FB can't count that low&quot; for numbers like 100M. Not only does FB not care about any individual user (unless they're famous), this PM claims they can't be bothered to care that groups of 100M people are tracked accurately.</p> <p>Coming back to the consequences of poor support, a common response to hearing about people getting incorrectly banned from one of these huge services is &quot;Good! Why would you want to use Uber/Amazon/whatever anyway? They're terrible and no one should use them&quot;. I disagree with this line of reasoning. For one thing, <a href="https://twitter.com/danluu/status/891508449414197248">why should you decide for that person whether or not they should use a service or what's good for them?</a> For another (and this this is a large enough topic that it should be its own post, so I'll just mention it briefly <a href="https://mastodon.social/@whitequark/111280549888138665">and link to this lengthier comment from @whitequark</a>) most services that people write off as unnecessary conveniences that you should just do without are actually serious accessibility issues for quite a few people (in absolute, not necessarily, percentage, terms). When we're talking about small businesses, those people can often switch to another business, but with things like Uber and Amazon, there are sometimes zero or one alternatives that offer similar convenience and when there's one, getting banned due to some random system misfiring can happen with the other service as well. <a href="https://www.reddit.com/r/mildlyinfuriating/comments/186redy/doordash_denied_refund_for_wrong_order_delivered/">For example, in response to many people commenting on how you should just issue a chargeback and get banned from DoorDash when they don't deliver, a disabled user responds</a>:</p> <blockquote> <p>I'm disabled. Don't have a driver's license or a car. There isn't a bus stop near my apartment, I actually take paratransit to get to work, but I have to plan that a day ahead. Uber pulls the same shit, so I have to cycle through Uber, Door dash, and GrubHub based on who has coupons and hasn't stolen my money lately. Not everyone can just go pick something up.</p> </blockquote> <p>Also, when talking about this class of issue, involvement is often not voluntary, <a href="https://news.ycombinator.com/item?id=39059307">such as in the case of this Fujitsu bug that incorrectly put people in prison</a>.</p> <p>On the third issue, the impossibility of getting people to agree on what constitutes spam, fraud, and other disallowed content, <a href="impossible-agree/">we discussed that in detail here</a>. We saw that, even in a trivial case with a single, uncontroversial, simple, rule, people can't agree on what's allowed. And, as you add more rules or add topics that are controversial or scale up the number of people, it becomes even harder to agree on what should be allowed.</p> <p>To recap, we looked at three areas where diseconomies of scale make moderation, support, anti-fraud, and anti-spam worse as companies get bigger. The first was that, even in cases where there's broad agreement that something is bad, such as fraud/scam/phishing websites and search, <a href="seo-spam/">the largest companies with the most sophisticated machine learning can't actually keep up with a single (albeit very skilled) person working on a small search engine</a>. The returns to scammers are much higher if they take on the biggest platforms, resulting in the anti-spam/anti-fraud/etc. problem being extremely non-linearly hard.</p> <p>To get an idea of the difference in scale, HN &quot;hellbans&quot; spammers and people who post some kinds of vitriolic comments. Most spammers don't seem to realize they're hellbanned and will keep posting for a while, so if you browse the &quot;newest&quot; (submissions) page while logged in, you'll see a steady stream of automatically killed stories from these hellbanned users. While there are quite a few of them, the percentage is generally well under half. When we looked at a &quot;mid-sized&quot; big tech company like Twitter circa 2017, based on the public numbers, if spam bots were hellbanned instead of removed, spam is so much more prevalent that all you'd see if you were able to see it. And, as big companies go, 2017-Twitter isn't that big. As we also noted, the former PM of FB ads targeting explained that numbers as low as 100M are in the &quot;I can't count that low&quot; range, too small to care about; to him, basically a rounding error. The non-linear difference in difficulty is much worse for a company like FB or Google. The non-linearity of the difficulty of this problems is, apparently, more than a match for whatever ML or AI techniques Zuckerberg and other tech execs want to brag about.</p> <p>In testimony in front of Congress, you'll see execs defend the effectiveness of these systems at scale with comments like &quot;we can identify X with 95% accuracy&quot;, a statement that may technically be correct, but seems designed to deliberately mislead an audience that's presumed to be innumerate. If you use, as a frame of reference, things at a personal scale, 95% might sound quite good. Even for something like HN's scale, 95% accurate spam detection that results in an immediate ban might be sort of alright. Anyway, even if it's not great, people who get incorrectly banned can just email Dan Gackle, who will unban them. As we noted when we looked at the numbers, 95% accurate detection at Twitter's scale would be horrible (and, indeed, the majority of DMs I get are obvious spam). Either you have to back off and only ban users in cases where you're extremely confident, or you ban all your users after not too long and, as companies like to handle support, appealing means that you'll get a response saying that &quot;your case was carefully reviewed and we have determined that you've violated our policies. This is final&quot;, even for cases where any sort of cursory review would cause a reversal of the ban, like when you ban a user for impersonating themselves. And then at FB's scale, it's even worse and you'll ban all of your users even more quickly, so then you back off and we end up with things like 100k minors a day being exposed to &quot;photos of adult genitalia or other sexually abusive content&quot;.</p> <p>The second area we looked at was support, which tends to get worse as companies get larger. At a high level, it's fair to say that companies don't care to provide decent support (with Amazon being somewhat of an exception here, especially with AWS, but even on the consumer side). Inside the system, there are individuals who care, but if you look at the fraction of resources expended on support vs. growth or even fun/prestige projects, support is an afterthought. Back when deepmind was training a StarCraft AI, it's plausible that Alphabet was spending more money playing Starcraft than on support agents (and, if not, just throw in one or two more big AI training projects and you'll be there, especially if you include the amortized cost of developing custom hardware, etc.).</p> <p>It's easy to see how little big companies care. All you have to do is contact support and get connected to someone who's paid $1/hr to respond to you in a language they barely know, attempting to help solve a problem they don't understand by walking through some flowchart, or appeal an issue and get told &quot;after careful review, we have determined that you have [done the opposite of what you actually did]&quot;. In some cases, you don't even need to get that far, like when <a href="https://mastodon.social/@alexjsp/109760284680279691">following Instagram's support instructions results in an infinite loop that takes you back where you started</a> and the <a href="https://news.ycombinator.com/item?id=34549132">&quot;click here if this wasn't you link returns a 404&quot;</a>. I've run into an infinite loop like this once, with Verizon, and it persisted for at least six months. I didn't check after that, but I'd bet on it persisting for years. If you had an onboarding or sign-up page that had an issue like this, that would be considered a serious bug that people should prioritize because that impacts growth. But for something like account loss due to scammers taking over accounts, that might get fixed after months or years. Or maybe not.</p> <p>If you ever talk to people who work in support at a company that really cares about support, it's immediately obvious that they operate completely different from typical big tech company support, in terms of process as well as culture. Another way you can tell that big companies don't care about support is how often big company employees and execs who've never looked into how support is done or could be done will tell you that it's impossible to do better.</p> <p>When you talk to people who work on support at companies that do actually care about this, it's apparent that it can be done much better. While I was writing this post, I actually <abbr title="to be clear, this was first-line support where I talked to normal, non-technical, users, not a role like 'Field Applications Engineer', where the typical user you interact with is an engineer">did support</abbr> at a company that does support decently well (for a tech company, adjusted for size, I'd say they're well above <a href="p95-skill/">99%-ile</a>), including going through the training and onboarding process for support folks. <a href="sounds-easy/">Executing anything well at scale is non-trivial</a>, so I don't mean to downplay how good their support org is, but the most striking thing to me was how much of the effectiveness of the org naturally followed from caring about providing a good support experience for the user. A full discussion of what that means is too long to include here, so we'll look at this in more detail another time, but one example is that, when we look at how big company support responds, it's often designed to discourage the user from responding (&quot;this review is final&quot;) or to justify, putatively to the user, that the company is doing an adequate job (&quot;this was not a purely automated process and each appeal was reviewed by humans in a robust process that ... &quot;). This company's training instructs you to do the opposite of the standard big company &quot;please go away&quot;-style and &quot;we did a great job and have a robust process, therefore complaints are invalid&quot;-style responses. For every anti-pattern you commonly see in support, the training tells you to do the opposite and discusses why the anti-pattern results in a bad user experience. Moreover, the culture has deeply absorbed these ideas (or rather, these ideas come out of the culture) and there are processes for <abbr title="this word is used colloquially and not rigorously since you can't really guarantee this at scale">ensuring</abbr> that people really know what it means to provide good support and follow through on it, support folks have ways to directly talk to the developers who are implementing the product, etc.</p> <p>If people cared about doing good support, they could talk to people who work in support orgs that are good at helping users or even try working in one before explaining how it's impossible to do better, but this generally isn't done. Their company's support org leadership could do this as well, or do what I did and actually directly work in a support role in an effective support org, but this doesn't happen. If you're a cynic, this all makes sense. In the same way that cynics advise junior employees &quot;big company HR isn't there to help you; their job is to protect the company&quot;, a cynic can credibly argue &quot;big company support isn't there to help the user; their job is to <abbr title="for different values of protect the company than HR; to me, this feels more analogous to insurance, where there are people whose job it is to keep costs down. I've had insurance companies deny coverage on things that, by their written policy, should clearly be covered. In one case, I could call a company rep on the phone, who would explain to me why their company was wrong to deny the claim and how I should appeal, but written appeals were handed in writing and always denied. Luckily, when working for a big tech company, you can tell your employer what's happening, who will then tell the insurance company to stop messing with you, much like our big sporting event cloud support story, but for most users of insurance, this isn't an option and their only recourse is to sue, which they generally won't do or will settle for peanuts even if they do sue. Insurance companies know this and routinely deny claims without even looking at them (this has come out in discovery in lawsuits); accounting for the cost of lawsuits, this kind of claim denial is much cheaper than handling claims correctly. Similarly, providing actual support costs much more than not doing so and getting the user to stop pestering the company about how their account is broken saves money, hence standard responses claiming that the review is final, nothing can be done, etc.; anything to get the user to reduce the support cost of the company (except actually provide support)">protect the company</abbr>&quot;, so of course big companies don't try to understand how companies that are good at supporting users do support because that's not what big company support is for.</p> <p>The third area we looked at was how it's impossible for people to agree on how a platform should operate and how people's biases mean that people don't understand how difficult a problem this is. For Americans, a prominent case of this are the left and right wing conspiracy theories that pop up every time some bug pseudo-randomly causes any kind of service disruption or banning.</p> <p><a href="https://twitter.com/greenberg/status/872115047769788418">In a tweet, Ryan Greeberg joked</a>:</p> <blockquote> <p>Come work at Twitter, where your bugs TODAY can become conspiracy theories of TOMORROW!</p> </blockquote> <p>In my social circles, people like to make fun of all of the absurd right-wing conspiracy theories that get passed around after some bug causes people to incorrectly get banned, causes the site not to load, etc., or even when some new ML feature correctly takes down a huge network of scam/spam bots, which also happens to reduce the follower count of some users. But of course this isn't unique to the right, and left-wing thought leaders and <a href="https://twitter.com/altluu/status/1588573465359319041/">politicians come up with their own conspiracy theories as well</a>.</p> <p>Putting all three of these together, worse detection of issues, worse support, and a harder time reaching agreement on policies, we end with the situation we noted at the start where, in a poll of my Twitter followers, people who mostly work in tech and are generally fairly technically savvy, only <a href="https://twitter.com/danluu/status/1570604630350106624">2.6% of people thought that the biggest companies were the best at moderation and spam/fraud filtering</a>, so it might seem a bit silly to spend so much time belaboring the point. When you sample the U.S population at large, a larger fraction of people say they believe in conspiracy theories like vaccines putting a microchip in you or that we never landed on the moon, and I don't spend my time explaining why vaccines do not actually put a microchip in you or why it's reasonable to think that we landed on the moon. One reason that would perhaps be reasonable is that I've been watching the &quot;only big companies can handle these issues&quot; rhetoric with concern as it catches on among non-technical people, like regulators, lawmakers, and <a href="https://twitter.com/altluu/status/1484718002046070784">high-ranking government advisors, who often listen to and then regurgitate nonsense</a>. Maybe next time you run into a lay person who tells you that only the largest companies could possibly handle these issues, you can politely point out that there's very strong consensus the other way among tech folks<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">5</a></sup>.</p> <p style="border-width:8px; border-style:solid; border-color:grey; background-color:#F3F3F3; padding: 0.1em;"> <b><a rel="nofollow" href="https://www.propelauth.com/?utm_source=danluu.com"> If you're a founder or early-stage startup looking for an auth solution, PropelAuth is targeting your use case</b></a>. Although they can handle other use cases, they're currently specifically trying to make life easier for pre-launch startups that haven't invested in an auth solution yet. Disclaimer: I'm an investor</p> <p><i>Thanks to Gary Bernhardt, Peter Bhat Harkins, Laurence Tratt, Dan Gackle, Sophia Wisdom, David Turner, Yossi Kreinin, Justin Blank, Ben Cox, Horace He, @borzhemsky, Kevin Burke, Bert Muthalaly, Sasuke, anonymous, Zach Manson, Joachim Schipper, Tony D'Souza, and @GL1zdA for comments/corrections/discussion.</i></p> <h2 id="appendix-techniques-that-only-work-at-small-scale">Appendix: techniques that only work at small scale</h2> <p>This post has focused on the disadvantages of bigness, but we can also flip this around and look at the advantages of smallness.</p> <p>As mentioned, the best experiences I've had on platforms are a side effect of doing things that don't scale. One thing that can work well is to have a single person, with a single vision, handling the entire site or, when that's too big, a key feature of the site.</p> <p>I'm on a number of small discords that have good discussion and essentially zero scams, spam, etc. The strategy for this is simple; the owner of the channel reads every message and bans and scammers or spammers who show up. When you get to a bigger site, like lobste.rs, or even bigger like HN, that's too large for someone to read every message (well, this could be done for lobste.rs, but considering that it's a spare-time pursuit for the owner and the volume of messages, it's not reasonable to expect them to read every message in a short timeframe), but there's still a single person who provides the vision for what should happen, even if the sites are large enough that it's not reasonable to literally read every message. The &quot;no vehicles in the park&quot; problem doesn't apply here because a person decides what the policies should be. You might not like those policies, but you're welcome to find another small forum or start your own (and this is actually how lobste.rs got started — under <a href="https://news.ycombinator.com/item?id=4108008">HN's previous moderation regime, which was known for banning people who disagreed</a> with them, <a href="https://jcs.org/2012/06/13/hellbanned_from_hacker_news">Joshua Stein was banned for publicly disagreeing with an HN policy</a>, so Joshua created lobsters (and then eventually handed it off to Peter Bhat Harkins).</p> <p><a href="https://metatalk.metafilter.com/20206/WWIC">There's also this story about craigslist in the early days, as it was just getting big enough to have a serious scam and spam problem</a></p> <blockquote> <p>... we were stuck at SFO for something like four hours and getting to spend half a workday sitting next to Craig Newmark was pretty awesome.</p> <p>I'd heard Craig say in interviews that he was basically just &quot;head of customer service&quot; for Craigslist but I always thought that was a throwaway self-deprecating joke. Like if you ran into Larry Page at Google and he claimed to just be the janitor or guy that picks out the free cereal at Google instead of the cofounder. But sitting next to him, I got a whole new appreciation for what he does. He was going through emails in his inbox, then responding to questions in the craigslist forums, and hopping onto his cellphone about once every ten minutes. Calls were quick and to the point &quot;Hi, this is Craig Newmark from craigslist.org. We are having problems with a customer of your ISP and would like to discuss how we can remedy their bad behavior in our real estate forums&quot;. He was literally chasing down forum spammers one by one, sometimes taking five minutes per problem, sometimes it seemed to take half an hour to get spammers dealt with. He was totally engrossed in his work, looking up IP addresses, answering questions best he could, and doing the kind of thankless work I'd never seen anyone else do with so much enthusiasm. By the time we got on our flight he had to shut down and it felt like his giant pile of work got slightly smaller but he was looking forward to attacking it again when we landed.</p> </blockquote> <p>At some point, if sites grow, they get big enough that a person can't really own every feature and every moderation action on the site, but sites can still get significant value out of having a single person own something that people would normally think is automated. A famous example of this is how <a href="https://news.ycombinator.com/item?id=36217321">the Digg &quot;algorithm&quot; was basically one person</a>:</p> <blockquote> <p>What made Digg work really was one guy who was a machine. He would vet all the stories, infiltrate all the SEO networks, and basically keep subverting them to keep the Digg front-page usable. Digg had an algorithm, but it was basically just a simple algorithm that helped this one dude 10x his productivity and keep the quality up.</p> <p>Google came to buy Digg, but figured out that really it's just a dude who works 22 hours a day that keeps the quality up, and all that talk of an algorithm was smoke and mirrors to trick the SEO guys into thinking it was something they could game (they could not, which is why front page was so high quality for so many years). Google walked.</p> <p>Then the founders realised if they ever wanted to get any serious money out of this thing, they had to fix that. So they developed &quot;real algorithms&quot; that independently attempted to do what this one dude was doing, to surface good/interesting content.</p> <p>...</p> <p>It was a total shit-show ... The algorithm to figure out what's cool and what isn't wasn't as good as the dude who worked 22 hours a day, and without his very heavy input, it just basically rehashed all the shit that was popular somewhere else a few days earlier ... Instead of taking this massive slap to the face constructively, the founders doubled-down. And now here we are.</p> <p>...</p> <p>Who I am referring to was named Amar (his name is common enough I don't think I'm outing him). He was the SEO whisperer and &quot;algorithm.&quot; He was literally like a spy. He would infiltrate the awful groups trying to game the front page and trick them into giving him enough info that he could identify their campaigns early, and kill them. All the while pretending to be an SEO loser like them.</p> </blockquote> <p><a href="https://mastodon.social/@ajroach42@retro.social/110499388858867441">Etsy supposedly used the same strategy as well</a>.</p> <p>Another class of advantage that small sites have over large ones is that the small site usually doesn't care about being large and can do things that you wouldn't do if you wanted to grow. For example, consider these two comments made in <a href="https://news.ycombinator.com/item?id=39223766">the midst of a large flamewar on HN</a></p> <blockquote> <p>My wife spent years on Twitter embroiled in a very long running and bitter political / rights issue. She was always thoughtful, insightful etc. She'd spend 10 minutes rewording a single tweet to make sure it got the real point across in a way that wasn't inflammatory, and that had a good chance of being persuasive. With 5k followers, I think her most popular tweets might get a few hundred likes. The one time she got drunk and angry, she got thousands of supportive reactions, and her followers increased by a large % overnight. And that scared her. She saw the way &quot;the crowd&quot; was pushing her. Rewarding her for the smell of blood in the water.</p> <p>I've turned off both the flags and flamewar detector on this article now, in keeping with the first rule of HN moderation, which is (I'm repeating myself but it's probably worth repeating) that we moderate HN less, not more, when YC or a YC-funded startup is part of a story ... Normally we would never late a ragestorm like this stay on the front page—there's zero intellectual curiosity here, as the comments demonstrate. This kind of thing is obviously off topic for HN: <a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>. If it weren't, the site would consist of little else. Equally obvious is that this is why HN users are flagging the story. They're not doing anything different than they normally would.</p> </blockquote> <p>For a social media site, low-quality high-engagement flamebait is one of the main pillars that drive growth. HN, which cares more about discussion quality than growth, tries to detect and suppress these (with exceptions like criticism of HN itself, of YC companies like Stripe, etc., to ensure a lack of bias). Any social media site that aims to grow does the opposite; they implement a ranked feed that puts the content that is most enraging and most engaging in front of the people its algorithms predict will be the most enraged and engaged by it. For example, let's say you're in a country with very high racial/religious/factonal tensions, with regular calls for violence, etc. What's the most engaging content? Well, that would be content calling for the death of your enemies, so you get things a livestream of someone calling for the death of the other faction and then grabbing someone and beating them shown to a lot of people. After all, what's more engaging than a beatdown of your sworn enemy? A theme of Broken Code is that someone will find some harmful content they want to suppress, but then get overruled because that would reduce engagement and growth. HN has no such goal, so it has no problem suppressing or eliminating content that HN deems to be harmful.</p> <p>Another thing you can do if growth isn't your primary goal is to deliberately make user-signups high friction. HN adds does a little bit of this by having a &quot;login&quot; link but not a &quot;sign up&quot; link, and sites like lobste.rs and metafilter do even more of this.</p> <h2 id="appendix-theory-vs-practice">Appendix: Theory vs. practice</h2> <p>In the main doc, we noted that big company employees often say that it's impossible to provide better support for theoretical reason X, without ever actually looking into how one provides support or what companies that provide good support do. When the now-$1T were the size where many companies do provide good support, these companies also did not provide good support, so this doesn't seem to come from size since these huge companies didn't even attempt to provide good support, then or now. This theoretical, plausible sounding, reason doesn't really hold up in practice.</p> <p>This is generally the case for theoretical discussions on disceconomies of scale of large tech companies. Another example is an idea mentioned at the start of this doc, that being a larger target has a larger impact than having more sophisticated ML. A standard extension of this idea that I frequently hear is that big companies actually do have the best anti-spam and anti-fraud, but they're also subject to the most sophisticated attacks. I've seen this used as a justification for why big companies seem to have worst anti-spam and anti-fraud than a forum like HN. While it's likely true that big companies are subject to the most sophisticated attacks, if this whole idea held and it were the case that their systems were really good, it would be harder, in absolute terms, to spam or scam people on reddit and Facebook than on HN, but that's not the case at all.</p> <p>If you actually try to spam, it's extremely easy to do so on large platforms and the most obvious things you might try will often work. As an experiment, I made a new reddit account and tried to get nonsense onto the front page and found this completely trivial. Similarly it's completely trivial to take over someone's Facebook account and post obvious scams for months to years, with extremely markers that they're scams, many people replying in concern that the account has been taken over and is running scams (unlike working in support and spamming reddit, I didn't try taking over people's Facebook accounts, but given people's password practices, it's very easy to take over an account, and given how Facebook responds to these takeovers when a friend's account is taken over, we can see that attacks that do the most naive thing possible, with zero sophistication, are not defeated), etc. In absolute terms, it's actually more difficult to get spammy or scammy content in front of eyeballs on HN than it is on reddit or Facebook.</p> <p>The theoretical reason here is one that would be significant if large companies were even remotely close to doing the kind of job they could do with the resources they have, but we're not even close to being there.</p> <p>To avoid belaboring the point in this already very long document, I've only listed a couple of examples here, but I find this pattern to hold true of almost every counterargument I've heard on this topic. If you actually look into it a bit, these theoretical arguments are classic <a href="cocktail-ideas/">cocktail party ideas that have little to no connection to reality</a>.</p> <p>A meta point here is that you absolutely cannot trust vaguely plausible sounding arguments from people on this since they virtually all of them fall apart when examined in practice. It seems quite reasonable to think that a business the size of reddit would have more sophisticated anti-spam systems than HN, which has a single person who both writes the code for the anti-spam systems and does the moderation. But the most naive and simplistic tricks you might use to put content on the front page work on reddit and don't work on HN. I'm not saying you can't defeat HN's system, but doing so would take a little bit of thought, which is not the case for reddit and Facebook. And likewise for support, where once you start talking to people about how to run a support org that's good for users, you immediately see that the most obvious things have not been seriously tried by big tech companies.</p> <h2 id="appendix-how-much-should-we-trust-journalists-summaries-of-leaked-documents">Appendix: How much should we trust journalists' summaries of leaked documents?</h2> <p>Overall, very little. As we discussed <a href="cruise-report/">when we looked at the Cruise pedestrian accident report</a>, almost every time I read a journalist's take on something (with rare exceptions like Zeynep), the journalist has a spin they're trying to put on the story and the impression you get from reading the story is quite different from the impression you get if you look at the raw source; <a href="gender-gap/">it's fairly common that there's so much spin that the story says the opposite of what the source docs say</a>. That's one issue.</p> <p>The full topic here is big enough that it deserves its own document, so we'll just look at two examples. The first is one we briefly looked at, when Eugene Zarashaw, a director at Facebook, testified in a Special Master’s Hearing. He said</p> <blockquote> <p>It would take multiple teams on the ad side to track down exactly the — where the data flows. I would be surprised if there’s even a single person that can answer that narrow question conclusively</p> </blockquote> <p>Eugene's testimony resulted in headlines like , &quot;Facebook Has No Idea What Is Going on With Your Data&quot;, &quot;Facebook engineers admit there’s no way to track all the data it collects on you&quot; (with a stock photo of an overwhelmed person in a nest of cables, grabbing their head) and &quot;Facebook Engineers: We Have No Idea Where We Keep All Your Personal Data&quot;, etc.</p> <p>Even without any technical knowledge, any unbiased person can plainly see that these headlines are inaccurate. There's a big difference between it taking work to figure out exactly where all data, direct and derived, for each user exists, and having no idea where the data is. If I Google, logged out with no cookies, <code>Eugene Zarashaw facebook testimony</code>, every single above the fold result I get is misleading, false, clickbait, like the above.</p> <p>For most people with relevant technical knowledge, who understand the kind of systems being discussed, Eugene Zarashaw's quote is not only not egregious, it's mundane, expected, and reasonable.</p> <p>Despite this lengthy disclaimer, there are a few reasons that I feel comfortable citing Jeff Horwitz's Broken Code as well as a few stories that cover similar ground. The first is that, if you delete all of the references to these accounts, the points in this doc don't really change, just like they wouldn't change if you delete 50% of the user stories mentioned here. The second is that, at least for me, the most key part is the attitudes on display and not the specific numbers. I've seen similar attitudes in companies I've worked for and heard about them inside companies where I'm well connected via my friends and I could substitute similar stories from my friends, but it's nice to be able to use already-public sources instead of using anonymized stories from my friends, so the quotes about attitude are really just a stand-in for other stories which I can verify. The third reason is a bit too subtle to describe here, so we'll look at that when I expand this disclaimer into a standalone document.</p> <p style="border-width:8px; border-style:solid; border-color:grey; background-color:#F3F3F3; padding: 0.1em;"> <b><a rel="nofollow" href="https://jobs.ashbyhq.com/freshpaint/bfe56523-bff4-4ca3-936b-0ba15fb4e572?utm_source=dl"> If you're looking for work, Freshpaint is hiring (US remote) in engineering, sales, and recruiting</b></a>. Disclaimer: I may be biased since I'm an investor, but they seem to have found product-market fit and are rapidly growing.</p> <h2 id="appendix-erin-kissane-on-meta-in-myanmar">Appendix: Erin Kissane on Meta in Myanmar</h2> <p>Erin starts with</p> <blockquote> <p>But once I started to really dig in, what I learned was so much gnarlier and grosser and more devastating than what I’d assumed. The harms Meta passively and actively fueled destroyed or ended hundreds of thousands of lives that might have been yours or mine, but for accidents of birth. I say “hundreds of thousands” because “millions” sounds unbelievable, but by the end of my research I came to believe that the actual number is very, very large.</p> <p>To make sense of it, I had to try to go back, reset my assumptions, and try build up a detailed, factual understanding of what happened in this one tiny slice of the world’s experience with Meta. The risks and harms in Myanmar—and their connection to Meta’s platform—are meticulously documented. And if you’re willing to spend time in the documents, it’s not that hard to piece together what happened. Even if you never read any further, know this: Facebook played what the lead investigator on the UN Human Rights Council’s Independent International Fact-Finding Mission on Myanmar (hereafter just “the UN Mission”) called a “determining role” in the bloody emergence of what would become the genocide of the Rohingya people in Myanmar.2</p> <p>From far away, I think Meta’s role in the Rohingya crisis can feel blurry and debatable—it was content moderation fuckups, right? In a country they weren’t paying much attention to? Unethical and probably negligent, but come on, what tech company isn’t, at some point?</p> </blockquote> <p>As discussed above, I have not looked into the details enough to determine if the claim that Facebook played a &quot;determining role&quot; in genocide are correct, but at a meta-level (no pun intended), it seems plausible. Every comment I've seen that aims to be a direction refutation of Erin's position is actually pre-refuted by Erin in Erin's text, so it appears that very few people who are publicly commenting who disagree with Erin read the articles before commenting (or they've read them and failed to understand what Erin is saying) and, instead, are disagreeing based on something other than the actual content. <a href="https://twitter.com/danluu/status/1605675278822703104">It reminds me a bit of the responses to David Jackson's proof of the four color theorem. Some people thought it was, finally, a proof, and others thought it wasn't.</a>. Something I found interesting at the time was that the people who thought it wasn't a proof had read the paper and thought it seemed flawed, whereas the people who thought it was a proof were going off of signals like David's track record or the prestige of his institution. At the time, without having read the paper myself, I guessed (with low confidence) that the proof was incorrect based on the meta-heuristic that thoughts from people who read the paper were stronger evidence than things like prestige. Similarly, I would guess that Erin's summary is at least roughly accurate and that Erin's endorsement of the UN HRC fact-finding mission is correct, although I have lower confidence in this than in my guess about the proof because making a positive claim like this is harder than finding a flaw and the area is one where evaluating a claim is significantly trickier.</p> <p>Unlike with Broken Code, the source documents are available here and it would be possible to retrace Erin's steps, but since there's quite a bit of source material and the claims that would need additional reading and analysis to really be convinced and those claims don't play a determining role in the correctness of this document, I'll leave that for somebody else.</p> <p>On the topic itself, Erin noted that some people at Facebook, when presented with evidence that something bad was happening, laughed it off as they simply couldn't believe that Facebook could be instrumental in something that bad. Ironically, this is fairly similar in tone and content to a lot of the &quot;refutations&quot; of Erin's articles which appear to have not actually read the articles.</p> <p>The most substantive objections I've seen are around the edges which, such as</p> <blockquote> <p>The article claims that &quot;Arturo Bejar&quot; was &quot;head of engineering at Facebook&quot;, which is simply false. He appears to have been a Director, which is a manager title overseeing (typically) less than 100 people. That isn't remotely close to &quot;head of engineering&quot;.</p> </blockquote> <p>What Erin actually said was</p> <blockquote> <p>... Arturo Bejar, one of Facebook’s heads of engineering</p> </blockquote> <p>So the objection is technically incorrect in that it was not said that Arturo Bejar was head of engineering. And, if you read the entire set of articles, you'll see references like &quot;Susan Benesch, head of the Dangerous Speech Project&quot; and &quot;the head of Deloitte in Myanmar&quot;, so it appears that the reason that Erin said that &quot;one of Facebook’s heads of engineering&quot; is that Erin is using the term head colloquially here (and note that the it isn't capitalized, as a title might be), to mean that Arturo was in charge of something.</p> <p>There is a form of the above objection that's technically correct — for an engineer at a big tech company, the term Head of Engineering will generally call to mind an executive who all engineers transitively report into (or, in cases where there are large pillars, perhaps one of a few such people). Someone who's fluent in internal tech company lingo would probably not use this phrasing, even when writing for lay people, but this isn't strong evidence of factual errors in the article even if, in an ideal world, journalists would be fluent in the domain-specific connotations of every phrase.</p> <p>The person's objection continues with</p> <blockquote> <p>I point this out because I think it calls into question some of the accuracy of how clearly the problem was communicated to <em>relevant</em> people at Facebook.</p> <p>It isn't enough for someone to tell random engineers or Communications VPs about a complex social problem.</p> </blockquote> <p>On the topic of this post, diseconomies of scale, this objection, if correct, actually supports the post. According to Arturo's LinkedIn, he was &quot;the leader for Integrity and Care Facebook&quot;, and the book Broken Code discusses his role at length, which is very closely related to the topic of Meta in Myanmar. Arturo is not, in fact, a &quot;random engineers or Communications VP&quot;.</p> <p>Anway, Erin documents that Facebook was repeatedly warned about what was happening, for years. These warnings went well beyond the standard reporting of bad content and fake accounts (although those were also done), and included direct conversations with directors, VPs, and other leaders. These warnings were dismissed and it seems that people thought that their existing content moderation systems were good enough, even in the face of fairly strong evidence that this was not the case.</p> <blockquote> <p>Reuters notes that one of the examples Schissler gives Meta was a Burmese Facebook Page called, “We will genocide all of the Muslims and feed them to the dogs.” 48</p> <p>None of this seems to get through to the Meta employees on the line, who are interested in…cyberbullying. Frenkel and Kang write that the Meta employees on the call “believed that the same set of tools they used to stop a high school senior from intimidating an incoming freshman could be used to stop Buddhist monks in Myanmar.”49</p> <p>Aela Callan later tells Wired that hate speech seemed to be a “low priority” for Facebook, and that the situation in Myanmar, “was seen as a connectivity opportunity rather than a big pressing problem.”50</p> </blockquote> <p>The details make this sound worse than a small excerpt, so I recommend reading the entire thing, but with respect to the discussion about resources, a key issue is that even after Meta decided to take some kind of action, the result was:</p> <blockquote> <p>As the Burmese civil society people in the private Facebook group finally learn, Facebook has a single Burmese-speaking moderator—a contractor based in Dublin—to review everything that comes in. The Burmese-language reporting tool is, as Htaike Htaike Aung and Victoire Rio put it in their timeline, “a road to nowhere.&quot;</p> </blockquote> <p>Since this was 2014, it's not fair to say that Meta could've spent the $50B metaverse dollars and hired 1.6 million moderators, but in 2014, it was still the 4th largest tech company in the world, worth $217B, with a net profit of $3B/yr, Meta would've &quot;only&quot; been able to afford something like 100k moderators and support staff if paid at a globally very generous loaded cost of $30k/yr (e.g., <a href="https://jacobin.com/2023/06/meta-is-trying-and-failing-to-crush-unions-in-kenya">Jacobin notes that Meta's Kenyan moderators are paid $2/hr</a> and don't get benefits). Myanmar's share of the global population was 0.7% and, let's say that you consider a developing genocide to be low priority and don't think that additional resources should be deployed to prevent or stop it and want to allocate a standard moderation share, then we have &quot;only&quot; have capacity for 700 generously paid moderation and support staff for Myanmar.</p> <p>On the other side of the fence, there actually were 700 people:</p> <blockquote> <p>in the years before the coup, it already had an internal adversary in the military that ran a professionalized, Russia-trained online propaganda and deception operation that maxed out at about 700 people, working in shifts to manipulate the online landscape and shout down opposing points of view. It’s hard to imagine that this force has lessened now that the genocidaires are running the country.</p> </blockquote> <p>These folks didn't have the vaunted technology that Zuckerberg says that smaller companies can't match, but it turns out you don't need billions of dollars of technology when it's 700 on 1 and the 1 is using tools that were developed for a different purpose.</p> <p>As you'd expect if you've ever interacted with the reporting system for a huge tech company, from the outside, nothing people tried worked:</p> <blockquote> <p>They report posts and never hear anything. They report posts that clearly call for violence and eventually hear back that they’re not against Facebook’s Community Standards. This is also true of the Rohingya refugees Amnesty International interviews in Bangladesh</p> </blockquote> <p>In the 40,000 word summary, Erin also digs through whistleblower reports to find things like</p> <blockquote> <p>…we’re deleting less than 5% of all of the hate speech posted to Facebook. This is actually an optimistic estimate—previous (and more rigorous) iterations of this estimation exercise have put it closer to 3%, and on V&amp;I [violence and incitement] we’re deleting somewhere around 0.6%…we miss 95% of violating hate speech.</p> </blockquote> <p>and</p> <blockquote> <p>[W]e do not … have a model that captures even a majority of integrity harms, particularly in sensitive areas … We only take action against approximately 2% of the hate speech on the platform. Recent estimates suggest that unless there is a major change in strategy, it will be very difficult to improve this beyond 10-20% in the short-medium term</p> </blockquote> <p>and</p> <blockquote> <p>While Hate Speech is consistently ranked as one of the top abuse categories in the Afghanistan market, the action rate for Hate Speech is worryingly low at 0.23 per cent.</p> </blockquote> <p>To be clear, I'm not saying that Facebook has a significantly worse rate of catching bad content than other platforms of similar or larger size. As we noted above, large tech companies often have fairly high false positive and false negative rates and have employees who dismiss concerns about this, saying that things are fine.</p> <h2 id="appendix-elsewhere">Appendix: elsewhere</h2> <ul> <li>Anna Lowenhaupt Tsing's <a href="https://asletaiwan.org/wp-content/uploads/2021/10/On-nonscalability.pdf">On Nonscalability: The Living World Is Not Amenable to Precision-Nested Scales </a></li> <li><a href="https://www.econtalk.org/glen-weyl-on-antitrust-capitalism-and-radical-reform">Glen Weyl on radical solutions to the concentration of corporate power</a></li> <li><a href="https://thezvi.wordpress.com/2019/05/30/quotes-from-moral-mazes/">Zvi's collection of Quotes from Moral Mazes</a></li> </ul> <h2 id="appendix-moderation-and-filtering-fails">Appendix: Moderation and filtering fails</h2> <p>Since I saw Zuck's statement about how only large companies (and the larger the better) can possibly do good moderation, anti-fraud, anti-spam, etc., I've been collecting links I run across when doing normal day-to-browsing of failures by large companies. If I deliberately looked for failures, I'd have a lot more. And, for some reason, some companies don't really trigger my radar for this so, for example, even though I see stories about AirBnB issues all the time, it didn't occur to me to collect them until I started writing this post, so there are only a few AirBnB fails here, even though they'd be up there with Uber in failure count if I actually recorded the links I saw.</p> <p>These are so frequent that, out of eight draft readers, at least two draft readers ran into an issue while reading the draft of this doc. Peter Bhat Harkins reported:</p> <blockquote> <p>Well, I received a keychron keyboard a few days ago. I ordered a used K1 v5 (Keychron does small, infrequent production runs so it was out of stock everywhere). I placed the order on KeyChron's official Amazon store, fulfilled by Amazon. After some examination, I've received a v4. It's the previous gen mechanical switch instead of the current optical switch. Someone apparently peeled off the sticker with the model and serial number and one key stabilizer is broken from wear, which strongly implies someone bought a v5 and returned a v4 they already owned. Apparently this is a common scam on Amazon now.</p> </blockquote> <p>In the other case, an anonymous reader created a Gmail account to used as a shared account for them and their partner, so they could get shared emails from local services. I know a number of people who've done this and this usually works fine, but in their case, after they used this email to set up a few services, Google decided that their account was suspicious:</p> <blockquote> <p>Verify your identity</p> <p>We’ve detected unusual activity on the account you’re trying to access. To continue, please follow the instructions below.</p> <p>Provide a phone number to continue. We’ll send a verification code you can use to sign in.</p> </blockquote> <p>Providing the phone number they used to sign up for the account resulted in</p> <blockquote> <p>This phone number has already been used too many times for verification.</p> </blockquote> <p>For whatever reason, even though this number was provided at account creation, using this apparently illegal number didn't result in the account being banned until it had been used for a while and the email address had been used to sign up for some services. Luckily, these were local services by small companies, so this issue could be fixed by calling them up. I've seen something similar happen with services that don't require you to provide a phone number on sign-up, but then lock and effectively ban the account unless you provide a phone number later, but I've never seen a case where the provided phone number turned out to not work after a day or two. The message above can be read two ways, the other way being that the phone number was allowed but had just recently been used to receive too many verification codes but, in recent history, the phone number had only once been used to receive a code, and that was the verification code necessary to attach a (required) phone number to the account in the first place.</p> <p>I also had a quality control failure from Amazon, when I ordered a 10 pack of Amazon Basics power strips and the first one I pulled out had its cable covered in solder. I wonder what sort of process could leave solder, likely lead-based solder (although I didn't test it) all over the outside of one of these and wonder if I need to wash every Amazon Basics electronics item I get if I don't want lead dust getting all over my apartment. And, of course, since this is constant, I had many spam emails get through Gmail's spam filter and hit my inbox, and multiple ham emails get filtered into spam, including the classic case where I emailed someone and their reply to me went to spam; from having talked to them about it previously, I have no doubt that most of my draft readers who use Gmail also had something similar happen to them and that this is so common they didn't even find it worth remarking on.</p> <p>Anyway, below, in a few cases, I've mentioned when commenters blame the user even though the issue is clearly not the user's fault. I haven't done this even close to exhaustively, so the lack of such a comment from me shouldn't be read as the lack of the standard &quot;the user must be at fault&quot; response from people.</p> <h3 id="google">Google</h3> <ul> <li><a href="https://twitter.com/danluu/status/1308215389344600066">&quot;I had to get the NY attorney general to write them a letter before they would actually respond to my support requests so that I could properly file my taxes&quot;</a></li> <li><a href="https://www.theverge.com/2018/1/12/16882408/google-racist-gorillas-photo-recognition-algorithm-ai">Google photo search for gorilla returns photos of black people</a>, fixed after Twitter thread about this goes viral; 3 years later, there are stories in the press about how Google fixed this by blocking search results for the terms &quot;gorilla&quot;, &quot;chimp&quot;, &quot;chimpanzee&quot;, and &quot;monkey&quot; and has not unblocked the terms <ul> <li>On 2024-01-06, I tried uploading a photo of a gorilla and searching for gorilla, which returned no results both immediately after the upload as well as a few weeks later, so this still appears to be blocked?</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35521345">Google suspends a YouTuber for impersonating themselves</a>; on appeal YouTube says &quot;unfortunate, there's not more we can do on our end. your account suspension &amp; appeal were very carefully reviewed &amp; the decision is final ... we really appreciate your understanding&quot;. <ul> <li>Channel restored after viral Twitter thread makes it to the front page of HN.</li> </ul></li> <li><a href="https://twitter.com/danluu/status/800777306612449280">Two different users report having their account locked out after moving; no recovery of account</a></li> <li><a href="https://twitter.com/danluu/status/799102941923667968">Google closed the accounts of everyone who bought a phone and then sold it to a particular person who was buying phones, resulting in emails to their email address getting bounced, inability to auth to anything using Google sign-in, etc.</a>; at least one user whose account was a recovery account for someone who bought and sold a phone also had their accounted closed; Dans Deals wrote this up and people's accounts were reinstated after the story went viral</li> <li><a href="https://nitter.net/JustJake/status/1667478906591666176">Google Cloud reduces quota for user, causing an incident, and then won't increase it again</a> <ul> <li>User tries to find out what's going on and has this discussion: <ul> <li><b>GCP support</b>: You exceeded the rate limit</li> <li><b>User</b>: We did 5000/10min. The quota was approved at 18k/min</li> <li><b>GCP support</b>: That's not the rate limit</li> <li><b>User</b>: What's the rate limit</li> <li><b>GCP support</b>: Not sure have to check with that team</li> </ul></li> <li>So it seems like GCP added some kind of internal rate limiting that's stricter than the user's approved quota?</li> <li>A commenter responds with &quot;if you don’t buy support from GCP <em>you have no support</em>.&quot; and other users note that paying for support can also give you no support</li> </ul></li> <li><a href="https://taxpolicy.org.uk/2024/02/17/the-invisible-campaign-to-censor-the-internet/">Google accepts fake DMCA takedown requests even in cases that are very obviously fake</a> <ul> <li>An official Google comment on this is the standard response that there are robust processes for this &quot;We have robust tools and processes in place to fight fraudulent takedown attempts, and we use a combination of automated and human review to detect signals of abuse – including tactics that are well-known to us like backdating. We provide extensive transparency and submit notices to Lumen about removal requests to hold requesters accountable. Sites can file counter notifications for us to re-review if they believe content has been removed from our results in error. We track networks of abuse and apply extra scrutiny to removal requests where appropriate, and we’ve taken legal action to fight bad actors abusing the DMCA&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33737577">Small business app creator has everything shut down pending &quot;verification&quot; of Google Pay</a> <ul> <li>Support did nothing and GCP refused to look into it until this story hit #1 on HN, at which point someone looked into it and fixed it</li> </ul></li> <li><a href="https://grist.org/technology/right-to-repair-new-york-hochul-big-tech-lobbying-law/">Lobbying group representing Google, Apple, etc., is able to insert the language they want directly into a right to repair bill</a>, excluding many devices from the actual right to repair. <ul> <li>&quot;“We had every environmental group walking supporting this bill,” Fahy told Grist. “What hurt this bill is Big Tech was opposed to it.”&quot;</li> </ul></li> <li><a href="https://nitter.net/emilyldolson/status/1485434187968614411">File containing a single line with &quot;1&quot; in it restricted on Google Drive due to copyright infringement; appeal denied</a> <ul> <li>HN readers play around and find that files containing just &quot;0&quot; also get flagged for copyright violation</li> <li><a href="https://news.ycombinator.com/item?id=30060405">issue fixed after viral Twitter thread</a></li> </ul></li> <li>In 2016, Fark has ads disabled when a photograph of a clothed adult posted in 2010 is incorrectly flagged as child porn; appeals process takes 5 weeks <ul> <li>Fark notes that they had similar problems in 2013 because an image was flagged as showing too much skin</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32713375">Pixel 6 freezes when calling emergency services</a> <ul> <li><a href="https://www.reddit.com/r/GooglePixel/comments/rfld6m/pixel_prevented_me_from_calling_emergency/">a user notes that they reported the issue almost 4 years before this complaint on an earlier Pixel and the issue was &quot;escalated&quot; but was still an issue ~8 months before the previous complaint</a></li> <li><a href="https://www.reddit.com/r/GooglePixel/comments/r4xz1f/comment/hnrvsr1/?utm_source=share&amp;utm_medium=web2x&amp;context=3">A Google official account responded that the freeze was due to Microsoft Teams, but the user notes they've never used or even installed Microsoft Teams (there was an actual issue where Teams would block emergency calls, but that was not this user's issue)</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32560361">Account locked and information sent to SFPD after father takes images of son's groin to send to doctor, causing an SFPD investigation; SFPD cleared the father of any wrongdoing, but Google &quot;stands by its decision&quot;, doesn't unlock the account</a> <ul> <li>Google spokesperson says &quot;We follow US law in defining what constitutes CSAM and use a combination of hash matching technology and artificial intelligence to identify it and remove it from our platforms,&quot;<br /></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32547912">Google cloud suspends corporate account, causing outage; there was a billing bug and the owner of the account paid and was assured that their account wouldn't be suspended due to the bug, but that was false and the account got suspended anyway</a> <ul> <li><a href="https://news.ycombinator.com/item?id=32551587">HN commenter suggests that &quot;engineers that lack business experience&quot; reach out to their account managers once they have significant spend; multiple people respond and say that they've done this and it didn't help at all</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33737793">Company locked out of their own domain on Google Workspaces; support refused to fix this</a></li> <li><a href="https://news.ycombinator.com/item?id=32548808">Google cloud account suspended because someone stole the CC numbers for the corporate card and made a fraudulent adwords charge</a></li> <li><a href="https://nitter.net/FordFischer/status/1136334778670518273">Journalist's YouTube account incorrectly demonetized</a> <ul> <li>fixed after 7 months of appealing and a viral Twitter thread</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32237445">Ads account suspended; an educated guess is that some ML fraud signals plus using a Brex card led to the suspension</a> <ul> <li>card works when paying for many other Google services</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32238158">Person's credit card stops working with Google accounts after using it to pay on multiple accounts</a> <ul> <li>guessed to be due to an incorrect anti-fraud check</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32238092">Ads account suspended for &quot;suspicious payments&quot; even though the same card is used for many other Google payments, which are not suspended</a> <ul> <li>after multiple appeals that fail, the former Google employee talks to internal contacts to get escalations, which also fail and the ads account stays suspended</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32239118">Google Play account banned for no known reason</a> <ul> <li>the link Google provides to file the appeal can't be access with a banned account</li> <li>the user had two apps using one API, so it counted as two separate violations at once, so the account was banned for &quot;multiple violations&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32238717">Google ads account for a small non-profit banned due to &quot;unpaid balance&quot;</a> <ul> <li>Balance reads $0.00 but appealing ban fails</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32238041">Google ads account banned after account automatically switched to Japanese and then payment is made with an American card</a></li> <li><a href="https://news.ycombinator.com/item?id=32084980">Google sheet with public election information incorrectly removed for &quot;phishing&quot;</a> <ul> <li>restored after viral HN thread</li> </ul></li> <li><a href="https://nitter.net/miguelytob/status/1315749803041619981">User account disabled and photos, etc., lost with no information on why and no information for why the appeal was rejected</a> <ul> <li>ex-Google engineer unable to escalate to anyone who can restore account<br /></li> </ul></li> <li><a href="https://nitter.net/GarethEvansYT/status/1308795658162319361">10-year old YouTube channel with 120M views scheduled for deletion due copyright claims (no information provided to channel creator about what the copyright infringement was)</a> <ul> <li>channel eventually saved after Twitter thread went viral</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31432334">FairEmail and Netguard app developer removes apps after giving up on fight with Google over whether or not FairEmail is Spyware</a> <ul> <li>app later restored sometime after viral HN thread</li> </ul></li> <li><a href="https://nitter.net/hermux/status/1371383715381805061">App banned from Play store because a button says &quot;Report User&quot; and not &quot;Report&quot;</a></li> <li><a href="https://news.ycombinator.com/item?id=33739658">User gets banned from GCP for running the code on Google's own GCP tutorials</a></li> <li><a href="https://news.ycombinator.com/item?id=30912187">Youtube comment anti-spam considered insufficient</a>, so a user creates their own YT anti spam</li> <li><a href="https://nitter.net/garybernhardt/status/1336428596265504768">Search for product reviews generally returns SEO linkfarm spam and not useful results</a> <ul> <li>See also, <a href="seo-spam/">my post on the same topic</a></li> </ul></li> <li><a href="https://nitter.net/garybernhardt/status/1359925407819198473">Google account with thousands of dollars of apps banned from Google with no information on what happened and appeals rejected</a> <ul> <li>account eventually restored after viral Twitter thread</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=28443244">Linux Experiments Youtube Channel deleted with no reason given</a> <ul> <li><a href="https://news.ycombinator.com/item?id=28448329">channel restored shortly after viral Twitter and HN threads</a></li> </ul></li> <li><a href="https://www.reddit.com/r/GooglePixel/comments/18wtcq6/pixel_7_pro_warranty_replacement_returned_with/">Warranty replacement Google Pixel 7 Pro is carrier locked to the wrong carrier and, even though user is in Australia, the phone is locked to a U.S. carrier</a> <ul> <li>User has gone to Google support 8 times over 1 month and Google support has incorrectly told user 8 times that the phone is unlocked, so user has had no usable phone for 1 month; the carrier the phone is locked to agrees that the phone is incorrectly carrier locked, but they can't do anything about it since the original purchaser of the phone would have to call the carrier, but apparently the warranty replacement is a locked, used, phone</li> <li>Possibly due to the reddit thread, Google support agrees to swap user's phone, but support continues to insist that the phone is not carrier locked</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38806650">Malware uses Google OAuth to hijack accounts</a> <ul> <li>Google claims they've mitigated this for all accounts that were compromised, which could be true</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33737772">GCP account suspended for no discernable reason after years of use</a> <ul> <li>Support was useless, but since the user used to work at Google, they emailed a former co-worker who sent an internal email, which caused the issue to get fixed immediately</li> </ul></li> <li><a href="https://www.reddit.com/r/OutOfTheLoop/comments/18tz779/whats_going_on_with_the_google_reviews_for/">Obviously fake Google reviews for movie not removed for quite some time</a> (obviously fake because many reviews copy+paste the exact same text)</li> <li><a href="https://news.ycombinator.com/item?id=38772187">Google doesn't detect obviously fake restaurant reviews</a> <ul> <li>I've noticed this as well locally — a new restaurant will have 100+ 5 star reviews, almost all of which look extremely fake; these reviews generally don't get removed, even years later</li> </ul></li> <li><a href="https://danfitdegree.hashnode.dev/nothing-has-ever-angered-me-more-than-the-google-play-team">Owner and developer at SaaS app studio 7 out of 100 apps (that use the same code) start getting rejected from app store</a> <ul> <li>The claimed reason is that the apps allow user generated content (UGC) and therefore need a way to block and report the content, but the apps already have this</li> <li>The developer keeps emailing support, explaining that they already have this and support keeps responding with nonsense like &quot;We confirm that your app ... does not contain functionality to report objectionable content ... For more information or a refresher, we strongly recommend that you review our e-learning course on UGC before resubmission.&quot;</li> <li>All attempts to escalate were also rejected, e.g., &quot;Can you escalate this?&quot; was responded to with &quot;Unfortunately, we do not handle this kind of concern. You may continue to communicate with the appropriate team for further assistance in resolving your issue. Please understand that I am not part of the review team so I'm not able to give further information about your concern. I again apologize for the inconvenience.&quot; and then &quot;As much as I'd like to help, I'm not able to assist you further. If you don't have any other concerns, I will now have to end our chat to assist other developers. I apologize and thank you for understanding. Have a great day. Bye!&quot;</li> <li><a href="https://news.ycombinator.com/item?id=33632886">Multiple developers suggest that instead of interacting with Google support as if anyone actually pays attention or cares, you should re-submit your app with some token change, such as incrementing an internal build number</a>; because Google's review process is nonsense, even serious concerns can be bypassed this way. The idea is that it's a mistake to think that the content of their messages makes any sense at all and that you're dealing with anything resembling a rational entity (<a href="https://news.ycombinator.com/item?id=33636755">see also</a>.</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38649931">Google groups is a massive source of USENET spam</a></li> <li><a href="https://news.ycombinator.com/item?id=38649772">Google groups is a massive source of USENET spam</a></li> <li><a href="https://news.ycombinator.com/item?id=38649630">Google groups is a massive source of USENET spam</a></li> <li><a href="https://news.ycombinator.com/item?id=38650333">Google groups is a massive source of email spam; a Google employee put information about this into a ticket, which did not fix the issue, nor does setting &quot;can't add me to groups&quot;</a></li> <li><a href="https://mas.to/@kissane/111575585854309686">Google locks user out by ignoring authenticated phone number change and only sending auth text to old number</a></li> <li>I had an issue related to the above, where I was once locked out of Google accounts while traveling because I only took my code generator and left my 2FA tokens at home; this was in the relatively early days of 2FA tokens and I added the tokens to reduce the odds that I would be locked out, because the documentation indicated that I would need any of my 2FA methods to be available to not get locked out; in fact, this is false, and Google will sometimes only let you authenticate with specific methods, so adding more methods actually increases the chances you'll get locked out if your concern is that you may lose a method and then lose access to your account</li> <li><a href="https://www.techdirt.com/2023/12/12/google-promises-unlimited-cloud-storage-then-cancels-plan-then-tells-journalist-his-lifes-work-will-be-deleted-without-enough-time-to-transfer-the-data/">Google allows user to pay for plan with unlimited storage, cancels unlimited storage plan, and then deletes user's data</a> <ul> <li>Many HN commenters on the story tell the user they should've had other backups, apparently not reading the story, which notes that the user concurrently had a government agency take all of their hard drives</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38628019">Google closes company's Google Cloud account over 3 cent billing error, plus some other stories</a></li> <li><a href="https://old.reddit.com/r/youtube/comments/18gjiqy/youtube_doesnt_want_to_take_down_scam_ads/">YouTube doesn't take down obvious scam ads when reported, responding with &quot;We decided not to take this ad down. We found that the ad doesn’t go against Google’s policies&quot;</a></li> <li><a href="https://old.reddit.com/r/youtube/comments/18gjiqy/youtube_doesnt_want_to_take_down_scam_ads/kd1bb0c/">YouTube doesn't take down obvious scam ads</a></li> <li><a href="https://www.reddit.com/r/ContraPoints/comments/wqzvp1/my_matt_walsh_video_was_taken_down_due_to/">Incorrect YouTube copyright takedown</a></li> <li><a href="https://news.ycombinator.com/item?id=27999813">YouTube copyright claim for sound of typing on keyboard</a>; fixed after Twitter thread goes viral</li> <li><a href="https://news.ycombinator.com/item?id=29816310">Another YouTube copyright claim for sound of typing on keyboard</a>; again fixed after Twitter thread goes viral</li> <li><a href="https://news.ycombinator.com/item?id=29816310">User puts free music they made on YouTube, allowing other people to remix it; someone takes YouTube ownership of the music</a>, fixed after user, one of the biggest YouTubers of all time, creates a video complaining about this</li> <li><a href="https://news.ycombinator.com/item?id=33633339">Developer's app removed from app store for no discernible reason (allegedly for &quot;user privacy&quot;) and then restored for no discernable reason </a></li> <li><a href="https://news.ycombinator.com/item?id=28000791">YouTube copyright claim for white noise</a></li> <li><a href="https://news.ycombinator.com/item?id=38611919">YouTube refuses to take down obvious scam ad</a></li> <li><a href="https://news.ycombinator.com/item?id=38612261">YouTube refuses to take down scam ads for fake medical treatments</a></li> <li><a href="https://news.ycombinator.com/item?id=38611664">YouTube refuses to take down scam ads</a></li> <li><a href="https://news.ycombinator.com/item?id=38614652">Google doesn't take down obvious scam ads with fake download buttons</a> <ul> <li>Mitigated on user's site by hiring a firm to block these ads post-auction?</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38612630">YouTube refuses to take down fraudulent ad after reporting</a></li> <li><a href="https://news.ycombinator.com/item?id=38614011">Personally reporting scam ads to an employee at Google who works in the area causes ads to get taken down for a day or two, but they return shortly afterwards</a></li> <li><a href="https://news.ycombinator.com/item?id=38612058">Google refuses to take down obvious scam ads after reporting</a></li> <li><a href="https://news.ycombinator.com/item?id=38614409">Google refuses to take down obvious scam ad, responding with &quot;We decided not to take this ad down. We found that the ad doesn’t go against Google’s policies, which prohibit certain content and practices that we believe to be harmful to users and the overall online ecosystem.&quot;</a></li> <li><a href="https://news.ycombinator.com/item?id=38611919">YouTube refuses to take down obvious real estate scam ad using Wayne Gretzky</a>, saying the ad doesn't violate any policy</li> <li><a href="https://mastodon.social/@gamingonlinux/111482420414024520">Straighforward SEO spam clone of competitor's website takes their traffic away</a></li> <li><a href="https://news.ycombinator.com/item?id=36273159">User had negotiated limit of 300 concurrently BigQuery queries and then Google decided to reduce this to 100 because Google rolled out a feature that Google PMs and/or engineers believed was worth 3x in concurrent queries; user notes that this feature doesn't help them and that their query limit is now too low; talking to support apparently didn't work</a></li> <li><a href="https://news.ycombinator.com/item?id=36273877">User keeps having their tiny GCP instance shut down because Google incorrectly and nonsensically detects crypto mining on their tiny instance</a></li> <li><a href="https://news.ycombinator.com/item?id=36276292">User has limit on IPs imposed on them and the standard process for requesting IPs returned &quot;Based on your service usage history you are not eligible for quota increase at this time&quot;; all attempts to fix this via support failed</a></li> <li><a href="https://www.reddit.com/r/vancouver/comments/17ud65f/hikers_getting_lost_on_the_north_shore_using/">Google Maps gives bad directions to hikers who get lost</a></li> <li><a href="https://www.reddit.com/r/vancouver/comments/17op12f/search_and_rescue_team_warns_against_google_maps/">Search and rescue teams warn people against use of Google Maps</a></li> <li><a href="https://mastodon.social/@danluu/109679379784923986">Google's suggested American and British pronunciations of numpy</a></li> <li><a href="https://twitter.com/danluu/status/1172676081628766208">CEO of Google personally makes sure that a recruiter who accidentally violated Google's wage fixing agreement with Apple is fired and the apologies to CEO of Apple for the error</a></li> <li><a href="https://news.ycombinator.com/item?id=33635273">Developer's app rejected from app store and developer given the runaround for months</a> <ul> <li>They keep getting support people telling them that their app doesn't do X, so they send instructions on how to see that the app does do X; their analytics show that support never even attempted to run the instructions and just kept telling them that their app didn't do X</li> </ul></li> <li><a href="https://twitter.com/danluu/status/1147202351398133760">One of many examples of Google not fixing Maps errors when reported, resulting in people putting up a sign telling users to ignore Google Maps directions</a> <ul> <li><a href="https://twitter.com/danluu/status/1147236977885904896">Some more examples here</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38145054">SEO spam of obituaries creates cottage industry of obituary pirates</a></li> <li><a href="https://news.ycombinator.com/item?id=33636097">Malware app stays up in app store for months after app is reported to be malware</a> <ul> <li>The app now seems to be gone, but archive.org indicates that the app was up for at least six months after this person noted that they reported this malware which owned their parents</li> </ul></li> <li><a href="https://twitter.com/danluu/status/1285408780688080896">User reports Google's accessible audio captchas only let you do 2-3 before banning you and making you do image-based captchas, making Google sites and services inaccessible to some blind people</a></li> <li><a href="https://mastodon.social/@varx@infosec.exchange/111349057429167617">User gets impossible versions of Google's ReCaptcha, making all sites that use ReCaptcha inaccessible; user is unable to cancel paid services that are behind ReCaptcha and is forced to issue chargebacks to stop payment to now-impossible to access services</a></li> <li><a href="https://news.ycombinator.com/item?id=38038713">User can't download India-specific apps while in India because Google only lets you change region once a year</a></li> <li><a href="https://nitter.net/Plamen__Andonov/status/1687524213786103815">3 year old YouTube channel with 24k subs, 100 videos, and 400 streams deleted, allegedly for saying &quot;Don't hold your own [bitcoin] keys&quot;</a>, which was apparently flagged as promoting illegal activity <ul> <li>YouTube responds with &quot;we've forwarded this info to the relevant team and confirmed that the channel will remain suspended for Harmful or dangerous content policies&quot; and links to a document; the user asks what content of theirs violates the policies and why, if the document says that you get 3 strikes your channel is terminated, the account was banned without getting 3 strikes; this question gets no response</li> </ul></li> <li><a href="https://www.reddit.com/r/tahoe/comments/11b02ic/google_maps_is_basically_telling_you_to_take/">Snow closure of highway causes Google Maps to route people to unplowed forest service road with 10 feet of snow</a></li> <li><a href="https://news.ycombinator.com/item?id=33636404">Google play rejects developer's app for nonsense reasons, so they keep resubmitting it until the app doesn't get rejected</a></li> <li><a href="https://www.reddit.com/r/vancouver/comments/qv7jh3/forest_service_roads_access_between_lillooet_and/">Washed out highways due to flooding causes Google Maps to route people through forest service roads that are in even worse condition</a></li> <li><a href="https://www.reddit.com/r/Kamloops/comments/13tcsyp/google_maps_has_problems_here/">Google routes people onto forest service roads that need an offroad vehicle to drive</a>; users note that they've reported this, which does nothing</li> <li><a href="https://news.ycombinator.com/item?id=38150657">Google captchas assume you know what various American things are regardless of where you are in the world</a></li> <li><a href="https://cofense.com/blog/google-amp-the-newest-of-evasive-phishing-tactic/">Google AMP allows phishing campaigns to show up with trusted URLs</a> <ul> <li>People warned Google engineers that this would happen and that there were other issues with AMP, but <a href="https://twitter.com/danluu/status/891508449414197248">the response from Google was that if you think that AMP is causing you problems, you're wrong and the benefit you've received from AMP is larger than the problems it's causing you</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36733735">User reports that chrome extension, after getting acquired, appears to steal credit card numbers</a> and reviews indicate that it now injects ads and sometimes (often?) doesn't work <ul> <li>6 months ago, user tried to get the extension taken down, but this seems to have failed (the Firefox extension is also still available)</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36406015">User has their Google account banned after updating their credit card to replace expiring credit card with new credit card</a> (both credit cards were from the same issuer, had the same billing address, etc.)</li> <li><a href="https://news.ycombinator.com/item?id=32211417">Reporting a spam youtube comment does nothing</a></li> <li><a href="https://www.bbc.com/news/technology-56886957">BBC reports bad ads to Google and Google claims to have fixed the issue with an ML system, but follow-up searches from the BBC indicate that the issue isn't fixed at all</a></li> <li><a href="https://news.ycombinator.com/item?id=36405519">User signs up for AdSense and sells thousands of dollars of ads that Google then doesn't allow the user to cash out</a> <ul> <li>This is a common story that I've seen hundreds of times. Unsurprisingly, multiple people respond and say the same thing happened to them and that there's no recourse when this happens.</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36335975">User has their Google account (Gmail) account locked for no discernable reason; account recovery process and all appeals do nothing</a> <ul> <li>For unknown reasons, after two years, the account recovery process works and the account is recovered</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36336256">User has their Google Pay account locked for &quot;fraud&quot;; there's a form you're supposed to fill out to get them to investigate, which did nothing three times</a> <ul> <li>User had their phone through googlefi, email through Gmail, DNS via Google, etc., all of which stopped working</li> <li>A couple years later, their accounts started working again for no discernable reason</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36336239">User gets locked out of Gmail despite having correct password and access to the recovery email (Gmail tells user their login was suspicious and refuses to let them log in)</a> <ul> <li>I've had this happen to me as well when I had my correct password as well as a 2fa device; luckily, things started working again later</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36337103">User can't get data out of Google after Google adds limit on how much data account can have</a></li> <li><a href="https://mastodon.social/@HalvarFlake/110531473438621147">User notes that they're only able to get support from Google because they used to work there and know people who can bypass the normal support process</a></li> <li><a href="https://news.ycombinator.com/item?id=36255243">Google takes down developer's Android app, saying that it's a clone of an iOS app; app was making $10k/mo</a> <ul> <li>Developer finds out that the app Google thinks they're cloning is their own iOS app</li> <li>Developer is able to get unbanned, but revenue never recovers an settles down to $1k/mo. Developer stops developing Android apps</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36195901">User finds that if they use &quot;site:&quot; &quot;wrong&quot;, Google puts them into CAPTCHA hell</a> <ul> <li><a href="https://news.ycombinator.com/item?id=36197882">Another user notes that this happens to them with other query query modifiers</a></li> </ul></li> <li><a href="https://palant.info/2023/05/31/more-malicious-extensions-in-chrome-web-store/">Reporting malware Chrome extensions doesn't get them taken down</a>, although some do end up getting taken down after a blog post on this goes viral</li> <li><a href="https://news.ycombinator.com/item?id=36124415">User accidentally gains admin privileges to many different companies Google Cloud account and can't raise any kind of support ticket to get anyone to look at the problem</a> <ul> <li><a href="https://news.ycombinator.com/item?id=36125825">Multiple people respond and tell stories about how bad Google's paid support is compared to AWS support</a></li> </ul></li> <li><a href="https://lauren.vortex.com/2023/05/17/google-account-recovery-failure-sad">15 year ol Gmail account lost with no recovery possible</a> <ul> <li>Someone who helps many people with recovery says &quot;they've all basically hit the brick wall of Google suggesting that at their scale, nothing can be done about such 'edge' cases&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36025239">Google account lost despite having proper auth because Google deems login attempts too suspicious</a></li> <li><a href="https://news.ycombinator.com/item?id=36027316">Google account lost despite having proper auth and access to backup account because Google deems login attempts too suspicious</a></li> <li><a href="https://news.ycombinator.com/item?id=36026759">Google account lost despite having proper auth because Google deems login to be too suspicious</a></li> <li><a href="https://news.ycombinator.com/item?id=36027151">Google account lost despite having proper auth and TOTP because Google deems login to be too suspicious</a></li> <li><a href="https://news.ycombinator.com/item?id=36030983">Google account lost despite proper auth because Google deems login to be too suspicious</a> <ul> <li>Person notes that they can log in when they travel back to the city they used to live in, but they can't log in where they moved to</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36026380">Google account lost despite proper auth for no known reason</a> <ul> <li>Account login restored for no known reason a few months later</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35749825">User tries to log into Gmail account and gets ~20 consecutive security challenges</a>, after which Gmail returns &quot;You cannot log in at this time&quot;, so their account appears to be lost</li> <li><a href="https://www.reddit.com/r/AskReddit/comments/12rt24x/comment/jgya3dn/">Google changes terms of service and reduces user's quota from 2TB to 15GB</a>, user is unable to find any way to talk to a human about this and is forced to pay for a more expensive plan to not lose their data <ul> <li><a href="https://news.ycombinator.com/item?id=35521606">YouTube account with single video and no comments banned for seemingly no reason</a>, support requests do nothing</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35521586">Huge YouTube channel shut down</a> <ul> <li>Someone defends this as the correct action because &quot;Their account got session jacked and taken over by a crypto scamming farm. Google was in the right to shut down the account until it could get resolved.&quot; <ul> <li>Someone who is actually familiar with what's going on notes that this is nonsense, &quot;Their account was shut down days <em>after</em> the crypto scam issue was resolved. They discussed it on the WAN show from the week before last.&quot;</li> </ul></li> </ul></li> <li><a href="https://issuetracker.google.com/issues/268606830?pli=1">Many users run into serious problems after Google decides to impose 5M file limit on Google Drive without warning</a> <ul> <li>Google support replies with &quot;I reviewed your case here on our end including the interactions with the previous representatives. This case has already been endorsed to one of our account specialists. What they found out is that the error is working as intended&quot;</li> <li><a href="https://news.ycombinator.com/item?id=35331890">On HN, the top comment is a Google engineer responding to say</a> &quot;I don't personally think that there are reasonable use-cases for human users with 5 million files. There may be some specialist software that produces data sets that a human might want to back up to Google Drive, but that software is unlikely to run happily on drive streamed files so even those would be unlikely to be stored directly on Drive.&quot; and a multiple people agree, despite the issue itself being full of people describing how they're running into issues <ul> <li><a href="https://news.ycombinator.com/item?id=35396184">Someone notes that Google Drive advertises storage tiers up to 30TB</a>, so 5M files would be 6MB at 30TB, not really a weird edge case of someone generating a bunch of tiny files or anything like that</li> <li><a href="https://news.ycombinator.com/item?id=35397520">Another user responds their home directory contains almost 5M files</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35329135">The top HN reply to the #2 comment is a second Google engineer saying that Google Drive isn't for files (and that it's for things like Google Docs) and that people shouldn't be using it to store files</a> <ul> <li><a href="https://news.ycombinator.com/item?id=35329135">Someone notes that Google's own page for drive advertises it as a &quot;File Sharing Platform&quot;</a>; this doesn't appear to have changed since, as of this writing, the big above-the-fold blurb on Google's own page about drive is that you can &quot;Store, share, and collaborate on files and folders from your mobile device, tablet, or computer&quot;. Unsurprisingly, <a href="https://news.ycombinator.com/item?id=35334560">users indicate that they think Google Drive is for files</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35329654">In low ranked HN comments, multiple people express surprise that Google didn't bother to contact the users who would be impacted by this change before making it</a></li> <li>This Google engineering attitude of &quot;this is how we imagine users use our product, and if you're using it differently, even if that's how the product is marketing, you're using it wrong&quot; was a very common attitude when I worked at Google and I see that it hasn't changed.</li> </ul></li> <li><a href="https://news.ycombinator.com †/item?id=35324218">Chrome on Android puts tabloid stories and other spam next to frequently used domains/links</a></li> <li><a href="https://news.ycombinator.com/item?id=34849083">Google pays Apple to not compete on search</a></li> <li><a href="https://news.ycombinator.com/item?id=34916922">Google search has been full of scam ads for years</a></li> <li><a href="https://news.ycombinator.com/item?id=34917701">r/blender repeatedly warns people over many months to not trust results for blender since top hit is malware</a>.</li> <li><a href="https://news.ycombinator.com/item?id=34918052">Rampant nutritional supplement scam advertising on Google</a></li> <li><a href="https://nitter.net/doctorow/status/1628948906657878016">Top search result for local restaurant is a scam restaurant</a></li> <li><a href="https://twitter.com/altluu/status/1461097423883972608">User reports massive about obvious phishing and spam content makes it through Gmail's spamfilter</a>, such as an ad that either steals your payment info or gets you to buy super overpriced gold</li> <li><a href="https://news.ycombinator.com/item?id=35060972">High-ranking / top Google results for many pieces of software is malware that pays for a high ranking ad slot</a> <ul> <li><a href="https://news.ycombinator.com/item?id=35061544">Reporting this malware doesn't seem effective</a> and <a href="https://news.ycombinator.com/item?id=35061707">the same malware ads can persist for very long periods of time</a> unless someone contacts a Google engineer or makes a viral thread about the ad</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35061704">Top result for zoom installer is an ad that tries to get you to install badware</a></li> <li><a href="https://news.ycombinator.com/item?id=35061831">User sees a huge number of scam ads on YouTube</a></li> <li><a href="https://news.ycombinator.com/item?id=35067385">User sees a huge number of scam ads on YouTube</a></li> <li><a href="https://news.ycombinator.com/item?id=35036142">User's list of wedding vendors they're using to organize a wedding tagged as phishing</a> and user is warned for violating Google Drive's phishing policy <ul> <li>User tried to get more information but found no way to do so</li> </ul></li> <li><a href="https://fleetdm.com/securing/tales-from-fleet-security-google-groups-scams">Corp security notes that it's very easy to send phishing emails to employees of corporation by passing it through Google Groups</a></li> <li><a href="https://news.ycombinator.com/item?id=34441697">Google account lost because 2FA recovery process doesn't work</a> <ul> <li>User lost their Google Authenticator 2FA when their phone broke. They have their backup recovery codes, but this only lets them log into their account (and uses up a code forever when logging in); after logging in, this does not enable them to change their 2FA, so each login is a countdown to losing their account</li> <li>In the HN comments, some people walk them through the steps to change their 2FA when using backup codes, which works for other users but not this user — user believes that some kind of anti-fraud system is suspicious the user is fraudulent, which limits what kind of 2FA enables changing 2FA, requiring the original and now lost 2FA to change 2FA, making the recovery codes useless; in standard internet comment style, some people keep telling the user that this works and the user should simply do the steps that work, even though the user has explained multiple times that this does not work for them</li> <li>Someone suggests buying Google One support, but <a href="https://news.ycombinator.com/item?id=34444274">someone else notes that Google One support appears to be very poor even though it's paid support</a>, and people have noted on many other threads that even cloud support can be useless when spending millions, tens of millions, or hundreds of millions a year, so the idea that you'll get support from Google because you pay for is isn't always correct</li> <li>Multiple people have reported the exact same issue and many people report that their mitigation to this is to score the 2FA secrets in their password manager; they know that this means that a computer and/or password manager compromise defeats their 2FA, but they feel that's better than randomly losing their account because the 2FA backup codes can simply not work if Google decides that they're too suspicious <ul> <li><a href="https://news.ycombinator.com/item?id=34443300">Someone suggests setting multiple Yubikeys to prevent this issue</a>. That sounds logical, but I've done this and I can report that it does not prevent this issue — I added multiple 2FA tokens in order to reduce the chance that losing 2FA tokens would cause me to get locked out; at one point, Google became suspicious of one of the 2FA I used to log in almost every time and required me to present another 2FA token, making my idea of having multiple 2FA tokens reduce the risk of a lockout actually backfire since, if Google becomes suspicious of the wrong 2FA tokens, losing any one out of N 2FA tokens could cause my account to become lost</li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33692942">User loses Gmail account after Google system decides the phone numbers they've been using for verification &quot;cannot be used for verification&quot;</a> <ul> <li><a href="https://news.ycombinator.com/item?id=33693183">Another user looks into it and finds that Google's official support suggestion is to create another account</a>, so the anti-fraud false positive means that this person lost their Gmail account</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34443237">User locked out of account after password change</a>; user is sure they're using the correct password because they're using a password manager <ul> <li>As with the above cases, the password reset flow doesn't work; after five weeks of trying daily, doing the exact same steps as each other time worked, so the account was only lost for five weeks and not forever</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34443249">User complaints that their Google accounts have been moderately broken for 10 years due to forced Google+ migration in 2013 that left their account in a bad state</a></li> <li><a href="https://news.ycombinator.com/item?id=34116361">User locked out of Google after changing password</a> <ul> <li>Google asks the user to enter the new and old password, which the user does, but this doesn't enable logging in</li> <li>Google sometimes asks the user to scan a QR code from a logged in account, but the user can't do this because they can't log in</li> <li>User changed password to main and recovery accounts at the same time, so they're locked out of both accounts</li> <li>For no discernable reason, repeatedly trying to get into the recovery account eventually worked, which allowed them to eventually get back into their main account</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34119045">User gets locked out of Gmail account when Gmail starts asking for 10+ year old password as well as current password to log in</a> <ul> <li>User finds a suggestion on an old support forum to not try to log in for 40+ days and then try, at which point the user is only asked for their current password and can log in <ul> <li>This is clearly not a universal solution as there are examples of people who try re-logging in every year to lost accounts, which usually doesn't work, but this apparently sometimes works?</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34118359">Someone posts the standard reply about how you shouldn't expect to get support unless you pay for Google One</a>, apparently ignoring how every time someone posts this, people respond to note that Google One support rarely fixes problems like these</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31681221">User loses Gmail account because Gmail suddenly refuses to allow access with only the correct password and requires access to recovery email address, which has lapsed</a> <ul> <li><a href="https://news.ycombinator.com/item?id=31681512">A comment blaming the user from someone who apparently didn't read the post</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34116750">User loses Gmail account because Gmail suddenly refuses to allow access with only the correct password and requires an old phone number which is no longer active</a> <ul> <li>This turned out to be another case where waiting a long time and then trying to log in worked</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34118527">User loses Gmail account because Gmail suddenly refuses to allow access with only the correct password and requires an old phone number which is no longer active</a> <ul> <li>In this case, waiting a long time and then trying to log in didn't work and the account seems permanently lost</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34116894">User loses Gmail account because Gmail suddenly refuses to allow access the correct password</a>; user has the recovery email as well, but that doesn't help <ul> <li>After three years of trying to log in every few months, logging in worked for no discernable reason, so the account was only lost for three years</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34117206">Google gives away user's Google Voice number, which they were using daily and had purchased credits on that were also lost</a> <ul> <li><a href="https://news.ycombinator.com/item?id=34117467">Someone who apparently didn't read the post suggests to the user they shouldn't have let the number be inactive for 6 months or they should've &quot;made the number permanent'</a></li> <li>Support refuses to refund user for credits and user can't get a new Google Voice number because the old one is still somehow linked to them and is considered a source of spam</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34116640">User loses Gmail account when recovery account token doesn't work</a></li> <li><a href="https://news.ycombinator.com/item?id=34134130">User loses Gmail account when credentials stop working for no discernible reason</a></li> <li><a href="https://old.reddit.com/r/tifu/comments/zndbku/tifu_by_accidentally_buying_two_google_pixels_and/">User has an issue with Google and talks to support; support tells user to issue a chargeback, which results in user's account getting banned and user losing 15 years of account history</a></li> <li><a href="https://news.ycombinator.com/item?id=33963269">User is in the middle of getting locked out of Google accounts and makes a viral post to try to get a human at Google to look at the issue</a></li> <li><a href="https://nitter.net/id_aa_carmack/status/1598391619673358342">John Carmack complains about having &quot;a ridiculous time&quot; with Google Cloud, only getting his issue resolved because he complained on Twitter an is one of the most famous programmers on the planet</a>, decided to move to another provider after the second time this happened</li> <li><a href="https://news.ycombinator.com/item?id=33361311">Developer documents years of incorrect Google Play Store policy violations and how to work around them</a> <ul> <li><a href="https://news.ycombinator.com/item?id=33362178">Someone claiming to have worked on the Google Play Store team says</a>: &quot;a lot of that was outsourced to overseas which resulted in much slower response time. Here stateside we had a lot of metrics in place to fast response. Typically your app would get reviewed the same day. Not sure what it's like now but the managers were incompetent back then even so.&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33363079">Developer notes that they sometimes get updates rejected from Google Play store and have even had their existing app get delisted, but that the algorithm for this is so poor that you can make meaningless changes, which has worked for getting them relisted every time so far</a></li> <li><a href="https://news.ycombinator.com/item?id=33361774">Developer banned from Google Play, but they win the appeal</a> <ul> <li>However, the Name / Namespace (com.company.project) continues to be blocked, so they'd have to refactor the app and change the product and company name to continue using Google Play</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33361789">Developer describes their process of interacting with the Google Chrome Webstore</a>, which involves so much kafkaesque nonsense that they have semi-automated handling of the nonsense they know they'll encounter</li> <li><a href="https://news.ycombinator.com/item?id=33362453">Developer has comical, sub-ChatGPT level interactions with &quot;Chrome Web Store Developer Support&quot;</a> (see link for multiple examples)</li> <li><a href="https://twitter.com/PointCrow/status/1587084876741689345">User complains about repeated nonsense video demonetization and age limiting </a>, such as this <a href="https://www.youtube.com/watch?v=I0QFtxRgkYw">I ate water with chopsticks</a> getting a strike against it for &quot;We reviewed your content carefully, and have confirmed that it violates our violent or graphic content policy&quot;, with a follow-up of &quot;your video was rated [sic] limited by ML then mistakenly confirmed by a manual reviewer as limited .... we've talked to the team to ensure it doesn't happen again&quot;, but of course this keeps happening, which is why the user is complaining (the complaint comes after the video was restricted again and the appeal was denied twice, despite the previous comment about how YouTube would ensure this doesn't happen again).</li> <li><a href="https://twitter.com/TheGoodDeath/status/1580949485877764096#m">User has YouTube video incorrectly taken down for violating community guidelines</a>, but it gets restored after they and <a href="https://twitter.com/ContraPoints/status/1581072538016174082#m">another big YouTuber both write viral Twitter threads about the incorrect takedown</a></li> <li><a href="twitter.comhillelogram/status/1586740840986296322">User notes that Gmail's spam filtering appears to be getting worse</a> <ul> <li>I remember this one because, when this user complained about it, I noticed that I was getting multiple spam emails per day (with plenty of false positives when I checked my spam folder as well)</li> <li><a href="https://twitter.com/ryanoneill/status/1586808096264839168">This complaint from a user was also memorable to me since I was getting the exact same spam as this user</a></li> </ul></li> <li><a href="https://www.reddit.com/r/IdiotsInCars/comments/18zelw9/comment/kgmrpyf/">User notes that Google (?) consistently sends you the wrong way into a highway offramp</a></li> <li><a href="https://twitter.com/summoningsalt/status/1575920487171252225?s=46&amp;t=o85YJpDSpmrZZIh_S0W1dQ">User's video on the history of megaman speedruns becomes age restricted, which also mostly demonetizes it?</a> <ul> <li>User appeals, and 45 minutes later, they get a response saying &quot;after careful review, we've confirmed that the age restriction on your video won't be lifted&quot; (video is 78 minutes long) <ul> <li>User then quotes YouTube's own guidelines to show that their video doesn't appear to violate the guidelines</li> </ul></li> <li>User tweets about this, and then YouTube replies saying they lifted the age restriction, but the video stopped getting recommended, so the video was still not making money (this user makes a living off YouTube videos)</li> <li>8 days later, the video is officially age restricted again, and they say that the previous reversal was an error</li> <li>User then makes a video about this and tweets about the video, which then goes viral.</li> <li>YouTube then responds after the tweet about getting the runraround goes viral, with &quot;really sorry again that this was such a confusing / back and forth experience 😞. we’ve shared your video with the right people &amp; if helpful, keep sharing more w/ our community outreach team on that same email too!!&quot;</li> </ul></li> <li>When Jamie Brandon was organizing a database conference, Gmail spamfiltered the majority of emails he sent out about it <ul> <li>~700 people signed up to be notified when tickets were available, but even though they explicitly signed up to get notified, Gmail still spamfiltered Jamie's email</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=28216733">Author publishes a book about their victimization and sex crimes; Google then puts a photo of someone else in an automatically generated profile for the author</a> <ul> <li>&quot;After spending weeks sending feedback and trying to get help from Google support, they finally deleted the woman’s photo, but then promptly replaced it with another Andrea Vassell who is a pastor in New York. She, the pastor in New York, wrote to me that she has been 'attacked' because people believe she is me.&quot;</li> <li>That the person was a pastor of a church also caused problems for some people mentioned in the book; author again tries to get the photo removed, which eventually works, but is then replaced by the photo of a man who'd been fired for threatening the author, and then months later, the pastor's photo showed up again as the author</li> <li>Author appears to be non-technical and found HN and is writing a desperate plea for someone to do something about it</li> <li>A Google employee whose profile reads &quot;Google's public liaison of Search, helping people better understand search &amp; Google better hear public feedback&quot;, responds with &quot;I'll share more about how you can better get this feedback to us ... [explanation of knowledge panels] ... Normally people just don't like the image we show, so we have a mechanism for them to upload a preferred image. That's very easy to use. But in your case, I understand your reasons for not wanting to have an image used at all. I believe if you had filed feedback explaining that, the image would have been removed.&quot;</li> <li>Author is dumbfounded given her lengthy explanation of how much feedback she has already provided and responds with &quot;Are you suggesting that I did not send feedback through the appropriate channels? I have dozens of email exchanges with Google, some of which have multiple people copied on them, and I have screenshots of me sending feedback through your feedback link located within the knowledge panel. (And I explained my situation to them with more detail than I have explained here.). In April and May, I received email responses from Google employees who work for the knowledge panel support team. After they changed the photo twice to images of the wrong women instead of deleting them, I continued complaining and they suggested I contact legal removals. When I contacted legal, I received automated responses to contact the knowledge support team. So I was bounced around. They then began ignoring me and I started receiving automated responses from everyone. Even though I was being ignored, on any given day, I would wake up and find a different photo presented alongside my book. I also reached out to you, Danny Sullivan, <em>directly</em>&quot;</li> </ul></li> <li><a href="https://www.gregegan.net/ESSAYS/GOOGLE/Google.html">Famous sci-fi author spends years trying to get Google to stop putting photos of other people in their &quot;knowledge panel&quot;</a> <ul> <li>This seems to currently be fixed, and it only took between five years and a decade to fix it.</li> </ul></li> <li><a href="https://farside.link/nitter/hacks4pancakes/status/1337208086910668800">User notes that knowledge panel for them is misleading or wrong</a>, and that attempts to claim the knowledge panel to fix this have failed</li> <li><a href="https://hristo-georgiev.com/google-turned-me-into-a-serial-killer">Google knowledge panel for person incorrectly states that they are &quot;also known as The Sadist ... a Bulgarian rapist and serial killer who murdered five people... &quot;</a> <ul> <li>Fixed after a story about this makes it to #1 on HN</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=27623387">User notes that Google's knowledge panels about business often contain incorrect information even when the website for the business has correct information</a></li> <li><a href="https://twitter.com/danluu/status/1514019521123663873">Company reaches out to candidate about a job, eventually giving them an offer. The offer acceptance reply in email is marked as spam by everyone at the company</a> <ul> <li>On looking in the spam folder, one user at the company (me) finds that 19 out of 20 &quot;spam&quot; emails are actually not spam. Other users check and find a huge amount of important email is being classified as spam.</li> <li>Google support responds with what appears to be an automated message which reads &quot;Hi Dan. Our team is working on this issue. Meanwhile, we suggest creating a filter by selecting 'Never send it to spam' to stop mail from being marked as spam&quot;, apparently suggesting that everyone with a corp gmail account disable spam filtering entirely by creating a filter that disables the spam filter <ul> <li>One person responds and says they actually did this because they were getting so much important email classified as spam</li> </ul></li> </ul></li> <li><a href="https://mastodon.social/@danluu/109919293479394809">&quot;Obvious to humans&quot; spam gets through Gmail's spam filter all the time while also classifying &quot;ham&quot; as &quot;spam&quot;</a></li> <li>I emailed a local window film installer and their response to me, which quotes my entire email, went straight to spam <br /></li> </ul> <h3 id="facebook-meta">Facebook (Meta)</h3> <ul> <li><a href="https://nitter.net/FordFischer/status/1302391363456176128">Journalist's account deleted and only restored after Twitter thread on deletion goes viral</a></li> <li><a href="https://nitter.net/RMac18/status/1382366931307565057">Facebook moderator notes there's no real feedback or escalation path between what moderators see and the people who set FB moderation policy</a></li> <li><a href="https://news.ycombinator.com/item?id=31707349">User banned from WhatsApp with no reason given</a> <ul> <li>appeal resulted in a generic template response</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31707349">Instagram user can no longer interact with account</a> <ul> <li>would like to remove account, but can't because login fails</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31708987">Multiple users report they created a FB account so they can see and manage FB ads; accounts banned and can no longer manage ads</a></li> <li><a href="https://news.ycombinator.com/item?id=31577325">User banned after FB account hacked</a> <ul> <li>account restored after viral HN story</li> </ul></li> <li>On a local FB group, user posts &quot;Looking for some tech advice (admins delete if not allowed)... my Instagram account was hacked and I have lost all access to that account. The guy is still posting as me daily and communicating to others as me in messages (its a bitcoin scam). Does anyone know how I can communicate with Instagram directly? There does not appear to be any way to contact them and all the instructions I've followed lead me nowhere bc I have completely lost access to that account! 😫 Thank you!&quot; <ul> <li>Someone suggests Instagram's instructions for this, <a href="https://help.instagram.com/368191326593075">https://help.instagram.com/368191326593075</a>, but user replies and says that these didn't work because &quot;I did all that but unfortunately the hacker was in my email and and verified all the changes before I noticed&quot;</li> <li>I replied and suggested searching linkedin for a connection to an employee, since the only things that work are internal escalation or going viral</li> </ul></li> <li><a href="https://social.lol/@robb/111704215593992932">Facebook incorrectly reports a user to DigitalOcean for phishing for a blog post they wrote</a> <ul> <li>DigitalOcean sends them an automated message saying that their droplet (machine/VM) will be suspending if they don't delete the offending post within 24 hours</li> <li>user appeals and appeal goes through; unclear if it would've gone through without the viral HN thread about this</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38882891">User banned from FB marketplace for &quot;violating community guidelines&quot; after posting an ad for a vacuum</a> <ul> <li>user appeals multiple times and each appeal is denied, ending with &quot;Unfortunately, your account cannot be reinstated due to violating community guidelines. The review is final&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38881698">Reporting post advocating for violence against a person does nothing</a></li> <li><a href="https://news.ycombinator.com/item?id=38881730">Reporting post where one user tells another user to kill themselves does nothing</a></li> <li><a href="https://news.ycombinator.com/item?id=38882156">Murdered person is flooded with racist comments; friends report these, which does nothing</a></li> <li><a href="https://erinkissane.com/meta-in-myanmar-full-series">40000 word series of articles by Erin Kissane that I'm not going to attempt to summarize</a></li> <li><a href="https://news.ycombinator.com/item?id=38613594">Facebook doesn't take down obvious scam ads after reporting them</a></li> <li><a href="https://news.ycombinator.com/item?id=38613594">User stops reporting obvious scam ads to Facebook because they never remove them</a>, always saying that the ad didn't breach any standards</li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18etniq/someone_hacked_my_late_aunts_facebook_and_fake/">Takeover of dead person's Facebook account to run scams</a></li> <li>See &quot;Kamantha&quot; story in body of post</li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18etniq/comment/kcq1vj0/">Facebook refuses to do anything about account that was taken over and constantly posts scams</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18etniq/comment/kcqiuqy/">Facebook refuses to do anything about fake page for business</a></li> <li><a href="https://news.ycombinator.com/item?id=38269276">Reporting scammer on facebook does nothing</a></li> <li><a href="https://www.youtube.com/watch?v=SMFCLpUuhqY">Paying for &quot;creator&quot; level of support on Facebook / insta appears to be worthless</a> <ul> <li>Reviews is that support is sort of nice, in that you get connected to a human being who isn't reading off a script, but also useless. At one point Jonny Keeley had a video didn't upload and support's recommendation was to try editing the video again and uploading it again. Keeley asked support why that would fix it and the answer was basically, there's no particular reason to think that it might fix it, but it might also just work to re-upload the video. Another time, Keeley got &quot;hacked&quot; and went to support. Support once again responded quickly, but all they did was send him a bunch of links that he thinks are publicly available. Keeley was hoping that support would fix the issue, but instead they appear to have given him information he could've googled.</li> </ul></li> <li><a href="https://www.cnn.com/2023/11/08/tech/meta-facebook-instagram-teen-safety/index.html">Zuckerberg rejected proposals to improve teen mental health from other FB execs</a> <ul> <li>article notes that &quot;that a lack of investment in well-being initiatives meant Meta lacked 'a roadmap of work that demonstrates we care about well-being.'&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=37938866">Malicious Google ad from Google-verified advertiser</a>; ad only removed after major publication writes a story about it <ul> <li>A user notes that something that amplifies the effectiveness of this attack is that Google allows advertisers to show fake domains, which is necessary for them to do tracking as they currently do it and not show some weird tracking domain</li> </ul></li> <li><a href="https://lerner.co.il/2023/10/19/im-banned-for-life-from-advertising-on-meta-because-i-teach-python/">User gets lifetime ban from running ads because they ran ads for courses teaching people to use pandas (the python library)</a> <ul> <li>User hits appeal button on form and is immediately banned for life. <a href="https://news.ycombinator.com/item?id=37949763">Someone notes that the appeal button is a trap and you should never hit the appeal button???</a>. Apparently you should <a href="https://news.ycombinator.com/item?id=37254898">fill out some kind of form that you won't be able to fill out if you hit the appeal button and are immediately banned</a>?</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=37944695">User notes pervasive scam ads</a></li> <li><a href="https://nitter.net/JakeMooreUK/status/1680962682726363136">You can deactivate anyone's WhatsApp account by sending an email asking for it to be deactivated</a> <ul> <li>This is sort of the opposite of all those scam FB accounts where reporting that the account is scamming does nothing</li> </ul></li> <li><a href="https://www.threads.net/@chancerydaily/post/CuxYO8IR8Ea?igshid=NTc4MTIwNjQ2YQ%3D%3D">User has innocuous Threads message removed for &quot;violating Community guidelines&quot;</a>, and then asks why there's so much spam that doesn't get removed but their message gets remove</li> <li><a href="https://www.threads.net/@parkermolloy/post/CuxTZD0vy52?igshid=NTc4MTIwNjQ2YQ%3D%3D">User has Threads message removed with message saying that it violates community guidelines; message is a reply to themselves that reads &quot;(Also, please don't respond to this with some 'well, on the taxpayer funding front, I think they have a point...' stuff. If you can read an article that highlights efforts to push people like me out of society and your takeaway is 'Huh, I think those people have a point!' then I'd much rather you not comment at all. I &quot;</a> <ul> <li>Like many others, user notes that they've repeatedly reported messages that do actually violate community guidelines and these messages don't get removed</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36359508">Rampant fraud on Instagram, Facebook, and WhatsApp</a></li> <li><a href="https://jacobin.com/2023/06/meta-is-trying-and-failing-to-crush-unions-in-kenya">Meta moderation in Kenya</a></li> <li><a href="https://blog.adafruit.com/2023/05/13/facebook-claiming-people-showing-art-electronics-wheelchair-modifications-is-hate-speech-our-video-show-and-tell-removed-from-facebook/">Facebook removes post showing art, electronics, and wheelchair mods is &quot;hate speech&quot;</a> <ul> <li>No support action does anything, but the post is restored after the story about this goes viral</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35936225">User notes that stories that vaguely resemble holding a gun to one's head, such as holding a hair dryer to one's head, get flagged</a></li> <li><a href="https://www.threads.net/@soozkempner/post/C2PKuDUIhzL">User reports threads desktop isn't usable for them for 6 weeks and then suddenly starts working again</a>; logging in on most browsers give them an infinite loop</li> <li>Dead link due to Mastodon migration, but comment about FB spam which used to be accessible in <a href="https://mastodon.social/@jefftk@mastodon.mit.edu/109826480309020697">https://mastodon.social/@jefftk@mastodon.mit.edu/109826480309020697</a></li> <li><a href="https://news.ycombinator.com/item?id=34803157">User banned from Facebook's 2FA system (including WhatsApp, Insta, etc., causing business Insta page to get deleted) due to flaw in password recovery system</a> <ul> <li>Despite having 2FA enabled, someone was able to take over this person's FB account. On appealing this, user is told &quot;We've determined that you are ineligible to use Facebook&quot;</li> <li>User also used FB login for DMV and is no longer able to log into DMV</li> <li>New accounts the user creates are linked to the old account and banned. Someone comments, &quot;<a href="https://news.ycombinator.com/item?id=34804934">lol so they can identify/verify that but somehow fail to fingerprint login from Vietnam and account hijacking.</a>&quot;</li> <li>As usual, multiple people have <a href="https://news.ycombinator.com/item?id=34808158">the standard response</a> that it's <a href="https://news.ycombinator.com/item?id=34803698">the user's fault for using a big platform</a> and that no one should use these platforms, with comments like &quot;It's common sense and obvious, yet whenever it gets mentioned, the messenger gets dunked on for victim blaming or whatever ... Somehow, this is a controversial opinion on HN&quot; (there are many more such comments; I just linked a couple) <ul> <li>The author, who already noted in the post that his industry is dependent on Instagram asks &quot;Please educate me on how to get the potential clients to switch platforms that they use to view pictures?&quot; and didn't get a response; presumably, as is standard, none of these commenters read the post, although perhaps some of them just think that no one should work in these industries and anyone who does so should go flip burgers or something</li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34803491">User's account banned after it was somehow compromised from a foreign IP</a> <ul> <li><a href="https://news.ycombinator.com/item?id=34803620">User gets the standard comment about how FB couldn't possibly review cases like this due to its scale</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34807319">User effectively banned from Facebook due to broken password recovery process</a>, which requires getting total strangers to vouch that you're a real person, presumably due to some bad ML. <ul> <li>Afterwards, some scammer created a fake profile of the person, so there's now a fake version of the person around on FB and not a real one</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34804049">User effectively banned from FB due to bad &quot;suspicious activity&quot; detection</a> despite having 2FA on and having access to their password and 2FA</li> <li><a href="https://news.ycombinator.com/item?id=34803729">User repeatedly has account suspended for no discernable reason</a></li> <li><a href="https://news.ycombinator.com/item?id=34805537">User effectively banned from FB until a friend of theirs starts a jos a job there, at which point their friend opens an internal ticket to get them unbanned</a></li> <li><a href="https://www.nj.com/news/2022/09/womans-facebook-was-hacked-and-disabled-her-instagram-too-why-is-meta-doing-to-fix-it.html">User banned from FB after account hacked</a></li> <li><a href="https://perfectionhangover.com/facebook-disabled-deactivated-my-account/">User banned from FB after account hacked</a> <ul> <li>See comments for many other examples</li> </ul></li> <li><a href="https://emilycordes.com/facebook/">User banned from facebook after account hacked</a> <ul> <li>Luckily for the user, this made the front page of HN and was highly ranked, causing some FB engineers to reach out and then fix the issue</li> <li>Of course the HN post has the standard comments; one commenter suggests that people with the standard comments actually read the article before commenting, for once: &quot;Anyone saying 'Good riddance! Go enjoy your life without Facebook!' is missing the point. Please read this bit from the article:&quot;Thing is I’m a Mum of two who has just moved to a new area. Facebook groups have offered me support and community, and Mums I’ve met in local playgrounds have added me as a friend so we can use messenger to plan playdates. Without these apps sadly my little social life becomes a lot lonelier, and harder.&quot; <ul> <li><a href="https://news.ycombinator.com/item?id=31579665">Undeterred, commenters respond to this comment with things like</a> &quot;this might actually have been a blessing in disguise--just the encouragement she needed to let go and move on from this harmful platform.&quot;</li> </ul></li> </ul></li> <li><a href="https://www.google.com/maps/reviews/@37.485073,-122.1482824,17z/data=!3m1!4b1!4m6!14m5!1m4!2m3!1sChdDSUhNMG9nS0VJQ0FnSUNSd3N2YnVnRRAB!2m1!1s0x0:0x64979e438bf4e3a5?hl=en-US&amp;entry=ttu">People who don't know employees at FB who can help them complain on Google Maps about their Facebook's anti-fraud systems</a></li> <li><a href="https://news.ycombinator.com/item?id=31579437">User banned from Facebook after posting about 12V system on Nissan Leaf in Nissan Leaf FB group</a> <ul> <li>The post was (presumably automatically) determined to have violated &quot;community standards&quot;, requiring identify verification to not be banned</li> <li>&quot;OK, I upload my driving licence. And it won't accept the upload. I try JPEG, PNG, different sizes, different browsers, different computers. Nothing seems to stick and after a certain number of attempts it says I have to wait a week to try again. After as couple of rounds of this the expiry deadline passes and my account is gone.&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31577975">Person notes that their wife and most of their wife's friends have lost their FB accounts at least once</a></li> <li><a href="https://news.ycombinator.com/item?id=31578591">Person notes that girlfriend's mother's account was taken over and reporting the account as being taken over does nothing</a> <ul> <li>The person, a programmer, finds it odd that taking over an account and changing the password, email, profile photo, full name, etc., all in quick succession doesn't trigger any kind of anti-fraud check</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31581510">User reports that you can get FB accounts banned by getting people in groups dedicated to mass reporting accounts to all report an account you want to get banned</a></li> <li>Someone wrote a spammy reply to a &quot;tweet&quot; of mine on Threads that was trending (they replied with a link to their substack and nothing else). I reported it and, of course, nothing happened. I guess I neeed to join one of the mass reporting groups to ask a bunch of people to report the spam.</li> <li><a href="https://news.ycombinator.com/item?id=31577718">User is locked out of account and told they need to upload photo ID, which does nothing</a> <ul> <li>Six months later, user gets to know a Facebook employee, who gets them unbanned</li> </ul></li> <li><a href="https://www.newsweek.com/onlyfans-star-slept-meta-employees-instagram-unbanned-1708744">User has Facebook account banned and can't get it unbanned</a> <ul> <li>User tried to contact FB employees on linkedin, which failed</li> <li>User then used instagram to meet FB employees and sleep with them, resulting in the account getting unbanned</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31581475">User is effectively banned from instagram because they logged in from a new device and can't confirm a no-longer active email</a></li> <li><a href="https://news.ycombinator.com/item?id=31578789">User gets FB account stolen, apparently bypassing 2FA check the user thought would protect them</a> <ul> <li>White male, father of 3 FB account replace by young Asian female account, apparently not at all suspicious to FB anti-fraud systems</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31583954">User finds that someone is impersonating them on Instagram</a>; reporting this does nothing</li> <li><a href="https://news.ycombinator.com/item?id=31577818">User has ad account hacked; all attempts to contact support get no response or a useless auto-response or a useless webpage</a></li> <li><a href="https://news.ycombinator.com/item?id=31578197">User reports that there are multiple fake accounts impersonating them and family members</a> and that reporting these accounts does nothing</li> <li><a href="https://news.ycombinator.com/item?id=31631585">Relatively early post-IPO Facebook engineer has account banned from Facebook</a> and of course no standard appeal process works <ul> <li>User reports that their engineering friends inside the company are also unable to escalate the issue, so their account as well as ads money and Oculus games are lost</li> </ul></li> <li><a href="https://mastodon.social/@skolima/109762138275850929/">Sophisticated/technical user gets Instagram account stolen despite using strong random password, password manager, and 2FA</a> <ul> <li>Crypto people had been trying to buy the account for 6 months and then the account was compromised</li> <li><a href="https://mastodon.social/@alexjsp/109760284680279691">Following Instagram's official instructions for what to do literally results in an infinite loop of instructions</a></li> <li>Instagram claims that they'll send you an email from <code>security@mail.instagram.com</code> if you change the email on your account, but this didn't happen; user looked at their Fastmail logs and believes that their email was not compromised</li> <li>User was able to regain their Insta account after the story hit the front page of HN</li> <li>Multiple people note that there are services that claim to be able to get you a desired Insta handle for $10k to $50k; it's commonly believed that this is done via compromised Facebook employees. Since there is (in general) no way to appeal or report these things, whatever it is that these services do is generally final unless you're famous, well connected in tech, or can make a story go viral about you</li> </ul></li> <li><a href="https://social.coop/@jordancox/109762988866327497">Desirable Instagram handle is stolen</a> <ul> <li>The first two times this happened, user was able to talk to a contact inside Facebook to fix it, but they lost their contact to Facebook so the handle was eventually stolen and appears to be gone forever</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34549132">User tries to recover their mother's hacked Instagram account and finds that the recovery process docs are an infinite loop</a> <ul> <li>They also find that the &quot;click here if this login wasn't you&quot; e-mail anti-fraud link when someone tries to log in as you is a 404</li> <li>They also find that if an account without 2FA on gets compromised and the attacker turns on 2FA, this disables all old recovery mechanisms.</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34553081">User logs in and is asked for email 2FA</a> <ul> <li>Email never arrives, isn't in spam folder, etc.</li> <li>User asks for another code, which returns the error &quot;Select a valid choice. 0 is not one of the available choices.&quot;</li> <li>Subsequent requests for codes fail. User tries to contact support and gets a pop-up which says &quot;If you’re unable to get the security code, you need to use the Instagram app to secure your account&quot;, but the user doesn't have the Instagram app installed, so their account is lost forever</li> </ul></li> <li><a href="https://fashionweekdaily.com/instagram-handle-sussexroyal-controversy/">Instagram takes username from user to give it to a more famous user</a>, a common story</li> <li><a href="https://news.ycombinator.com/item?id=34549196">User with password, 2FA, registered pgp key (!?) gets locked of account due to some kind of anti-fraud system triggering</a>; FB claims that only a passport scan will unlock the account, which the user apparently hasn't tried</li> <li><a href="https://news.ycombinator.com/item?id=34443330">User finds that it's not possible to move Duo 2FA and loses account forever</a> <ul> <li>According to the user, FB has auth steps to handle this case, which involves sending in ID docs, which the user tries annually. These steps do nothing</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=39145302">User with Pixel phone can't use bluetooth for months because Google releases an update that breaks bluetooth (presumably only for some and not all devices) and doesn't bother to fix it for months</a></li> <li><a href="https://twitter.com/danluu/status/1584615878800576512">I tried clicking on some Facebook ads (I normally don't look at or click on them) leading up to Black Friday and most ads were actually scams</a></li> <li><a href="https://news.ycombinator.com/item?id=33280491">User reports fake FB profile (profile uses images from a famous dead person)</a> and gets banned after reporting the profile a lot; user believes they were banned for reporting this profile too many times</li> <li><a href="https://www.facebook.com/jefftk/posts/pfbid0gytpKikevTKLat5sE4En6BsRsGxpoaF3oR5joPFZXwgKuq2GrSLEbnV9xK24LScFl?comment_id=512928177058808&amp;reply_comment_id=1306228043460821">User makes FB post about a deepfake phishing attack</a>, which then attracts about 1 spam comment per minute that they have to manually delete because FB's systems don't handle this correctly</li> <li><a href="https://twitter.com/danluu/status/905456556979965953">FB Ad Manager product claims reach of 101M people in the U.S. aged 18-34, but U.S. census has the total population being 76M, a difference of 25M assuming all people in the U.S. in that age group can be reached via FB ads</a> <ul> <li>Former PM of the ads targeting team says that this is expected and totally fine because FB can't be expected to slice and dice numbers as small as tens or hundreds of millions accurately. &quot;Think at FB scale&quot;.</li> </ul></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/rwm50a/i_was_given_a_week_facebook_ban_over_sandshrew/">User gets banned from FB for a week for posting sexual content when they posted an image of a pokemon</a></li> <li>For maybe five years or so, I would regularly get spam in my feed where a scammer would sell fake sneakers and then tag a bunch of people, tagging someone I'm FB friends with, causing me to get this spam into my feed <ul> <li>This exact scam doesn't show up in my feed all the time anymore, but tag spam like this still sometimes shows up</li> </ul></li> <li><a href="https://twitter.com/Jxck_Sweeney/status/1766191047590420685">Instagram takes down account posting public jet flight information when billionaire asks the to</a></li> </ul> <h3 id="amazon">Amazon</h3> <ul> <li><a href="https://nitter.net/fchollet/status/1550930876183166976">Author notes that 100% of the copies of their book sold on Amazon are counterfeits (100% because Amazon never buys real books because counterfeiters always have inventory)</a> <ul> <li>Author spent months trying to get Amazon to take action on this; no effect</li> <li>Author believes that most high-sale volume technical books on Amazon are impacted and says that other authors have noticed the same thing</li> </ul></li> <li><a href="https://kurtisknodel.com/blog?post=amazon">Top USB drive listings on Amazon are scams</a></li> <li><a href="https://news.ycombinator.com/item?id=32289842">Amazon retail website asks user to change password; Amazon retail account and AWS stop working</a> <ul> <li>never restored</li> </ul></li> <li><a href="https://www.reddit.com/r/magicTCG/comments/omj8jh/amazon_customers_beware/">Magic card scam on Amazon</a> <ul> <li><a href="https://www.reddit.com/r/mtgfinance/comments/omjagl/after_watching_the_video_from_uebisquid_i_got/">many customers report that &quot;rare&quot; cards were removed from packs bought from Amazon</a></li> </ul></li> <li><a href="https://nitter.net/fchollet/status/1550930876183166976">Counterfeit books sold on Amazon</a> <ul> <li>seller of non-counterfeit books reported to Amazon various times over the years without effect</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32211102">User notes that Amazon is more careful about counterfeits in Australia than in the U.S. due to regulatory action, and that Valve only issued refunds in some cases due to Australian regulatory action</a></li> <li><a href="https://news.ycombinator.com/item?id=32210999">User notes that Amazon sells counterfeit Kindle books</a></li> <li><a href="https://news.ycombinator.com/item?id=32214971">Author notes that Amazon sells counterfeit copies of their book</a></li> <li><a href="https://wonderbowgames.com/en-uk/blogs/news/counterfeit-copies-of-kelp">Boardgame publisher reports counterfeit copies of their game on Amazon, which they have not been able to get Amazon to remove</a> <ul> <li>I saw this on a FB group I'm on since the publisher is running a PR blitz to get people to report the fake copies on Amazon in order to get Amazon to stop selling counterfeits</li> </ul></li> <li><a href="https://nitter.net/joabaldwin/status/1741145625809559933">Amazon resells returned, damaged, merchandise</a> <ul> <li>This is so common that authors warn each other that this happens and so that other authors know that to leave a note in the book telling the user what happened when authors return damaged author's copies of books</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38819176">Amazon ships &quot;new&quot; book with notes scribbled on pages and exercises done; on asking for replacement, user gets a 2nd book &quot;new&quot; in similar condition</a></li> <li><a href="https://www.youtube.com/watch?v=B90_SNNbcoU&amp;themeRefresh=1">Top-selling Amazon fuses dangerously doesn't blow at well above rated current</a></li> <li><a href="https://news.ycombinator.com/item?id=38820664">Amazon sells used items as new</a></li> <li><a href="https://news.ycombinator.com/item?id=38819094">Amazon sells used items as new</a></li> <li><a href="https://news.ycombinator.com/item?id=38819234">Amazon sells used items as new</a></li> <li><a href="https://news.ycombinator.com/item?id=38819528">Amazon sells used items as new</a></li> <li><a href="https://news.ycombinator.com/item?id=38819432">Amazon sells used item as new; book has toilet paper inside</a></li> <li><a href="https://news.ycombinator.com/item?id=38819839">Amazon sells used item as new; book has toilet paper inside</a></li> <li><a href="https://news.ycombinator.com/item?id=38821489">Amazon ships HDDs in oversized box with no padding</a></li> <li><a href="https://news.ycombinator.com/item?id=38821770">Amazon sells used or damaged items as new</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18upi9k/brand_new_300_microwave_from_amazon_came_full_of/">Amazon sells used microwave full of food detritus as new</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18upi9k/comment/kflvmby/">Amazon sells used pressure cooker with shopping bag and food smell as new</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18upi9k/comment/kfm4psd/">Amazon sells used vacuum cleaner as new, complete with home address and telephone number of the person who returned the vacuum</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18upi9k/comment/kfnvuf2/">Amazon ships incorrect product to user buying a new item, apparently due to someone returned in different item</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18upi9k/comment/kfra5de/">Amazon sells incomplete used item as new</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18upi9k/comment/kfluk9m/">Amazon sells used items as new</a></li> <li><a href="​​https://www.reddit.com/r/mildlyinteresting/comments/18uq0rg/this_book_sold_to_me_as_new_from_amazon_had_an/">Amazon sells used item as new, complete with invoice for sale 13 years ago, with name and address of previous owner</a></li> <li><a href="https://news.ycombinator.com/item?id=38817916">Amazon selling low quality, counterfeit, engine oil filters</a></li> <li><a href="https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/warning-letters/amazoncom-inc-662503-12202023">Amazon sells supplements with unlabeled or mislabeled ingredients</a> <ul> <li><a href="https://news.ycombinator.com/item?id=38795981">Someone notes that Amazon used to require certification for supplements, but the person who was driving this left Amazon and it appears that no one has picked it up</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38794482">Amazon sells counterfeit shampoo that burns user's scalp</a> <ul> <li>User wrote a review, which was deleted by Amazon</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38795941">Amazon sells damaged, used, items as new</a></li> <li><a href="https://news.ycombinator.com/item?id=38795469">Amazon sells counterfeit supplement</a> <ul> <li>User wrote a review noting this, which Amazon deleted</li> </ul></li> <li><a href="https://www.reddit.com/r/lego/comments/18qomuv/amazon_set_filled_with_trash/">Amazon sells box full of trash as new lego set</a></li> <li><a href="https://www.reddit.com/r/Wellthatsucks/comments/18or8h3/didnt_order_a_refurbished/">Amazon sells used item with a refurbished sticker on it as new</a></li> <li><a href="https://www.reddit.com/r/Wellthatsucks/comments/18or8h3/comment/kekcvmm/">User has National Geographic subscription that can't be cancelled through the web interface, so they talk to Amazon support to cancel it; Amazon support cancels their Amazon Prime subscription instead</a></li> <li><a href="https://www.reddit.com/r/Wellthatsucks/comments/18or8h3/comment/kejdhoe/">Amazon sells used, damaged, item as new</a></li> <li><a href="https://www.reddit.com/r/Wellthatsucks/comments/18or8h3/comment/kejp9p3/">Amazon sells used item with Kohl's sticker on it as new</a></li> <li><a href="https://boardgamegeek.com/thread/3209950/scam-amazon-testament-returned-item-onsold-new">Amazon sells nearly empty boardgame box as new, presumed to be returned item with game removed</a></li> <li><a href="https://www.facebook.com/groups/132851767828/permalink/10161153164112829/">Amazon sells counterfeit board game</a></li> <li><a href="https://news.ycombinator.com/item?id=38691752">User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice</a></li> <li><a href="https://news.ycombinator.com/item?id=38691794">User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice</a></li> <li><a href="https://news.ycombinator.com/item?id=38691812">User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice</a></li> <li><a href="https://news.ycombinator.com/item?id=38696003">User writes review noting that product tries to buy fake reviews; Amazon deletes their review as being bought because it mentioned this practice</a></li> <li><a href="https://www.reddit.com/r/hardware/comments/sd10h1/counterfeit_sd_cards/">Amazon sells counterfeit SD cards; user compares to reference SD card bought at brick and mortar store</a> <ul> <li>A commenter notes that counterfeit SD cards are so common on Amazon that r/photography has a blanket recommendation against buying SD cards on Amazon</li> </ul></li> <li><a href="https://www.fredmiranda.com/forum/topic/1772750/1&amp;year=2022#16137759">User leave a review noting that product is a scam/fake an review is rejected</a></li> <li><a href="https://www.fredmiranda.com/forum/topic/1619442/0">Counterfeit lens filter on Amazon</a>; multiple users note that they never buy camera gear (which includes things like SD cards) from Amazon because they've received too many counterfeits</li> <li><a href="https://www.reddit.com/r/pcmasterrace/comments/191v2mg/comment/kgy40ar/">Amazon sells used, dirty, CPU as new CPU</a>; CPU is marked &quot;NOT FOR RESALE&quot; (NFR) <ul> <li>it's not known why this CPU is marked NFR; a commenter speculates that it was a review copy of a CPU, in which case it would be relatively likely to be <a href="why-benchmark/">a highly-binned copy that's better than what you'd normally get</a>. On the other hand, it could also be an early engineering sample with all sorts of bugs or other issues; when I worked for a CPU company, we would buy Intel CPUs to test them and engineering samples would not only have a variety of bugs that only manifested in certain circumstances, they would sometimes have instructions that did completely different things that could be reasonable behavior, except that Intel had changed the specified behavior before release, so the CPU would just do the wrong thing, resulting in crashes on real software (this happened with the first CPU we were able to get that had the <code>MWAIT</code> instruction, an engineering sample that was apparently from before Intel had finalized the current behavior of <code>MWAIT</code>).</li> </ul></li> <li><a href="https://www.cbc.ca/news/gopublic/amazon-returns-watch-missing-1.7046767">Amazon doesn't refund user after they receive empty box instead of $2100 item, but there's a viral story about this, so maybe Amazon will fix this</a></li> <li><a href="https://www.reddit.com/r/SonyAlpha/comments/beoecq/update_amazon_sent_me_the_wrong_lens_and_is/">Amazon refuses to refund user after sending them an old, used, camera lens instead of high-end new lens</a> <ul> <li>On the photography forum where this is posted, users note that you should never by camera lenses or other high-end gear from Amazon if you don't want to risk being scammed</li> </ul></li> <li><a href="https://www.reddit.com/r/photography/comments/rx50ny/psa_watch_out_when_buying_used_lens_from_amazon/">Amazon doesn't refund user after sending them a low-end used camera lends instead of the ordered high-end lens</a> <ul> <li>Users on this photography forum (a different one than the above) note that this happens frequently enough that you should never order camera lenses from Amazon</li> </ul></li> <li><a href="https://petapixel.com/2021/05/26/couple-buys-7000-camera-from-amazon-gets-empty-boxes-instead/">Amazon refuses to refund user who got an empty box instead of a $7000 camera</a> until story goes viral and Amazon gets a lot of bad press <ul> <li>Based on the shipping weight, Amazon clearly shipped something light or an empty box and not a camera</li> </ul></li> <li><p><a href="https://www.cbc.ca/news/canada/british-columbia/amazon-shoe-packages-1.6926200">User gets constant stream of unwanted Amazon packages</a></p> <ul> <li>In response to a news story, Amazon says &quot;The case in question has been addressed, and corrective action is being taken to stop the packages&quot;, but the user reports that nothing has changed</li> </ul></li> <li><p><a href="https://www.reddit.com/r/onguardforthee/comments/18alquj/comment/kbypxid/">Amazon sells user used AirPods, which later causes a problem when they want to use the 1-year warranty because Apple shows an in-service date 2 months before the user bought the item</a></p> <ul> <li>To fix this, Apple requests that the user get some evidence from Amazon that the particular serial number was associated with their purchase and Amazon refuses to do this; people recommend that, to fix this, the user do the &quot;standard&quot; Amazon scam of buying a new item and returning a used item to swap a broken used item for a new item</li> </ul></li> <li><p><a href="https://www.reddit.com/r/onguardforthee/comments/18alquj/comment/kbyzh9t/">User receives old, used, HD from Amazon instead of new HD</a></p></li> <li><p><a href="https://www.reddit.com/r/Justrolledintotheshop/comments/17sgdc0/psa_stop_buying_car_parts_on_amazon/">Mechanic warns people not to buy car parts on Amazon because counterfeits are so frequent</a></p> <ul> <li>They note that you can get counterfeits in various places, but the rate is highest from Amazon and it's often a safety issue; they're current dealing with a customer who had counterfeit brake pads</li> <li>Many other mechanics reply and report similar issues, e.g., someone bought a water pump from Amazon that exploded after 5 months that they believe is fake</li> </ul></li> <li><p><a href="https://www.reddit.com/r/YouShouldKnow/comments/ifytxk/ysk_that_amazon_has_a_serious_problem_with/">User stops buying household products from Amazon because counterfeit rate is too high</a></p></li> <li><p><a href="https://www.reddit.com/r/YouShouldKnow/comments/ifytxk/comment/g2r9srl/">User gets counterfeit card game from Amazon</a></p></li> <li><p><a href="https://www.reddit.com/r/YouShouldKnow/comments/ifytxk/comment/g2rtfex/">User gets counterfeit board game from Amazon</a></p></li> <li><p><a href="https://www.reddit.com/r/YouShouldKnow/comments/ifytxk/comment/g2sexe6/">Amazon sells counterfeit gun parts and accessories</a></p></li> <li><p><a href="https://www.reddit.com/r/YouShouldKnow/comments/ifytxk/comment/g2tfkpa/">Amazon sells so many counterfeits that board game maker runs a marketing campaign to ask people to stop buying their game on Amazon</a></p> <ul> <li>They spent months trying to get Amazon to go after counterfeits without making progress until the marketing campaign; two days after they started it, Amazon contacted them to try to deal with the issue</li> </ul></li> <li><p><a href="https://themarkup.org/show-your-work/2020/06/18/how-we-investigated-banned-items-on-amazon-com">Searching for items in many categories trivially finds huge number of fraudulent or counterfeit items</a></p></li> <li><p><a href="https://www.reddit.com/r/YouShouldKnow/comments/ifytxk/comment/g2r4l2r/">User gets counterfeit hair product that burns scalp</a></p></li> <li><p><a href="https://www.reddit.com/r/books/comments/1976o66/found_out_my_friend_returns_books_right_after/">User receives used book from Amazon and their friend tells them that it's normal to buy books and return them in the return window, which their friend does all the time</a></p></li> <li><p><a href="https://forums.macrumors.com/threads/amazon-shuts-down-customers-smart-home.2392704/">Amazon driver mishears automated response from Eufy doorbell</a>, causing Amazon to shut down user's smarthome (user was able to get smarthome and account back after one week)</p> <ul> <li>Video footage allegedly shows that the doorbell said &quot;excuse me, can I help you&quot;, which lead to an Amazon executive personally accusing this user of racism; when account was unlocked, the user wasn't informed (except that things started working again)</li> <li>In the comments to the article, someone says that it's impossible that Amazon would do this, with comments like &quot;None of this makes any sense and is probably 100% false.&quot;, as if huge companies can't do things that don't make any sense, but Amazon's official response to a journalist reaching our for comment confirms that the something like the events happened; if it was 100% false, it would be very strange for Amazon to respond thusly instead of responding with a denial or not responding</li> </ul></li> <li><p><a href="https://www.youtube.com/watch?v=Kcohq313q00">Youtuber who made a video about the above has their Amazon Associates account deleted after video goes viral</a></p></li> <li><p><a href="https://news.ycombinator.com/item?id=36766149">Amazon account gets locked out</a>; support refuses to acknowledge there's an issue until user calls back many times and then tells user to abandon the account and make another one</p></li> <li><p><a href="https://news.ycombinator.com/item?id=36447510">User has Amazon account closed because they sometimes send gifts to friends back home in Belarus</a></p></li> <li><p><a href="https://news.ycombinator.com/item?id=36447380">User gets counterfeit item from Amazon</a>; they contact support with detailed photos showing that the item is counterfeit and support replies with &quot;the information we have indicates that the product you received was authentic&quot;</p></li> <li><p><a href="https://www.reddit.com/r/pcmasterrace/comments/1996od8/update_amazon_sent_the_wrong_gpu_twice_but_the/">User gets the wrong GPU from Amazon, twice</a>; luckily for them, the second time, Amazon sent a higher end GPU than was purchased, so the user is getting a free upgrade</p></li> <li><p><a href="https://www.inc.com/sonya-mann/amazon-counterfeits-no-starch.html">Technical book publisher fails to get counterfeits removed from Amazon</a></p> <ul> <li>Amazon announced a new system designed to deal with this, but people continue to report rampant technical book counterfeiting on Amazon, so the system does not appear to have worked</li> </ul></li> <li><p><a href="https://news.ycombinator.com/item?id=35919753">ChatGPT clone of author's book only removed after Washington Post story on problem</a></p></li> <li><p><a href="https://news.ycombinator.com/item?id=35922209">Searching for children's books on Amazon returns AI generated nonsense</a></p></li> <li><p><a href="https://wandering.shop/@youseeatortoise/111782434593735690">Amazon takes down legitimate cookbook</a>; author notes &quot;They won't tell us why. They won't tell us how to fix whatever tripped the algorithm. They won't seem to let us appeal. Reaching a human at Amazon is a Kafkaesque experience that we haven't yet managed to do.&quot;</p> <ul> <li>When I checked later, not restored despite viral Mastodon thread and highly upvoted/ranked front-page HN article</li> <li><a href="https://news.ycombinator.com/item?id=39066608">Multiple</a> <a href="https://news.ycombinator.com/item?id=39068983">people</a> give the standard response of asking why booksellers bother to use Amazon, seemingly unaware (???) that Amazon has a lot of marketshare and authors can get a lot more reach and revenue on Amazon than on other platforms (when they're not arbitrarily banned) (the author of the book replies and says this as well, but one doesn't need to be an author to know this)</li> </ul></li> <li><p><a href="https://news.ycombinator.com/item?id=35521641">Amazon basically steals $250 from customer</a>, appeal does nothing, as usual</p></li> <li><p><a href="https://www.reddit.com/r/Wellthatsucks/comments/19csery/good_ole_amazon/">Amazon delivers package directly to food waste / compost bin and declines to provide any compensation</a></p> <ul> <li>User notes that they had a nice call with Amazon support and that they hope this doesn't happen again. From my experience with trying to get Amazon to stop shipping packages via Intelcom and Purolator, I suspect this user will have this problem happen again — I've heard that you can get them to not deliver using certain mechanisms, but you have to repeatedly contact support until someone actually puts this onto your file, as opposed to just saying that they'll do it and then not doing it, which is what's happened the two times I've contacted support about this</li> </ul></li> <li><p><a href="https://www.reddit.com/r/pcmasterrace/comments/11q8mjw/i_got_a_4090_but_its_showing_up_as_a_3070ti_fresh/">User receives fake GPU from Amazon</a>, after an attempt to buy from the official Amazon.com msi store</p></li> <li><p><a href="https://www.reddit.com/r/Wellthatsucks/comments/10l5ndf/not_sure_if_this_is_the_right_subreddit_but_a/">Amazon Fresh order comes with bottle of urine</a></p></li> <li><p><a href="https://www.howtogeek.com/142496/why-the-heck-is-amazon-selling-these-fake-16-terabyte-portable-hard-drives/">Amazon sells many obviously fake 16 TB physically tiny SSD drives for $100</a></p> <ul> <li>The author sent a list of fakes to Amazon and a few disappeared. The author isn't sure if the listings that disappeared were actually removed by Amazon or if it's just churn in the listings</li> <li><a href="https://news.ycombinator.com/item?id=34410697">An HN commenter searches and also finds many fakes</a>, which have good reviews that are obviously for a different product; someone notes that they've tried reporting these kinds of obvious fakes where someone takes a legitimate product with good reviews and then swaps in a scam product but that this does nothing</li> <li><a href="https://news.ycombinator.com/item?id=34409945">Multiple people note that they've tried leaving 1* reviews for fake products and had these reviews rejected by Amazon for not meeting the review policy guidelines</a></li> <li>Some time after this story made the front page of HN, this class of fakes got cleaned up. However, other fakes that are mentioned in the HN comments (see item directly below this) did not get cleaned up; maybe someone can write an article about how these other items are fake to get these other things cleaned up as well</li> </ul></li> <li><p><a href="https://news.ycombinator.com/item?id=34410117">User notes that bestselling item on Amazon is a fake item</a> and that they tried to leave a review to this effect, but the review was rejected</p> <ul> <li>I looked up the item and it's still a bestselling item. There are some reviews which indicate that it's a fake item, but this fake item seems to have been on sale for years</li> </ul></li> <li><p><a href="https://github.com/DesktopECHO/T95-H616-Malware">Amazon sells Android TV boxes that are actually malware</a></p> <ul> <li>It appears that these devices have been on sale on Amazon at least since 2017; I clicked the search query in the link of the above post and it still returns many matching devices in 2014</li> </ul></li> <li><p><a href="https://krebsonsecurity.com/2024/01/canadian-man-stuck-in-triangle-of-e-commerce-fraud/">Amazon scammer causes user to get arrested and charged for fraud</a>, which causes user to lose their job</p> <ul> <li>The user also notes &quot;In Canada, a criminal record is not a record of conviction, it’s a record of charges and that’s why I can’t work now. Potential employers never find out what the nature of it is, they just find out that I have a criminal arrest record.&quot;</li> <li><a href="https://www.youtube.com/watch?v=2IT2oAzTcvU">For more information on how the scam works, see this talk by Nina Kollars</a></li> </ul></li> <li><p><a href="https://news.ycombinator.com/item?id=33839191">An Amazon seller story</a></p> <ul> <li>It's unclear exactly what's going on here since <a href="https://news.ycombinator.com/item?id=33841910">some parts of the seller's story appear to be false</a>? Some parts are quite plausible and really terrible if true</li> </ul></li> <li><p><a href="https://www.cbc.ca/news/canada/toronto/prohibited-weapons-found-on-amazon-1.7079582">Illegal weapon a bestselling item on Amazon</a>, although this does get removed after it's reported</p></li> <li><p><a href="https://web.archive.org/web/20240112193755/https://www.amazon.com/fulfill-request-respectful-information-users-Brown/dp/B0CM82FJL2">Fake Amazon listings with titles and descriptions like &quot;I'm sorry but I cannot fulfill this request it goes against OpenAI use policy. My purpose is to provide helpful and respectful information to users&quot;</a></p> <ul> <li>The most obvious cases seem to have been cleaned up after a story about this hit #1 on HN</li> <li><a href="https://www.amazon.com/s?k=FOPEAS">Someone noted that the seller's page is still up</a> (which is still true today) and if you scroll around for listings, other ones with slightly different text, like <a href="https://web.archive.org/web/20240112231322/https://www.amazon.com/complete-information-provided-provide-context/dp/B0C5X1GXPH">&quot;I'm sorry I cannot complete this task there isn't enough information provided. Please provide more context or information so I can assist you better &quot;</a> are still up <ul> <li>These listings are total nonsense, such as the above, which has a photo of a cat and also says &quot;Exceptional Read/Write Speeds: With exceptional read/write speeds of up to 560MB and 530MB &quot;</li> </ul></li> <li>I checked out other items from this seller, and they have a silicone neck support &quot;bowl&quot; that also says &quot;Note: Products with electrical plugs are designed for use in the US. Outlets and voltage differ internationally and this product may require an adapter or converter for use in your destination. Please check compatibility before purchasing.&quot;, so it seems that someone at Amazon took down the listings that HN commenters called out (the HN thread on this is full of HN commenters pointing out ridiculous listing and those individual listings being taken down), but there's no systematic removal of nonsense listings, of which there are many</li> </ul></li> <li><p>I tried to buy 3M 616 litho tape from Amazon (in Canada) and every listing had a knock-off product that copy+pasted the description of 3M 616 into the description</p> <ul> <li>It's possible the knock-off stuff is as good, but it seems sketchy (and an illegal trademark violation) to use 3M's product description for your own knock-off product; at least some reviews indicate that expected to get 3M 616 and got a knock-off instead</li> </ul></li> <li><p>When searching for replacement Kidde smoke detectors on amazon.ca, all of the one I found are not Canadian versions, meaning they're not approved by SA, cUL, ULC or cETL. It's possible this doesn't matter, but in the event of a fire and an insurance claim, I wouldn't want to have a non-approved smoke detector</p></li> <li><p><a href="https://www.reddit.com/r/TyreReviews/comments/1atu9el/comment/kr3jse2/">Amazon store selling 5 year old tires as new (tires age over time and 5 year old tires should not be sold as new)</a></p></li> </ul> <h3 id="microsoft">Microsoft</h3> <p>This includes GitHub, LinkedIn, Activision, etc.</p> <ul> <li><a href="https://www.threads.net/@doniecnn/post/C2IPHSFuNHJ/?igshid=NTc4MTIwNjQ2YQ%3D%3D">Microsoft AI generated news articles put person's photo into a story about a different person's sexual misconduct trial</a> <ul> <li><a href="https://www.cnn.com/2023/11/02/tech/microsoft-ai-news/index.html">Other incorrect AI generated stories include Joe Biden falling asleep during a moment of silence for Maui wildfire victims, a conspiracy theory about Democrats being behind the recent covid surge</a>, and a story about San Francisco Supervisor Dean Preston resigning after criticism by Elon Musk; these seem to be a side effect of laying off human editors and replacing them with AI</li> <li>Other results include an obituary for a former NBA player who died at age 42, titled &quot;Brandon Hunter useless at 42&quot; and AI generated poll attached to a Guardian article on a deceased 21-year old woman, &quot;What do you think is the reason behind the woman’s death&quot; with the options &quot;murder, accident, or suicide&quot;</li> </ul></li> <li><a href="https://twitter.com/defunkt/status/1754610843361362360">User banned from GitHub for no discernable reason</a> <ul> <li>User happens to be co-founder of GitHub, so this goes viral when they tweet about it, causing them to get unbanned; GitHub's COO responds with &quot;You're 100% unsuspended now. I'm working with our Trust &amp; Safety team to understand what went wrong with our automations and I'm incredibly sorry for the trouble.&quot;</li> </ul></li> <li><a href="https://nitter.net/garybernhardt/status/1327349023473242112">Gary Bernhardt, a paying user of GitHub files a Privacy / PII Github support request</a> <ul> <li>ignored for 51 days, until Gary creates a viral Twitter thread</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31762904">LinkedIn user banned after finding illegal business on LinkedIn and reporting it</a> <ul> <li>seems like the illegal business used their accounts to mass report the user</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31763503">LinkedIn user banned for looking at too many profiles</a> <ul> <li>appeal rejected by customer service</li> <li>this also happened to me when I was recruiting and looking at profiles and I also got nonsens responses from customer service, although my account wasn't permanently banned</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38882186">Azure kills entire company's prod subscription because Azure assigned them a shared IP that another customer used in an attack</a></li> <li><a href="https://djanes.xyz/spam-on-github-is-getting-crazy-these-days/">GitHub spam is out of control</a></li> <li><a href="https://news.ycombinator.com/item?id=38793607">Outlook / Hotmail repeatedly incorrectly blocks the same mail servers</a>; this can apparently be fixed by: <ul> <li>Visit <a href="https://olcsupport.office.com/">https://olcsupport.office.com/</a> and submitting the complaint; Wait for the auto-reply, followed by the &quot;Nothing was detected&quot; email; replying with &quot;Escalate&quot; in the body, which then causes the server to get unblocked again in a day</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38795329">User reports that, every December, users on the service get email rejected by Microsoft, which needs to be manually escalated every year</a></li> <li><a href="https://news.ycombinator.com/item?id=38799729">User running mail server on same IP for 10+ years, with no one else using IP, repeatedly has Microsoft block mail from the IP address, requiring manual escalation to fix each time</a></li> <li><a href="https://news.ycombinator.com/item?id=38799753">Whitelisting a server doesn't necessary allow it to receive email if Microsoft decides to block it</a>; a Microsoft employee thinks this should work, but it apparently doesn't work</li> <li><a href="https://news.ycombinator.com/item?id=38792852">Microsoft arbitrarily blocks email from user's server; after escalation, they fix it, but only for hotmail and live.com, not Office 365</a></li> <li><a href="https://mastodon.social/@danluu/111349023730856070">OpenAI decides user needs to do 80 CAPTCHAs in a row to log in</a> <ul> <li>In response to this, someone sent me: &quot;Another friend of mine also had terrible issues even signing up for openai -- they told him he could only use his phone number to sign up for a maximum of 3 accounts, and he tried telling them that in fact he had only ever used it to sign up for 1 account and got back the same answer again and again (as if they use their own stuff for support) ... he said he kept trying to emphasize the word THREE with caps for the bot/human on the other end&quot; [but this didn't work]</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38147774">User reports software on GitHub that has malware installer three times and GitHub does nothing</a></li> <li><a href="https://twitter.com/danluu/status/1486951948448251907">I used linkedin for recruiting, which involved (manually) looking at people's profiles and was threatened with a ban for looking at too many profiles</a> <ul> <li>The message says you should contact support &quot;if you think this was in error&quot;, but this returns a response that's either fully automated or might as well be and appears to do nothing</li> </ul></li> <li><a href="https://nitter.net/garybernhardt/status/1650947253773938688">Gary Bernhardt spends 5 days trying to get Azure to take down phishing sites</a>, which did nothing <ul> <li>Gary has 40k Twitter followers, so he tweeted about it, which got the issue fixed after a couple of days. Gary says &quot;No wonder the world is so full of scams if this is the level of effort it takes to get Microsoft to take down a single phishing site hosted on their infrastructure&quot;.</li> </ul></li> <li><a href="https://mastodon.social/@danluu/110335983520055904">Spammer spams GitHub repos with garbage issues and PRs for months</a> <ul> <li>After I made this viral Mastodon thread about this which also made it to the front page of HN, <a href="https://mastodon.bawue.social/@ixs/110337628935185476">one of the two accounts was suspended</a>, but when I checked significantly later, the other was still around and spamming <ul> <li>I did not report this account because I reported a blatant troll account (which I know was banned from Twitter and lobsters for trolling) and got no action, and I've seen many other people indicate that they find GitHub reporting to be useless, which seems to have been the case here; <a href="https://mastodon.social/@networkexception@chaos.social/110337762885594243">one person noted that, before my viral thread, they had already blocked the account from a repo they're a maintainer and didn't bother to report because of GitHub's bad reporting flow</a></li> </ul></li> </ul></li> <li><a href="https://daverupert.com/2023/02/solved-the-case-of-the-bing-ban-theory/">Microsoft incorrectly marks many blogs as spam, banning them from Bing as well as DuckDuckGo</a> <ul> <li>Fixed sometime after a post about this went viral</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35657982">GitHub Copilot emits GPL code</a></li> <li><a href="https://www.tomshardware.com/news/windows-keeps-feeding-tabloid-news">Windows puts conspiracy theory articles and other SEO spam into search menu</a></li> <li><a href="https://news.ycombinator.com/item?id=35156916">Microsoft bans people using prompt injections on BingGPT</a></li> <li><a href="https://iter.ca/post/gh-sig-pwn/">User finds way to impersonate signed commits from any user because GitHub uses regexes instead of a real parser and has a bug in their regex</a> <ul> <li>Bug report is initially closed as &quot;being able to impersonate your own account is not an issue&quot;, by someone who apparently didn't understand the issue</li> <li>After the user pings the issue a couple more times, the issue is eventually re-opened and fixed after a couple months, so this is at least better than the other GitHub cases we've seen, where someone has to make a viral Twitter thread to get the issue fixed</li> <li><a href="https://news.ycombinator.com/item?id=39101099">In the HN comments for the story</a>, someone notes that GitHub is quick to close security issues that they don't seem to have looked closely at</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33917962">User is banned from GitHub after authorizing a shady provider with a login</a> <ul> <li>Of course this has the standard comments blaming the user, but <a href="https://news.ycombinator.com/item?id=33921457">people note that the &quot;log in with GitHub&quot; prompt and the &quot;grant this service to act on your behalf&quot; prompt look almost identical</a>; even so, people keep responding with comments like &quot;dont bother wasting anymore resources to protect the stupids&quot;</li> </ul></li> <li><a href="https://blog.mikeswanson.com/post/702753924034297856/activisions-faulty-anti-cheat-software">Activision's RICOCHET anti-cheat software is famous for having a high false positive rate, banning people from games they paid for</a> (this also bans people from playing &quot;offline&quot; in single-player mode) <ul> <li>User had their game crash 8 times in a row due to common problems (many people reported crashes with the version that crashed for this user), which apparently triggered some kind of false positive in anti-cheat software</li> <li>Support goes well beyond what most companies respond with, and responds with &quot;Any specifics regarding the ban will not be released in order to help maintain the integrity and security of the game, this is company policy that will not be changing.&quot;</li> <li>Since this software is famous for being inaccurate and having a high false positive rate, <a href="https://www.reddit.com/r/activision/comments/17pdw1k/i_thought_you_were_all_cheaters_complaining_but/">there are a huge number of accounts of false bans, such as this one</a>. In order to avoid doubling the length of this post, I won't list these</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33859751">Relatively rare case of user figuring out why they were incorrectly banned by Activision and getting their account restored</a> <ul> <li>Of course support was useless as always and trying to get help online just resulted in getting a lot of comments blaming the user for cheating</li> <li>User was banned because, after Activision and Blizzard were merged, their Blizzard username (which contains the substring &quot;erotica&quot;) became illegal, causing them to be banned by Activision's systems. But, unlike a suspension for an illegal username in Blizzard's system, Activition's system doesn't tell you that have an illegal username and just bans you</li> <li>Luckily, the user was able to find a single reddit post by someone who had a similar issue and that post had a link that lets you log into the account system even if you're banned, which then lets you change your username</li> <li>Three days after making that change, the user was unbanned</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33860501">User who bought Activision game to play in single-player campaign mode only banned for cheating after trying to launch/play game on Linux through Wine/Proton</a> <ul> <li>Support gave user the runaround and eventually stopped responding, so user appears to be permanently banned</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33860849">Anti-&quot;cheat&quot; software bans users before they can even try playing the game</a> <ul> <li>Someone speculates that it could be due to buying refurbished hardware, since Activision bans based on hardware serial numbers and some people were banned because they bought SSDs from banned machines</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33862300">Anti-&quot;cheat&quot; software bans user from Bungie (Activision) game for no discernable reason</a>; user speculates it might be because AutoHotkey to script Windows (for out of game activities)</li> <li><a href="https://www.reddit.com/r/Minecraft/comments/vz8myi/7_day_ban_all_for_typing_nigel_on_a_sign_several/">Minecraft user banned for 7 days for making sign that says Nigel on their mom's realm</a> (server, basically?); other users report that creating or typing something with the substring &quot;nig&quot; is dangerous<br /> <ul> <li>See also, <a href="https://www.youtube.com/watch?v=elIwvFi83AM">offensive words in Minecraft</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33038600">Microsoft Edge incorrectly blocks a page as being suspicious</a> <ul> <li>Developer tries to appeal, but is told that they need to send a link to a URL for the support person to look at, which is impossible because it's an API server that has no pages. Support does not reply to this.</li> </ul></li> <li><a href="https://www.reddit.com/r/gaming/comments/3wx4qu/wow_player_banned_by_blizzard_after_killing_a/https://www.reddit.com/r/gaming/comments/3wx4qu/wow_player_banned_by_blizzard_after_killing_a/">User banned from WoW for beating someone playing with 60 accounts, who submits 60 false reports against user</a>; people report this kind of issue in Overwatch as well, where mass reporting someone is an easy way to get Blizzard to suspend or ban their account</li> <li><a href="https://www.reddit.com/r/Asmongold/comments/q6bgr4/player_banned_for_not_renaming_pet/">User suspends user from WoW for not renaming their pet from its default name of &quot;Gorilla&quot;, which was deemed to be offensive</a></li> </ul> <h3 id="stripe">Stripe</h3> <ul> <li><a href="https://news.ycombinator.com/item?id=32854528">Turns off account for a business that's been running since 2016 with basically the same customers</a>. After a week of talking to tech support, the account is reactivated and then, shortly afterwards, 35% of client accounts get turned off. <a href="https://news.ycombinator.com/item?id=32858314">Account reactivated after story got 1k points on HN</a></li> <li><a href="https://news.ycombinator.com/item?id=34189717">Stripe holds $400k from account and support just gives developer the runaround for a month</a> <ul> <li>Support asks for 2 receipts and then, after they're sent, asks for the same two receipts again, etc.</li> <li>As usual, HN commenters blame the developer and make up reasons that the developer might be bad, e.g., people say that the developer might be committing fraud. From a quick skim, at least five people called the developer's story fake or said or implied that the developer was involved in some kind of shady activity</li> <li><a href="https://news.ycombinator.com/item?id=34233011">Shortly after the original story made HN, Stripe resolved the issues and unlocked the accounts</a>, so the standard responses that the developer must be doing something fraudulent were wrong again; <a href="https://news.ycombinator.com/item?id=34233018">a detailed accounting of what happens makes it clear that nothing about Stripe's response, other than the initial locking for potential fraud, was remotely reasonable</a> <ul> <li>The developer notes that Stripe support was trying to stonewall them until they pointed out that there was a high-ranking HN post about the issue: &quot;Dec 30 [over one month from the initial freezing of funds]: While I was writing my HN post I was also on chat with Stripe for over an hour. No new information. They were basically trying to shut down the chat with me until I sent them the HN story and showed that it was getting some traction. Then they started working on my issue again and trying to communicate with more people. No resolution.&quot;</li> </ul></li> <li>After the issue was resolved, the developer was able to get information from Stripe about why the account was locked; the reason was that the company had a spike in sales due to Black Friday. Until the issue hit the top of HN, the developer was not able to talk to any person at Stripe who was useful in any way</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33743750">Developer at SaaS for martial arts academies in Europe notes that some new anti-fraud detection seems to be incorrectly suspending accounts</a>; their academies have their own accounts and multiple got suspended <ul> <li><a href="https://news.ycombinator.com/item?id=33744053">These stories are frequent enough that someone responds &quot;Monthly HN Stripe customer support thread&quot;</a>, to which the moderator of HN responds that it's more than monthly and HN will probably have to do something about this at some point, since having the HN front page be Stripe support on a regular basis is a bit much <ul> <li>Doing a search now, there are still plenty of support horror stories, <a href="https://news.ycombinator.com/item?id=38967685">but they typically don't get many votes and don't have Stripe staff drop in to fix the issue</a>, so it seems that this support channel no longer works as well as it used to.</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33745033">Multiple</a> <a href="https://news.ycombinator.com/item?id=33744754">people</a> <a href="https://news.ycombinator.com/item?id=33744614">point</a> out issues in how Stripe handles SEPA DD and <a href="https://news.ycombinator.com/item?id=33895486">other users of Stripe note that they're impacted by this as well</a></li> <li><a href="https://news.ycombinator.com/item?id=33744135">Of course, this gets the usual responses that we need to see both sides of the story, maybe you're wrong and Stripe is right, etc.</a>; the developer responds to one of these with an apology for their error</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=32855524">After account was approved and despite many tests using the Stripe test environment, on launch, it turns out that the account wasn't approved after all and payments couldn't go through</a>. Some people say that you should send real test payments before launch, but <a href="https://news.ycombinator.com/item?id=32856098">someone notes that using real CC info for tests and not the Stripe test stuff is a violation of Stripe's terms</a></li> <li><a href="https://nitter.net/rameerez/status/1665680257788026880">Stripe user notes that Stripe fraud detection completely fails on some trivial attacks, writes about it after seeing it hit them as well as many other Stripe users</a> <ul> <li><a href="https://news.ycombinator.com/item?id=36197160">Developer describes the support they received as &quot;a joke&quot;</a> since they had to manually implement rules to block the clearly fraudulent charges</li> <li>A stripe developer replies and says they'll look into it after two threads on this go viral</li> </ul></li> <li><a href="https://nitter.net/adamwathan/status/1550092016242946049">Shut down payments for job board; seems to have been re-activated after Twitter thread</a></li> <li><a href="https://news.ycombinator.com/item?id=32855509">Turned off company without warning; company moved to Parallel Economy</a></li> <li><a href="https://nitter.net/garybernhardt/status/1719862368887607659">Wording of Stripe's renewal email causes users of service to think that you have to email the service to cancel</a>; issue gets no action for a year, until Gary Bernhardt publicly tweets about it</li> <li><a href="https://news.ycombinator.com/item?id=38182339">User has Stripe account closed for no discernable reason</a></li> <li><a href="https://news.ycombinator.com/item?id=36904347">Stripe user has money taken out of their account</a> <ul> <li>A Stripe employee responds with &quot;we wouldn't do so without cause&quot;, implying that Stripe's error rate is 0%</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36403607">Stripe arbitrarily shuts down user's business</a> <ul> <li>Payments restored after story goes viral on HN <ul> <li>This happens so frequently that multiple people comment on how often this happens and how this seems to be the only way to get support from Stripe for some business-killing issues</li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=36198562">Developer notes that Stripe fraud detection totally failed when people do scams via CashApp</a> <ul> <li>Another developer replies and notes that it's weird that you can block prepaid cards but not CashApp when CashApp is used for so much scamming</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35809687">Developer has payments frozen, initially because account was misclassified and put into the wrong risk category</a> <ul> <li>Developer notes that suspension is particularly problematic because a &quot;minimum fee commitment&quot; with Stripe where they get a discount but also have a fee floor regardless of transaction volume; having payments suspending effectively increases their rate</li> <li>After one week, their account was unfrozen, but then another department froze their account, &quot; this time by a different Stripe risk team with even weirder demands: among other things, they wanted a 'working website' (our website works?) and 'contact information to appear on the website' (it's on every page?) It was as if Stripe had never heard of or talked to us before, and just like the other risk team, they asked questions but didn't respond to our emails.&quot;</li> <li>This also got resolved, but new teams keep freezing their account, causing the developer to go through a similar process again each time</li> <li>Fed up with this, the developer made an HN post which got enough upvotes that the Stripe employee who handles HN escalations saw the post</li> <li>Of course, someone has the standard response that this must be be the user/developer's fault, it must be because the business is shady or high risk, one that typically gets banned from payment processors, but if that's the case, that makes this example even worse — why would Stripe negotiate a minimum fee agreement with a business they expect to ban and how come the business keeps getting unbanned each time after someone bans them <ul> <li>Also, multiple people report having or seeing similar experiences, &quot;I find it totally believable after having to work through multiple internal risk teams to get my test accounts past automated flaggers&quot;, etc</li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=21030633">Stripe suspends account and support doesn't respond when suer wants to know why</a> <ul> <li>User notes that they can't even refund their users: &quot;when I attempted to process a refund for a customer who had been injured &amp; was unable to continue training, I get an error message stating I am unable to process refunds! Am I supposed to tell my customer that my payment process won't refund his money? FYI - The payment I am attempting to refund HAS NOT been paid out yet - the money is sitting in my stripe account - but they refuse to refund it or even dignify me with a response.&quot;</li> <li><a href="https://news.ycombinator.com/item?id=21031171">Many people comment on bad Stripe support has been for them, even this happy customer</a>: &quot;We’re using stripe and are overall happy. But their customer support is pretty bad. Lots of canned replies and ping-pong back and forth until you get someone to actually read your question&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=21030879">Stripe account suspended due to a lien; after the lien is removed, Stripe doesn't unsuspend the account and the account is still frozen</a> <ul> <li>Luckily, the son of the user is a programmer who knows someone at Stripe, so the issue gets fixed</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=28085706">Developer's Stripe account is suspended with a standard message seen in other Stripe issues, &quot;Our systems recently identified charges that appear to be unauthorized by the customer, meaning that the owner of the card or bank account did not consent to these payments. This unfortunately means that we will no longer be able to accept payments ... &quot;</a> <ul> <li>Developer pressed some button to verify their identity, which resulted in &quot;Thank you for completing our verification process. Unfortunately, after conducting a further review of your account, we’ve determined that we still won't be able to accept payments for xx moving forward&quot;. They then tried to contact support, which did nothing</li> <li>After their HN post hits the front page, someone looks into it and it appears that the issue is fixed and the developer gets an email which reads &quot;It looks like some activity on your account was misidentified as unauthorized, causing potential charge declines and the cancellation of your account. This was a mistake on our end, so we've gone ahead and re-enabled your account.&quot; The developer notes that having support not respond until you can get a front-page HN post is a poor support strategy for users and that they lost credit card renewals during the time the account was suspended</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=28088283">Developer has product launch derailed when Stripe suspends their account for no discernable reason</a>; they try talking to support which doesn't work <ul> <li>What does work is posting a comment on a front-page HN thread about someone else's Stripe horror story, which becomes the top comment, which causes a Stripe employee to look at the issue and unsuspend the account</li> </ul></li> <li><a href="https://justuseapp.com/page/stripe-ban">Stripe bans developer for having too many disputes when they've had one dispute, which was a $10 charge where they won on submitting evidence about the dispute to J.P. Morgan, the user's bank</a> <ul> <li>The developer appeals and receives a message saying that they're &quot;after further conducting a review of your account, we've determined that we still won't be able to accept payments ... going forward. Stripe can only support users with a low risk of customer disputes. After reviewing your submitted information and website, it does seem like your business presents a higher level of risk than we can currently support&quot;</li> <li>After the story hits #1 on HN, their account is unbanned, but then a day later, it's rebanned for a completely different reason!</li> </ul></li> <li><a href="https://tinyletter.com/blauvelt/letters/looking-forward-after-some-of-the-hardest-few-months-of-my-life">Developer banned from Stripe; they aren't sure why, but wonder if it's because they're a Muslim charity</a></li> <li><a href="https://news.ycombinator.com/item?id=28881026">User, who appears to be a non-technical small business owner has their Stripe account suspended, which also disabled the &quot;call for help&quot; button or any other method of contacting support</a> <ul> <li>After six weeks, they find HN and make a post on HN, which gets the attention of someone at Stripe, and they're able to get their information out of Stripe and move to another payment processor, though they mention that they lose 6 weeks of revenue and close with &quot;please..do better. You're messing with people's livelihoods&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33142462">Developer notes that the only way they've been able to get Stripe issues resolved is by searching LinkedIn for connections at Stripe, because support just gives you the runaround</a></li> <li><a href="https://news.ycombinator.com/item?id=33139957">User (not using Stripe as a service, someone being charged by Stripe) gets fraudulently charged every month and issues a chargeback every month</a> <ul> <li>To stop having to issue a chargeback each month, user is stuck in a support loop where Stripe tells them to contact the credit card company and the credit card company tells them to contact Stripe t <ul> <li>Stripe support also responds nonsensically sometimes, e.g., responding and asking if they need help resetting their password</li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=33140571">Developer notes that Stripe's &quot;radar&quot; fraud detection product misses extremely simple fraudulent cases</a>​​, such as &quot;Is it a 100th charge coming from the same IP in Ukraine with a Canadian VISA&quot;, or &quot;Same fake TLD for the email address, for a customer number 2235&quot;, so they use broad rules to reject fraudulent charges, but this also rejects many good charges and causes a loss of revenue</li> </ul> <h3 id="uber">Uber</h3> <ul> <li>Former manager of payments team banned from Uber due to incorrect fraud detection <ul> <li>engineer spends six months trying to get it fixed and it's eventually fixed via adding a whitelist that manually unbans the former manager of the payments team, but the underlying issue isn't fixed</li> </ul></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18xr6ah/account_deactivate_for_not_delivering_drugs/">UberEats driver has accounted deactivated for not delivering drugs</a> <ul> <li>Driver originally contacted Uber support, who told them to contact the police. The police confirmed that the package contained crack cocaine</li> <li>The next day, Uber support called the driver and asked what happened. After the driver explained, support told them they would report the package as having being delivered to the police</li> <li>Shortly afterwards, driver's account was deactivated for not delivering the drugs</li> </ul></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18i8t6o/refusing_to_refund_customers/">Despite very clear documentation that UberEats delivered the wrong order, Uber refuses to refund user</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18i8t6o/comment/kdbxf10/">User has account put into a degraded state because user asked for too many refunds on bad or missing items</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18i8t6o/comment/kdcd041/">User has account put into a degraded state because user asked for refund on missing item</a></li> <li>I often wonder if the above will happen to me. My local grocery store delivers using DoorDash and, most of the time, at least one item is missing (I also often get items I didn't order that I don't want); either the grocery store or the driver (or both) seem to frequently accidentally co-mingle items for different customers, resulting in a very high rate of errors</li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18i8t6o/comment/kdc5j9p/">Asking for refund on order with error puts account into degraded state</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18i8t6o/comment/kdd9ops/">Uber refuses to refund item that didn't arrive on UberEats, until user threatens to send evidence of non-refund to UK regulatory agency, at which point Uber issues the refund</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/18hxjo2/uber_trying_to_steal_from_me/">Uber refuses to refund user when Uber Eats driver clearly delivered the wrong item; item isn't even an Uber delivery (it's a DoorDash delivery)</a></li> <li>A friend of mine had an Uber stop in the wrong place (it was multiple blocks away and this person ordered an Uber to pick them up from a medical appointment because they're severely concussed, so much so that walking is extremely taxing for them); the driver refused to come to them, so they agreed to go to the driver. When they were halfway there, the driver canceled the order in a way that caused my friend to pay a cancellation fee</li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/189juqi/paid_20_for_a_12_pk_and_received_a_6_pk_over_uber/">User receives a 6 pack instead of 12 pack from Uber Eats</a> and customer service declines to refund the difference <ul> <li>In the comments, other people report having the same issue</li> </ul></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/185hz9a/comment/kb1v3vv/">Uber refuses to issue refund when stolen phone (?) $2000 in charges</a>; user only gets money back via chargeback and presumably gets banned for life, as is standard when issuing chargebacks to big tech companies</li> <li><a href="https://www.reddit.com/r/UberEATS/comments/184no3q/how_uber_eats_stole_my_money/">UberEats refuses to issue refund when order is clearly delivered to wrong location (half a mile away)</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/184no3q/comment/kaxk57x/">Early in the pandemic, Uber said they would compensate drivers if they got covid, but they refuse to do this</a> <ul> <li>After 6 months of having support deny their request, Uber gives them half of the compensation that was promised</li> </ul></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/184t01x/uber_eats_punishing_shop_and_pay_drivers_when_an/">UberEats punishes driver for &quot;shop and pay&quot; order when item is out of stock and cannot be purchased</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/184r0vn/why_i_will_never_order_from_uber_eats_and_why/">Disabled user orders groceries from UberEats; when order is delivered to the wrong building, support won't have item actually delivered</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/184b0w7/cant_cancel_order_whats_going_on/">User can't cancel UberEats order when restaurant doesn't make order, leaving order in limbo</a> <ul> <li>The restaurant is closed and support responds saying that they can't do anything about it and the user needs to go to cancel in their app, but going to cancel in their app forwards them to chat with the support person who says that they need to go cancel in their app; after more discussion, support tells them that their order, which they know was never delivered, is not eligible for a refund</li> </ul></li> <li><a href="https://twitter.com/perrymetzger/status/1147589115245944833">Uber maps routes driver down impossible route and a user indicates that reporting this issue does nothing</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/17ip5na/comment/k6wcmk2/">UberEats driver notes that reporting that someone stole an order from the restaurant is pointless because the order just gets sent back out to another driver anyway</a>; contacting support is useless and costs you valuable time that you could be using to earn money <ul> <li>Another driver reports the same thing</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=37893601">Two people scammed Uber Eats out of $1M</a></li> <li><a href="https://news.ycombinator.com/item?id=37894450">Uber drivers at a local airport cancel ride if fare is under $100</a></li> <li><a href="https://news.ycombinator.com/item?id=37070672">Uber driver suddenly has account blocked</a> <ul> <li>They find out that it's because a passenger reported an item lost; passenger later realizes they had the item all along and driver is unblocked, but driver was 2 hours from home and had to do the 2 hour drive home without being able to pick up fares</li> </ul></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/14u3dml/i_found_worms_in_my_food_and_ubereats_wont_offer/">User reports that UberEats delivery had slugs in it</a> and Uber does their now-standard move of not issuing a refund; they issue a refund after the post about this goes viral</li> <li><a href="https://www.reddit.com/r/UberEATS/comments/14a62vk/uber_wont_refund_me_for_the_drinks_my_delivery/">User reports that UberEats driver spilled drink, with clear evidence of this and Uber refuses to refund until after a thread about it goes viral and the user complains on Twitter</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/14a62vk/comment/jobtkvz/">Uber refuses to refund UberEats pizza delivery that never showed up</a>; user indicates that they'd never contacted support before and had never asked for a refund before</li> <li><a href="https://www.theguardian.com/technology/2023/apr/16/stop-or-ill-fire-you-the-driver-who-defied-ubers-automated-hr">Driver threatened with ban from Uber</a> and is unable to get any kind of remotely reasonable response until his union, the App Drivers and Couriers Union worked with the Worker Info Exchange to go after Uber in court; &quot;Just before the case came to court, Uber apologised and acknowledged it had made a mistake&quot; <ul> <li>Uber's issues an official response of &quot;We are disappointed that the court did not recognise the robust processes we have in place, including meaningful human review, when making a decision to deactivate a driver’s account due to suspected fraud.&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35592403">User notes that Uber drivers often do scams and that Uber doesn't seem to prevent this</a></li> <li><a href="https://news.ycombinator.com/item?id=35592349">User notes that they frequently get scammed by Uber drivers, but Uber <em>usually</em> refunds them when they report the scam to Uber</a></li> <li><a href="https://news.ycombinator.com/item?id=35592397">User notes that Uber drivers try to scam the ~1/20 times</a></li> <li><a href="https://www.reddit.com/r/UberEATS/comments/zzia89/blocked_from_refunds_after_4_years/">User blocked from UberEats refunds after too many orders got screwed up and the user got too many refunds</a>; user sent photos of the messed up orders, but Uber support doesn't care about what happened, just the size of and number of refunds</li> <li><a href="https://news.ycombinator.com/item?id=34051876">User's uber account blocked for no discernable reason</a> <ul> <li>At the time, there was no way to contact support, but the user tries again after a few years, at which point there is a support form, but that still doesn't work <a href="https://news.ycombinator.com/item?id=34052099">User's wife is incorrectly banned from user. User worked at Uber for four years and knows that the only way to get this fixed is to escalate to employees they know because, as a former employee, they know how useless support is</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=34052126">User tries to get support for UberEATS not honoring a buy 1 get 1 free deal; support didn't help</a></li> <li><a href="https://news.ycombinator.com/item?id=34052279">Former Uber engineer notes that people randomly get banned for false positives</a></li> <li><a href="https://news.ycombinator.com/item?id=34052096">User had some Uber driver create an account with their email despite them never verifying the email with this new account</a>; user tried to have their email removed, but support says they can't discuss anything about the account since they're not the account owner</li> <li><a href="​​https://news.ycombinator.com/item?id=34052872">User's wife banned from Uber for no discernible reason</a></li> <li><a href="https://news.ycombinator.com/item?id=34052502">User's Uber account gets into apparently unfixable unexpected state</a></li> </ul> <h3 id="cloudflare">Cloudflare</h3> <ul> <li><a href="https://twitter.com/danluu/status/1285408780688080896">Blind user reports that Cloudflare doesn't have audio captchas, making much of the web unusable</a> <ul> <li>Person that handles Cloudflare captchas invites the user to solve an open research problem so that they're able to browse websites hosted on Cloudflare</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31573854">Cloudflare suspends user's account with no warning or information</a> <ul> <li>contacting CloudFlare results in the a response of &quot;Your account violated our terms of service specifically fraud. The suspension is permanent and we will not be making changes on our end.&quot;</li> <li>account restored after viral HN thread</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35742606">User finds much of the internet unusable with Firefox due to Cloudflare CAPTCHA infinite loop</a> (switching to Chrome allows them to browse the internet); user writes up a detailed account of the issue and their issue is auto-closed (and other people report the exact same experience) <ul> <li><a href="https://forum.gitlab.com/t/cant-open-the-signin-page-it-keeps-showing-checking-your-browser-before-accessing-gitlab-com/45857/21">Same issue, different user</a></li> <li><a href="https://gitlab.com/librewolf-community/browser/linux/-/issues/244">Same issue, different user</a> <ul> <li><a href="https://github.com/arkenfox/user.js/issues/1253">Same issue, different user</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35745141">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35745236">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35745456">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35744343">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35746506">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35750008">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35743309">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35748050">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35743306">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35775633">Same issue, different user</a></li> <li><a href="https://news.ycombinator.com/item?id=35745839">Similar issue, but with Brave instead of Firefox</a></li> <li><a href="https://news.ycombinator.com/item?id=35743287">Standard response of &quot;why use the product if it does this bad thing?&quot;</a> <ul> <li><a href="https://news.ycombinator.com/item?id=35743638">People point out that, as usual, this standard response is nonsense because, just for example, government websites that some people need to interact with sometimes use Cloudflare</a></li> </ul></li> <li>After the story hits the front page of HN, a cloudflare exec replies and says people will look into it and then one person reports that the issue is fixed for them; I found tens of people who said that they reported the issue to Cloudflare, so I would guess that, overall, thousands of people reported the issue to cloudflare, which did nothing until someone wrote a post that hit the HN front page.</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=31575318">Cloudflare takes site down due to what appears to be incorrect malware detection</a></li> <li><a href="https://news.ycombinator.com/item?id=31576353">Cloudflare blocks transfer of domains in cases of incorrect fraud detection</a></li> <li><a href="https://news.ycombinator.com/item?id=38881712">Incorrect phishing detection results in huge phishing warning on website</a></li> <li><a href="https://community.cloudflare.com/t/cloudflare-mistakenly-flagged-my-website-as-phishing-now-shows-a-warning-and-misinforms-my-users/384158">Incorrect phishing detection results in URL being blocked</a> <ul> <li><a href="https://news.ycombinator.com/item?id=38882074">This was apparently triggered by a URL that hadn't existed for 10 years?</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38785565">User can't access any site on cloudflare because cloudflare decided they're malicious</a></li> <li><a href="https://news.ycombinator.com/item?id=38789602">User can't access any site on cloudflare and some other hosts, they believe because another user on their ISP had malware on their computer</a></li> <li><a href="https://news.ycombinator.com/item?id=38986187">Cloudflare blocks some HN comments</a> <ul> <li>Users do a bit of testing and find that the blocking is fairly arbitrary</li> </ul></li> <li><a href="https://jrhawley.ca/2023/08/07/blocked-by-cloudflare">User is blocked by Cloudflare and can no longer visit many (all?) sites that are behind Cloudflare when using Firefox</a> <ul> <li><a href="https://news.ycombinator.com/item?id=37049016">In the comments, on the order of 10 users note they've run into the same problem</a>. The article is highly upvoted and a Cloudflare PM looks into it (resolution unknown)</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35745454">RSS feeds blocked because Cloudflare detects RSS client isn't a browser with a real human directly operating it</a></li> <li><a href="https://news.ycombinator.com/item?id=35746090">User from Hong Kong finds that they often have to use a VPN to access sites on Cloudflare because Cloudflare thinks their IP is bad</a></li> <li><a href="https://news.ycombinator.com/item?id=35746284">User finds a large fraction of the internet unusable due to Cloudflare infinite browser check loop</a></li> <li><a href="https://www.ctrl.blog/entry/cloudflare-ip-blockade.html">User finds a large fraction of the internet unusable because Cloudflare has decided their IP is suspicious</a> <ul> <li><a href="https://news.ycombinator.com/item?id=32914809">User changes ISPs in order to be able to use the internet again</a> ### TODO example in main body</li> </ul></li> <li><a href="https://farside.link/nitter/taviso/status/832744397800214528">Security researcher finds security flaw in Cloudflare</a> <ul> <li><a href="https://farside.link/nitter//taviso/status/1566077115992133634">Researcher claims that afterwards, &quot;Cloudflare literally lobbied the FTC to investigate me and question the legality of openly discussing security research&quot;</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=28792267">Cloudflare ia haven for scammers and copyright thieves</a></li> </ul> <h3 id="shopify">Shopify</h3> <ul> <li><a href="https://news.ycombinator.com/item?id=32042517">Having a store located in Puerto Rico causes payouts to be suspended every 3-4 months to verify address</a></li> <li><a href="https://nitter.net/mattzollerseitz/status/1541994356521013248">Kafkaesque support nightmare after payouts suspended</a> <ul> <li>bizarre requirements, such as proving the bookstore has a license to sell the books they're selling</li> </ul></li> </ul> <h3 id="twitter-x">Twitter (X)</h3> <p>I dropped most of the Twitter stories since <a href="https://news.ycombinator.com/item?id=39402876">there are so many after the acquisition that it seems silly to list them</a>, but I've kept a few random ones.</p> <ul> <li><a href="https://www.reddit.com/r/Twitter/comments/18yksgu/anyone_else_getting_porn_ads/">Users report NSFW, pornographic, ads</a></li> <li><a href="https://www.reddit.com/r/Twitter/comments/17n139b/twitters_beastility_problem/">Users report seeing bestiality, CP, gore, etc., when they don't want to see it</a></li> <li><a href="https://news.ycombinator.com/item?id=38001523">Scammers posting as customer service agents on Twitter/X</a></li> </ul> <h3 id="apple">Apple</h3> <ul> <li><a href="https://www.threads.net/@sineadbovell/post/C1nIp1TxD1_">Apple ML identifies user as three different people, depending on hairstyle</a></li> <li><a href="https://mjtsai.com/blog/2016/10/10/apple-and-kapeli-respond/">Long story about Apple removing an app from the app store</a></li> <li><a href="https://news.ycombinator.com/item?id=35677813">Rampant, easy to find, scam/spam apps on app store</a> <ul> <li><a href="https://news.ycombinator.com/item?id=35678623">A developer asks, how is it that so many legitimate apps get banned taken down from the app store for bad reasons when so much blatant spam gets through</a>? <ul> <li><a href="https://news.ycombinator.com/item?id=35679233">Lots of stories of legitimate apps getting autorejected immediately on submission</a>, often <a href="https://news.ycombinator.com/item?id=35681602">requiring jumping through nonsensical hoops to get the app reinstated</a></li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=35679612">Apple forces developer to remove app for being too similar to another one of their apps because they have localized versions of their apps for different geos</a>; developer asks how come people with essentially identical apps can keep 400 basically identical apps up?</li> <li><a href="https://news.ycombinator.com/item?id=35680593">Searches for apps in various basic categories return scams and random puzzle games (in non-game categories)</a></li> <li><a href="https://news.ycombinator.com/item?id=33633324">User makes an app that lets you read HN</a>; Apple store repeatedly rejects app for reasons that don't make sense given what the app does, but support fails to understand the explanation <ul> <li>Luckily, it's Apple and not Google and they eventually manage to get a human on the phone, who understands the verbal explanation</li> </ul></li> </ul> <h3 id="doordash">DoorDash</h3> <ul> <li><a href="https://www.reddit.com/r/Wellthatsucks/comments/18x6bbl/comment/kg2l3ha/">Driver can't contact customer, so DoorDash support tells driver to dump food in parking lot</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18htrrm/dasher_will_only_pick_up_my_food_if_i_send_them/">DoorDash driver says they'll only actually deliver the item if the user pays them $15 extra</a></li> <li>The above is apparently not that uncommon scam as a friend of mine had this happen to them as well</li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/186redy/doordash_denied_refund_for_wrong_order_delivered/">DoorDash refuses refund for item that didn't arrive</a> <ul> <li>Of course, people have the standard response of &quot;why don't you stop using these crappy services?&quot; (the link above this one is also full of these) and some responds, &quot;Because I'm disabled. Don't have a driver's license or a car. There isn't a bus stop near my apartment, I actually take paratransit to get to work, but I have to plan that a day ahead. Uber pulls the same shit, so I have to cycle through Uber, Door dash, and GrubHub based on who has coupons and hasn't stolen my money lately. Not everyone can just go pick something up.&quot;</li> </ul></li> <li>At one point, after I had a few bad deliveries in a row and gave a few drivers low ratings (I normally give people a high rating unless they don't even attempt to deliver to my door), I had a driver who took a really long time to deliver who, from watching the map, was just driving around. With my rating, I wrote a note that said that it appeared that, from the route, the driver was multi-apping, at which point DoorDash removed my ability to rate drivers, so I switched to Uber</li> </ul> <h3 id="walmart">Walmart</h3> <ul> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18waqmr/comment/kfwksnc/">Driver steals delivery order; Walmart support does nothing and user has to drive to Walmart store to get issue fixed</a>, but this is actually possible, unlike with most tech companies <ul> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18waqmr/comment/kfx8npu/">Walmart employee notes that delivery is subcontracted out, with no real feedback mechanism</a></li> </ul></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18lj8nl/apparently_pizza_hut_now_uses_doordash_for/">Delivery doesn't arrive and user is unable to get refund</a></li> <li><a href="https://www.reddit.com/r/campbellriver/comments/18a7l0a/walmart_just_charged_me_80_dollars_for_phone/">Walmart refuses to refund user when they're charged the wrong price</a></li> </ul> <h3 id="airbnb">Airbnb</h3> <p>I've seen a ton of these but, for some reason, it didn't occur to me to add them to my list, so I don't have a lot of examples even though I've probably seen three times as many of these as I've seen Uber horror stories.</p> <ul> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/18v8wq7/comment/kfph732/">AirBnB had cameras in the bathroom and bedroom and support refused to refund user</a></li> <li><a href="https://www.reddit.com/r/mildlyinfuriating/comments/185hz9a/my_credit_card_was_compromised_and_the_thief/">AirBnB refuses to issue refund of scam booking to stolen credit card</a>; user has to issue chargeback and (as is standard) presumably gets their account banned for life</li> <li><a href="https://news.ycombinator.com/item?id=37277016">User finds cameras in AirBnB that cover sleeping areas and other private areas and AirBnB says they'll refund user as user books a hotel and then refuses to refund user</a> <ul> <li>User is a tenacious lawyer and goes through arbitration to get a refund, which takes a large amount of effort and almost an entire year (dates in 1st level reddit link from above appear to be wrong if dates are correct in subsequent links)</li> </ul></li> </ul> <h2 id="appendix-jeff-horwitz-s-broken-code-https-www-amazon-com-broken-code-facebook-harmful-secrets-dp-b0bw4wdnct-encoding-utf8-tag-abroaview-20-linkcode-ur2-linkid-db543113619810998b5a2415c3d801e6-camp-1789-creative-9325">Appendix: <a href="https://www.amazon.com/Broken-Code-Facebook-Harmful-Secrets/dp/B0BW4WDNCT/?&amp;_encoding=UTF8&amp;tag=abroaview-20&amp;linkCode=ur2&amp;linkId=db543113619810998b5a2415c3d801e6&amp;camp=1789&amp;creative=9325">Jeff Horwitz's Broken Code</a></h2> <p>Below are a few relevant excerpts. This is intended to be analogous to <a href="https://thezvi.wordpress.com/2019/05/30/quotes-from-moral-mazes/">Zvi Mowshowitz's Quotes from Moral Mazes</a>, which gives you an idea of what's in the book but is definitely not a replacement for reading the book. If these quotes are interesting, I recommend reading <a href="https://www.amazon.com/Broken-Code-Facebook-Harmful-Secrets/dp/B0BW4WDNCT/?&amp;_encoding=UTF8&amp;tag=abroaview-20&amp;linkCode=ur2&amp;linkId=db543113619810998b5a2415c3d801e6&amp;camp=1789&amp;creative=9325">the book</a>!</p> <blockquote> <p>The former employees who agreed to speak to me said troubling things from the get-go. Facebook’s automated enforcement systems were flatly incapable of performing as billed. Efforts to engineer growth had inadvertently rewarded political zealotry. And the company knew far more about the negative effects of social media usage than it let on. <hr> as the election progressed, the company started receiving reports of mass fake accounts, bald-faced lies on campaign-controlled pages, and coordinated threats of violence against Duterte critics. After years in politics, Harbath wasn’t naive about dirty tricks. But when Duterte won, it was impossible to deny that Facebook’s platform had rewarded his combative and sometimes underhanded brand of politics. The president-elect banned independent media from his inauguration—but livestreamed the event on Facebook. His promised extrajudicial killings began soon after.</p> <p>A month after Duterte’s May 2016 victory came the United Kingdom’s referendum to leave the European Union. The Brexit campaign had been heavy on anti-immigrant sentiment and outright lies. As in the Philippines, the insurgent tactics seemed to thrive on Facebook—supporters of the “Leave” camp had obliterated “Remain” supporters on the platform. ... Harbath found all that to be gross, but there was no denying that Trump was successfully using Facebook and Twitter to short-circuit traditional campaign coverage, garnering attention in ways no campaign ever had. “I mean, he just has to go and do a short video on Facebook or Instagram and then the media covers it,” Harbath had marveled during a talk in Europe that spring. She wasn’t wrong: political reporters reported not just the content of Trump’s posts but their like counts.</p> <p>Did Facebook need to consider making some effort to fact-check lies spread on its platform? Harbath broached the subject with Adam Mosseri, then Facebook’s head of News Feed.</p> <p>“How on earth would we determine what’s true?” Mosseri responded. Depending on how you looked at it, it was an epistemic or a technological conundrum. Either way, the company chose to punt when it came to lies on its platform. <hr> Zuckerberg believed math was on Facebook’s side. Yes, there had been misinformation on the platform—but it certainly wasn’t the majority of content. Numerically, falsehoods accounted for just a fraction of all news viewed on Facebook, and news itself was just a fraction of the platform’s overall content. That such a fraction of a fraction could have thrown the election was downright illogical, Zuckerberg insisted.. ... But Zuckerberg was the boss. Ignoring Kornblut’s advice, he made his case the following day during a live interview at Techonomy, a conference held at the Ritz-Carlton in Half Moon Bay. Calling fake news a “very small” component of the platform, he declared the possibility that it had swung the election “a crazy idea.” ... A favorite saying at Facebook is that “Data Wins Arguments.” But when it came to Zuckerberg’s argument that fake news wasn’t a major problem on Facebook, the company didn’t have any data. As convinced as the CEO was that Facebook was blameless, he had no evidence of how “fake news” came to be, how it spread across the platform, and whether the Trump campaign had made use of it in their Facebook ad campaigns. ... One week after the election, BuzzFeed News reporter Craig Silverman published an analysis showing that, in the final months of the election, fake news had been the most viral election-related content on Facebook. A story falsely claiming that the pope had endorsed Trump had gotten more than 900,000 likes, reshares, and comments—more engagement than even the most widely shared stories from CNN, the New York Times, or the Washington Post. The most popular falsehoods, the story showed, had been in support of Trump.</p> <p>It was a bombshell. Interest in the term “fake news” spiked on Google the day the story was published—and it stayed high for years, first as Trump’s critics cited it as an explanation for the president-elect’s victory, and then as Trump co-opted the term to denigrate the media at large. ... even as the company’s Communications staff had quibbled with Silverman’s methodology, executives had demanded that News Feed’s data scientists replicate it. Was it really true that lies were the platform’s top election-related content?</p> <p>A day later, the staffers came back with an answer: almost.</p> <p>A quick and dirty review suggested that the data BuzzFeed was using had been slightly off, but the claim that partisan hoaxes were trouncing real news in Facebook’s News Feed was unquestionably correct. Bullshit peddlers had a big advantage over legitimate publications—their material was invariably compelling and exclusive. While scores of mainstream news outlets had written rival stories about Clinton’s leaked emails, for instance, none of them could compete with the headline “WikiLeaks CONFIRMS Hillary Sold Weapons to ISIS.” <hr> The engineers weren’t incompetent—just applying often-cited company wisdom that “Done Is Better Than Perfect.” Rather than slowing down, Maurer said, Facebook preferred to build new systems capable of minimizing the damage of sloppy work, creating firewalls to prevent failures from cascading, discarding neglected data before it piled up in server-crashing queues, and redesigning infrastructure so that it could be readily restored after inevitable blowups.</p> <p>The same culture applied to product design, where bonuses and promotions were doled out to employees based on how many features they “shipped”—programming jargon for incorporating new code into an app. Conducted semiannually, these “Performance Summary Cycle” reviews incented employees to complete products within six months, even if it meant the finished product was only minimally viable and poorly documented. Engineers and data scientists described living with perpetual uncertainty about where user data was being collected and stored—a poorly labeled data table could be a redundant file or a critical component of an important product. Brian Boland, a longtime vice president in Facebook’s Advertising and Partnerships divisions, recalled that a major data-sharing deal with Amazon once collapsed because Facebook couldn’t meet the retailing giant’s demand that it not mix Amazon’s data with its own.</p> <p>“Building things is way more fun than making things secure and safe,” he said of the company’s attitude. “Until there’s a regulatory or press fire, you don’t deal with it.” <hr> Nowhere in the system was there much place for quality control. Instead of trying to restrict problem content, Facebook generally preferred to personalize users’ feeds with whatever it thought they would want to see. Though taking a light touch on moderation had practical advantages—selling ads against content you don’t review is a great business—Facebook came to treat it as a moral virtue, too. The company wasn’t failing to supervise what users did—it was neutral.</p> <p>Though the company had come to accept that it would need to do some policing, executives continued to suggest that the platform would largely regulate itself. In 2016, with the company facing pressure to moderate terrorism recruitment more aggressively, Sheryl Sandberg had told the World Economic Forum that the platform did what it could, but that the lasting solution to hate on Facebook was to drown it in positive messages.</p> <p>“The best antidote to bad speech is good speech,” she declared, telling the audience how German activists had rebuked a Neo-Nazi political party’s Facebook page with “like attacks,” swarming it with messages of tolerance.</p> <p>Definitionally, the “counterspeech” Sandberg was describing didn’t work on Facebook. However inspiring the concept, interacting with vile content would have triggered the platform to distribute the objectionable material to a wider audience. <hr> ​​... in an internal memo by Andrew “Boz” Bosworth, who had gone from being one of Mark Zuckerberg’s TAs at Harvard to one of his most trusted deputies and confidants at Facebook. Titled “The Ugly,” Bosworth wrote the memo in June 2016, two days after the murder of a Chicago man was inadvertently livestreamed on Facebook. Facing calls for the company to rethink its products, Bosworth was rallying the troops.</p> <p>“We talk about the good and the bad of our work often. I want to talk about the ugly,” the memo began. Connecting people created obvious good, he said—but doing so at Facebook’s scale would produce harm, whether it was users bullying a peer to the point of suicide or using the platform to organize a terror attack.</p> <p>That Facebook would inevitably lead to such tragedies was unfortunate, but it wasn’t the Ugly. The Ugly, Boz wrote, was that the company believed in its mission of connecting people so deeply that it would sacrifice anything to carry it out.</p> <p>“That’s why all the work we do in growth is justified. All the questionable contact importing practices. All the subtle language that helps people stay searchable by friends. All of the work we do to bring more communication in. The work we will likely have to do in China some day. All of it,” Bosworth wrote. <hr> Every team responsible for ranking or recommending content rushed to overhaul their systems as fast as they could, setting off an explosion in the complexity of Facebook’s product. Employees found that the biggest gains often came not from deliberate initiatives but from simple futzing around. Rather than redesigning algorithms, which was slow, engineers were scoring big with quick and dirty machine learning experiments that amounted to throwing hundreds of variants of existing algorithms at the wall and seeing which versions stuck—which performed best with users. They wouldn’t necessarily know why a variable mattered or how one algorithm outperformed another at, say, predicting the likelihood of commenting. But they could keep fiddling until the machine learning model produced an algorithm that statistically outperformed the existing one, and that was good enough. <hr> ... in Facebook’s efforts to deploy a classifier to detect pornography, Arturo Bejar recalled, the system routinely tried to cull images of beds. Rather than learning to identify people screwing, the model had instead taught itself to recognize the furniture on which they most often did ... Similarly fundamental errors kept occurring, even as the company came to rely on far more advanced AI techniques to make far weightier and complex decisions than “porn/not porn.” The company was going all in on AI, both to determine what people should see, and also to solve any problems that might arise. <hr> Willner happened to read an NGO report documenting the use of Facebook to groom and arrange meetings with dozens of young girls who were then kidnapped and sold into sex slavery in Indonesia. Zuckerberg was working on his public speaking skills at the time and had asked employees to give him tough questions. So, at an all-hands meeting, Willner asked him why the company had allocated money for its first-ever TV commercial—a recently released ninety-second spot likening Facebook to chairs and other helpful structures—but no budget for a staffer to address its platform’s known role in the abduction, rape, and occasional murder of Indonesian children.</p> <p>Zuckerberg looked physically ill. He told Willner that he would need to look into the matter ... Willner said, the company was hopelessly behind in the markets where she believed Facebook had the highest likelihood of being misused. When she left Facebook in 2013, she had concluded that the company would never catch up. <hr> Within a few months, Facebook laid off the entire Trending Topics team, sending a security guard to escort them out of the building. A newsroom announcement said that the company had always hoped to make Trending Topics fully automated, and henceforth it would be. If a story topped Facebook’s metrics for viral news, it would top Trending Topics.</p> <p>The effects of the switch were not subtle. Freed from the shackles of human judgment, Facebook’s code began recommending users check out the commemoration of “National Go Topless Day,” a false story alleging that Megyn Kelly had been sacked by Fox News, and an only-too-accurate story titled “Man Films Himself Having Sex with a McChicken Sandwich.”</p> <p>Setting aside the feelings of McDonald’s social media team, there were reasons to doubt that the engagement on that final story reflected the public’s genuine interest in sandwich-screwing: much of the engagement was apparently coming from people wishing they’d never seen such accursed content. Still, Zuckerberg preferred it this way. Perceptions of Facebook’s neutrality were paramount; dubious and distasteful was better than biased.</p> <p>“Zuckerberg said anything that had a human in the loop we had to get rid of as much as possible,” the member of the early polarization team recalled.</p> <p>Among the early victims of this approach was the company’s only tool to combat hoaxes. For more than a decade, Facebook had avoided removing even the most obvious bullshit, which was less a principled stance and more the only possible option for the startup. “We were a bunch of college students in a room,” said Dave Willner, Charlotte Willner’s husband and the guy who wrote Facebook’s first content standards. “We were radically unequipped and unqualified to decide the correct history of the world.”</p> <p>But as the company started churning out billions of dollars in annual profit, there were, at least, resources to consider the problem of fake information. In early 2015, the company had announced that it had found a way to combat hoaxes without doing fact-checking—that is, without judging truthfulness itself. It would simply suppress content that users disproportionately reported as false.</p> <p>Nobody was so naive as to think that this couldn’t get contentious, or that the feature wouldn’t be abused. In a conversation with Adam Mosseri, one engineer asked how the company would deal, for example, with hoax “debunkings” of manmade global warming, which were popular on the American right. Mosseri acknowledged that climate change would be tricky but said that was not cause to stop: “You’re choosing the hardest case—most of them won’t be that hard.”</p> <p>Facebook publicly revealed its anti-hoax work to little fanfare in an announcement that accurately noted that users reliably reported false news. What it omitted was that users also reported as false any news story they didn’t like, regardless of its accuracy.</p> <p>To stem a flood of false positives, Facebook engineers devised a workaround: a “whitelist” of trusted publishers. Such safe lists are common in digital advertising, allowing jewelers to buy preauthorized ads on a host of reputable bridal websites, for example, while excluding domains like www.wedddings.com. Facebook’s whitelisting was pretty much the same: they compiled a generously large list of recognized news sites whose stories would be treated as above reproach.</p> <p>The solution was inelegant, and it could disadvantage obscure publishers specializing in factual but controversial reporting. Nonetheless, it effectively diminished the success of false viral news on Facebook. That is, until the company faced accusations of bias surrounding Trending Topics. Then Facebook preemptively turned it off.</p> <p>The disabling of Facebook’s defense against hoaxes was part of the reason fake news surged in the fall of 2016. <hr> Gomez-Uribe’s team hadn’t been tasked with working on Russian interference, but one of his subordinates noted something unusual: some of the most hyperactive accounts seemed to go entirely dark on certain days of the year. Their downtime, it turned out, corresponded with a list of public holidays in the Russian Federation.</p> <p>“They respect holidays in Russia?” he recalled thinking. “Are we all this fucking stupid?”</p> <p>But users didn’t have to be foreign trolls to promote problem posts. An analysis by Gomez-Uribe’s team showed that a class of Facebook power users tended to favor edgier content, and they were more prone to extreme partisanship. They were also, hour to hour, more prolific—they liked, commented, and reshared vastly more content than the average user. These accounts were outliers, but because Facebook recommended content based on aggregate engagement signals, they had an outsized effect on recommendations. If Facebook was a democracy, it was one in which everyone could vote whenever they liked and as frequently as they wished. ... hyperactive users tended to be more partisan and more inclined to share misinformation, hate speech, and clickbait, <hr> At Facebook, he realized, nobody was responsible for looking under the hood. “They’d trust the metrics without diving into the individual cases,” McNally said. “It was part of the ‘Move Fast’ thing. You’d have hundreds of launches every year that were only driven by bottom-line metrics.”</p> <p>Something else worried McNally. Facebook’s goal metrics tended to be calculated in averages.</p> <p>“It is a common phenomenon in statistics that the average is volatile, so certain pathologies could fall straight out of the geometry of the goal metrics,” McNally said. In his own reserved, mathematically minded way, he was calling Facebook’s most hallowed metrics crap. Making decisions based on metrics alone, without carefully studying the effects on actual humans, was reckless. But doing it based on average metrics was flat-out stupid. An average could rise because you did something that was broadly good for users, or it could go up because normal people were using the platform a tiny bit less and a small number of trolls were using Facebook way more.</p> <p>Everyone at Facebook understood this concept—it’s the difference between median and mean, a topic that is generally taught in middle school. But, in the interest of expediency, Facebook’s core metrics were all based on aggregate usage. It was as if a biologist was measuring the strength of an ecosystem based on raw biomass, failing to distinguish between healthy growth and a toxic algae bloom. <hr> One distinguishing feature was the shamelessness of fake news publishers’ efforts to draw attention. Along with bad information, their pages invariably featured clickbait (sensationalist headlines) and engagement bait (direct appeals for users to interact with content, thereby spreading it further).</p> <p>Facebook already frowned on those hype techniques as a little spammy, but truth be told it didn’t really do much about them. How much damage could a viral “Share this if you support the troops” post cause? <hr> Facebook’s mandate to respect users’ preferences posed another challenge. According to the metrics the platform used, misinformation was what people wanted. Every metric that Facebook used showed that people liked and shared stories with sensationalistic and misleading headlines.</p> <p>McNally suspected the metrics were obscuring the reality of the situation. His team set out to demonstrate that this wasn’t actually true. What they found was that, even though users routinely engaged with bait content, they agreed in surveys that such material was of low value to them. When informed that they had shared false content, they experienced regret. And they generally considered fact-checks to contain useful information. <hr> every time a well-intentioned proposal of that sort blew up in the company’s face, the people working on misinformation lost a bit of ground. In the absence of a coherent, consistent set of demands from the outside world, Facebook would always fall back on the logic of maximizing its own usage metrics.</p> <p>“If something is not going to play well when it hits mainstream media, they might hesitate when doing it,” McNally said. “Other times we were told to take smaller steps and see if anybody notices. The errors were always on the side of doing less.” ... “For people who wanted to fix Facebook, polarization was the poster child of ‘Let’s do some good in the world,’ ” McNally said. “The verdict came back that Facebook’s goal was not to do that work.” <hr> When the ranking team had begun its work, there had been no question that Facebook was feeding its users overtly false information at a rate that vastly outstripped any other form of media. This was no longer the case (even though the company would be raked over the coals for spreading “fake news” for years to come).</p> <p>Ironically, Facebook was in a poor position to boast about that success. With Zuckerberg having insisted throughout that fake news accounted for only a trivial portion of content, Facebook couldn’t celebrate that it might be on the path of making the claim true. <hr> multiple members of both teams recalled having had the same response when they first learned of MSI’s new engagement weightings: it was going to make people fight. Facebook’s good intent may have been genuine, but the idea that turbocharging comments, reshares, and emojis would have unpleasant effects was pretty obvious to people who had, for instance, worked on Macedonian troll farms, sensationalism, and hateful content.</p> <p>Hyperbolic headlines and outrage bait were already well-recognized digital publishing tactics, on and off Facebook. They traveled well, getting reshared in long chains. Giving a boost to content that galvanized reshares was going to add an exponential component to the already-healthy rate at which such problem content spread. At a time when the company was trying to address purveyors of misinformation, hyperpartisanship, and hate speech, it had just made their tactics more effective.</p> <p>Multiple leaders inside Facebook’s Integrity team raised concerns about MSI with Hegeman, who acknowledged the problem and committed to trying to fine-tune MSI later. But adopting MSI was a done deal, he said—Zuckerberg’s orders.</p> <p>Even non-Integrity staffers recognized the risk. When a Growth team product manager asked if the change meant News Feed would favor more controversial content, the manager of the team responsible for the work acknowledged it very well could. <hr> The effect was more than simply provoking arguments among friends and relatives. As a Civic Integrity researcher would later report back to colleagues, Facebook’s adoption of MSI appeared to have gone so far as to alter European politics. “Engagement on positive and policy posts has been severely reduced, leaving parties increasingly reliant on inflammatory posts and direct attacks on their competitors,” a Facebook social scientist wrote after interviewing political strategists about how they used the platform. In Poland, the parties described online political discourse as “a social-civil war.” One party’s social media management team estimated that they had shifted the proportion of their posts from 50/50 positive/negative to 80 percent negative and 20 percent positive, explicitly as a function of the change to the algorithm. Major parties blamed social media for deepening political polarization, describing the situation as “unsustainable.”</p> <p>The same was true of parties in Spain. “They have learnt that harsh attacks on their opponents net the highest engagement,” the researcher wrote. “From their perspective, they are trapped in an inescapable cycle of negative campaigning by the incentive structures of the platform.”</p> <p>If Facebook was making politics more combative, not everyone was upset about it. Extremist parties proudly told the researcher that they were running “provocation strategies” in which they would “create conflictual engagement on divisive issues, such as immigration and nationalism.”</p> <p>To compete, moderate parties weren’t just talking more confrontationally. They were adopting more extreme policy positions, too. It was a matter of survival. “While they acknowledge they are contributing to polarization, they feel like they have little choice and are asking for help,” the researcher wrote. <hr> Facebook’s most successful publishers of political content were foreign content farms posting absolute trash, stuff that made About.com’s old SEO chum look like it belonged in the New Yorker.</p> <p>Allen wasn’t the first staffer to notice the quality problem. The pages were an outgrowth of the fake news publishers that Facebook had battled in the wake of the 2016 election. While fact-checks and other crackdown efforts had made it far harder for outright hoaxes to go viral, the publishers had regrouped. Some of the same entities that BuzzFeed had written about in 2016—teenagers from a small Macedonian mountain town called Veles—were back in the game. How had Facebook’s news distribution system been manipulated by kids in a country with a per capita GDP of $5,800? <hr> When reviewing troll farm pages, he noticed something—their posts usually went viral. This was odd. Competition for space in users’ News Feeds meant that most pages couldn’t reliably get their posts in front of even those people who deliberately chose to follow them. But with the help of reshares and the News Feed algorithms, the Macedonian troll farms were routinely reaching huge audiences. If having a post go viral was hitting the attention jackpot, then the Macedonians were winning every time they put a buck into Facebook’s slot machine.</p> <p>The reason the Macedonians’ content was so good was that it wasn’t theirs. Virtually every post was either aggregated or stolen from somewhere else on the internet. Usually such material came from Reddit or Twitter, but the Macedonians were just ripping off content from other Facebook pages, too, and reposting it to their far larger audiences. This worked because, on Facebook, originality wasn’t an asset; it was a liability. Even for talented content creators, most posts turned out to be duds. But things that had already gone viral nearly always would do so again. <hr> Allen began a note about the problem from the summer of 2018 with a reminder. “The mission of Facebook is to empower people to build community. This is a good mission,” he wrote, before arguing that the behavior he was describing exploited attempts to do that. As an example, Allen compared a real community—a group known as the National Congress of American Indians. The group had clear leaders, produced original programming, and held offline events for Native Americans. But, despite NCAI’s earnest efforts, it had far fewer fans than a page titled “Native American Proub” [sic] that was run out of Vietnam. The page’s unknown administrators were using recycled content to promote a website that sold T-shirts.</p> <p>“They are exploiting the Native American Community,” Allen wrote, arguing that, even if users liked the content, they would never choose to follow a Native American pride page that was secretly run out of Vietnam. As proof, he included an appendix of reactions from users who had wised up. “If you’d like to read 300 reviews from real users who are very upset about pages that exploit the Native American community, here is a collection of 1 star reviews on Native American ‘Community’ and ‘Media’ pages,” he concluded.</p> <p>This wasn’t a niche problem. It was increasingly the default state of pages in every community. Six of the top ten Black-themed pages—including the number one page, “My Baby Daddy Ain’t Shit”—were troll farms. The top fourteen English-language Christian- and Muslim-themed pages were illegitimate. A cluster of troll farms peddling evangelical content had a combined audience twenty times larger than the biggest authentic page.</p> <p>“This is not normal. This is not healthy. We have empowered inauthentic actors to accumulate huge followings for largely unknown purposes,” Allen wrote in a later note. “Mostly, they seem to want to skim a quick buck off of their audience. But there are signs they have been in contact with the IRA.”</p> <p>So how bad was the problem? A sampling of Facebook publishers with significant audiences found that a full 40 percent relied on content that was either stolen, aggregated, or “spun”—meaning altered in a trivial fashion. The same thing was true of Facebook video content. One of Allen’s colleagues found that 60 percent of video views went to aggregators.</p> <p>The tactics were so well-known that, on YouTube, people were putting together instructional how-to videos explaining how to become a top Facebook publisher in a matter of weeks. “This is where I’m snagging videos from YouTube and I’ll re-upload them to Facebook,” said one guy in a video Allen documented, noting that it wasn’t strictly necessary to do the work yourself. “You can pay 20 dollars on Fiverr for a compilation—‘Hey, just find me funny videos on dogs, and chain them together into a compilation video.’ ”</p> <p>Holy shit, Allen thought. Facebook was losing in the later innings of a game it didn’t even understand it was playing. He branded the set of winning tactics “manufactured virality.”</p> <p>“What’s the easiest (lowest effort) way to make a big Facebook Page?” Allen wrote in an internal slide presentation. “Step 1: Find an existing, engaged community on [Facebook]. Step 2: Scrape/Aggregate content popular in that community. Step 3: Repost the most popular content on your Page.” <hr> Allen’s research kicked off a discussion. That a top page for American Vietnam veterans was being run from overseas—from Vietnam, no less—was just flat-out embarrassing. And unlike killing off Page Like ads, which had been a nonstarter for the way it alienated certain internal constituencies, if Allen and his colleagues could work up ways to systematically suppress trash content farms—material that was hardly exalted by any Facebook team—getting leadership to approve them might be a real possibility.</p> <p>This was where Allen ran up against that key Facebook tenet, “Assume Good Intent.” The principle had been applied to colleagues, but it was meant to be just as applicable to Facebook’s billions of users. In addition to being a nice thought, it was generally correct. The overwhelming majority of people who use Facebook do so in the name of connection, entertainment, and distraction, and not to deceive or defraud. But, as Allen knew from experience, the motto was hardly a comprehensive guide to living, especially when money was involved. <hr> With the help of another data scientist, Allen documented the inherent traits of crap publishers. They aggregated content. They went viral too consistently. They frequently posted engagement bait. And they relied on reshares from random users, rather than cultivating a dedicated long-term audience.</p> <p>None of these traits warranted severe punishment by itself. But together they added up to something damning. A 2019 screening for these features found 33,000 entities—a scant 0.175 percent of all pages—that were receiving a full 25 percent of all Facebook page views. Virtually none of them were “managed,” meaning controlled by entities that Facebook’s Partnerships team considered credible media professionals, and they accounted for just 0.14 percent of Facebook revenue. <hr> After it was bought, CrowdTangle was no longer a company but a product, available to media companies at no cost. However much publishers were angry with Facebook, they loved Silverman’s product. The only mandate Facebook gave him was for his team to keep building things that made publishers happy. Savvy reporters looking for viral story fodder loved it, too. CrowdTangle could surface, for instance, an up-and-coming post about a dog that saved its owner’s life, material that was guaranteed to do huge numbers on social media because it was already heading in that direction.</p> <p>CrowdTangle invited its formerly paying media customers to a party in New York to celebrate the deal. One of the media executives there asked Silverman whether Facebook would be using CrowdTangle internally as an investigative tool, a question that struck Silverman as absurd. Yes, it had offered social media platforms an early window into their own usage. But Facebook’s staff now outnumbered his own by several thousand to one. “I was like, ‘That’s ridiculous—I’m sure whatever they have is infinitely more powerful than what we have!’ ”</p> <p>It took Silverman more than a year to reconsider that answer. <hr> It was only as CrowdTangle started building tools to do this that the team realized just how little Facebook knew about its own platform. When Media Matters, a liberal media watchdog, published a report showing that MSI had been a boon for Breitbart, Facebook executives were genuinely surprised, sending around the article asking if it was true. As any CrowdTangle user would have known, it was.</p> <p>Silverman thought the blindness unfortunate, because it prevented the company from recognizing the extent of its quality problem. It was the same point that Jeff Allen and a number of other Facebook employees had been hammering on. As it turned out, the person to drive it home wouldn’t come from inside the company. It would be Jonah Peretti, the CEO of BuzzFeed.</p> <p>BuzzFeed had pioneered the viral publishing model. While “listicles” earned the publication a reputation for silly fluff in its early days, Peretti’s staff operated at a level of social media sophistication far above most media outlets, stockpiling content ahead of snowstorms and using CrowdTangle to find quick-hit stories that drew giant audiences.</p> <p>In the fall of 2018, Peretti emailed Cox with a grievance: Facebook’s Meaningful Social Interactions ranking change was pressuring his staff to produce scuzzier content. BuzzFeed could roll with the punches, Peretti wrote, but nobody on his staff would be happy about it. Distinguishing himself from publishers who just whined about lost traffic, Peretti cited one of his platform’s recent successes: a compilation of tweets titled “21 Things That Almost All White People Are Guilty of Saying.” The list—which included “whoopsie daisy,” “get these chips away from me,” and “guilty as charged”—had performed fantastically on Facebook. What bothered Peretti was the apparent reason why. Thousands of users were brawling in the comments section over whether the item itself was racist.</p> <p>“When we create meaningful content, it doesn’t get rewarded,” Peretti told Cox. Instead, Facebook was promoting “fad/junky science,” “extremely disturbing news,” “gross images,” and content that exploited racial divisions, according to a summary of Peretti’s email that circulated among Integrity staffers. Nobody at BuzzFeed liked producing that junk, Peretti wrote, but that was what Facebook was demanding. (In an illustration of BuzzFeed’s willingness to play the game, a few months later it ran another compilation titled “33 Things That Almost All White People Are Guilty of Doing.”) <hr> As users’ News Feeds became dominated by reshares, group posts, and videos, the “organic reach” of celebrity pages began tanking. “My artists built up a fan base and now they can’t reach them unless they buy ads,” groused Travis Laurendine, a New Orleans–based music promoter and technologist, in a 2019 interview. A page with 10,000 followers would be lucky to reach more than a tiny percent of them.</p> <p>Explaining why a celebrity’s Facebook reach was dropping even as they gained followers was hell for Partnerships, the team tasked with providing VIP service to notable users and selling them on the value of maintaining an active presence on Facebook. The job boiled down to convincing famous people, or their social media handlers, that if they followed a set of company-approved best practices, they would reach their audience. The problem was that those practices, such as regularly posting original content and avoiding engagement bait, didn’t actually work. Actresses who were the center of attention on the Oscars’ red carpet would have their posts beaten out by a compilation video of dirt bike crashes stolen from YouTube. ... Over time, celebrities and influencers began drifting off the platform, generally to sister company Instagram. “I don’t think people ever connected the dots,” Boland said. <hr> “Sixty-four percent of all extremist group joins are due to our recommendation tools,” the researcher wrote in a note summarizing her findings. “Our recommendation systems grow the problem.”</p> <p>This sort of thing was decidedly not supposed to be Civic’s concern. The team existed to promote civic participation, not police it. Still, a longstanding company motto was that “Nothing Is Someone Else’s Problem.” Chakrabarti and the researcher team took the findings to the company’s Protect and Care team, which worked on things like suicide prevention and bullying and was, at that point, the closest thing Facebook had to a team focused on societal problems.</p> <p>Protect and Care told Civic there was nothing it could do. The accounts creating the content were real people, and Facebook intentionally had no rules mandating truth, balance, or good faith. This wasn’t someone else’s problem—it was nobody’s problem. <hr> Even if the problem seemed large and urgent, exploring possible defenses against bad-faith viral discourse was going to be new territory for Civic, and the team wanted to start off slow. Cox clearly supported the team’s involvement, but studying the platform’s defenses against manipulation would still represent moonlighting from Civic’s main job, which was building useful features for public discussion online.</p> <p>A few months after the 2016 election, Chakrabarti made a request of Zuckerberg. To build tools to study political misinformation on Facebook, he wanted two additional engineers on top of the eight he already had working on boosting political participation.</p> <p>“How many engineers do you have on your team right now?” Zuckerberg asked. Chakrabarti told him. “If you want to do it, you’re going to have to come up with the resources yourself,” the CEO said, according to members of Civic. Facebook had more than 20,000 engineers—and Zuckerberg wasn’t willing to give the Civic team two of them to study what had happened during the election. <hr> While acknowledging the possibility that social media might not be a force for universal good was a step forward for Facebook, discussing the flaws of the existing platform remained difficult even internally, recalled product manager Elise Liu.</p> <p>“People don’t like being told they’re wrong, and they especially don’t like being told that they’re morally wrong,” she said. “Every meeting I went to, the most important thing to get in was ‘It’s not your fault. It happened. How can you be part of the solution? Because you’re amazing.’  <hr> “We do not and possibly never will have a model that captures even a majority of integrity harms, particularly in sensitive areas,” one engineer would write, noting that the company’s classifiers could identify only 2 percent of prohibited hate speech with enough precision to remove it.</p> <p>Inaction on the overwhelming majority of content violations was unfortunate, Rosen said, but not a reason to change course. Facebook’s bar for removing content was akin to the standard of guilt beyond a reasonable doubt applied in criminal cases. Even limiting a post’s distribution should require a preponderance of evidence. The combination of inaccurate systems and a high burden of proof would inherently mean that Facebook generally didn’t enforce its own rules against hate, Rosen acknowledged, but that was by design.</p> <p>“Mark personally values free expression first and foremost and would say this is a feature, not a bug,” he wrote.</p> <p>Publicly, the company declared that it had zero tolerance for hate speech. In practice, however, the company’s failure to meaningfully combat it was viewed as unfortunate—but highly tolerable. <hr> Myanmar, ruled by a military junta that exercised near-complete control until 2011, was the sort of place where Facebook was rapidly filling in for the civil society that the government had never allowed to develop. The app offered telecommunications services, real-time news, and opportunities for activism to a society unaccustomed to them.</p> <p>In 2012, ethnic violence between the country’s dominant Buddhist majority and its Rohingya Muslim minority left around two hundred people dead and prompted tens of thousands of people to flee their homes. To many, the dangers posed by Facebook in the situation seemed obvious, including to Aela Callan, a journalist and documentary filmmaker who brought them to the attention of Elliot Schrage in Facebook’s Public Policy division in 2013. All the like-minded Myanmar Cassandras received a polite audience in Menlo Park, and little more. Their argument that Myanmar was a tinderbox was validated in 2014, when a hardline Buddhist monk posted a false claim on Facebook that a Rohingya man had raped a Buddhist woman, a provocation that produced clashes, killing two people. But with the exception of Bejar’s Compassion Research team and Cox—who was personally interested in Myanmar, privately funding independent news media there as a philanthropic endeavor—nobody at Facebook paid a great deal of attention.</p> <p>Later accounts of the ignored warnings led many of the company’s critics to attribute Facebook’s inaction to pure callousness, though interviews with those involved in the cleanup suggest that the root problem was incomprehension. Human rights advocates were telling Facebook not just that its platform would be used to kill people but that it already had. At a time when the company assumed that users would suss out and shut down misinformation without help, however, the information proved difficult to absorb. The version of Facebook that the company’s upper ranks knew—a patchwork of their friends, coworkers, family, and interests—couldn’t possibly be used as a tool of genocide.</p> <p>Facebook eventually hired its first Burmese-language content reviewer to cover whatever issues arose in the country of more than 50 million in 2015, and released a packet of flower-themed, peace-promoting digital stickers for Burmese users to slap on hateful posts. (The company would later note that the stickers had emerged from discussions with nonprofits and were “widely celebrated by civil society groups at the time.”) At the same time, it cut deals with telecommunications providers to provide Burmese users with Facebook access free of charge.</p> <p>The first wave of ethnic cleansing began later that same year, with leaders of the country’s military announcing on Facebook that they would be “solving the problem” of the country’s Muslim minority. A second wave of violence followed and, in the end, 25,000 people were killed by the military and Buddhist vigilante groups, 700,000 were forced to flee their homes, and thousands more were raped and injured. The UN branded the violence a genocide.</p> <p>Facebook still wasn’t responding. On its own authority, Gomez-Uribe’s News Feed Integrity team began collecting examples of the platform giving massive distribution to statements inciting violence. Even without Burmese-language skills, it wasn’t difficult. The torrent of anti-Rohingya hate and falsehoods from the Burmese military, government shills, and firebrand monks was not just overwhelming but overwhelmingly successful.</p> <p>This was exploratory work, not on the Integrity Ranking team’s half-year roadmap. When Gomez-Uribe, along with McNally and others, pushed to reassign staff to better grasp the scope of Facebook’s problem in Myanmar, they were shot down.</p> <p>“We were told no,” Gomez-Uribe recalled. “It was clear that leadership didn’t want to understand it more deeply.”</p> <p>That changed, as it so often did, when Facebook’s role in the problem became public. A couple of weeks after the worst violence broke out, an international human rights organization condemned Facebook for inaction. Within seventy-two hours, Gomez-Uribe’s team was urgently asked to figure out what was going on.</p> <p>When it was all over, Facebook’s negligence was clear. A UN report declared that “the response of Facebook has been slow and ineffective,” and an external human rights consultant that Facebook hired eventually concluded that the platform “has become a means for those seeking to spread hate and cause harm.”</p> <p>In a series of apologies, the company acknowledged that it had been asleep at the wheel and pledged to hire more staffers capable of speaking Burmese. Left unsaid was why the company screwed up. The truth was that it had no idea what was happening on its platform in most countries. <hr> Barnes was put in charge of “meme busting”—that is, combating the spread of viral hoaxes about Facebook, on Facebook. No, the company was not going to claim permanent rights to all your photos unless you reshared a post warning of the threat. And no, Zuckerberg was not giving away money to the people who reshared a post saying so. Suppressing these digital chain letters had an obvious payoff; they tarred Facebook’s reputation and served no purpose.</p> <p>Unfortunately, restricting the distribution of this junk via News Feed wasn’t enough to sink it. The posts also spread via Messenger, in large part because the messaging platform was prodding recipients of the messages to forward them on to a list of their friends.</p> <p>The Advocacy team that Barnes had worked on sat within Facebook’s Growth division, and Barnes knew the guy who oversaw Messenger forwarding. Armed with data showing that the current forwarding feature was flooding the platform with anti-Facebook crap, he arranged a meeting.</p> <p>Barnes’s colleague heard him out, then raised an objection.</p> <p>“It’s really helping us with our goals,” the man said of the forwarding feature, which allowed users to reshare a message to a list of their friends with just a single tap. Messenger’s Growth staff had been tasked with boosting the number of “sends” that occurred each day. They had designed the forwarding feature to encourage precisely the impulsive sharing that Barnes’s team was trying to stop.</p> <p>Barnes hadn’t so much lost a fight over Messenger forwarding as failed to even start one. At a time when the company was trying to control damage to its reputation, it was also being intentionally agnostic about whether its own users were slandering it. What was important was that they shared their slander via a Facebook product.</p> <p>“The goal was in itself a sacred thing that couldn’t be questioned,” Barnes said. “They’d specifically created this flow to maximize the number of times that people would send messages. It was a Ferrari, a machine designed for one thing: infinite scroll.” <hr> Entities like Liftable Media, a digital media company run by longtime Republican operative Floyd Brown, had built an empire on pages that began by spewing upbeat clickbait, then pivoted to supporting Trump ahead of the 2016 election. To compound its growth, Liftable began buying up other spammy political Facebook pages with names like “Trump Truck,” “Patriot Update,” and “Conservative Byte,” running its content through them.</p> <p>In the old world of media, the strategy of managing loads of interchangeable websites and Facebook pages wouldn’t make sense. For both economies of scale and to build a brand, print and video publishers targeted each audience through a single channel. (The publisher of Cat Fancy might expand into Bird Fancy, but was unlikely to cannibalize its audience by creating a near-duplicate magazine called Cat Enthusiast.)</p> <p>That was old media, though. On Facebook, flooding the zone with competing pages made sense because of some algorithmic quirks. First, the algorithm favored variety. To prevent a single popular and prolific content producer from dominating users’ feeds, Facebook blocked any publisher from appearing too frequently. Running dozens of near-duplicate pages sidestepped that, giving the same content more bites at the apple.</p> <p>Coordinating a network of pages provided a second, greater benefit. It fooled a News Feed feature that promoted virality. News Feed had been designed to favor content that appeared to be emerging organically in many places. If multiple entities you followed were all talking about something, the odds were that you would be interested so Facebook would give that content a big boost.</p> <p>The feature played right into the hands of motivated publishers. By recommending that users who followed one page like its near doppelgängers, a publisher could create overlapping audiences, using a dozen or more pages to synthetically mimic a hot story popping up everywhere at once. ... Zhang, working on the issue in 2020, found that the tactic was being used to benefit publishers (Business Insider, Daily Wire, a site named iHeartDogs), as well as political figures and just about anyone interested in gaming Facebook content distribution (Dairy Queen franchises in Thailand). Outsmarting Facebook didn’t require subterfuge. You could win a boost for your content by running it on ten different pages that were all administered by the same account.</p> <p>It would be difficult to overstate the size of the blind spot that Zhang exposed when she found it ... ... Liftable was an archetype of that malleability. The company had begun as a vaguely Christian publisher of the low-calorie inspirational content that once thrived on Facebook. But News Feed was a fickle master, and by 2015 Facebook had changed its recommendations in ways that stopped rewarding things like “You Won’t Believe Your Eyes When You See This Phenomenally Festive Christmas Light Show.”</p> <p>The algorithm changes sent an entire class of rival publishers like Upworthy and ViralNova into a terminal tailspin, but Liftable was a survivor. In addition to shifting toward stories with headlines like “Parents Furious: WATCH What Teacher Did to Autistic Son on Stage in Front of EVERYONE,” Liftable acquired WesternJournal.com and every large political Facebook page it could get its hands on.</p> <p>This approach was hardly a secret. Despite Facebook rules prohibiting the sale of pages, Liftable issued press releases about its acquisition of “new assets”—Facebook pages with millions of followers. Once brought into the fold, the network of pages would blast out the same content.</p> <p>Nobody inside or outside Facebook paid much attention to the craven amplification tactics and dubious content that publishers such as Liftable were adopting. Headlines like “The Sodomites Are Aiming for Your Kids” seemed more ridiculous than problematic. But Floyd and the publishers of such content knew what they were doing, and they capitalized on Facebook’s inattention and indifference. <hr> The early work trying to figure out how to police publishers’ tactics had come from staffers attached to News Feed, but that team was broken up during the consolidation of integrity work under Guy Rosen ... “The News Feed integrity staffers were told not to work on this, that it wasn’t worth their time,” recalled product manager Elise Liu ... Facebook’s policies certainly made it seem like removing networks of fake accounts shouldn’t have been a big deal: the platform required users to go by their real names in the interests of accountability and safety. In practice, however, the rule that users were allowed a single account bearing their legal name generally went unenforced. <hr> In the spring of 2018, the Civic team began agitating to address dozens of other networks of recalcitrant pages, including one tied to a site called “Right Wing News.” The network was run by Brian Kolfage, a U.S. veteran who had lost both legs and a hand to a missile in Iraq.</p> <p>Harbath’s first reaction to Civic’s efforts to take down a prominent disabled veteran’s political media business was a flat no. She couldn’t dispute the details of his misbehavior—Kolfage was using fake or borrowed accounts to spam Facebook with links to vitriolic, sometimes false content. But she also wasn’t ready to shut him down for doing things that the platform had tacitly allowed.</p> <p>“Facebook had let this guy build up a business using shady-ass tactics and scammy behavior, so there was some reluctance to basically say, like, ‘Sorry, the things that you’ve done every day for the last several years are no longer acceptable,’ ” she said. ... Other than simply giving up on enforcing Facebook’s rules, there wasn’t much left to try. Facebook’s Public Policy team remained uncomfortable with taking down a major domestic publisher for inauthentic amplification, and it made the Civic team prove that Kolfage’s content, in addition to his tactics, was objectionable. This hurdle became a permanent but undisclosed change in policy: cheating to manipulate Facebook’s algorithm wasn’t enough to get you kicked off the platform—you had to be promoting something bad, too. <hr> Tests showed that the takedowns cut the amount of American political spam content by 20 percent overnight. Chakrabarti later admitted to his subordinates that he had been surprised that they had succeeded in taking a major action on domestic attempts to manipulate the platform. He had privately been expecting Facebook’s leadership to shut the effort down. <hr> A staffer had shown Cox that a Brazilian legislator who supported the populist Jair Bolsonaro had posted a fabricated video of a voting machine that had supposedly been rigged in favor of his opponent. The doctored footage had already been debunked by fact-checkers, which normally would have provided grounds to bring the distribution of the post to an abrupt halt. But Facebook’s Public Policy team had long ago determined, after a healthy amount of discussion regarding the rule’s application to President Donald Trump, that government officials’ posts were immune from fact-checks. Facebook was therefore allowing false material that undermined Brazilians’ trust in democracy to spread unimpeded.</p> <p>... Despite Civic’s concerns, voting in Brazil went smoothly. The same couldn’t be said for Civic’s colleagues over at WhatsApp. In the final days of the Brazilian election, viral misinformation transmitted by unfettered forwarding had blown up. <hr> Supporters of the victorious Bolsonaro, who shared their candidate’s hostility toward homosexuality, were celebrating on Facebook by posting memes of masked men holding guns and bats. The accompanying Portuguese text combined the phrase “We’re going hunting” with a gay slur, and some of the posts encouraged users to join WhatsApp groups supposedly for that violent purpose. Engagement was through the roof, prompting Facebook’s systems to spread them even further.</p> <p>While the company’s hate classifiers had been good enough to detect the problem, they weren’t reliable enough to automatically remove the torrent of hate. Rather than celebrating the race’s conclusion, Civic War Room staff put out an after-hours call for help from Portuguese-speaking colleagues. One polymath data scientist, a non-Brazilian who spoke great Portuguese and happened to be gay, answered the call.</p> <p>For Civic staffers, an incident like this wasn’t a good time, but it wasn’t extraordinary, either. They had come to accept that unfortunate things like this popped up on the platform sometimes, especially around election time.</p> <p>It took a glance at the Portuguese-speaking data scientist to remind Barnes how strange it was that viral horrors had become so routine on Facebook. The volunteer was hard at work just like everyone else, but he was quietly sobbing as he worked. “That moment is embedded in my mind,” Barnes said. “He’s crying, and it’s going to take the Operations team ten hours to clear this.” <hr> India was a huge target for Facebook, which had already been locked out of China, despite much effort by Zuckerberg. The CEO had jogged unmasked through Tiananmen Square as a sign that he wasn’t bothered by Beijing’s notorious air pollution. He had asked President Xi Jinping, unsuccessfully, to choose a Chinese name for his first child. The company had even worked on a secret tool that would have allowed Beijing to directly censor the posts of Chinese users. All of it was to little avail: Facebook wasn’t getting into China. By 2019, Zuckerberg had changed his tune, saying that the company didn’t want to be there—Facebook’s commitment to free expression was incompatible with state repression and censorship. Whatever solace Facebook derived from adopting this moral stance, succeeding in India became all the more vital: If Facebook wasn’t the dominant platform in either of the world’s two most populous countries, how could it be the world’s most important social network? <hr> Civic’s work got off to an easy start because the misbehavior was obvious. Taking only perfunctory measures to cover their tracks, all major parties were running networks of inauthentic pages, a clear violation of Facebook rules.</p> <p>The BJP’s IT cell seemed the most successful. The bulk of the coordinated posting could be traced to websites and pages created by Silver Touch, the company that had built Modi’s reelection campaign app. With cumulative follower accounts in excess of 10 million, the network hit both of Facebook’s agreed-upon standards for removal: they were using banned tricks to boost engagement and violating Facebook content policies by running fabricated, inflammatory quotes that allegedly exposed Modi opponents’ affection for rapists and that denigrated Muslims.</p> <p>With documentation of all parties’ bad behavior in hand by early spring, the Civic staffers overseeing the project arranged an hour-long meeting in Menlo Park with Das and Harbath to make the case for a mass takedown. Das showed up forty minutes late and pointedly let the team know that, despite the ample cafés, cafeterias, and snack rooms at the office, she had just gone out for coffee. As the Civic Team’s Liu and Ghosh tried to rush through several months of research showing how the major parties were relying on banned tactics, Das listened impassively, then told them she’d have to approve any action they wanted to take.</p> <p>The team pushed ahead with preparing to remove the offending pages. Mindful as ever of optics, the team was careful to package a large group of abusive pages together, some from the BJP’s network and others from the INC’s far less successful effort. With the help of Nathaniel Gleicher’s security team, a modest collection of Facebook pages traced to the Pakistani military was thrown in for good measure</p> <p>Even with the attempt at balance, the effort soon got bogged down. Higher-ups’ enthusiasm for the takedowns was so lacking that Chakrabarti and Harbath had to lobby Kaplan directly before they got approval to move forward.</p> <p>“I think they thought it was going to be simpler,” Harbath said of the Civic team’s efforts.</p> <p>Still, Civic kept pushing. On April 1, less than two weeks before voting was set to begin, Facebook announced that it had taken down more than one thousand pages and groups in separate actions against inauthentic behavior. In a statement, the company named the guilty parties: the Pakistani military, the IT cell of the Indian National Congress, and “individuals associated with an Indian IT firm, Silver Touch.”</p> <p>For anyone who knew what was truly going on, the announcement was suspicious. Of the three parties cited, the pro-BJP propaganda network was by far the largest—and yet the party wasn’t being called out like the others.</p> <p>Harbath and another person familiar with the mass takedown insisted this had nothing to do with favoritism. It was, they said, simply a mess. Where the INC had abysmally failed at subterfuge, making the attribution unavoidable under Facebook’s rules, the pro-BJP effort had been run through a contractor. That fig leaf gave the party some measure of deniability, even if it might fall short of plausible.</p> <p>If the announcement’s omission of the BJP wasn’t a sop to India’s ruling party, what Facebook did next certainly seemed to be. Even as it was publicly mocking the INC for getting caught, the BJP was privately demanding that Facebook reinstate the pages the party claimed it had no connection to. Within days of the takedown, Das and Kaplan’s team in Washington were lobbying hard to reinstate several BJP-connected entities that Civic had fought so hard to take down. They won, and some of the BJP pages got restored.</p> <p>With Civic and Public Policy at odds, the whole messy incident got kicked up to Zuckerberg to hash out. Kaplan argued that applying American campaign standards to India and many other international markets was unwarranted. Besides, no matter what Facebook did, the BJP was overwhelmingly favored to return to power when the election ended in May, and Facebook was seriously pissing it off.</p> <p>Zuckerberg concurred with Kaplan’s qualms. The company should absolutely continue to crack down hard on covert foreign efforts to influence politics, he said, but in domestic politics the line between persuasion and manipulation was far less clear. Perhaps Facebook needed to develop new rules—ones with Public Policy’s approval.</p> <p>The result was a near moratorium on attacking domestically organized inauthentic behavior and political spam. Imminent plans to remove illicitly coordinated Indonesian networks of pages, groups, and accounts ahead of upcoming elections were shut down. Civic’s wings were getting clipped. <hr> By 2019, Jin’s standing inside the company was slipping. He had made a conscious decision to stop working so much, offloading parts of his job onto others, something that did not conform to Facebook’s culture. More than that, Jin had a habit of framing what the company did in moral terms. Was this good for users? Was Facebook truly making its products better?</p> <p>Other executives were careful when bringing decisions to Zuckerberg to not frame decisions in terms of right or wrong. Everyone was trying to work collaboratively, to make a better product, and whatever Zuckerberg decided was good. Jin’s proposals didn’t carry that tone. He was unfailingly respectful, but he was also clear on what he considered the range of acceptable positions. Alex Schultz, the company’s chief marketing officer, once remarked to a colleague that the problem with Jin was that he made Zuckerberg feel like shit.</p> <p>In July 2019, Jin wrote a memo titled “Virality Reduction as an Integrity Strategy” and posted it in a 4,200-person Workplace group for employees working on integrity problems. “There’s a growing set of research showing that some viral channels are used for bad more than they are used for good,” the memo began. “What should our principles be around how we approach this?” Jin went on to list, with voluminous links to internal research, how Facebook’s products routinely garnered higher growth rates at the expense of content quality and user safety. Features that produced marginal usage increases were disproportionately responsible for spam on WhatsApp, the explosive growth of hate groups, and the spread of false news stories via reshares, he wrote.</p> <p>None of the examples were new. Each of them had been previously cited by Product and Research teams as discrete problems that would require either a design fix or extra enforcement. But Jin was framing them differently. In his telling, they were the inexorable result of Facebook’s efforts to speed up and grow the platform.</p> <p>The response from colleagues was enthusiastic. “Virality is the goal of tenacious bad actors distributing malicious content,” wrote one researcher. “Totally on board for this,” wrote another, who noted that virality helped inflame anti-Muslim sentiment in Sri Lanka after a terrorist attack. “This is 100% direction to go,” Brandon Silverman of CrowdTangle wrote.</p> <p>After more than fifty overwhelmingly positive comments, Jin ran into an objection from Jon Hegeman, the executive at News Feed who by then had been promoted to head of the team. Yes, Jin was probably right that viral content was disproportionately worse than nonviral content, Hegeman wrote, but that didn’t mean that the stuff was bad on average. ... Hegeman was skeptical. If Jin was right, he responded, Facebook should probably be taking drastic steps like shutting down all reshares, and the company wasn’t in much of a mood to try. “If we remove a small percentage of reshares from people’s inventory,” Hegeman wrote, “they decide to come back to Facebook less.” <hr> If Civic had thought Facebook’s leadership would be rattled by the discovery that the company’s growth efforts had been making Facebook’s integrity problems worse, they were wrong. Not only was Zuckerberg hostile to future anti-growth work; he was beginning to wonder whether some of the company’s past integrity efforts were misguided.</p> <p>Empowered to veto not just new integrity proposals but work that had long ago been approved, the Public Policy team began declaring that some failed to meet the company’s standards for “legitimacy.” Sparing Sharing, the demotion of content pushed by hyperactive users—already dialed down by 80 percent at its adoption—was set to be dialed back completely. (It was ultimately spared but further watered down.)</p> <p>“We cannot assume links shared by people who shared a lot are bad,” a writeup of plans to undo the change said. (In practice, the effect of rolling back Sparing Sharing, even in its weakened form, was unambiguous. Views of “ideologically extreme content for users of all ideologies” would immediately rise by a double-digit percentage, with the bulk of the gains going to the far right.)</p> <p>“Informed Sharing”—an initiative that had demoted content shared by people who hadn’t clicked on the posts in question, and which had proved successful in diminishing the spread of fake news—was also slated for decommissioning.</p> <p>“Being less likely to share content after reading it is not a good indicator of integrity,” stated a document justifying the planned discontinuation.</p> <p>A company spokeswoman denied numerous Integrity staffers’ contention that the Public Policy team had the ability to veto or roll back integrity changes, saying that Kaplan’s team was just one voice among many internally. But, regardless of who was calling the shots, the company’s trajectory was clear. Facebook wasn’t just slow-walking integrity work anymore. It was actively planning to undo large chunks of it. <hr> Facebook could be certain of meeting its goals for the 2020 election if it was willing to slow down viral features. This could include imposing limits on reshares, message forwarding, and aggressive algorithmic amplification—the kind of steps that the Integrity teams throughout Facebook had been pushing to adopt for more than a year. The moves would be simple and cheap. Best of all, the methods had been tested and guaranteed success in combating longstanding problems.</p> <p>The correct choice was obvious, Jin suggested, but Facebook seemed strangely unwilling to take it. It would mean slowing down the platform’s growth, the one tenet that was inviolable.</p> <p>“Today the bar to ship a pro-Integrity win (that may be negative to engagement) often is higher than the bar to ship pro-engagement win (that may be negative to Integrity),” Jin lamented. If the situation didn’t change, he warned, it risked a 2020 election disaster from “rampant harmful virality.” <hr> Even including downranking, “we estimate that we may action as little as 3–5% of hate and 0.6% of [violence and incitement] on Facebook, despite being the best in the world at it,” one presentation noted. Jin knew these stats, according to people who worked with him, but was too polite to emphasize them. <hr> Company researchers used multiple methods to demonstrate QAnon’s gravitational pull, but the simplest and most visceral proof came from setting up a test account and seeing where Facebook’s algorithms took it.</p> <p>After setting up a dummy account for “Carol”—a hypothetical forty-one-year-old conservative woman in Wilmington, North Carolina, whose interests included the Trump family, Fox News, Christianity, and parenting—the researcher watched as Facebook guided Carol from those mainstream interests toward darker places.</p> <p>Within a day, Facebook’s recommendations had “devolved toward polarizing content.” Within a week, Facebook was pushing a “barrage of extreme, conspiratorial, and graphic content.” ... The researcher’s write-up included a plea for action: if Facebook was going to push content this hard, the company needed to get a lot more discriminating about what it pushed.</p> <p>Later write-ups would acknowledge that such warnings went unheeded. <hr> As executives filed out, Zuckerberg pulled Integrity’s Guy Rosen aside. “Why did you show me this in front of so many people?” Zuckerberg asked Rosen, who as Chakrabarti’s boss bore responsibility for his subordinate’s presentation landing on that day’s agenda.</p> <p>Zuckerberg had good reason to be unhappy that so many executives had watched him being told in plain terms that the forthcoming election was shaping up to be a disaster. In the course of investigating Cambridge Analytica, regulators around the world had already subpoenaed thousands of pages of documents from the company and had pushed for Zuckerberg’s personal communications going back for the better part of the decade. Facebook had paid $5 billion to the U.S. Federal Trade Commission to settle one of the most prominent inquiries, but the threat of subpoenas and depositions wasn’t going away. ... If there had been any doubt that Civic was the Integrity division’s problem child, lobbing such a damning document straight onto Zuckerberg’s desk settled it. As Chakrabarti later informed his deputies, Rosen told him that Civic would henceforth be required to run such material through other executives first—strictly for organizational reasons, of course.</p> <p>​​Chakrabarti didn’t take the reining in well. A few months later, he wrote a scathing appraisal of Rosen’s leadership as part of the company’s semiannual performance review. Facebook’s top integrity official was, he wrote, “prioritizing PR risk over social harm.” <hr> Facebook still hadn’t given Civic the green light to resume the fight against domestically coordinated political manipulation efforts. Its fact-checking program was too slow to effectively shut down the spread of misinformation during a crisis. And the company still hadn’t addressed the “perverse incentives” resulting from News Feed’s tendency to favor divisive posts. “Remains unclear if we have a societal responsibility to reduce exposure to this type of content,” an updated presentation from Civic tartly stated.</p> <p>“Samidh was trying to push Mark into making those decisions, but he didn’t take the bait,” Harbath recalled. <hr> Cutler remarked that she would have pushed for Chakrabarti’s ouster if she didn’t expect a substantial portion of his team would mutiny. (The company denies Cutler said this.) <hr> a British study had found that Instagram had the worst effect of any social media app on the health and well-being of teens and young adults. <hr> The second was the death of Molly Russell, a fourteen-year-old from North London. Though “apparently flourishing,” as a later coroner’s inquest found, Russell had died by suicide in late 2017. Her death was treated as an inexplicable local tragedy until the BBC ran a report on social media activity in 2019. Russell had followed a large group of accounts that romanticized depression, self-harm, and suicide, and she had engaged with more than 2,100 macabre posts, mostly on Instagram. Her final login had come at 12:45 the morning she died.</p> <p>“I have no doubt that Instagram helped kill my daughter,” her father told the BBC.</p> <p>Later research—both inside and outside Instagram—would demonstrate that a class of commercially motivated accounts had seized on depression-related content for the same reason that others focused on car crashes or fighting: the stuff pulled high engagement. But serving pro-suicide content to a vulnerable kid was clearly indefensible, and the platform pledged to remove and restrict the recommendation of such material, along with hiding hashtags like #Selfharm. Beyond exposing an operational failure, the extensive coverage of Russell’s death associated Instagram with rising concerns about teen mental health. <hr> Though much attention, both inside and outside the company, had been paid to bullying, the most serious risks weren’t the result of people mistreating each other. Instead, the researchers wrote, harm arose when a user’s existing insecurities combined with Instagram’s mechanics. “Those who are dissatisfied with their lives are more negatively affected by the app,” one presentation noted, with the effects most pronounced among girls unhappy with their bodies and social standing.</p> <p>There was a logic here, one that teens themselves described to researchers. Instagram’s stream of content was a “highlight reel,” at once real life and unachievable. This was manageable for users who arrived in a good frame of mind, but it could be poisonous for those who showed up vulnerable. Seeing comments about how great an acquaintance looked in a photo would make a user who was unhappy about her weight feel bad—but it didn’t make her stop scrolling.</p> <p>“They often feel ‘addicted’ and know that what they’re seeing is bad for their mental health but feel unable to stop themselves,” the “Teen Mental Health Deep Dive” presentation noted. Field research in the U.S. and U.K. found that more than 40 percent of Instagram users who felt “unattractive” traced that feeling to Instagram. Among American teens who said they had thought about dying by suicide in the past month, 6 percent said the feeling originated on the platform. In the U.K., the number was double that.</p> <p>“Teens who struggle with mental health say Instagram makes it worse,” the presentation stated. “Young people know this, but they don’t adopt different patterns.”</p> <p>These findings weren’t dispositive, but they were unpleasant, in no small part because they made sense. Teens said—and researchers appeared to accept—that certain features of Instagram could aggravate mental health issues in ways beyond its social media peers. Snapchat had a focus on silly filters and communication with friends, while TikTok was devoted to performance. Instagram, though? It revolved around bodies and lifestyle. The company disowned these findings after they were made public, calling the researchers’ apparent conclusion that Instagram could harm users with preexisting insecurities unreliable. The company would dispute allegations that it had buried negative research findings as “plain false.” <hr> Facebook had deployed a comment-filtering system to prevent the heckling of public figures such as Zuckerberg during livestreams, burying not just curse words and complaints but also substantive discussion of any kind. The system had been tuned for sycophancy, and poorly at that. The irony of heavily censoring comments on a speech about free speech wasn’t hard to miss. <hr> CrowdTangle’s rundown of that Tuesday’s top content had, it turned out, included a butthole. This wasn’t a borderline picture of someone’s ass. It was an unmistakable, up-close image of an anus. It hadn’t just gone big on Facebook—it had gone biggest. Holding the number one slot, it was the lead item that executives had seen when they opened Silverman’s email. “I hadn’t put Mark or Sheryl on it, but I basically put everyone else on there,” Silverman said.</p> <p>The picture was a thumbnail outtake from a porn video that had escaped Facebook’s automated filters. Such errors were to be expected, but was Facebook’s familiarity with its platform so poor that it wouldn’t notice when its systems started spreading that content to millions of people?</p> <p>Yes, it unquestionably was. <hr> In May, a data scientist working on integrity posted a Workplace note titled “Facebook Creating a Big Echo Chamber for ‘the Government and Public Health Officials Are Lying to Us’ Narrative—Do We Care?”</p> <p>Just a few months into the pandemic, groups devoted to opposing COVID lockdown measures had become some of the most widely viewed on the platform, pushing false claims about the pandemic under the guise of political activism. Beyond serving as an echo chamber for alternating claims that the virus was a Chinese plot and that the virus wasn’t real, the groups served as a staging area for platform-wide assaults on mainstream medical information. ... An analysis showed these groups had appeared abruptly, and while they had ties to well-established anti-vaccination communities, they weren’t arising organically. Many shared near-identical names and descriptions, and an analysis of their growth showed that “a relatively small number of people” were sending automated invitations to “hundreds or thousands of users per day.”</p> <p>Most of this didn’t violate Facebook’s rules, the data scientist noted in his post. Claiming that COVID was a plot by Bill Gates to enrich himself from vaccines didn’t meet Facebook’s definition of “imminent harm.” But, he said, the company should think about whether it was merely reflecting a widespread skepticism of COVID or creating one.</p> <p>“This is severely impacting public health attitudes,” a senior data scientist responded. “I have some upcoming survey data that suggests some baaaad results.” <hr> President Trump was gearing up for reelection and he took to his platform of choice, Twitter, to launch what would become a monthslong attempt to undermine the legitimacy of the November 2020 election. “There is no way (ZERO!) that Mail-In Ballots will be anything less than substantially fraudulent,” Trump wrote. As was standard for Trump’s tweets, the message was cross-posted on Facebook.</p> <p>Under the tweet, Twitter included a small alert that encouraged users to “Get the facts about mail-in ballots.” Anyone clicking on it was informed that Trump’s allegations of a “rigged” election were false and there was no evidence that mail-in ballots posed a risk of fraud.</p> <p>Twitter had drawn its line. Facebook now had to choose where it stood. Monika Bickert, Facebook’s head of Content Policy, declared that Trump’s post was right on the edge of the sort of misinformation about “methods for voting” that the company had already pledged to take down.</p> <p>Zuckerberg didn’t have a strong position, so he went with his gut and left it up. But then he went on Fox News to attack Twitter for doing the opposite. “I just believe strongly that Facebook shouldn’t be the arbiter of truth of everything that people say online,” he told host Dana Perino. “Private companies probably shouldn’t be, especially these platform companies, shouldn’t be in the position of doing that.”</p> <p>The interview caused some tumult inside Facebook. Why would Zuckerberg encourage Trump’s testing of the platform’s boundaries by declaring its tolerance of the post a matter of principle? The perception that Zuckerberg was kowtowing to Trump was about to get a lot worse. On the day of his Fox News interview, protests over the recent killing of George Floyd by Minneapolis police officers had gone national, and the following day the president tweeted that “when the looting starts, the shooting starts”—a notoriously menacing phrase used by a white Miami police chief during the civil rights era.</p> <p>Declaring that Trump had violated its rules against glorifying violence, Twitter took the rare step of limiting the public’s ability to see the tweet—users had to click through a warning to view it, and they were prevented from liking or retweeting it.</p> <p>Over on Facebook, where the message had been cross-posted as usual, the company’s classifier for violence and incitement estimated it had just under a 90 percent probability of breaking the platform’s rules—just shy of the threshold that would get a regular user’s post automatically deleted.</p> <p>Trump wasn’t a regular user, of course. As a public figure, arguably the world’s most public figure, his account and posts were protected by dozens of different layers of safeguards. <hr> Facebook drew up a list of accounts that were immune to some or all immediate enforcement actions. If those accounts appeared to break Facebook’s rules, the issue would go up the chain of Facebook’s hierarchy and a decision would be made on whether to take action against the account or not. Every social media platform ended up creating similar lists—it didn’t make sense to adjudicate complaints about heads of state, famous athletes, or persecuted human rights advocates in the same way the companies did with run-of-the-mill users. The problem was that, like a lot of things at Facebook, the company’s process got particularly messy.</p> <p>For Facebook, the risks that arose from shielding too few users were seen as far greater than the risks of shielding too many. Erroneously removing a bigshot’s content could unleash public hell—in Facebook parlance, a “media escalation” or, that most dreaded of events, a “PR fire.” Hours or days of coverage would follow when Facebook erroneously removed posts from breast cancer victims or activists of all stripes. When it took down a photo of a risqué French magazine cover posted to Instagram by the American singer Rihanna in 2014, it nearly caused an international incident. As internal reviews of the system later noted, the incentive was to shield as heavily as possible any account with enough clout to cause undue attention.</p> <p>No one team oversaw XCheck, and the term didn’t even have a specific definition. There were endless varieties and gradations applied to advertisers, posts, pages, and politicians, with hundreds of engineers around the company coding different flavors of protections and tagging accounts as needed. Eventually, at least 6 million accounts and pages were enrolled into XCheck, with an internal guide stating that an entity should be “newsworthy,” “influential or popular,” or “PR risky” to qualify. On Instagram, XCheck even covered popular animal influencers, including Doug the Pug.</p> <p>Any Facebook employee who knew the ropes could go into the system and flag accounts for special handling. XCheck was used by more than forty teams inside the company. Sometimes there were records of how they had deployed it and sometimes there were not. Later reviews would find that XCheck’s protections had been granted to “abusive accounts” and “persistent violators” of Facebook’s rules.</p> <p>The job of giving a second review to violating content from high-profile users would require a sizable team of full-time employees. Facebook simply never staffed one. Flagged posts were put into a queue that no one ever considered, sweeping already once-validated complaints under the digital rug. “Because there was no governance or rigor, those queues might as well not have existed,” recalled someone who worked with the system. “The interest was in protecting the business, and that meant making sure we don’t take down a whale’s post.”</p> <p>The stakes could be high. XCheck protected high-profile accounts, including in Myanmar, where public figures were using Facebook to incite genocide. It shielded the account of British far-right figure Tommy Robinson, an investigation by Britain’s Channel Four revealed in 2018.</p> <p>One of the most explosive cases was that of Brazilian soccer star Neymar, whose 150 million Instagram followers placed him among the platform’s top twenty influencers. After a woman accused Neymar of rape in 2019, he accused the woman of extorting him and posted Facebook and Instagram videos defending himself—and showing viewers his WhatsApp correspondence with his accuser, which included her name and nude photos of her. Facebook’s procedure for handling the posting of “non-consensual intimate imagery” was simple: delete it. But Neymar was protected by XCheck. For more than a day, the system blocked Facebook’s moderators from removing the video. An internal review of the incident found that 56 million Facebook and Instagram users saw what Facebook described in a separate document as “revenge porn,” exposing the woman to what an employee referred to in the review as “ongoing abuse” from other users.</p> <p>Facebook’s operational guidelines stipulate that not only should unauthorized nude photos be deleted, but people who post them should have their accounts deleted. Faced with the prospect of scrubbing one of the world’s most famous athletes from its platform, Facebook blinked.</p> <p>“After escalating the case to leadership,” the review said, “we decided to leave Neymar’s accounts active, a departure from our usual ‘one strike’ profile disable policy.”</p> <p>Facebook knew that providing preferential treatment to famous and powerful users was problematic at best and unacceptable at worst. “Unlike the rest of our community, these people can violate our standards without any consequences,” a 2019 review noted, calling the system “not publicly defensible.”</p> <p>Nowhere did XCheck interventions occur more than in American politics, especially on the right. <hr> When a high-enough-profile account was conclusively found to have broken Facebook’s rules, the company would delay taking action for twenty-four hours, during which it tried to convince the offending party to remove the offending post voluntarily. The program served as an invitation for privileged accounts to play at the edge of Facebook’s tolerance. If they crossed the line, they could simply take it back, having already gotten most of the traffic they would receive anyway. (Along with Diamond and Silk, every member of Congress ended up being granted the self-remediation window.)</p> <p>Sometimes Kaplan himself got directly involved. According to documents first obtained by BuzzFeed, the global head of Public Policy was not above either pushing employees to lift penalties against high-profile conservatives for spreading false information or leaning on Facebook’s fact-checkers to alter their verdicts.</p> <p>An understanding began to dawn among the politically powerful: if you mattered enough, Facebook would often cut you slack. Prominent entities rightly treated any significant punishment as a sign that Facebook didn’t consider them worthy of white-glove treatment. To prove the company wrong, they would scream as loudly as they could in response.</p> <p>“Some of these people were real gems,” recalled Harbath. In Facebook’s Washington, DC, office, staffers would explicitly justify blocking penalties against “Activist Mommy,” a Midwestern Christian account with a penchant for anti-gay rhetoric, because she would immediately go to the conservative press.</p> <p>Facebook’s fear of messing up with a major public figure was so great that some achieved a status beyond XCheck and were whitelisted altogether, rendering even their most vile content immune from penalties, downranking, and, in some cases, even internal review. <hr> Other Civic colleagues and Integrity staffers piled into the comments section to concur. “If our goal, was say something like: have less hate, violence etc. on our platform to begin with instead of remove more hate, violence etc. our solutions and investments would probably look quite different,” one wrote.</p> <p>Rosen was getting tired of dealing with Civic. Zuckerberg, who famously did not like to revisit decisions once they were made, had already dictated his preferred approach: automatically remove content if Facebook’s classifiers were highly confident that it broke the platform’s rules and take “soft” actions such as demotions when the systems predicted a violation was more likely than not. These were the marching orders and the only productive path forward was to diligently execute them. <hr> The week before, the Wall Street Journal had published a story my colleague Newley Purnell and I cowrote about how Facebook had exempted a firebrand Hindu politician from its hate speech enforcement. There had been no question that Raja Singh, a member of the Telangana state parliament, was inciting violence. He gave speeches calling for Rohingya immigrants who fled genocide in Myanmar to be shot, branded all Indian Muslims traitors, and threatened to raze mosques. He did these things while building an audience of more than 400,000 followers on Facebook. Earlier that year, police in Hyderabad had placed him under house arrest to prevent him from leading supporters to the scene of recent religious violence.</p> <p>That Facebook did nothing in the face of such rhetoric could have been due to negligence—there were a lot of firebrand politicians offering a lot of incitement in a lot of different languages around the world. But in this case, Facebook was well aware of Singh’s behavior. Indian civil rights groups had brought him to the attention of staff in both Delhi and Menlo Park as part of their efforts to pressure the company to act against hate speech in the country.</p> <p>There was no question whether Singh qualified as a “dangerous individual,” someone who would normally be barred from having a presence on Facebook’s platforms. Despite the internal conclusion that Singh and several other Hindu nationalist figures were creating a risk of actual bloodshed, their designation as hate figures had been blocked by Ankhi Das, Facebook’s head of Indian Public Policy—the same executive who had lobbied years earlier to reinstate BJP-associated pages after Civic had fought to take them down.</p> <p>Das, whose job included lobbying India’s government on Facebook’s behalf, didn’t bother trying to justify protecting Singh and other Hindu nationalists on technical or procedural grounds. She flatly said that designating them as hate figures would anger the government, and the ruling BJP, so the company would not be doing it. ... Following our story, Facebook India’s then–managing director Ajit Mohan assured the company’s Muslim employees that we had gotten it wrong. Facebook removed hate speech “as soon as it became aware of it” and would never compromise its community standards for political purposes. “While we know there is more to do, we are making progress every day,” he wrote.</p> <p>It was after we published the story that Kiran (a pseudonym) reached out to me. They wanted to make clear that our story in the Journal had just scratched the surface. Das’s ties with the government were far tighter than we understood, they said, and Facebook India was protecting entities much more dangerous than Singh. <hr> “Hindus, come out. Die or kill,” one prominent activist had declared during a Facebook livestream, according to a later report by retired Indian civil servants. The ensuing violence left fifty-three people dead and swaths of northeastern Delhi burned. <hr> The researcher set up a dummy account while traveling. Because the platform factored a user’s geography into content recommendations, she and a colleague noted in a writeup of her findings, it was the only way to get a true read on what the platform was serving up to a new Indian user.</p> <p>Ominously, her summary of what Facebook had recommended to their notional twenty-one-year-old Indian woman began with a trigger warning for graphic violence. While Facebook’s push of American test users toward conspiracy theories had been concerning, the Indian version was dystopian.</p> <p>“In the 3 weeks since the account has been opened, by following just this recommended content, the test user’s News Feed has become a near constant barrage of polarizing nationalist content, misinformation, and violence and gore,” the note stated. The dummy account’s feed had turned especially dark after border skirmishes between Pakistan and India in early 2019. Amid a period of extreme military tensions, Facebook funneled the user toward groups filled with content promoting full-scale war and mocking images of corpses with laughing emojis.</p> <p>This wasn’t a case of bad posts slipping past Facebook’s defenses, or one Indian user going down a nationalistic rabbit hole. What Facebook was recommending to the young woman had been bad from the start. The platform had pushed her to join groups clogged with images of corpses, watch purported footage of fictional air strikes, and congratulate nonexistent fighter pilots on their bravery.</p> <p>“I’ve seen more images of dead people in the past three weeks than I’ve seen in my entire life, total,” the researcher wrote, noting that the platform had allowed falsehoods, dehumanizing rhetoric, and violence to “totally take over during a major crisis event.” Facebook needed to consider not only how its recommendation systems were affecting “users who are different from us,” she concluded, but rethink how it built its products for “non-US contexts.”</p> <p>India was not an outlier. Outside of English-speaking countries and Western Europe, users routinely saw more cruelty, engagement bait, and falsehoods. Perhaps differing cultural senses of propriety explained some of the gap, but a lot clearly stemmed from differences in investment and concern. <hr> This wasn’t supposed to be legal in the Gulf under the gray-market labor sponsorship system known as kafala, but the internet had removed the friction from buying people. Undercover reporters from BBC Arabic posed as a Kuwaiti couple and negotiated to buy a sixteen-year-old girl whose seller boasted about never allowing her to leave the house.</p> <p>Everyone told the BBC they were horrified. Kuwaiti police rescued the girl and sent her home. Apple and Google pledged to root out the abuse, and the bartering apps cited in the story deleted their “domestic help” sections. Facebook pledged to take action and deleted a popular hashtag used to advertise maids for sale.</p> <p>After that, the company largely dropped the matter. But Apple turned out to have a longer attention span. In October, after sending Facebook numerous examples of ongoing maid sales via Instagram, it threatened to remove Facebook’s products from its App Store.</p> <p>Unlike human trafficking, this, to Facebook, was a real crisis.</p> <p>“Removing our applications from Apple’s platforms would have had potentially severe consequences to the business, including depriving millions of users of access to IG &amp; FB,” an internal report on the incident stated.</p> <p>With alarm bells ringing at the highest levels, the company found and deleted an astonishing 133,000 posts, groups, and accounts related to the practice within days. It also performed a quick revamp of its policies, reversing a previous rule allowing the sale of maids through “brick and mortar” businesses. (To avoid upsetting the sensibilities of Gulf State “partners,” the company had previously permitted the advertising and sale of servants by businesses with a physical address.) Facebook also committed to “holistic enforcement against any and all content promoting domestic servitude,” according to the memo.</p> <p>Apple lifted its threat, but again Facebook wouldn’t live up to its pledges. Two years later, in late 2021, an Integrity staffer would write up an investigation titled “Domestic Servitude: This Shouldn’t Happen on FB and How We Can Fix It.” Focused on the Philippines, the memo described how fly-by-night employment agencies were recruiting women with “unrealistic promises” and then selling them into debt bondage overseas. If Instagram was where domestic servants were sold, Facebook was where they were recruited.</p> <p>Accessing the direct-messaging inboxes of the placing agencies, the staffer found Filipina domestic servants pleading for help. Some reported rape or sent pictures of bruises from being hit. Others hadn’t been paid in months. Still others reported being locked up and starved. The labor agencies didn’t help.</p> <p>The passionately worded memo, and others like it, listed numerous things the company could do to prevent the abuse. There were improvements to classifiers, policy changes, and public service announcements to run. Using machine learning, Facebook could identify Filipinas who were looking for overseas work and then inform them of how to spot red flags in job postings. In Persian Gulf countries, Instagram could run PSAs about workers’ rights.</p> <p>These things largely didn’t happen for a host of reasons. One memo noted a concern that, if worded too strongly, Arabic-language PSAs admonishing against the abuse of domestic servants might “alienate buyers” of them. But the main obstacle, according to people familiar with the team, was simply resources. The team devoted full-time to human trafficking—which included not just the smuggling of people for labor and sex but also the sale of human organs—amounted to a half-dozen people worldwide. The team simply wasn’t large enough to knock this stuff out. <hr> “We’re largely blind to problems on our site,” Leach’s presentation wrote of Ethiopia.</p> <p>Facebook employees produced a lot of internal work like this: declarations that the company had gotten in over its head, unable to provide even basic remediation to potentially horrific problems. Events on the platform could foreseeably lead to loss of life and almost certainly did, according to human rights groups monitoring Ethiopia. Meareg Amare, a university lecturer in Addis Ababa, was murdered outside his home one month after a post went viral, receiving 35,000 likes, listing his home address and calling for him to be attacked. Facebook failed to remove it. His family is now suing the company.</p> <p>As it so often did, the company was choosing growth over quality. Efforts to expand service to poorer and more isolated places would not wait for user protections to catch up, and, even in countries at “dire” risk of mass atrocities, the At Risk Countries team needed approval to do things that harmed engagement. <hr> Documents and transcripts of internal meetings among the company’s American staff show employees struggling to explain why Facebook wasn’t following its normal playbook when dealing with hate speech, the coordination of violence, and government manipulation in India. Employees in Menlo Park discussed the BJP’s promotion of the “Love Jihad” lie. They met with human rights organizations that documented the violence committed by the platform’s cow-protection vigilantes. And they tracked efforts by the Indian government and its allies to manipulate the platform via networks of accounts. Yet nothing changed.</p> <p>“We have a lot of business in India, yeah. And we have connections with the government, I guess, so there are some sensitivities around doing a mitigation in India,” one employee told another about the company’s protracted failure to address abusive behavior by an Indian intelligence service.</p> <p>During another meeting, a team working on what it called the problem of “politicized hate” informed colleagues that the BJP and its allies were coordinating both the “Love Jihad” slander and another hashtag, #CoronaJihad, premised on the idea that Muslims were infecting Hindus with COVID via halal food.</p> <p>The Rashtriya Swayamsevak Sangh, or RSS—the umbrella Hindu nationalist movement of which the BJP is the political arm—was promoting these slanders through 6,000 or 7,000 different entities on the platform, with the goal of portraying Indian Muslims as subhuman, the presenter explained. Some of the posts said that the Quran encouraged Muslim men to rape their female family members.</p> <p>“What they’re doing really permeates Indian society,” the presenter noted, calling it part of a “larger war.”</p> <p>A colleague at the meeting asked the obvious question. Given the company’s conclusive knowledge of the coordinated hate campaign, why hadn’t the posts or accounts been taken down?</p> <p>“Ummm, the answer that I’ve received for the past year and a half is that it’s too politically sensitive to take down RSS content as hate,” the presenter said.</p> <p>Nothing needed to be said in response.</p> <p>“I see your face,” the presenter said. “And I totally agree.” <hr> One incident in particular, involving a local political candidate, stuck out. As Kiran recalled it, the guy was a little fish, a Hindu nationalist activist who hadn’t achieved Raja Singh’s six-digit follower count but was still a provocateur. The man’s truly abhorrent behavior had been repeatedly flagged by lower-level moderators, but somehow the company always seemed to give it a pass.</p> <p>This time was different. The activist had streamed a video in which he and some accomplices kidnapped a man who, they informed the camera, had killed a cow. They took their captive to a construction site and assaulted him while Facebook users heartily cheered in the comments section. <hr> Zuckerberg launched an internal campaign against social media overenforcement. Ordering the creation of a team dedicated to preventing wrongful content takedowns, Zuckerberg demanded regular briefings on its progress from senior employees. He also suggested that, instead of rigidly enforcing platform rules on content in Groups, Facebook should defer more to the sensibilities of the users in them. In response, a staffer proposed entirely exempting private groups from enforcement for “low-tier hate speech.” <hr> The stuff was viscerally terrible—people clamoring for lynchings and civil war. One group was filled with “enthusiastic calls for violence every day.” Another top group claimed it was set up by Trump-supporting patriots but was actually run by “financially motivated Albanians” directing a million views daily to fake news stories and other provocative content.</p> <p>The comments were often worse than the posts themselves, and even this was by design. The content of the posts would be incendiary but fall just shy of Facebook’s boundaries for removal—it would be bad enough, however, to harvest user anger, classic “hate bait.” The administrators were professionals, and they understood the platform’s weaknesses every bit as well as Civic did. In News Feed, anger would rise like a hot-air balloon, and such comments could take a group to the top.</p> <p>Public Policy had previously refused to act on hate bait <hr> We have heavily overpromised regarding our ability to moderate content on the platform,” one data scientist wrote to Rosen in September. “We are breaking and will continue to break our recent promises.” <hr> The longstanding conflicts between Civic and Facebook’s Product, Policy, and leadership teams had boiled over in the wake of the “looting/shooting” furor, and executives—minus Chakrabarti—had privately begun discussing how to address what was now unquestionably viewed as a rogue Integrity operation. Civic, with its dedicated engineering staff, hefty research operation, and self-chosen mission statement, was on the chopping block. <hr> The group had grown to more than 360,000 members less than twenty-four hours later when Facebook took it down, citing “extraordinary measures.” Pushing false claims of election fraud to a mass audience at a time when armed men were calling for a halt to vote counting outside tabulation centers was an obvious problem, and one that the company knew was only going to get bigger. Stop the Steal had an additional 2.1 million users pending admission to the group when Facebook pulled the plug.</p> <p>Facebook’s leadership would describe Stop the Steal’s growth as unprecedented, though Civic staffers could be forgiven for not sharing their sense of surprise. <hr> Zuckerberg had accepted the deletion under emergency circumstances, but he didn’t want the Stop the Steal group’s removal to become a precedent for a backdoor ban on false election claims. During the run-up to Election Day, Facebook had removed only lies about the actual voting process—stuff like “Democrats vote on Wednesday” and “People with outstanding parking tickets can’t go to the polls.” Noting the thin distinction between the claim that votes wouldn’t be counted and that they wouldn’t be counted accurately, Chakrabarti had pushed to take at least some action against baseless election fraud claims.</p> <p>Civic hadn’t won that fight, but with the Stop the Steal group spawning dozens of similarly named copycats—some of which also accrued six-figure memberships—the threat of further organized election delegitimization efforts was obvious.</p> <p>Barred from shutting down the new entities, Civic assigned staff to at least study them. Staff also began tracking top delegitimization posts, which were earning tens of millions of views, for what one document described as “situational awareness.” A later analysis found that as much as 70 percent of Stop the Steal content was coming from known “low news ecosystem quality” pages, the commercially driven publishers that Facebook’s News Feed integrity staffers had been trying to fight for years. <hr> Zuckerberg overruled both Facebook’s Civic team and its head of counterterrorism. Shortly after the Associated Press called the presidential election for Joe Biden on November 7—the traditional marker for the race being definitively over—Molly Cutler assembled roughly fifteen executives that had been responsible for the company’s election preparation. Citing orders from Zuckerberg, she said the election delegitimization monitoring was to immediately stop. <hr> On December 17, a data scientist flagged that a system responsible for either deleting or restricting high-profile posts that violated Facebook’s rules had stopped doing so. Colleagues ignored it, assuming that the problem was just a “logging issue”—meaning the system still worked, it just wasn’t recording its actions. On the list of Facebook’s engineering priorities, fixing that didn’t rate.</p> <p>In fact, the system truly had failed, in early November. Between then and when engineers realized their error in mid-January, the system had given a pass to 3,100 highly viral posts that should have been deleted or labeled “disturbing.”</p> <p>Glitches like that happened all the time at Facebook. Unfortunately, this one produced an additional 8 billion “regrettable” views globally, instances in which Facebook had shown users content that it knew was trouble. The company would later say that only a small minority of the 8 billion “regrettable” content views touched on American politics, and that the mistake was immaterial to subsequent events. A later review of Facebook’s post-election work tartly described the flub as a “lowlight” of the platform’s 2020 election performance, though the company disputes that it had a meaningful impact. At least 7 billion of the bad content views were international, the company says, and of the American material only a portion dealt with politics. Overall, a spokeswoman said, the company remains proud of its pre- and post-election safety work. <hr> Zuckerberg vehemently disagreed with people who said that the COVID vaccine was unsafe, but he supported their right to say it, including on Facebook. ... Under Facebook’s policy, health misinformation about COVID was to be removed only if it posed an imminent risk of harm, such as a post telling infected people to drink bleach ... A researcher randomly sampled English-language comments containing phrases related to COVID and vaccines. A full two-thirds were anti-vax. The researcher’s memo compared that figure to public polling showing the prevalence of anti-vaccine sentiment in the U.S.—it was a full 40 points lower.</p> <p>Additional research found that a small number of “big whales” was behind a large portion of all anti-vaccine content on the platform. Of 150,000 posters in Facebook groups that were eventually disabled for COVID misinformation, just 5 percent were producing half of all posts. And just 1,400 users were responsible for inviting half of all members. “We found, like many problems at FB, this is a head-heavy problem with a relatively few number of actors creating a large percentage of the content and growth,” Facebook researchers would later note.</p> <p>One of the anti-vax brigade’s favored tactics was to piggyback on posts from entities like UNICEF and the World Health Organization encouraging vaccination, which Facebook was promoting free of charge. Anti-vax activists would respond with misinformation or derision in the comments section of these posts, then boost one another’s hostile comments toward the top slot <hr> Even as Facebook prepared for virally driven crises to become routine, the company’s leadership was becoming increasingly comfortable absolving its products of responsibility for feeding them. By the spring of 2021, it wasn’t just Boz arguing that January 6 was someone else’s problem. Sandberg suggested that January 6 was “largely organized on platforms that don’t have our abilities to stop hate.” Zuckerberg told Congress that they need not cast blame beyond Trump and the rioters themselves. “The country is deeply divided right now and that is not something that tech alone can fix,” he said.</p> <p>In some instances, the company appears to have publicly cited research in what its own staff had warned were inappropriate ways. A June 2020 review of both internal and external research had warned that the company should avoid arguing that higher rates of polarization among the elderly—the demographic that used social media least—was proof that Facebook wasn’t causing polarization.</p> <p>Though the argument was favorable to Facebook, researchers wrote, Nick Clegg should avoid citing it in an upcoming opinion piece because “internal research points to an opposite conclusion.” Facebook, it turned out, fed false information to senior citizens at such a massive rate that they consumed far more of it despite spending less time on the platform. Rather than vindicating Facebook, the researchers wrote, “the stronger growth of polarization for older users may be driven in part by Facebook use.”</p> <p>All the researchers wanted was for executives to avoid parroting a claim that Facebook knew to be wrong, but they didn’t get their wish. The company says the argument never reached Clegg. When he published a March 31, 2021, Medium essay titled “You and the Algorithm: It Takes Two to Tango,” he cited the internally debunked claim among the “credible recent studies” disproving that “we have simply been manipulated by machines all along.” (The company would later say that the appropriate takeaway from Clegg’s essay on polarization was that “research on the topic is mixed.”)</p> <p>Such bad-faith arguments sat poorly with researchers who had worked on polarization and analyses of Stop the Steal, but Clegg was a former politician hired to defend Facebook, after all. The real shock came from an internally published research review written by Chris Cox.</p> <p>Titled “What We Know About Polarization,” the April 2021 Workplace memo noted that the subject remained “an albatross public narrative,” with Facebook accused of “driving societies into contexts where they can’t trust each other, can’t share common ground, can’t have conversations about issues, and can’t share a common view on reality.”</p> <p>But Cox and his coauthor, Facebook Research head Pratiti Raychoudhury, were happy to report that a thorough review of the available evidence showed that this “media narrative” was unfounded. The evidence that social media played a contributing role in polarization, they wrote, was “mixed at best.” Though Facebook likely wasn’t at fault, Cox and Raychoudhury wrote, the company was still trying to help, in part by encouraging people to join Facebook groups. “We believe that groups are on balance a positive, depolarizing force,” the review stated.</p> <p>The writeup was remarkable for its choice of sources. Cox’s note cited stories by New York Times columnists David Brooks and Ezra Klein alongside early publicly released Facebook research that the company’s own staff had concluded was no longer accurate. At the same time, it omitted the company’s past conclusions, affirmed in another literature review just ten months before, that Facebook’s recommendation systems encouraged bombastic rhetoric from publishers and politicians, as well as previous work finding that seeing vicious posts made users report “more anger towards people with different social, political, or cultural beliefs.” While nobody could reliably say how Facebook altered users’ off-platform behavior, how the company shaped their social media activity was accepted fact. “The more misinformation a person is exposed to on Instagram the more trust they have in the information they see on Instagram,” company researchers had concluded in late 2020.</p> <p>In a statement, the company called the presentation “comprehensive” and noted that partisan divisions in society arose “long before platforms like Facebook even existed.” For staffers that Cox had once assigned to work on addressing known problems of polarization, his note was a punch to the gut. <hr> In 2016, the New York Times had reported that Facebook was quietly working on a censorship tool in an effort to gain entry to the Chinese market. While the story was a monster, it didn’t come as a surprise to many people inside the company. Four months earlier, an engineer had discovered that another team had modified a spam-fighting tool in a way that would allow an outside party control over content moderation in specific geographic regions. In response, he had resigned, leaving behind a badge post correctly surmising that the code was meant to loop in Chinese censors.</p> <p>With a literary mic drop, the post closed out with a quote on ethics from Charlotte Brontë’s Jane Eyre: “Laws and principles are not for the times when there is no temptation: they are for such moments as this, when body and soul rise in mutiny against their rigour; stringent are they; inviolate they shall be. If at my individual convenience I might break them, what would be their worth?”</p> <p>Garnering 1,100 reactions, 132 comments, and 57 shares, the post took the program from top secret to open secret. Its author had just pioneered a new template: the hard-hitting Facebook farewell.</p> <p>That particular farewell came during a time when Facebook’s employee satisfaction surveys were generally positive, before the time of endless crisis, when societal concerns became top of mind. In the intervening years, Facebook had hired a massive base of Integrity employees to work on those issues, and seriously pissed off a nontrivial portion of them.</p> <p>Consequently, some badge posts began to take on a more mutinous tone. Staffers who had done groundbreaking work on radicalization, human trafficking, and misinformation would summarize both their accomplishments and where they believed the company had come up short on technical and moral grounds. Some broadsides against the company ended on a hopeful note, including detailed, jargon-light instructions for how, in the future, their successors could resurrect the work.</p> <p>These posts were gold mines for Haugen, connecting product proposals, experimental results, and ideas in ways that would have been impossible for an outsider to re-create. She photographed not just the posts themselves but the material they linked to, following the threads to other topics and documents. A half dozen were truly incredible, unauthorized chronicles of Facebook’s dawning understanding of the way its design determined what its users consumed and shared. The authors of these documents hadn’t been trying to push Facebook toward social engineering—they had been warning that the company had already wandered into doing so and was now neck deep. <hr> The researchers’ best understanding was summarized this way: “We make body image issues worse for one in three teen girls.” <hr> In 2020, Instagram’s Well-Being team had run a study of massive scope, surveying 100,000 users in nine countries about negative social comparison on Instagram. The researchers then paired the answers with individualized data on how each user who took the survey had behaved on Instagram, including how and what they posted. They found that, for a sizable minority of users, especially those in Western countries, Instagram was a rough place. Ten percent reported that they “often or always” felt worse about themselves after using the platform, and a quarter believed Instagram made negative comparison worse.</p> <p>Their findings were incredibly granular. They found that fashion and beauty content produced negative feelings in ways that adjacent content like fitness did not. They found that “people feel worse when they see more celebrities in feed,” and that Kylie Jenner seemed to be unusually triggering, while Dwayne “The Rock” Johnson was no trouble at all. They found that people judged themselves far more harshly against friends than celebrities. A movie star’s post needed 10,000 likes before it caused social comparison, whereas, for a peer, the number was ten.</p> <p>In order to confront these findings, the Well-Being team suggested that the company cut back on recommending celebrities for people to follow, or reweight Instagram’s feed to include less celebrity and fashion content, or de-emphasize comments about people’s appearance. As a fellow employee noted in response to summaries of these proposals on Workplace, the Well-Being team was suggesting that Instagram become less like Instagram.</p> <p>“Isn’t that what IG is mostly about?” the man wrote. “Getting a peek at the (very photogenic) life of the top 0.1%? Isn’t that the reason why teens are on the platform?” <hr> “We are practically not doing anything,” the researchers had written, noting that Instagram wasn’t currently able to stop itself from promoting underweight influencers and aggressive dieting. A test account that signaled an interest in eating disorder content filled up with pictures of thigh gaps and emaciated limbs.</p> <p>The problem would be relatively easy for outsiders to document. Instagram was, the research warned, “getting away with it because no one has decided to dial into it.” <hr> He began the presentation by noting that 51 percent of Instagram users reported having a “bad or harmful” experience on the platform in the previous seven days. But only 1 percent of those users reported the objectionable content to the company, and Instagram took action in 2 percent of those cases. The math meant that the platform remediated only 0.02 percent of what upset users—just one bad experience out of every 5,000.</p> <p>“The numbers are probably similar on Facebook,” he noted, calling the statistics evidence of the company’s failure to understand the experiences of users such as his own daughter. Now sixteen, she had recently been told to “get back to the kitchen” after she posted about cars, Bejar said, and she continued receiving the unsolicited dick pics she had been getting since the age of fourteen. “I asked her why boys keep doing that? She said if the only thing that happens is they get blocked, why wouldn’t they?”</p> <p>Two years of research had confirmed that Joanna Bejar’s logic was sound. On a weekly basis, 24 percent of all Instagram users between the ages of thirteen and fifteen received unsolicited advances, Bejar informed the executives. Most of that abuse didn’t violate the company’s policies, and Instagram rarely caught the portion that did. <hr> nothing highlighted the costs better than a Twitter bot set up by New York Times reporter Kevin Roose. Using methodology created with the help of a CrowdTangle staffer, Roose found a clever way to put together a daily top ten of the platform’s highest-engagement content in the United States, producing a leaderboard that demonstrated how thoroughly partisan publishers and viral content aggregators dominated the engagement signals that Facebook valued most.</p> <p>The degree to which that single automated Twitter account got under the skin of Facebook’s leadership would be difficult to overstate. Alex Schultz, the VP who oversaw Facebook’s Growth team, was especially incensed—partly because he considered raw engagement counts to be misleading, but more because it was Facebook’s own tool reminding the world every morning at 9:00 a.m. Pacific that the platform’s content was trash.</p> <p>“The reaction was to prove the data wrong,” recalled Brian Boland. But efforts to employ other methodologies only produced top ten lists that were nearly as unflattering. Schultz began lobbying to kill off CrowdTangle altogether, replacing it with periodic top content reports of its own design. That would still be more transparency than any of Facebook’s rivals offered, Schultz noted</p> <p>...</p> <p>Schultz handily won the fight. In April 2021, Silverman convened his staff on a conference call and told them that CrowdTangle’s team was being disbanded. ... “Boz would just say, ‘You’re completely off base,’ ” Boland said. “Data wins arguments at Facebook, except for this one.” <hr> When the company issued its response later in May, I read the document with a clenched jaw. Facebook had agreed to grant the board’s request for information about XCheck and “any exceptional processes that apply to influential users.”</p> <p>...</p> <p>“We want to make clear that we remove content from Facebook, no matter who posts it,” Facebook’s response to the Oversight Board read. “Cross check simply means that we give some content from certain Pages or Profiles additional review.”</p> <p>There was no mention of whitelisting, of C-suite interventions to protect famous athletes, of queues of likely violating posts from VIPs that never got reviewed. Although our documents showed that at least 7 million of the platform’s most prominent users were shielded</p> <p>by some form of XCheck, Facebook assured the board that it applied to only “a small number of decisions.” The only XCheck-related request that Facebook didn’t address was for data that might show whether XChecked users had received preferential treatment.</p> <p>“It is not feasible to track this information,” Facebook responded, neglecting to mention that it was exempting some users from enforcement entirely. <hr> “I’m sure many of you have found the recent coverage hard to read because it just doesn’t reflect the company we know,” he wrote in a note to employees that was also shared on Facebook. The allegations didn’t even make sense, he wrote: “I don’t know any tech company that sets out to build products that make people angry or depressed.”</p> <p>Zuckerberg said he worried the leaks would discourage the tech industry at large from honestly assessing their products’ impact on the world, in order to avoid the risk that internal research might be used against them. But he assured his employees that their company’s internal research efforts would stand strong. “Even though it might be easier for us to follow that path, we’re going to keep doing research because it’s the right thing to do,” he wrote.</p> <p>By the time Zuckerberg made that pledge, research documents were already disappearing from the company’s internal systems. Had a curious employee wanted to double-check Zuckerberg’s claims about the company’s polarization work, for example, they would have found that key research and experimentation data had become inaccessible.</p> <p>The crackdown had begun. <hr> One memo required researchers to seek special approval before delving into anything on a list of topics requiring “mandatory oversight”—even as a manager acknowledged that the company did not maintain such a list. <hr> The “Narrative Excellence” memo and its accompanying notes and charts were a guide to producing documents that reporters like me wouldn’t be excited to see. Unfortunately, as a few bold user experience researchers noted in the replies, achieving Narrative Excellence was all but incompatible with succeeding at their jobs. Writing things that were “safer to be leaked” meant writing things that would have less impact.</p> </blockquote> <h2 id="appendix-non-statements">Appendix: non-statements</h2> <p>I really like the &quot;non-goals&quot; section of design docs. I think the analogous non-statements section of a doc like this is much less valuable because <a href="https://jakeseliger.com/2022/01/31/most-people-dont-read-carefully-or-for-comprehension/">the top-level non-statements can generally be inferred by reading this doc</a>, whereas top-level non-goals often add information, but I figured I'd try this out anyway.</p> <ul> <li>Facebook (or any other company named here, like Uber) is uniquely bad <ul> <li>As discussed, on the contrary, I think Facebook isn't very atypical, which is why</li> </ul></li> <li>Zuckerberg (or any other person named) is uniquely bad</li> <li>Big tech employees are bad people</li> <li>No big tech company employees are working hard or trying hard <ul> <li>For some reason, a common response to any criticism of a tech company foible or failure is &quot;people are working hard&quot;. This is almost never a response to a critique that nobody is working hard, and that is once again not the critique here</li> </ul></li> <li>Big tech companies should be broken up or otherwise have antitrust action taken against them <ul> <li>Maybe so, but this document doesn't make that case</li> </ul></li> <li>Bigger companies in the same industry are strictly worse than smaller companies <ul> <li>Discussed above, but I'll mention it again here</li> </ul></li> <li>The general bigness vs. smallness tradeoff as discussed here applies strictly across all areas all industries <ul> <li>Also mentioned above, but mentioned again here. For example, the percentage of rides in which a taxi drier tries to scam the user seems much higher with traditional taxis than with Uber</li> </ul></li> <li>It's easy to do moderation and support at scale</li> <li>On average, large companies provide a worse experience for users <ul> <li>For example, I still use Amazon because it gives me the best overall experience. As noted above, cost and shipping are better with Amazon than with any other alternative. There are entire classes of items where most things I've bought are counterfeit, such as masks and respirators. When I bought these in January 2020, before these were something many people would buy, I got genuine 3M masks. Masks and filters were then hard to get for a while, and then when they became available again, the majority of 3M masks and filters I got were counterfeit (out of curiosity, I tried more than a few independent orders over the next few years). I try to avoid classes of items that have a high counterfeit rate (but a naive user who doesn't know to do this will buy a lot of low-quality counterfeits), and I know I'm rolling the dice every time I buy any expensive item (if I get a counterfeit or an empty box, Amazon might not accept the return or refund me unless I can make a viral post about the issue), and sometimes a class of item goes from being one where you can usually get good items to one where most items are counterfeit.</li> <li>Many objections are, implicitly or explicitly, are about the average experience, but this is nonsensical when the discussion is about the experience in the tail; this is like the standard response you see when someone notes that a concurrency bug is a problem and someone else say it's fine because &quot;it works for me&quot;, which doesn't make sense for bugs that occur in the tail.</li> </ul></li> </ul> <div class="footnotes"> <hr /> <ol> <li id="fn:C">when Costco was smaller, I would've put Costco here instead of Best Buy, but as they've gotten bigger, I've noticed that their quality has gone down. It's really striking how (relatively) frequently I find sealed items like cheese going bad long before their &quot;best by&quot; date or just totally broken items. This doesn't appear to have anything to do with any particular location since I moved almost annually for close to a decade and observed this decline across many different locations (because I was moving, at first, I thought that I got unlucky with where I'd moved to, but as I tried locations in various places, I realized that this wasn't specific to any location and it seems to have impacted stores in both the U.S. and Canada). <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:B"><p><a href="https://archive.is/1lEoA">when the WSJ looked at leaked internal Meta documents, they found, among other things, that Meta estimated that 100k minors per day &quot;received photos of adult genitalia or other sexually abusive content&quot;</a>. Of course, <a href="https://mastodon.social/@danluu/111042448496012974">smart contrarians</a> will argue that this is totally normal, e.g., two of the first few comments on HN were about how there's nothing particularly wrong with this. Sure, it's bad for children to get harassed, but &quot;it can happen on any street corner&quot;, &quot;what's the base rate to compare against&quot;, etc.</p> <p>Very loosely, if we're liberal, we might estimate that Meta had 2.5B DAU in early 2021 and 500M were minors, or if we're conservative, maybe we guess that 100M are minors. So, we might guess that Meta estimated something like 0.1% to 0.02% of minors on Meta platforms received photos of genitals or similar each day. Is this roughly the normal rate they would experience elsewhere? Compared to the real world, possibly, although I would be surprised if 0.1% of children are being exposed to people's genitals &quot;on any street corner&quot;. Compared to a well moderated small forum, that seems highly implausible. The internet commenter reaction was the same reaction that Arturo Bejar, who designed Facebook's reporting system and worked in the area, had. He initially dismissed reports about this kind of thing because it didn't seem plausible that it could really be that bad, but he quickly changed his mind once he started looking into it:</p> <blockquote> <p>Joanna’s account became moderately successful, and that’s when things got a little dark. Most of her followers were enthused about a [14-year old] girl getting into car restoration, but some showed up with rank misogyny, like the guy who told Joanna she was getting attention “just because you have tits.”</p> <p>“Please don’t talk about my underage tits,” Joanna Bejar shot back before reporting the comment to Instagram. A few days later, Instagram notified her that the platform had reviewed the man’s comment. It didn’t violate the platform’s community standards.</p> <p>Bejar, who had designed the predecessor to the user-reporting system that had just shrugged off the sexual harassment of his daughter, told her the decision was a fluke. But a few months later, Joanna mentioned to Bejar that a kid from a high school in a neighboring town had sent her a picture of his penis via an Instagram direct message. Most of Joanna’s friends had already received similar pics, she told her dad, and they all just tried to ignore them.</p> <p>Bejar was floored. The teens exposing themselves to girls who they had never met were creeps, but they presumably weren’t whipping out their dicks when they passed a girl in a school parking lot or in the aisle of a convenience store. Why had Instagram become a place where it was accepted that these boys occasionally would—or that young women like his daughter would have to shrug it off?</p> </blockquote> <p>Much of the book, Broken Code, is about Bejar and others trying to get Meta to take problems like this seriously and making little progress and often having their progress undone (although, PR issues for FB seem to force FB's hand and drive some progress towards the end of the book):</p> <blockquote> <p>six months prior, a team had redesigned Facebook’s reporting system with the specific goal of reducing the number of completed user reports so that Facebook wouldn’t have to bother with them, freeing up resources that could otherwise be invested in training its artificial intelligence–driven content moderation systems. In a memo about efforts to keep the costs of hate speech moderation under control, a manager acknowledged that Facebook might have overdone its effort to stanch the flow of user reports: “We may have moved the needle too far,” he wrote, suggesting that perhaps the company might not want to suppress them so thoroughly.</p> <p>The company would later say that it was trying to improve the quality of reports, not stifle them. But Bejar didn’t have to see that memo to recognize bad faith. The cheery blue button was enough. He put down his phone, stunned. This wasn’t how Facebook was supposed to work. How could the platform care about its users if it didn’t care enough to listen to what they found upsetting?</p> <p>There was an arrogance here, an assumption that Facebook’s algorithms didn’t even need to hear about what users experienced to know what they wanted. And even if regular users couldn’t see that like Bejar could, they would end up getting the message. People like his daughter and her friends would report horrible things a few times before realizing that Facebook wasn’t interested. Then they would stop.</p> </blockquote> <p>If you're interested in the topic, I'd recommend reading the whole book, but if you just want to get a flavor for the kinds of things the book discusses, I've put a few relevant quotes into an appendix. After reading the book, I can't say that I'm very sure the number is correct because I'd have to look at the data to be strongly convinced, but it does seem plausible. And as for why Facebook might expose children to more of this kind of thing than another platform, the book makes the case that this falls out of a combination of optimizing for engagement, &quot;number go up&quot;, and neglecting &quot;trust and safety&quot; work</p> <blockquote> <p>Only a few hours of poking around Instagram and a handful of phone calls were necessary to see that something had gone very wrong—the sort of people leaving vile comments on teenagers’ posts weren’t lone wolves. They were part of a large-scale pedophilic community fed by Instagram’s recommendation systems.</p> <p>Further reporting led to an initial three-thousand-word story headlined “Instagram Connects Vast Pedophile Network.” Co-written with Katherine Blunt, the story detailed how Instagram’s recommendation systems were helping to create a pedophilic community, matching users interested in underage sex content with each other and with accounts advertising “menus” of content for sale. Instagram’s search bar actively suggested terms associated with child sexual exploitation, and even glancing contact with accounts with names like Incest Toddlers was enough to trigger Instagram to begin pushing users to connect with them.</p> </blockquote> <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:F">but, fortunately for Zuckerberg, his target audience seems to have little understanding of the tech industry, so it doesn't really matter that Zuckerberg's argument isn't plausible. In a future post, [we might look at <a href="https://twitter.com/danluu/status/1478329866185437184">incorrect reasoning from regulators</a> and <a href="https://twitter.com/altluu/status/1484718002046070784">government officials</a> but, for now, see <a href="https://twitter.com/garybernhardt/status/1340064481271898115">this example of Gary Bernhardt where FB makes a claim that appears to be the opposite of correct to people who work in the area</a>. <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:S">Another claim, rarer than &quot;it would cost too much to provide real support&quot;, is &quot;support can't be done because it's a social engineering attack vector&quot;. This isn't as immediately implausible because this calls to mind all of the cases where people had their SMS-2FA'd accounts owned by someone calling up a phone company and getting a phone number transferred, but I don't find it all that plausible since bank and brokerage accounts are, in general, much higher value than FB accounts and FB accounts are still compromised at a much higher rate, even for online-only accounts, accounts back before KYC requirements were in play, or whatever other reason people name as a reasonable-sounding reason for the difference. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:A"><p>Another reason, less reasonable, but the actual impetus for this post, is that when Zuckerberg made his comments that only the absolute largest companies in the world can handle issues like fraud and spam, it struck me as completely absurd and, because I enjoy absurdity, I started a doc where I recorded links I saw to large company spam, fraud, moderation, and support, failures, much like the <a href="seo-spam/#appendix-google-knowledge-card-results">list of Google knowledge card results I kept track of for a while</a>. I didn't have a plan for what to do with that and just kept it going for years before I decided to publish the list, at which point I felt that I had to write something, since the bare list by itself isn't that interesting, so I started writing up summaries of each link (the original list was just a list of links), and here we are. When I sit down to write something, I generally have an idea of the approach I'm going to take, but I frequently end up changing my mind when I start looking at the data.</p> <p>For example, <a href="testing/">since going from hardware to software, I've had this feeling that conventional software testing is fairly low ROI</a>, so when I joined Twitter, I had this idea that I would look at the monetary impact of errors (e.g., serving up a 500 error to a user) and outages and use that to justify working on testing, in the same way that studies looking into the monetary impact of latency can often drive work on latency reduction. Unfortunately for my idea, I found that a naive analysis found a fairly low monetary impact and I <a href="metrics-analytics/">immediately found</a> a <a href="tracing-analytics/">number of</a> other <a href="cgroup-throttling/">projects that</a> were <a href="latency-pitfalls/">high impact</a>, so I wrote up a doc explaining that my findings were the opposite of what I needed to justify doing the work that I wanted to do, but I hoped to do a more in-depth follow-up that could overturn my original result, and then worked on projects that were supported by data.</p> <p>This also frequently happens when I write things up here, such as <a href="https://www.patreon.com/posts/60185075">this time I wanted to write up this really compelling sounding story, but, on digging into it, despite it being widely cited in tech circles, I found out that it wasn't true and there wasn't really any interesting there</a>. It's qute often that when I look into something, I find that the angle of I was thinking of doesn't work. When I'm writing for work, I usually feel compelled to at least write up a short doc with evidence of the negative result but, for my personal blog, I don't really feel the same compulsion, so my drafts folder and home drive are littered with abandoned negative results.</p> <p>However, in this case, on digging into the stories in the links and talking to people at various companies about how these systems work, the problem actually seemed worse than I realized before I looked into it, so it felt worth writing up even if I'm writing up something most people in tech know to be true.</p> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> </ol> </div> Why it's impossible to agree on what's allowed impossible-agree/ Wed, 07 Feb 2024 00:00:00 +0000 impossible-agree/ <p>On large platforms, it's impossible to have policies on things like moderation, spam, fraud, and sexual content that people agree on. David Turner made a simple game to illustrate how difficult this is even in a trivial case, <a href="https://novehiclesinthepark.com/">No Vehicles in the Park</a>. If you haven't played it yet, I recommend playing it now before continuing to read this document.</p> <p>The idea behind the site is that it's very difficult to get people to agree on what moderation rules should apply to a platform. Even if you take a much simpler example, what vehicles should be allowed in a park given a rule and some instructions for how to interpret the rule, and then ask a small set of questions, people won't be able to agree. On doing the survey myself, one of the first reactions I had was that the questions aren't chosen to be particularly nettlesome and there are many edge cases Dave could've asked about if he wanted to make it a challenge. And yet, despite not making the survey particularly challenging, there isn't broad agreement on the questions. Comments on the survey also indicate another problem with rules, which is that it's much harder to get agreement than people think it will be. If you read comments on rule interpretation or moderation on lobsters, HN, reddit, etc., when people suggest a solution, the vast majority of people will suggest something that anyone who's done moderation or paid attention to how moderation works knows cannot work, the moderation equivalent of <a href="sounds-easy/">&quot;I could build that in a weekend&quot;</a><sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">1</a></sup>. Of course we see this on Dave's game as well. The top HN comment, the most agree-upon comment, and a very common sentiment elsewhere is<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">2</a></sup>:</p> <blockquote> <p>I'm fascinated by the fact that my takeaway is the precise opposite of what the author intended.</p> <p>To me, the answer to all of the questions was crystal-clear. Yes, you can academically wonder whether an orbiting space station is a vehicle and whether it's in the park, but the obvious intent of the sign couldn't be clearer. Cars/trucks/motorcycles aren't allowed, and obviously police and ambulances (and fire trucks) doing their jobs don't have to follow the sign.</p> <p>So if this is supposed to be an example of how content moderation rules are unclear to follow, it's achieving precisely the opposite.</p> </blockquote> <p>And someone agreeingly replies with:</p> <blockquote> <p>Exactly. There is a clear majority in the answers.</p> </blockquote> <p>After going through the survey, you get a graph showing how many people answered yes and no to each question, which is where the &quot;clear majority&quot; comes from. First of all, I think it's not correct to say that there is a clear majority. But even supposing that there were, there's no reason to think that there being a majority means that most people agree with you even if you take the majority position in each vote. In fact, given how &quot;wiggly&quot; the per-question majority graph looks, it would be extraordinary if it were the case that being in the majority for each question meant that most people agreed with you or that there's any set of positions that the majority of people agree on. Although you could construct a contrived dataset where this is true, it would be very surprising if this were true in a natural dataset.</p> <p>If you look at the data (which isn't available on the site, but Dave was happy to pass it along when I asked), as of when I pulled the data, there was no set of answers which the majority of users agreed on and it was not even close. I pulled this data shortly after I posted on the link to HN, when the vast majority of responses were HN readers, who are more homogeneous than the population at large. Despite these factors making it easier to find agreement, the most popular set of answers was only selected by 11.7% of people. This is the position the top commenter says is &quot;obvious&quot;, but it's a minority position not only in the sense that only 11.7% of people agree and 88.3% of people disagree, almost no one holds a position with only a small amount of disagreement from this allegedly obvious position. The 2nd and 3rd most common positions, representing 8.5% and 6.5% of the vote, respectively, are similar and only disagree on whether or not a non-functioning WW-II era tank that's part of a memorial violates the rule. Beyond that, approximately 1% of people hold the 4th, 5th, 6th, and 7th most popular positions, with every less popular position having less than 1% agreement, with a fairly rapid drop from there as well. So, 27% of people find themselves in agreement with significantly more than 1% of other users (the median user agrees with 0.16% of other users). See below for a plot of what this looks like. The opinions are sorted from most popular to least popular, with the most popular on the left. A log scale is used because there's so little agreement on opinions that a linear scale plot looks like a few points above zero followed by a bunch of zeros.</p> <p><img src=/images/park-agreement.png alt="a plot illustrating the previous paragraph" width=2000 height=1236></p> <p>Another way to look at this data is that 36902 people expressed an opinion on what constitutes a vehicle in the park and they came up with 9432 distinct opinions, for an average of ~3.9 people, per distinct expressed opinion. i.e., the average user agreement is ~0.01%. Although averages are, on average, overused, an average works as a summary for expressing the level of agreement because while we do have a small handful of opinions with much higher than the average 0.01% agreement, to &quot;maintain&quot; the average, this must be balanced out by a ginormous number of people who have even less agreement with other users. There's no way to have a low average agreement with high actual agreement unless that's balanced out by even higher disagreement, and vice versa.</p> <p>On HN, in response to the same comment, Michael Chermside had the reasonable but not highly upvoted comment,</p> <blockquote> <p>&gt; To me, the answer to all of the questions was crystal-clear.</p> <p>That's not particularly surprising. But you may be asking the wrong question.</p> <p>If you want to know whether the rules are clear then I think that the right question to ask is not &quot;Are the answers crystal-clear to you?&quot; but &quot;Will different people produce the same answers?&quot;.</p> <p>If we had a sharp drop in the graph at one point then it would suggest that most everyone has the same cutoff; instead we see a very smooth curve as if different people read this VERY SIMPLE AND CLEAR rule and still didn't agree on when it applied.</p> </blockquote> <p>Many (and probably actually most) people are overconfident when predicting what other people think is obvious and often incorrectly assume that other people will think the same thoughts and find the same things obvious. This is more true of the highly-charged issues that result in bitter fights about moderation than the simple &quot;no vehicles in the park&quot; example, but even this simple example demonstrates not only the difficulty in reaching agreement, but the difficulty in understanding how difficult it is to reach agreement.</p> <p>To use an example from another context that's more charged, consider in any sport and whether or not a player is considered to be playing fair or is making dirty plays and should be censured. We could look at many different players from many different sports, so let's arbitrarily pick Draymond Green. If you ask any serious basketball fan who's not a Warriors fan, who's the dirtiest player in the NBA today, you'll find general agreement that it's Draymond Green (although some people will argue for Dillon Brooks, so if you want near uniform agreement, you'll have to ask for the top two dirtiest players). And yet, if you ask a Warriors fan about Draymond, most have no problem explaining away every dirty play of his. So if you want to get uniform agreement to a question that's much more straightforward than the &quot;no vehicles in the park&quot; question, such as, &quot;<a href="https://twitter.com/BrendenNunesNBA/status/1648180185811025926">is it ok to stomp on another player's chest and then use them as a springboard to leap into the air?</a> on top of a hundred other dirty plays&quot;, you'll find that for many such seemingly obvious questions, a sizable group of people will have extremely strong disagreements with the &quot;obvious&quot; answer. When you move away from a contrived, abstract, example like &quot;no vehicles in the park&quot; to a real-world issue that people have emotional attachments to, it generally becomes impossible to get agreement even in cases where disinterested third parties would all agree, which we observed is already impossible even without emotional attachment. And when you move away from sports into issues people care even more strongly about, like politics, the disagreements get stronger.</p> <p>While people might be able to &quot;agree to disagree&quot; on whether or not a a non-functioning WW-II era tank that's part of a memorial violates the &quot;no vehicles in the park&quot; rule (resulting in a pair of positions that accounts for 15% of the vote), in reality, people often have a hard time agreeing to disagree over what outsiders would consider very small differences of opinion. Charged issues are often <a href="https://en.wikipedia.org/wiki/Narcissism_of_small_differences">fractally contentious, causing disagreement among people who hold all but identical opinions</a>, making them significantly more difficult to agree on than our &quot;no vehicles in the park&quot; example.</p> <p>To pick a real-world example, consider <abbr title="among tech folks, probably best known as the author of The Tyranny of Structurelessness">Jo Freeman</abbr>, a feminist who, in 1976, wrote about her experienced <a href="https://www.jofreeman.com/joreen/trashing.htm">being canceled for minute differences in opinion and how this was unfortunately common in the Movement</a> (using the term &quot;trashed&quot; and not &quot;canceled&quot; because cancellation hadn't come into common usage yet and, in my opinion, &quot;trashed&quot; is the better term anyway). In the nearly fifty years since Jo Freeman wrote &quot;Trashing&quot;, the propensity of humans to pick on minute differences and attempt to destroy anyone who doesn't completely agree with them hasn't changed; for a recent, parallel, example, <a href="https://www.youtube.com/watch?v=OjMPJVmXxV8">Natalie Wynn's similar experience</a>.</p> <p>For people with opinions far away in the space of commonly held opinions, the differences in opinion between Natalie and the people calling for her to be deplatformed are fairly small. But, not only did these &quot;small&quot; differences in opinion result in people calling for Natalie to be deplatformed, they called for her to be physically assaulted, doxed, etc., and they suggested the same treatment suggested for her friends and associates as well as people who didn't really associate with her, but publicly talked about similar topics and didn't cancel her. Even now, years later, she still gets calls to be deplatformed and I expect this will continue past the end of my life (when I wrote this, years after the event Natalie discussed, I did a Twitter search and found a long thread from someone ranting about what a horrible human being Natalie is for the alleged transgression discussed in the video, dated 10 days ago, and it's easy to find more of these rants). I'm not going to attempt to describe the difference in positions because the positions are close enough that, to describe them would take something like 5k to 10k words (as opposed to, say, a left-wing vs. a right-wing politician, where the difference is blatant enough that you can describe in a sentence or two); you can watch the hour in the 1h40m video that's dedicated to the topic if you want to know the full details.</p> <p>The point here is just that, if you look at almost any person who has public opinions on charged issues, the opinion space is fractally contentious. No large platform can satisfy user preferences because users will disagree over what content should be moderated off the platform and what content should be allowed. And, of course, this problem scales up as the platform gets larger<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">3</a></sup>.</p> <p style="border-width:8px; border-style:solid; border-color:grey; background-color:#F3F3F3; padding: 0.1em;"> <b><a rel="nofollow" href="https://jobs.ashbyhq.com/freshpaint/bfe56523-bff4-4ca3-936b-0ba15fb4e572?utm_source=dl"> If you're looking for work, Freshpaint is hiring (US remote) in engineering, sales, and recruiting</b></a>. Disclaimer: I may be biased since I'm an investor, but they seem to have found product-market fit and are rapidly growing.</p> <p><i>Thanks to Peter Bhat Harkins, Dan Gackle, Laurence Tratt, Gary Bernhardt, David Turner, Kevin Burke, Sophia Wisdom, Justin Blank, and Bert Muthalaly for comments/corrections/discussion.</I></p> <div class="footnotes"> <hr /> <ol> <li id="fn:M"><p>Something I've repeatedly seen on every forum I've been on is <a href="https://mastodon.social/@danluu/110561295139766945">the suggestion that we just don't need moderation after all and all our problems will be solved if we just stop this nasty censorship</a>. If you want a small forum that's basically 4chan, then no moderation can work fine, but even if you want a big platform that's like 4chan, no moderation doesn't actually work. If we go back to those Twitter numbers, 300M users and 1M bots removed a day, if you stop doing this kind of &quot;censorship&quot;, the platform will quickly fill up with bots to the point that everything you see will be spam/scam/phishing content or content from an account copying content from somewhere else or using LLM-generated content to post scam/scam/phishing content. Not only will most accounts be bots, bots will be a part of large engagement/voting rings that will drown out all human content.</p> <p>The next most naive suggestion is to stop downranking memes, dumb jokes, etc., often throw in with a comment like &quot;doesn't anyone here have a sense of humor?&quot;. If you look at why forums with upvoting/ranking ban memes, it generally happens after the forum becomes totally dominated by memes/comics because people upvote those at a much higher rate than any kind of content with a bit of nuance, and not everyone wants a forum that's full of the lowest common denominator meme/comic content. And as for &quot;having a sense of humor&quot; in comments, if you look forums that don't ban cheap humor, top comments will generally end up dominated by these, e.g., for maybe 3-6 months, one the top comments on any kind of story about a man doing anything vaguely heroic on reddit forums that don't ban this kind of cheap was some variant of &quot;I'm surprised he can walk with balls that weigh 900 lbs.&quot;, often repeated multiple times by multiple users, amidst a sea of the other cheap humor that was trendy during that period. Of course, some people actually want that kind of humor to dominate the comments, they actually want to see the same comment 150 times a day for months on end, but I suspect most people who grumpily claim &quot;no one has a sense of humor here&quot; when their cheap humor gets flagged don't actually want to read a forum that's full of other people's cheap humor.</p> <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:C">This particular commenter indicates that they understand that moderation is, in general, a hard problem; they just don't agree with the &quot;no vehicles in the park&quot; example, but many other people think that both the park example and moderation are easy. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:F"><p>Nowadays, it's trendy to use &quot;federation&quot; as a cure-all in the same way people used &quot;blockchain&quot; as a cure-all five years ago, but federation doesn't solve this problem for the typical user. I actually had a conversation with someone who notes in their social media bio that they're one of the creators of the ActivityPub spec, who claimed that federation does solve this problem and that Threads adding ActivityPub would create some kind of federating panacea. I noted that fragmentation is already a problem for many users on Mastodon and whether or not Threads will be blocked is contentious and will only increase fragmentation, and the ActivityPub guy replied with something like &quot;don't worry about that, most people won't block Threads, and it's their problem if they do.&quot;</p> <p>I noted that a problem many of my non-technical friends had when they tried Mastodon was that they'd pick a server and find that they couldn't follow someone they wanted to follow due to some kind of server blocking or ban. So then they'd try another server to follow this one person and then find that another person they wanted to follow is blocked. The fundamental problem is that users on different servers want different things to be allowed, which then results in no server giving you access to everything you want to see. The ActivityPub guy didn't have a response to this and deleted his comment.</p> <p>By the way, a problem that's much easier than moderation/spam/fraud/obscene content/etc. policy that the fediverse can't even solve is how to present content. Whenever I use Mastodon to interact with someone using &quot;honk&quot;, messages get mangled. For example, a Mastodon message <code>&quot;</code> in the subject (and content warning) field gets converted to <code>&amp;quot;</code> the Mastodon user sees the reply from the honk user, so every reply from a honk user forks the discussion into a different subject. Here's something that can be fully specified without ambiguity, where people are much less emotionally attached to the subject than they are for moderation/spam/fraud/obscene content/etc., and the fediverse can't even solve this problem across two platforms.</p> <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> </ol> </div> Notes on Cruise's pedestrian accident cruise-report/ Mon, 29 Jan 2024 00:00:00 +0000 cruise-report/ <p>This is a set of notes on the Quinn Emanuel report on Cruise's handling of the 2023-10-02 accident where a Cruise autonomous vehicle (AV) hit a pedestrian, stopped, and then started moving again with the pedestrian stuck under the bottom of the AV, dragging the pedestrian 20 feet. After seeing some comments about this report, I read five stories on this report and then skimmed the report and my feeling is that the authors of four of the stories probably didn't read the report, and that people who were commenting had generally read stories by journalists who did not appear to read the source material, so the comments were generally way off base. <a href="dunning-kruger/">As we previously discussed, it's common for summaries to be wildly wrong, even when they're summarizing a short paper that's easily read by laypeople</a>, so of course summaries of a 200-page report are likely to be misleading at best.</p> <p>On reading the entire report, I'd say that Cruise both looks better and worse than in the articles I saw, which is the same pattern we saw when we looked at the actual source for <a href="elon-twitter-texts/">Exhibits H and J from Twitter v. Musk</a>, the <a href="us-v-ms/">United States v. Microsoft Corp. docs</a>, etc.; just as some journalists seem to be pro/anti-Elon Musk and pro/anti-Microsoft, willing to push an inaccurate narrative to dunk on them to the maximum extent possible or exonerate them to the maximum extent possible, we see the same thing here with Cruise. And as we saw in those cases, despite some articles seemingly trying to paint Cruise in the best or worst light possible, the report itself has material that is more positive and more negative than we see in the most positive or negative stories.</p> <p>Aside from correcting misleading opinions on the report, I find the report interesting because it's rare to see any kind of investigation over what went wrong in tech in this level of detail, let alone a public one. We often see this kind of investigation in safety critical systems and sometimes see in sports as well as for historical events, but tech events are usually not covered like this. Of course companies do post-mortems of incidents, but you generally won't see a 200-page report on a single incident, nor will the focus of post-mortems be what the focus was here. In the past, <a href="wat/">we've noted that a lot can be learned by looking at the literature and incident reports on safety-critical systems</a>, so of course this is true here as well, where we see a safety-critical system that's more tech adjacent than the ones we've looked at previously.</p> <p>The length and depth of the report here reflects a difference in culture between safety-critical systems and &quot;tech&quot;. The behavior that's described as unconscionable in the report is not only normal in tech, but probably more transparent and above board than you'd see at most major tech companies; I find the culture clash between tech and safety-critical systems interesting as well. I attempted to inject as little of my opinion as possible into the report as possible, even in cases where knowledge of tech companies or engineering meant that I would've personally written something different. For more opinions, <a href="#back-to-danluu-com">see the section at the end</a>.</p> <p><b><a href="https://assets.ctfassets.net/95kuvdv8zn1v/1mb55pLYkkXVn0nXxEXz7w/9fb0e4938a89dc5cc09bf39e86ce5b9c/2024.01.24_Quinn_Emanuel_Report_re_Cruise.pdf">REPORT TO THE BOARDS OF DIRECTORS OF CRUISE LLC, GM CRUISE HOLDINGS LLC, AND GENERAL MOTORS HOLDINGS LLC REGARDING THE OCTOBER 2, 2023 ACCIDENT IN SAN FRANCISCO</a></b></p> <h2 id="i-introduction">I. Introduction</h2> <h3 id="a-overview">A. Overview</h3> <ul> <li><strong>2023-10-24</strong>: California DMV suspended Cruise's driverless license</li> <li><strong>2023-10-02</strong>: human-drive Nissan hit a pedestrian, putting the pedestrian in the path of a Cruise autonomous vehicle (AV), which then dragged the pedestrian <code>20 feet</code> before stopping</li> <li>DMV claims <ul> <li>Cruise failed to disclose that the AV moved forward after its initial impact</li> <li>video Cruise played only shows a part of the accident and not the pedestrian dragging</li> <li>DMV only learned about dragging from another government agency, &quot;impeding its oversight&quot;</li> </ul></li> <li>NHTSA and CPUC also took action against Cruise and made similar claims</li> <li>Media outlets also complained they were misled by Cruise</li> <li>Cruise leadership and Cruise employees who talked with regulators admit they didn't explain the dragging, but they said they played the full video clip but, in all but one of the meetings, internet issues may have prevented regulators from seeing the entire accident</li> <li>Cruise employees claim the NHTSA received the full video immediately after the <b>10-03</b> meeting and the CPUC declined the offer for the full video</li> <li>Cruise employees note they played the full video, with no internet issues, to the SF MTA, SFPD, and SFFD on <b>10-03</b> and had a full discussion with those agencies</li> <li>Cruise leadership concedes they never informed the media, but leadership believed that Cruise's obligations to the media are different than their obligations to regulators</li> </ul> <h3 id="b-scope-of-review">B. Scope of Review</h3> <ul> <li>[no notes]</li> </ul> <h3 id="c-review-plan-methodology-and-limitations">C. Review Plan Methodology and Limitations</h3> <ul> <li>205k &quot;documents&quot;, including &quot; including e-mails, texts, Slack communications, and internal Cruise documents&quot;</li> <li>Interviewed 88 current and former employees and contractors</li> <li>Reviewed a report by Exponent Inc., 3rd party firm</li> <li>Internal review only; did not interview regulators and public officials</li> <li>A number of employees and contractors were not available &quot;due to personal circumstances and/or the wide-scale Reduction in Force&quot;, but these interviews were not deemed to be important</li> <li>Report doesn't address broader issues outside of mandate, &quot;such as the safety or safety processes of Cruise AVs or its operations, which are more appropriately evaluated by those with engineering and technical safety expertise&quot;</li> </ul> <h3 id="d-summary-of-principal-findings-and-conclusions">D. Summary of Principal Findings and Conclusions</h3> <ul> <li>By morning of <strong>10-03</strong>, leadership and 100+ employees knew the pedestrian had been dragged ~20ft. by the Cruise AV during the secondary movement after the AV came to a stop.</li> <li>Plan was to disclose this happened by playing the full video, &quot;letting the 'video speak for itself.'&quot; <ul> <li>Cruise assumed that regulators and government officials would ask questions and Cruise would provide further info</li> </ul></li> <li>&quot;Weight of the evidence&quot; is that Cruise attempted to play the full video, but in 3 meetings, internet issues prevented this from happening and Cruise didn't point out that the pedestrian dragging happened</li> <li>On <strong>10-02</strong> and <strong>10-03</strong>, &quot;Cruise leadership was fixated on correcting the inaccurate media narrative&quot; that Cruise's AV had caused the accident <ul> <li>This led Cruise to convey information about the Nissan and omit &quot;other important information&quot; about the accident to &quot;the media, regulators, and other government officials&quot;</li> </ul></li> <li>&quot;The reasons for Cruise’s failings in this instance are numerous: poor leadership, mistakes in judgment, lack of coordination, an 'us versus them' mentality with regulators, and a fundamental misapprehension of Cruise’s obligations of accountability and transparency to the government and the public. Cruise must take decisive steps to address these issues in order to restore trust and credibility.&quot;</li> <li>&quot;the DMV Suspension Order is a direct result of a proverbial self-inflicted wound by certain senior Cruise leadership and employees who appear not to have fully appreciated how a regulated business should interact with its regulators ... it was a fundamentally flawed approach for Cruise or any other business to take the position that a video of an accident causing serious injury provides all necessary information to regulators and otherwise relieves them of the need to affirmatively and fully inform these regulators of all relevant facts. As one Cruise employee stated in a text message to another employee about this matter, our 'leaders have failed us.'&quot;</li> </ul> <h2 id="ii-the-facts-regarding-the-october-2-accident">II. THE FACTS REGARDING THE OCTOBER 2 ACCIDENT</h2> <h3 id="a-background-regarding-cruise-s-business-operations">A. Background Regarding Cruise’s Business Operations</h3> <ul> <li>Cruise founded in 2013, acquired by GM in 2016 (GM owns 79%)</li> <li>Cruise's stated goal: &quot;responsibly deploy the world’s most advanced driverless vehicle service&quot;</li> <li>&quot;Cruise’s stated mission is to make transportation cleaner, safer and more accessible&quot;</li> <li>Driverless ride-hail operation started 2021-09 in SF</li> <li>Started charging in 2022-06</li> <li>Has expanded to other areas, including overseas</li> <li><b>10-02</b> accident was first pedestrian injury in &gt; <code>5M mi</code> of driving</li> </ul> <h3 id="b-key-facts-regarding-the-accident">B. Key Facts Regarding the Accident</h3> <ul> <li><strong>10-02, <i>9:29pm</i></strong>: human-driven Nissan Sentra strikes pedestrian in crosswalk of 4-way intersection at 5th &amp; Market in SF</li> <li>Pedestrian entered crosswalk against a red light and &quot;Do Not Walk&quot; signal and then paused in Nissan's lane. Police report cited both driver and pedestrian for code violations and concluded that the driver was &quot;most at fault&quot;</li> <li>The impact launched the pedestrian into the path of the Crusie AV</li> <li>Cruise AV braked but still hit pedestrian</li> <li>After coming to a complete stop, the AV moved to find a safe place to stop, known as a &quot;'minimal risk condition' pullover maneuver (pullover maneuver) or 'secondary movement.'&quot;</li> <li>AV drove up to <code>7.7mph</code> for <code>20 feet</code>, dragging pedestrian with it</li> <li>Nissan driver fled the scene (hit and run)</li> </ul> <h3 id="c-timeline-of-key-events">C. Timeline of Key Events</h3> <ul> <li><strong>10-02, <i>9:29pm</i></strong>: accident occurs and Nissan driver flees; AV transits low-res 3-second video (Offload 1) confirming collison to Cruise Remote Assistance Center</li> <li><strong><i>9:32pm</i></strong>: AV transmits medium-res 14-second video (Offload 2) of collision, but not pullover maneuver and pedestrian dragging</li> <li><strong><i>9:33pm</i></strong>: emergency responders arrive between 9:33pm and 9:38pm</li> <li><b><i>9:40pm</b></i>: SFFD uses heavy rescue tools to remove pedestrian from under AV</li> <li><b><i>9:49pm</b></i>: Cruise Incident Response team labels accident a &quot;Sev-1&quot;, which is for minor collisions. Team created a virtual &quot;war room&quot; on Google Meet and a dedicated slack channel (war room slack channel) with ~20 employees</li> <li><b><i>10:17pm</b></i>: Cruise contractors arrive at accident scene. One contractor takes &gt; 100 photos and videos and notes blood and skin patches on the ground, showing the AV moved from point-of-impact to final stopping place <ul> <li>Another contractor, with Cruise's authorization, gives SFPD 14-second video showing Nissan</li> </ul></li> <li><b><i>11:31pm</b></i>: Cruise raises incident to &quot;Sev-0&quot;, for &quot;major vehicle incident with moderate to major injury or fatality to any party&quot;. Maybe 200 additional employees are paged to the war room</li> <li><b>10-03, <i>12:15am</b></i>: incident management convenes virtual meeting to share updates about accident and discuss media strategy to rebut articles that AV caused the accident</li> <li><b><i>12:45am</b></i>: Cruise govt affairs team reaches out to govt officials</li> <li><b><i>12:53am</b></i>: Cruise issues a press release noting the Nissan caused the accident. CEO <abbr title="CEO, CTO, President, and co-founder">Kyle Vogt</abbr> and Communications VP <abbr title="Communications VP">Aaron McLear</abbr> heavily edit the press statement. No mention of pullover maneuver or dragging; Cruise employees claim they were not aware of those facts at the time</li> <li><b><i>1:30am</b></i>: AV back at Cruise facility, start the process of downloading collision report data from AV, including full video</li> <li><b><i>2:14am</b></i>: 45-second video of accident which depicts pullover maneuver and dragging available, but no Cruise employee receives a notification that it's ready until &gt; 4 hours later, when all data from AV is processed</li> <li><b><i>3:21am</b></i>: At the request of Cruise govt. Affairs, Director of Systems Integrity <abbr title="Director of Systems Integrity">Matt Wood</abbr> creates 12s video of accident showing Nissan hitting pedestrian and pedestrian landing in front of Cruise AV. Video stops before AV hits pedestrian</li> <li><b><i>3:45am</b></i>: <abbr title="Director of Systems Integrity">Wood</abbr> posts first known communication within Cruise of pullover maneuver and pedestrian dragging to war room slack channel with 77 employees in the channel at the time. <abbr title="Director of Systems Integrity">Wood</abbr> says the AV moved 1-2 car lengths after initial collision</li> <li><b><i>6:00am</b></i>: Cruise holds virtual Crisis Management Team (CMT) meeting; pedestrian dragging is discussed. Subsequent slack messages (6:17am, 6:25am, 6:56am) confirm discussion on pullover maneuver and dragging</li> <li><b><i>6:28am</b></i>: Cruise posts 45s 9-pane video of pullover and dragging, the full video (offload 3) to war room slack channel</li> <li><b><i>6:45am</b></i>: Virtual Senior Leadership Team (SLT) meeting; <abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> and <abbr title="Communications VP">McLear</abbr> discuss whether or not to share full video with media or alter Cruise press statement and decide to do neither</li> <li><b><i>7:25am</b></i>: Cruise govt. affairs employee emails NHTSA and offers to meet</li> <li><b><i>7:45am</b></i>: Cruise eng and safety teams hold preliminary meetings to discuss collision and pullover maneuver</li> <li><b><i>9:05am</b></i>: Cruise regulator, legal, and systems integrity employees have pre-meeting to prepare for NHTSA briefing; they discuss pullover and dragging</li> <li><b><i>10:05am</b></i>: <abbr title="Director of Systems Integrity">Wood</abbr> and VP of Global Government Affairs Prashanthi <abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr> have virtual meeting with Mayor of SF's transpo advisor. <abbr title="Director of Systems Integrity">Wood</abbr> shows full video, &quot;reportedly with internet connectivity issues from his home computer&quot;; neither <abbr title="Director of Systems Integrity">Wood</abbr> nor <abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr> brings up or discusses pullover or dragging</li> <li><b><i>10:30am</b></i>: Virtual meeting with NHTSA. <abbr title="Director of Systems Integrity">Wood</abbr> shows full video, &quot;again having internet connectivity issues causing video to freeze and/or black-out in key places including after initial impact&quot; and again not bringing up or discussing pullover or dragging</li> <li><b><i>10:35am</b></i>: Cruise eng and safety teams have 2nd meeting to discuss collision</li> <li><b><i>11:05am</b></i>: Cruise regulatory, legal, and systems integrity employees have pre-meeting for DMV and California Highway Patrol (CHP) briefing; Cruise team doesn't discuss pullover and dragging</li> <li><b><i>11:30am</b></i>: hybrid in-person and virtual meeting with DMV and CHP. <abbr title="Director of Systems Integrity">Wood</abbr> shows full video, again with internet connectivity issues and again not bringing up or discussing pullover or dragging</li> <li><b><i>12:00pm</b></i>: virtual Cruise CMT meeting; engineers present findings, including chart detailing movement of AV during accident. Shows how AV collided with pedestrian and then moved forward again, dragging pedestrian ~20 ft. AV programmed to move as much as 100 ft, but internal AV systems flagged a failed wheel speed sensor because wheels were moving at different speeds (because one wheel was spinning on pedestrian's leg), stopping the car early</li> <li><b><i>12:30pm</b></i>: Cruise govt affairs employee calls CPUC to discuss <b>10-02</b> accident and video</li> <li><b><i>12:40pm</b></i>: Cruise virtual SLT meeting. Chart from CMT meeting presented. <abbr title="CEO, CTO, President, and co-founder">Vogt</abbr>, COO <abbr title="COO">Gil West</abbr>, Chief Legal Officer <abbr title="Chief Legal Officer">Jeff Bleich</abbr>, and others present. Safety and eng teams raise question of grounding fleet; <abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> and <abbr title="COO">West</abbr> say no</li> <li><b><i>1:40pm</b></i>: full video uploaded to NHTSA</li> <li><b><i>2:37pm</b></i>: Cruise submits 1-day report to NHTSA; no mention of pullover or dragging</li> <li><b><i>3:30pm</b></i>: Cruise virtual meeting with SF MTA, SFPD, and SFFD. <abbr title="Director of Systems Integrity">Wood</abbr> &quot;shows full video several times&quot; without technical difficulties. Cruise doesn't bring up pullover maneuver or dragging, but officials see it and ask Cruise questions about it</li> <li><b><i>6:05pm</b></i>: Cruise CMT meeting. <abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> and <abbr title="COO">West</abbr> end Sev-0 war room. Some cruise employees later express concerns about this</li> <li><b>10-05, <i>10:46am</b></i>: Forbes asks Cruise for comment on AV dragging. Cruise declines to comment and stands by <b>10-03</b> press release</li> <li><b><i>1:07pm</b></i>: CPUC sends request for information with <b>10-19</b> response deadline</li> <li><b>10-06, <i>10:31am</b></i>: Forbes publishes &quot;Cruise Robotaxi Dragged Woman 20 Feet in Recent Accident, Local Politician Says&quot;</li> <li><b>10-10, <i>4:00pm</b></i>: DMV requests more complete video from Cruise. Cruise responds same day, offering to screenshare video</li> <li><b>10-11, <i>11am</i></b>: Cruise meeting with DMV on operational issues unrelated to accident. DMV's video request &quot;briefly discussed&quot;</li> <li><b><i>12:48pm</i></b>: Cruise paralegal submits 10-day report to NHTSA after checking for updates. Report doesn't mention pullover or dragging &quot;as no one told the paralegal these facts needed to be added&quot;</li> <li><b>10-12, <i>3pm</b></i>: NHTSA notifies Cruise that it intends to open Preliminary Evaluation (PE) for <b>10-02</b> accident and 3 other pedestrian-related events</li> <li><b>10-13, <i>10am</b></i>: Cruise meets with DMV and CHP to share 9m 6-pane video and DMV clarifies that it wants the 45s 9-pane video (&quot;full video&quot;)</li> <li><b><i>12:19pm</b></i>: Cruise uploads full video</li> <li><b><i>1:30pm</b></i>: Cruise meets with NHTSA and argues that PE is unwarranted</li> <li><b><i>10-16, 11:30am</b></i>: Cruise meets with DMV and CHP, who state they don't believe they were shown full video during <b>10-03</b> meeting</li> <li><b><i>10-16</b></i>: NHTSA officially opens PE</li> <li><b>10-18, <i>3:00pm</b></i>: Cruise holds standing monthly meeting with CPUC. Cruise says they'll meet CPUC's <b>10-19</b> deadline</li> <li><b>10-19, <i>1:40pm</b></i>: Cruise provides information and full video in response to <b>10-05</b> request</li> <li><b>10-23, <i>2:35pm</b></i>: Cruise learns of possible DMV suspension of driverless permit</li> <li><b>10-24, <i>10:28am</b></i>: DMV issues suspension of Cruise's driverless permit. Except for the few employees who heard on <b>10-23</b>, Cruise employees are surprised</li> <li><b><i>10:49am</b></i>: Cruise publishes blog post which states: &quot;Shortly after the incident, our team proactively shared information with the California Department of Motor Vehicles (DMV), California Public Utilities Commision [sic] (CPUC), and National Highway Traffic Safety Administration (NHTSA), including the full video, and have stayed in close contact with regulators to answer their questions&quot;</li> <li><b>11-02, <i>12:03pm</b></i>: Cruise submits 30-day NHTSA report, which includes discussion of pullover and dragging</li> <li><b>11-02</b>: Cruise recalls 950 systems as a result of <b>10-02</b> accident</li> <li><b>12-01</b>: CPUC issues Order to Show Cause &quot;for failing to provide complete information and for making misleading public comments regarding the <b>October 2, 2023</b> Cruise related incident and its subsequent interactions with the commission&quot;</li> </ul> <h3 id="d-video-footage-of-the-accident">D. Video Footage of the Accident</h3> <ul> <li>6 videos <ul> <li><strong>Offload 1; <i>9:29pm</i></strong>: low res, 3s, 4-pane. Captures 3s immediately after collision, including audio</li> <li><strong>Offload 2; <i>9:32pm</i></strong>*: 14s, 9-pane. No audio. Shows Nissan pedestrian collision and pedestrian being thrown into path of Cruise AV</li> <li><strong>Media Video; <i>10:04pm</i></strong>: 21s, 4-pane. Derived from offload 2, but slowed down</li> <li><strong><i>1:06am</i></strong>: 4s clip of offload 2, cut by <abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> and sent to SVP of Government Affairs <abbr title="SVP of Government Affairs">David Estrada</abbr> and Chief Legal Officer <abbr title="Chief Legal Officer">Jeff Bleich</abbr> with &quot;this is the cut I was thinking of&quot;. <abbr title="SVP of Government Affairs">Estrada</abbr> responds with &quot;yes, agree that should be the primary video that gets released if one is released&quot;. Video is a single pane from left-front of AV and only shows Nissan hitting pedestrian. <abbr title="SVP of Government Affairs">Estrada</abbr> says this should be shown in meetings with regulators first to show &quot;what happened in clarity so you can see how the event happened (establish clear fault of human driver)&quot;, but &quot;no evidence this shorter 4-second video was shown at any of the regulatory meetings.&quot;</li> <li><strong><i>3:21am</i></strong>: 12s 9-pane video derived from offload 2. Cruise VP of Global Government Affairs <abbr title="VP of Global Government Affairs Prashanthi">Prashanthi Raman</abbr> and <abbr title="SVP of Government Affairs">Estrada</abbr> asked <abbr title="Director of Systems Integrity">Wood</abbr> for shorter version of 14s video, &quot;given last night’s Sev 0 and our need to discuss with policymakers, can you please make us a usable video of this angle [link to Webviz]. We only need to show the impact and the person landing in front of us and then cut it there&quot;. <abbr title="Director of Systems Integrity">Wood</abbr> created the video. Cruise’s Senior Director of Federal Affairs <abbr title="Senior Director of Federal Affairs">Eric Danko</abbr> tells <abbr title="Director of Systems Integrity">Wood</abbr>, &quot;believe NHTSA will want video footage that captures moment of our impact as well&quot; and <abbr title="Director of Systems Integrity">Wood<abbr> replies, &quot; can create a NHTSA version video once the logs have been offloaded&quot;</li> <li><strong>&quot;full video&quot;; <i>6:28am</i></strong>: 45s, 9-pane, shows pullover and dragging. No audio. Link to full video posted to war room slack at</li> </ul></li> </ul> <h3 id="e-the-facts-regarding-what-cruise-knew-and-when-about-the-october-2-accident">E. The Facts Regarding What Cruise Knew and When About the October 2 Accident</h3> <h4 id="1-facts-cruise-learned-the-evening-of-october-2">1. Facts Cruise Learned the Evening of October 2</h4> <h5 id="a-accident-scene">a. Accident Scene</h5> <ul> <li>Driverless Support Specialists (DSS) arrive at scene, 9:39pm to 9:44pm</li> <li>Another 2-person DSS team arrives with member of operations team and member of Safety Escalation team (SET), 10:00-10:30pm</li> <li>At least one contractor takes &gt; 100 photos and video and indicated an understanding of pedestrian dragging <ul> <li>Contractor noted blood and skin pieces, took long shots of trail of blood that indicated traveled after impact; contractor was instructed to bring phone to Cruise instead of uploading video onto customary Slack channel; contractor believes this was to protect injured pedestrian's privacy</li> <li>Photos and video uploaded into &quot;RINO&quot; database at 2:23am, accessed by &gt; 100 employees starting at <b>10-03, <i>5:11am</i></b>; DB doesn't show which specific employees reviewed specific photos and videos</li> </ul></li> <li>Another person at the scene denied knowing about dragging</li> <li>In Cruise's internal review, prior to Quinn Emanuel, One Remote Assistance operator (RA) said they saw &quot;ped flung onto hood of AV. You could see and hear the bump&quot; and another saw AV &quot;was already pulling over to the side&quot;. Quinn Emanuel didn't find out about these until after the RIF on 12-14. On reaching out, one declined the interview and the other didn't respond <ul> <li>Two other interviewees reported discussion of secondary movement of AV on evening of <b>10-02</b> or early morning of <b>10-03</b> but &quot;this information has not been verified and appears contrary to the weight of the evidence&quot;</li> </ul></li> <li>No employees interviewed by Quinn Emanuel indicated they knew about dragging on <b>10-02</b></li> </ul> <h5 id="b-virtual-sev-0-war-room">b. Virtual &quot;Sev-0 War Room&quot;</h5> <ul> <li>20 people initially in war room</li> <li>200+ joined and left war room on <b>10-02</b> and <b>10-03</b></li> <li>2 interviewees recalled discussion of pedestrian dragging in Sev-0 war room on Meet. Neither could identify who was involved in the discussion or the timing of the discussion. One said it was after 4:00am</li> <li>Cruise incident response playbook outlines roles of Incident Commander, SLT, CMT as well as how to respond in the weeks after incident. Playbook was not followed, said to be &quot;aborted&quot; because &quot;too manually intensive&quot;</li> </ul> <h5 id="c-initial-media-narrative-about-the-october-2-accident">c. Initial Media Narrative About the October 2 Accident</h5> <ul> <li>&quot;Although the War Room was supposed to address a variety of issues such as understanding how the accident happened and next steps, the focus quickly centered almost exclusively on correcting a false media narrative that the Cruise AV had caused the Accident&quot;</li> </ul> <h4 id="2-facts-cruise-learned-on-october-3">2. Facts Cruise Learned on October 3</h4> <h5 id="a-the-12-15-a-m-sev-0-collision-sfo-meeting">a. The 12:15 a.m. &quot;Sev-0 Collision SFO&quot; Meeting</h5> <ul> <li>CMT Incident Manager convened meeting with 140 invites</li> <li>Focus on sharing updates and media narrative strategy</li> <li>Slack communications show Cruise employees viewed risk that public could think that Cruise AV injured pedestrian as a crisis</li> <li><abbr title="SVP of Government Affairs">Estrada</abbr> says to <abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr>, &quot;feels like we are fighting with both arms tied behind our back if we are so afraid of releasing an exonerating video, very naïve if we think we won’t get walloped by media and enemies&quot; <ul> <li><abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr> responds, &quot;we are under siege is my opinion, we have no fighting chance with these headlines/media stories…we are drowning — and we will lose every time&quot;</li> <li>above statement is said to have &quot;captured well the feeling within Cruise’s senior leadership&quot;</li> </ul></li> <li><abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> attended meeting and wanted to reveal only 4s part of clip showing the Nissan hitting the pedestrian <ul> <li><abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> insisted he wanted to authorize any video or media statement before release, &quot;nothing would be shared or done&quot; without his approval</li> </ul></li> <li>In parallel, comms team drafted bullet points to share with media, including &quot;AV came to a complete stop immediately after impacting the struck pedestrian&quot;, which comms team did not know was inaccurate</li> </ul> <h5 id="b-engineer-s-3-45-a-m-slack-message">b. Engineer’s 3:45 a.m. Slack Message</h5> <ul> <li>Slack communication <ul> <li><abbr title="Director of Systems Integrity">Wood</abbr>: I have not seen this mentioned yet, but in the 1st RA Session the AV is stopped nearly right next to the adjacent vehicle but drives forward another 1-2 car lengths before coming to it's [sic] final position.</li> <li>Unnamed employee: ACP, I can’t access the link but is the PED under the vehicle while it continues to move? Am I understanding that correctly</li> <li><abbr title="Director of Systems Integrity">Wood</abbr>: I believe so and the AV video can be seen moving vertically</li> </ul></li> <li><abbr title="Director of Systems Integrity">Wood</abbr> determined this by looking at data from RA center, which implied AV movement of 1-2 car lengths with dragged pedestrian</li> </ul> <h5 id="c-the-6-00-a-m-crisis-management-team-cmt-meeting">c. The 6:00 a.m. Crisis Management Team (CMT) Meeting</h5> <ul> <li>CMT discussed pullover and dragging</li> <li>100+ in meeting, including &quot;COO Gil <abbr title="COO">West</abbr>, co-founder and Chief Product Officer Dan Kan, VP of Communications, Senior Director of Federal Affairs, and members of the communications, legal, engineering, safety, regulatory, and government affairs teams&quot;</li> <li>6:17am, engineer slacks <abbr title="Director of Systems Integrity">Wood</abbr>, &quot;have they raised the issue that the AV moved post-event? On this call. I joined late&quot; and <abbr title="Director of Systems Integrity">Wood</abbr> responds &quot;Not yet. I will raise&quot;</li> <li>Slack conversation during meet, from <abbr title="COO">West</abbr> to 6 other senior leaders: <ul> <li><abbr title="COO">West</abbr>: ACP- For awareness it was reported at the CMT meeting that the AV moved 1-2 vehicle lengths before RA connection (low collision and looking for pull over before e-stop hit)</li> <li><abbr title="CEO, CTO, President, and co-founder">Vogt</abbr>: Should we run road to sim and see what the AV would have done if it was in the other vehicles position? I think that might be quite powerful</li> <li><abbr title="COO">West</abbr>: Good idea- I suspect the AV would have stopped and never hit the Ped in the first place</li> </ul></li> <li>engineer summarized CMT meeting in war room slack, &quot;in the CMT meeting this morning, there was discussion of releasing/sharing video at some point. <abbr title="Director of Systems Integrity">Matt Wood</abbr> also noted that the AV travels at a slow speed after the collision, with the pedestrian underneath the car (it's about 7 meters). It wasn't discussed, but i wanted to point out that someone who has access to our AV video above up until the point of the collision could see later that the AV traveled about this distance post-collision, because there is a video on social media which shows the AV stopped with the pedestrian underneath, and there are some markers in the scene.&quot; <ul> <li>engineer also pointed out that non-engineers should be able to deduce pedestrian dragging from before before AV impact plus social media video showing AV's final position. After the DMV suspension order, also said &quot;I pointed out in the channel that it was not hard to conclude there was movement after the initial stop…it seems the DMV fully understanding the entire details was predictable&quot;</li> </ul></li> </ul> <h5 id="d-the-6-45-a-m-senior-leadership-team-slt-meeting">d. The 6:45 a.m. Senior Leadership Team (SLT) Meeting</h5> <ul> <li>Dragging discussed in SLT meeting</li> <li>SLT discussed amending media statement, &quot;the outcome [of these discussions] was whatever statement was published on social we would stick with because the decision was we would lose credibility by editing a previously agreed upon statement&quot;</li> <li>At this point, senior members of comms team knew that &quot;AV came to a complete stop immediately after impacting the struck pedestrian&quot; statement was inaccurate, but comms team continued giving inaccurate statement to press after SLT meeting, resulting in publications with incorrect statements in Forbes, CNBC, ABC News Digital, Engadget, Jalopnik, and The Register <ul> <li>&quot;complete stop&quot; removed on <b>10-13</b> after comms employee flagged statement to legal, which said &quot;I don’t think we can say this&quot;</li> </ul></li> </ul> <h5 id="e-the-7-45-a-m-and-10-35-a-m-engineering-and-safety-team">e. The 7:45 a.m. and 10:35 a.m. Engineering and Safety Team</h5> <p>Meetings</p> <ul> <li>[no notes]</li> </ul> <h5 id="f-the-12-05-p-m-cmt-meeting">f. The 12:05 p.m. CMT Meeting</h5> <ul> <li>[no notes]</li> </ul> <h5 id="g-the-12-40-p-m-slt-meeting">g. The 12:40 p.m. SLT Meeting</h5> <ul> <li>&quot;<abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> is said to have stated that it was good the AV stopped after 20 feet when it detected interference with its tire rather than continuing, as AVs are programmed to do, to look for a safe place to pull over for up to 100 feet or one full block&quot;</li> <li>Safety and eng teams raised question of grounding fleet until fix deployed <ul> <li><abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> and <abbr title="COO">West</abbr> rejected idea</li> </ul></li> </ul> <h5 id="h-the-6-05-p-m-cmt-meeting">h. The 6:05 p.m. CMT Meeting</h5> <ul> <li>CMT leaders learn SLT is disbanding Sev-0 war room</li> <li>Some interviewees expressed concern that no future CMT meetings scheduled for biggest incident in Cruise history <ul> <li>Some suggested to Chief Legal Officer <abbr title="Chief Legal Officer">Jeff Bleich</abbr> that &quot;miniature CMT&quot; should continue to meet; <abbr title="Chief Legal Officer">Bleich</abbr> and others supportive, but this wasn't done</li> </ul></li> </ul> <h4 id="3-cruise-s-response-to-the-forbes-article">3. Cruise’s Response to the Forbes Article</h4> <ul> <li>Forbes reached out about pedestrian dragging <ul> <li>Cruise decides not to respond to avoid triggering a new media cycle <ul> <li>Cruise stops sharing video with media</li> </ul></li> </ul></li> </ul> <h2 id="iii-cruise-s-communications-with-regulators-city-officials-and-other-stakeholders">III. CRUISE’S COMMUNICATIONS WITH REGULATORS, CITY OFFICIALS, AND OTHER STAKEHOLDERS</h2> <h3 id="a-overview-of-cruise-s-initial-outreach-and-meetings-with-regulators">A. Overview of Cruise’s Initial Outreach and Meetings with Regulators</h3> <ul> <li>&quot;initial blurb&quot; drafted by 12:24am; Cruise not aware of dragging at the time</li> </ul> <h3 id="b-the-mayor-s-office-meeting-on-october-3">B. The Mayor’s Office Meeting on October 3</h3> <ul> <li>Meeting with Mayor’s Transportation Advisor Alexandra Sweet</li> <li>Cruise employee gave overview, then <abbr title="Director of Systems Integrity">Wood</abbr> played full video <ul> <li>This approach became the standard presentation from Cruise</li> </ul></li> <li>Full video was played twice by <abbr title="Director of Systems Integrity">Wood</abbr>, but there were connectivity issues</li> <li>Sweet apparently noticed that the vehicle moved again, but didn't ask about dragging or why vehicle moved again</li> </ul> <h3 id="c-cruise-s-disclosures-to-the-national-highway-traffic-safety-administration-nhtsa">C. Cruise’s Disclosures to the National Highway Traffic Safety Administration (NHTSA)</h3> <h4 id="1-cruise-s-initial-outreach-on-october-3">1. Cruise’s Initial Outreach on October 3</h4> <ul> <li>**10-03, <i>7:25am</i> **: Cruise’s Head of Regulatory Engagement emailed NHTSA <ul> <li>NHTSA issues they wanted addressed included &quot;Whether the Cruise ADS or remote assistant could ascertain that the pedestrian was trapped under the vehicle or the location of a pedestrian on the ground&quot; and &quot;vehicle control dynamics (lateral and longitudinal) leading to the incident and following impact including ADS predicted path of the pedestrian and whether any crash avoidance or mitigation took place&quot; and video of accident</li> </ul></li> </ul> <h4 id="2-cruise-s-nhtsa-pre-meeting">2. Cruise’s NHTSA Pre-Meeting</h4> <ul> <li>Talking points for anticipated questions <ul> <li>Did you stop the fleet? <ul> <li><abbr title="Deputy General Counsel">Alicia Fenrick</abbr>: We have not changed the posture of the fleet.</li> <li>We have not identified a fault in AV response</li> </ul></li> <li>Why did the vehicle move after it had initially stopped? <ul> <li>[Not discussed] <abbr title="Director of Systems Integrity">Matthew Wood</abbr>: The impact triggered a collision detection and the vehicle is designed to pull over out of lane in</li> </ul></li> <li>Why didn't the vehicle brake in anticipation of the pedestrian in the road? <ul> <li><abbr title="Director of Systems Integrity">Matthew Wood</abbr>: I think the video speaks for itself, the pedestrian is well past our lane of travel into the other lane</li> <li><abbr title="Deputy General Counsel">Alicia Fenrick</abbr>: The pedestrian was clearly well past lane of travel of the AV. It would not be reasonable to expect that the other vehicle would speed up and proceed to hit the pedestrian, and then for the pedestrian to flip over the adjacent car and wind up in our lane.</li> </ul></li> </ul></li> <li>Excerpt of notes from an employee: <ul> <li>They requested a video-wait until the meeting at least. Then another question-where we end the video.</li> <li>Alicia: Biggest issue candidly. That we moved, and why, is something we are going to need to explain. The facts are what they are.</li> <li>Matt: why we moved, it is a collision response. Detected as a minor collision, so design response is a permissible lane pullover.</li> <li>How to reference this: triggered a collision detection and was designed to pull over out of lane. Do not qualify as minor collision, rather as a collision detection.</li> <li>Questions will be: it stopped and then proceeded forward.</li> <li>General premise: we are looking into this, we are doing a deep dive, we have done some preliminary analysis and this is what we have but it is only preliminary.</li> <li>Buckets: before impact, impact, after impact.</li> </ul></li> <li>Slack messages show discussion of when video should be sent and which video should be sent; decided to play full video due to to avoid being &quot;accused of hiding the ball&quot;</li> </ul> <h4 id="3-cruise-s-meeting-with-nhtsa-on-october-3">3. Cruise’s Meeting with NHTSA on October 3</h4> <ul> <li><abbr title="Director of Systems Integrity">Wood</abbr> played the full video 2 or 3 times, &quot;but it kept stopping or blacking- or whiting out because his home computer was having connectivity issues&quot;</li> <li>&quot;NHTSA did not see the Full Video clearly or in its entirety&quot;</li> <li>No discussion of pullover or dragging <ul> <li>Pre-meeting notes edited after meeting, adding &quot;[Not discussed]&quot; to this item</li> </ul></li> <li>Meeting notes of some questions asked by NHTSA <ul> <li>Could RA detect that pedestrian trapped? <ul> <li><abbr title="Director of Systems Integrity">Wood</abbr>: Yes</li> </ul></li> <li>Sensors too? <ul> <li><abbr title="Director of Systems Integrity">Wood</abbr>: Yes</li> </ul></li> <li>The statement &quot;the last thing you would want to do is move when a pedestrian is underneath&quot; appears to have been said, but recollections disagree on who said it. Some believe <abbr title="Director of Systems Integrity">Wood</abbr> said this and NHTSA concurred, some believe <abbr title="Director of Systems Integrity">Wood</abbr> said this an NHTSA repeated the statement, and some believe that NHTSA said this and <abbr title="Director of Systems Integrity">Wood</abbr> concurred</li> </ul></li> <li>Post-meeting slack discussion <ul> <li>Employee: &quot;I think we might need to mention the comment Matt made during the NHTSA call that the last thing you would want to do is move with a pedestrian under the car. From my notes and recollection Matt said 'As pedestrian is under vehicle last thing want to do is maneuver' and [the NHTSA regulator] agreed&quot;</li> <li>Another employee: &quot;lets see where the conversation goes. if it’s relevant, we should share it. That’s not the main point here though&quot;</li> <li>In other discussions, other employees and execs express varying levels of concern on the non-disclosure of pullover and dragging, from moderate to none (e.g., Senior Director of Federal Affair says he &quot;stands by it ... [Cruise's employees] have gone beyond their regulatory requirements&quot;)</li> </ul></li> </ul> <h4 id="4-cruise-s-nhtsa-post-meeting-on-october-3">4. Cruise’s NHTSA Post-Meeting on October 3</h4> <ul> <li>NHTSA sent a request for video and Cruise uploaded full video</li> </ul> <h4 id="5-cruise-s-interactions-with-nhtsa-on-october-12-13-and-16">5. Cruise’s Interactions with NHTSA on October 12, 13, and 16</h4> <h5 id="a-october-12-call">a. October 12 Call</h5> <ul> <li>NHTSA regulator called Cruise employee and informed them that NHTSA was planning Preliminary Evaluation; employee sent the following to Cruise NHTSA team: <ul> <li>&quot;She shared that there was a lot of consternation in the front office about last week's incident. It is going to be a pretty broad investigation into how vehicles react to pedestrians out in the street and people in the roadway. But questions about last week's incident will be included in the IR questions and analysis. I offered an additional briefing about last week's incident, but she said that we were quite upfront and shared the video and told them everything they need to know.&quot;</li> <li>&quot;it [is] difficult to believe that they could find fault with our reaction to the pedestrian in Panini [Panini is the name of the specific AV] that would extend beyond asking us additional questions in a follow-up…&quot;</li> </ul></li> </ul> <h5 id="b-october-13-meeting">b. October 13 Meeting</h5> <ul> <li>&quot;Despite the severe consequences that could result from a PE, including a recall, Cruise’s Chief Legal Officer and Senior Vice President of Government Affairs did not attend&quot;</li> <li>From meeting agenda: &quot;We’re just a little confused by this. We met with you about the Panini incident last week, and the team didn’t express any remaining concerns about it, even when asked if you had any additional concerns. Was there really remaining concern about AV behavior regarding Panini? If yes, why did they not request another briefing? We’ve been extremely cooperative with the Agency and have always provided information that the agency requested. What will be gained by this escalation that we are not already providing? Offer briefing on any of these topics in lieu of PE.&quot;</li> <li>Also planned to state: &quot;Regarding last week’s incident we briefed the agency within hours of the event, provided video, and offered repeatedly to share additional information, including around the topic of pedestrian safety broadly. None was requested, which makes us question the motivations behind opening a PE. PEs are punitive means to gather information, and are reputationally harmful, particularly in a nascent industry.&quot;</li> </ul> <h5 id="c-october-16-pe">c. October 16 PE</h5> <ul> <li>[no notes]</li> </ul> <h4 id="6-cruise-s-nhtsa-reports-regarding-the-october-2-accident">6. Cruise’s NHTSA Reports Regarding the October 2 Accident</h4> <ul> <li>NHTSA's SGO requires three written reports, including &quot;a written description of the pre-crash, crash, and post-crash details&quot;</li> <li>Cruise's first two reports did not mention pullover and dragging; after consultation with GM, third report did mention pullover and dragging</li> </ul> <h5 id="a-nhtsa-1-day-report">a. NHTSA 1-Day Report</h5> <ul> <li>Original draft forwarded from paralegal to Deputy General Counsel <abbr title="Deputy General Counsel">Alicia Fenrick</abbr>, Director of Communications Erik Moser, and Managing Legal Counsel <abbr title="Managing Legal Counsel">Andrew Rubenstein</abbr>: &quot;A Cruise autonomous vehicle (&quot;AV&quot;), operating in driverless autonomous mode, was at a complete stop in response to a red light on southbound Cyril Magnin Street at the intersection with Market Street. A dark colored Nissan Sentra was also stopped in the adjacent lane to the left of the AV. As the Nissan Sentra and the AV proceeded through the intersection after the light turned green, a pedestrian entered the crosswalk on the opposite side of Market Street across from the vehicles and proceeded through the intersection against a red light. The pedestrian passed through the AV's lane of travel but stopped mid-crosswalk in the adjacent lane. Shortly thereafter, the Nissan Sentra made contact with the pedestrian, launching the pedestrian in front of the AV. The AV braked aggressively but, shortly thereafter, made contact with the pedestrian. This caused no damage to the AV. The driver of the Nissan Sentra left the scene shortly after the collision. Police and Emergency Medical Services (EMS) were called to the scene. The pedestrian was transported by EMS.&quot; <ul> <li>LGTM'd [approved] by <abbr title="Deputy General Counsel">Fenrick</abbr> and Moser; <abbr title="Managing Legal Counsel">Rubenstein</abbr> said &quot;the GA folks have suggested some additional edits&quot;, which included adding that the &quot;completely&quot; pass through AV's lane of travel, changing &quot;launching&quot; to &quot;deflecting&quot;, and removing &quot;this caused no damage to the AV&quot;; no discussion of possible inclusion of pullover and dragging</li> </ul></li> <li>Cruise employee who established NHTSA reporting system believed that full details, including pullover and dragging, should've been included, but they were on vacation at the time</li> <li>In later, <b>10-24</b>, employee Q&amp;A on DMV suspension order, an employee asked &quot;Why was the decision made not to include the post-collision pull-over in the written report to the NHTSA? At least, this seems like it must have been an intentional decision, not an accidental oversight.&quot; <ul> <li><abbr title="Managing Legal Counsel">Rubenstein</abbr> drafted this prepared response for <abbr title="Deputy General Counsel">Fenrick</abbr>: &quot;The purpose of the NHTSA reporting requirement is to notify the agency of the occurrence of crashes. Consistent with that objective and our usual practice, our report notified NHTSA that the crash had occurred. Additionally, we had already met with NHTSA, including showing the full video to them, prior to submission of the report. That meeting was the result of our proactive outreach: we immediately reached out to NHTSA after the incident to set up a meeting to discuss with them. Our team met with NHTSA in the morning following the incident, including showing the full video to NHTSA. We then submitted the report and sent a copy of the full video later that afternoon.&quot;</li> <li><abbr title="Deputy General Counsel">Fenrick</abbr> LGTM'd the above, but the response ended up not being given</li> </ul></li> <li>Quinn Emanuel notes, &quot;It is difficult to square this rationale with the plain language of the NHTSA regulation itself, which requires “a written description of the pre-crash, crash, *and post-crash details….*” (emphasis added)&quot;</li> </ul> <h5 id="b-nhtsa-10-day-report">b. NHTSA 10-Day Report</h5> <ul> <li>Paralegal had full authority to determine if any new info or updates were necessary</li> <li>Paralegal asked three employees on slack, &quot;hi, checking in to see if there have been any updates to this incident? In particular, any status on the ped&quot; <ul> <li>An employee who interacts with law enforcement responded &quot;Unfortunately no. I’ve reached out to the investigating sergeant but have not received a response. This is probably due to other investigations he may be involved in&quot;</li> <li>This employee said that they were referring only to the pedestrian's medical condition, but the paralegal took the response more broadly</li> </ul></li> <li>The paralegal also checked the RINO database for updates and saw none, then filed the 10-day report, which states &quot;There are no updates related to this incident since the original submission on October 3, 2023&quot; and then repeats the narrative in the 1-day report, omitting discussion of pullover and dragging</li> </ul> <h5 id="c-nhtsa-30-day-report">c. NHTSA 30-Day Report</h5> <ul> <li>GM urged Cruise to be more comprehensive with 30-day report, so CLO <abbr title="Chief Legal Officer">Bleich</abbr> got involved. <ul> <li><abbr title="Chief Legal Officer">Bleich</abbr> reviewed the 1-day and 10-day reports, and then followed up with &quot;[t]he most important thing now is simply to be complete and accurate in our reporting of this event to our regulators&quot;, says to include the pullover and dragging</li> <li><abbr title="Managing Legal Counsel">Rubenstein</abbr> objected to including dragging in 30-day report</li> </ul></li> </ul> <h4 id="7-conclusions-regarding-cruise-s-interactions-with-nhtsa">7. Conclusions Regarding Cruise’s Interactions with NHTSA</h4> <ul> <li>[no notes]</li> </ul> <h3 id="d-cruise-s-disclosures-to-the-department-of-motor-vehicles-dmv">D. Cruise’s Disclosures to the Department of Motor Vehicles (DMV)</h3> <h4 id="1-cruise-s-initial-outreach-to-the-dmv-and-internal-discussion-of-which-video-to-show">1. Cruise’s Initial Outreach to the DMV and Internal Discussion of Which Video to Show</h4> <ul> <li>&quot;<abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> wanted to focus solely on the Nissan’s role in causing the Accident and avoid showing the pedestrian’s injuries&quot;</li> <li><abbr title="SVP of Government Affairs">Estrada</abbr> to <abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr>, apparently concurring: &quot;Think we should get the clip of the video as <abbr title="CEO, CTO, President, and co-founder">Kyle</abbr> described to prepare to show it to policymakers ... show the impact and the person landing in front of us. Cut it there. That's all that is needed.&quot;</li> <li><abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr> and <abbr title="Senior Director of Federal Affairs">Danko</abbr> disagreed and pushed for showing most complete video available</li> </ul> <h4 id="2-dmv-s-response-to-cruise-s-outreach">2. DMV’s Response to Cruise’s Outreach</h4> <ul> <li>[no notes]</li> </ul> <h4 id="3-cruise-s-dmv-pre-meeting">3. Cruise’s DMV Pre-Meeting</h4> <ul> <li>&quot;While Deputy General Counsel <abbr title="Deputy General Counsel">Fenrick</abbr> said she did not typically attend DMV meetings, she opted to attend this meeting in order to have some overlapping attendees between the NHTSA and DMV meetings. Notably, neither <abbr title="Chief Legal Officer">Bleich</abbr> nor <abbr title="SVP of Government Affairs">Estrada</abbr> attended the pre-meeting despite planning to meet in-person with the DMV Director to discuss the Accident.&quot;</li> </ul> <h4 id="4-cruise-s-october-3-meeting-with-the-dmv">4. Cruise’s October 3 Meeting with the DMV</h4> <h5 id="a-dmv-meeting-discussions">a. DMV Meeting Discussions</h5> <ul> <li>DMV regulators do not believe full video was played</li> <li>Cruise employees have different recollections, but many believe full video was played, likely with bad connectivity problems</li> <li>No discussion of pullover or dragging</li> </ul> <h5 id="b-cruise-s-post-dmv-meeting-reflections">b. Cruise’s Post-DMV Meeting Reflections</h5> <ul> <li>Slack discussion <ul> <li><abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr>: thoughts?</li> <li><abbr title="Deputy General Counsel">Fenrick</abbr>: You mean DMV call? More aggressive than NHTSA . . .</li> <li>ACP - Not overly so but seemed a bit more critical and a little unrealistic. Like really we should predict another vehicle will hit and run and brake accordingly. I think they think they're expectations of anticipatory response is to other road users collisions was a bit off.</li> <li><abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr>: They tend to ask insane hypotheticals. I was about to interrupt and say we can go through any number of hypos.. this is what happened but I was waiting for them to ask a followup question before I did it.</li> <li><abbr title="Deputy General Counsel">Fenrick</abbr>: insane hypothetical is absolutely right</li> <li>ACP - Bigger concern is that no regulator has really clued in that we moved after rolling over the pedestrian</li> </ul></li> <li>In another slack discussion, an employee stated &quot;the car moved and they didn’t ask and we’re kind of lucky they didn’t ask&quot; <ul> <li>Some employees indicate that this was the general consensus about the meeting</li> </ul></li> </ul> <h4 id="5-cruise-s-october-10-communications-with-dmv">5. Cruise’s October 10 Communications with DMV</h4> <ul> <li>DMV asked for video by <b>10-11</b>. Cruise did not do this, but showed a video in a meeting on <b>10-13</b></li> </ul> <h4 id="6-cruise-s-october-11-meeting-with-the-dmv">6. Cruise’s October 11 Meeting with the DMV</h4> <ul> <li>[no notes]</li> </ul> <h4 id="7-cruise-s-october-13-meeting-with-the-dmv">7. Cruise’s October 13 Meeting with the DMV</h4> <ul> <li>Cruise shared 9-minute 6-pane video created by <abbr title="Director of Systems Integrity">Wood</abbr> <ul> <li>&quot;Notably, the camera angles did not include the lower frontal camera angles that most clearly showed the AV’s impact with the pedestrian and pullover maneuver&quot;</li> </ul></li> <li>&quot;Interviewees said that the DMV’s tone in the meeting 'felt very mistrustful' and that it 'felt like something was not right here.'&quot; <ul> <li>DMV had questions about what appeared to be missing or misleading video</li> <li>In response to DMV's concerns and request, Cruise uploaded full video to DMV online portal</li> </ul></li> </ul> <h4 id="8-cruise-s-october-16-meeting-with-the-dmv">8. Cruise’s October 16 Meeting with the DMV</h4> <ul> <li>Meeting was scheduled for a different topic, but meeting moved to topic of DMV being misled about the accident; &quot;Cruise interviewees recalled that the DMV and CHP attendees were angry about the October 3 presentation, saying their collective memory was that they were not shown the Full Video&quot;</li> </ul> <h4 id="9-cruise-s-october-23-communications-with-the-dmv">9. Cruise’s October 23 Communications with the DMV</h4> <ul> <li>Cruise calls political consultant to have them find out why DMV has been silent on expansion of SF autonomous fleet <ul> <li>Consultant says DMV is &quot;pissed&quot; and considering revocation of Cruise's license to operate</li> </ul></li> <li>Internal disagreement on whether this could happen. &quot;<abbr title="SVP of Government Affairs">Estrada</abbr> then sent CLO <abbr title="Chief Legal Officer">Bleich</abbr> a Slack message indicating that he had talked to the DMV Director and there was '[n]o indication whatsoever that they are considering revoking.'&quot; <ul> <li><abbr title="VP of Global Government Affairs Prashanthi">Raman</abbr> checks with political consultant again, who repeats that DMV is very angry and may revoke</li> </ul></li> </ul> <h4 id="10-dmv-s-october-24-suspension-order">10. DMV’s October 24 Suspension Order</h4> <ul> <li><abbr title="SVP of Government Affairs">Estrada</abbr> calls DMV director Gordon to ask about suspension and is stonewalled</li> <li><abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> joins the call and makes personal appeal, saying he's &quot;been committed to this since he was 13 to try and improve driver safety&quot;</li> <li>Appeal fails and suspension order is issued shortly afterwards</li> <li>Slack conversation <ul> <li><abbr title="SVP of Government Affairs">Estrada</abbr>: <abbr title="CEO, CTO, President, and co-founder">Kyle</abbr> leading our response that we provided &quot;full&quot; video and we will stand by that if it's a fight.</li> <li><abbr title="Chief Legal Officer">Bleich</abbr>: ACP- This will be a difficult fight to win. DMV and CHP have credibility and Steve Gordon seems to swear that he did not see the end of the video. The word of Cruise employees won't be trusted. I think we should bring in an outside firm to review the sequence of events and do an internal report since otherwise there is no basis for people to believe us. We should consider doing this and how to message it.</li> <li><abbr title="SVP of Government Affairs">Estrada</abbr>: Yes agree difficult and that we need to do it because we have facts, we can have sworn statements and data analytics on our side. Not a he said she said. We have proof. If we prove with facts a false statement that is important reputation saving.</li> <li>Steve stopped even trying to make this claim. He resorted to arguing we should have highlighted the pullover attempt. This is a big overreach by them to make a claim like this we have the ability to prove false.</li> </ul></li> </ul> <h4 id="11-post-october-24-dmv-communications">11. Post-October 24 DMV Communications</h4> <ul> <li><a href="https://web.archive.org/web/20231024184429/https://getcruise.com/news/blog/2023/a-detailed-review-of-the-recent-sf-hit-and-run-incident/">Vogt posted this blog post</a>, titled &quot;A detailed review of the recent SF hit-and-run incident&quot; <ul> <li>[The report only has an excerpt from the blog post, but for the same reason I think it's worth looking at the report in detail, I think it's worth looking at the blog post linked above; my read of the now-deleted blog post is that it attempts to place the blame on the &quot;hit and run&quot; driver, which is repeatedly emphasized; the blog post also appears to include a video of the simulation discussed above, where <abbr title="CEO, CTO, President, and co-founder">Vogt</abbr> says &quot;Should we run road to sim and see what the AV would have done if it was in the other vehicles position? I think that might be quite powerful&quot;]</li> <li>[The blog post does discuss the pullover and dragging, saying &quot;The AV detected a collision, bringing the vehicle to a stop; then attempted to pull over to avoid causing further road safety issues, pulling the individual forward approximately 20 feet&quot;]</li> </ul></li> </ul> <h4 id="12-conclusions-regarding-cruise-s-communications-with-the-dmv">12. Conclusions Regarding Cruise’s Communications with the DMV</h4> <ul> <li><abbr title="Chief Legal Officer">Bleich</abbr>: &quot;[T]he main concern from DMV was that our vehicle did not distinguish between a person and another object under its carriage originally, and so went into an MRC. Second, they felt that we should have emphasized the AV’s second movement right away in our first meeting. In fact, in the first meeting -- although we showed them the full video -- they (and we) were focused on confirming that we were not operating unsafely before the collision and we did not cause the initial contact with the pedestrian. They did not focus on the end of the video and -- because they did not raise it -- our team did not actively address it&quot;</li> <li><abbr title="CEO, CTO, President, and co-founder">Vogt</abbr>: &quot;I am very much struggling with the fact that our GA team did not volunteer the info about the secondary movement with the DMV, and that during the handling of the event I remember getting inconsistent reports as to what was shared. At some point bad judgment call must have been made, and I want to know how that happened.&quot;</li> <li><abbr title="Chief Legal Officer">Bleich</abbr>: &quot;ACP -- I share your concern that the second movement wasn’t part of the discussion. I don’t know that there was a deliberate decision by the team that was doing the briefings. I believe they were still in the mode from the previous evening where they were pushing back against an assumption that we either were responsible for hitting the pedestrian or that we did not react fast enough when the pedestrian fell into our path. But as I’ve probed for basic information about what we shared and when I’ve had the same frustration that dates get pushed together or details are left out. I don’t know if this is deliberate, or people are simply having difficulty recalling exactly what they did or said during the immediate aftermath of that event.&quot;</li> <li>&quot;these Slacks convey that the three senior leaders of the company – the CEO, CLO, and COO – were not actively engaged in the regulatory response for the worst accident in Cruise’s history. Instead, they were trying to piece together what happened after the fact.&quot;</li> </ul> <h3 id="e-cruise-s-disclosures-to-the-sf-mta-sf-fire-department-and-sf-police">E. Cruise’s Disclosures to the SF MTA, SF Fire Department, and SF Police</h3> <p>Department</p> <ul> <li>After playing video, a government official asks &quot;this car moves with the woman underneath it, is that what we are seeing?&quot;, which results in a series of discussions about this topic</li> <li>Two of the four Cruise employees in the meeting report being shocked to see the pullover and dragging, apparently not realizing that this had happened</li> </ul> <h3 id="f-cruise-s-disclosures-to-the-california-public-utilities-commission-cpuc">F. Cruise’s Disclosures to the California Public Utilities Commission (CPUC)</h3> <h4 id="1-cruise-s-october-3-communications-with-the-cpuc">1. Cruise’s October 3 Communications with the CPUC</h4> <ul> <li>CPUC and Cruise disagree on whether or not there was an offer to play the full video</li> </ul> <h4 id="2-cpuc-s-october-5-data-request">2. CPUC’s October 5 Data Request</h4> <ul> <li>CPUC requests video by <b>10-19</b>; Cruise's standard policy was to respond on the last day, so video was sent on <b>10-19</b></li> </ul> <h4 id="3-cruise-s-october-19-response-to-cpuc-s-data-request">3. Cruise’s October 19 Response to CPUC’s Data Request</h4> <ul> <li>Video, along with the following summary: &quot;[T]he Nissan Sentra made contact with the pedestrian, deflecting the pedestrian in front of the AV. The AV biased rightward before braking aggressively but, shortly thereafter, made contact with the pedestrian. The AV then attempted to achieve a minimal risk condition (MRC) by pulling out of the lane before coming to its final stop position. The driver of the Nissan Sentra left the scene shortly after the collision.&quot;</li> </ul> <h4 id="4-conclusions-regarding-cruise-s-disclosures-to-the-cpuc">4. Conclusions Regarding Cruise’s Disclosures to the CPUC</h4> <ul> <li>[no notes]</li> </ul> <h3 id="g-cruise-s-disclosures-to-other-federal-officials">G. Cruise’s Disclosures to Other Federal Officials</h3> <ul> <li>Cruise's initial outreach focused on conveying the accident had been caused by the hit-and-run Nissan driver</li> <li>After the DMV suspension on <b>10-24</b>, &quot;outreach focused on conveying the message that it believed it had worked closely with regulatory agencies such as the California DMV, CPUC, and NHTSA following the October 2 Accident&quot;</li> </ul> <h2 id="iv-the-aftermath-of-the-october-2-accident">IV. THE AFTERMATH OF THE OCTOBER 2 ACCIDENT</h2> <h3 id="a-the-cruise-license-suspension-by-the-dmv-in-california">A. The Cruise License Suspension by the DMV in California</h3> <ul> <li>Operating with human driver behind the wheel still allowed</li> </ul> <h3 id="b-the-nhtsa-pe-investigation-and-safety-recall">B. The NHTSA PE Investigation and Safety Recall</h3> <ul> <li>[no notes]</li> </ul> <h3 id="c-the-cpuc-s-show-cause-ruling">C. The CPUC’s “Show Cause Ruling”</h3> <ul> <li>[no notes]</li> </ul> <h3 id="d-new-senior-management-of-cruise-and-the-downsizing-of-cruise">D. New Senior Management of Cruise and the Downsizing of Cruise</h3> <ul> <li>[no notes]</li> </ul> <h2 id="v-summary-of-findings-and-conclusions">V. SUMMARY OF FINDINGS AND CONCLUSIONS</h2> <ul> <li>&quot;By the time Cruise employees from legal, government affairs, operations, and systems integrity met with regulators and other government officials on October 3, they knew or should have known that the Cruise AV had engaged in a pullover maneuver and dragged the pedestrian underneath the vehicle for approximately 20 feet&quot; <ul> <li>&quot;Cruise’s passive, non-transparent approach to its disclosure obligations to its regulators reflects a basic misunderstanding of what regulatory authorities need to know and when they need to know it&quot;</li> </ul></li> <li>&quot;Although neither Cruise nor Quinn Emanuel can definitively establish that NHTSA or DMV were shown the entirety of the Full Video, including the pullover maneuver and dragging, the weight of the evidence indicates that Cruise attempted to play the Full Video in these meetings; however, internet connectivity issues impeded or prevented these regulators from seeing the video clearly or fully.&quot; <ul> <li>&quot;in the face of these internet connectivity issues that caused the video to freeze or black- or white-out, Cruise employees remained silent, failing to ensure that the regulators understood what they likely could not see – that the Cruise AV had moved forward again after the initial impact, dragging the pedestrian underneath the vehicle&quot;</li> </ul></li> <li>&quot;Even if, as some Cruise employees stated, they were unaware of the pullover maneuver and pedestrian dragging at the time of certain regulatory briefings (which itself raises other concerns), Cruise leadership and other personnel were informed about the full details of the October 2 Accident during the day on October 3 and should have taken corrective action.&quot;</li> <li>&quot;While Cruise employees clearly demonstrated mistakes of judgment and failure to appreciate the importance of transparency and accountability, based on Quinn Emanuel’s review to date, the evidence does not establish that Cruise employees sought to intentionally mislead government regulators about the October 2 Accident, including the pullover maneuver and pedestrian dragging&quot;</li> <li>&quot;Cruise’s senior leadership repeatedly failed to understand the importance of public trust and accountability&quot;</li> <li>&quot;Cruise’s response to the October 2 Accident reflects deficient leadership at the highest levels of the Company—including among some members of the C-Suite, legal, governmental affairs, systems integrity, and communications teams—that led to a lack of coordination, mistakes of judgment, misapprehension of regulatory requirements and expectations, and inconsistent disclosures and discussions of material facts at critical meetings with regulators and other government officials. The end result has been a profound loss of public and governmental trust and a suspension of Cruise’s business in California&quot; <ul> <li>&quot;There was no captain of the ship. No single person or team within Cruise appears to have taken responsibility to ensure a coordinated and fully transparent disclosure of all material facts regarding the October 2 Accident to the DMV, NHTSA, and other governmental officials. Various members of the SLT who had the responsibility for managing the response to this Accident were missing-in-action for key meetings, both preparatory and/or with the regulators. This left each Cruise team to prepare for the meetings independently, with different employees attending different regulatory meetings, and with no senior Cruise official providing overall direction to ensure consistency in approach and disclosure of all material facts.&quot;</li> <li>&quot;There was no demonstrated understanding of regulatory expectations by certain senior Cruise management or line employees&quot;</li> <li>&quot;Cruise’s deficient regulatory response to the October 2 Accident reflects preexisting weaknesses in the Company, including ineffectual Cruise leadership with respect to certain senior leaders. Two out of many examples illustrate these weaknesses.&quot; <ul> <li>No coordinated or rigorous process for what needed to be discussed with DMV, NHTSA, etc., nor did leadership or employees in meetings take steps to ensure they were informed of what had happened before the meetings (such as asking their direct reports for updates); &quot;To underscore Cruise’s lack of coordination in its briefings to regulators and other government officials on October 3, senior leadership never convened a meeting of the various teams to discuss and learn how these meetings went, what questions were asked, and what discussions took place. Had they done so, they should have realized that in only one of the four meetings did government officials ask questions about the pullover maneuver and pedestrian dragging, requiring corrective action&quot;</li> <li>&quot;Cruise lawyers displayed a lack of understanding of what information must be communicated to NHTSA in these reports, and misapprehended the NHTSA requirement ... Cruise leadership gave a paralegal the primary responsibility for preparing and filing such reports with the Cruise legal department exercising little oversight&quot;</li> </ul></li> </ul></li> </ul> <h2 id="vi-recommendations">VI. RECOMMENDATIONS</h2> <ul> <li>New senior leadership</li> <li>Consider creating a dedicated, cross-disciplinary Regulatory Team which understands regulations, has experience dealing with regulators, and proactively improves Cruise's regulatory reporting processes and systems, reporting directly to CEO with board oversight</li> <li>Training for remaining senior leadership</li> <li>Create a streamlined Crisis Management Team <ul> <li>200 people in a war room can't manage a crisis; also need to have a &quot;captain&quot; or someone in charge</li> </ul></li> <li>Review incident response protocol and ensure that it is followed</li> <li>&quot;There is a need to reform the governmental affairs, legal, and public communications functions within Cruise&quot;</li> <li>&quot;Cruise should file its reports about any accident involving a Cruise vehicle with regulators by having a designated Chief Safety Officer or senior engineer, as well as a regulatory lawyer, within Cruise review and approve the filing of each report&quot;</li> </ul> <h2 id="appendix">Appendix</h2> <ul> <li>The report by Exponent, mentioned above, is included in the Appendix. It is mostly redacted, although there is a lot of interesting non-redacted content, such as &quot;the collision detection system incorrectly identified the pedestrian as being located on the side of the AV at the time of impact instead of in front of the AV and thus determined the collision to be a side impact ... The determination by the ADS that a side collision occurred, and not a frontal collision, led to a less severe collision response being executed and resulted in the AV performing the subsequent outermost lane stop maneuver instead of an emergency stop ... The root cause of the AV’s post-collision movement, after the initial brief stop, was the inaccurate determination by the ADS that a side collision had occurred ... the inaccuracy of the object track considered by the collision detection system and the resulting disparity between this track and the pedestrian’s actual position, the ADS failed to accurately determine the location of the pedestrian at the time of impact and while the pedestrian was underneath the vehicle&quot;</li> </ul> <h2 id="back-to-danluu-com">back to danluu.com</h2> <p>I don't have much to add to this. I certainly have opinions, but I don't work in automotive and haven't dug into it enough to feel informed enough to add my own thoughts. In one discussion I had with a retired exec who used to work on autonomous vehicles, on incident management at Cruise vs. tech companies Twitter or Slack, the former exec said:</p> <blockquote> <p>You get good at incidents given a steady stream of incidents of varying severity if you have to handle the many small ones. You get terrible at incidents if you can cover up the small ones until a big one happens. So it's not only funny but natural for internet companies to do it better than AV companies I think</p> </blockquote> <p>On the &quot;minimal risk condition&quot; pullover maneuver, this exec said:</p> <blockquote> <p>These pullover maneuvers are magic pixie dust making AVs safe: if something happens, we'll do a safety pullover maneuver</p> </blockquote> <p>And on <a href="https://web.archive.org/web/20231024184429/https://getcruise.com/news/blog/2023/a-detailed-review-of-the-recent-sf-hit-and-run-incident/">the now-deleted blog post, &quot;A detailed review of the recent SF hit-and-run incident&quot;</a>, the exec said:</p> <blockquote> <p>Their mentioning of regulatory ADAS test cases does not inspire confidence; these tests are shit. But it's a bit unfair on my part since of course they would mention these tests, it doesn't mean they don't have better ones</p> </blockquote> <p>On how regulations and processes making safety-critical industries safer and what you'd do if you cared about safety vs. the recommendations in the report, this exec said</p> <blockquote> <p>[Dan,] you care about things being done right. People in these industries care about compliance. Anything &quot;above the state of the art&quot; buys you zero brownie points. eg for [X], any [Y] ATM are not required at all. [We] are better at [X] than most and it does nothing for compliance ... OTOH if a terrible tool or process exists that does nothing good but is considered &quot;the state of the art&quot; / is mandated by a standard, you sure as hell are going to use it</p> </blockquote> <p style="border-width:8px; border-style:solid; border-color:grey; background-color:#F3F3F3; padding: 0.1em;"> <b><a rel="nofollow" href="https://jobs.ashbyhq.com/freshpaint/bfe56523-bff4-4ca3-936b-0ba15fb4e572?utm_source=dl"> If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer</b></a>. I'm in an investor, so you should consider my potential bias, but they seem to have found product-market fit and are growing extremely quickly (revenue-wise)</p> <p><i> Thanks to an anonymous former AV exec, Justin Blank, and 5d22b for comments/corrections/discussion.</i></p> <h2 id="appendix-a-physical-hardware-curiosity">Appendix: a physical hardware curiosity</h2> <p>One question I had for the exec mentioned above, which wasn't relevant to this case, but is something I've wondered about for a while, is why the AVs that I see driving don't have upgraded tires and brakes. You can get much shorter stopping distances from cars that aren't super heavy by upgrading their tires and brakes, but the AVs I've seen have not had this done.</p> <p>In this case, we can't do the exact comparison from an upgraded vehicle to the base vehicle because the vehicle dynamics data was redacted from section 3.3.3, table 9, and figure 40 of the appendix, but it's common knowledge that the simplest safety upgrade you can make on a car is upgrading the tires (and, if relevant, the brakes). One could argue that this isn't worth the extra running cost, or the effort (for the low-performance cars that I tend to see converted into AVs, getting stopping distances equivalent to a sporty vehicle would generally require modifying the wheel well so that wider tires don't rub) but, as an outsider, I'd be curious to know what the cost benefit trade-off on shorter stopping distances is.</p> <p>They hadn't considered it before, but thought that better tires and brake would make a difference in a lot of other cases and prevent accients and explained the lack of this upgrade by:</p> <blockquote> <p>I think if you have a combination of &quot;we want to base AV on commodity cars&quot; and &quot;I am an algorithms guy&quot; mindset you will not go look at what the car should be.</p> </blockquote> <p>And, to be clear, upgraded tires and brakes would not have changed the outcome in this case. The timeline from the Exponent report has</p> <ul> <li><strong>-2.9s</strong>: contact between Nissan and Pedestrian</li> <li><strong>-2s</strong>: Pedestrian track dropped</li> <li><strong>-1.17s</strong>: Pedestrian beings separating from Nissan</li> <li><strong>-0.81s</strong>: [redacted]</li> <li><strong>-0.78s</strong>: Pedestrian lands in AV's travel lane</li> <li><strong>-0.41s</strong>: Collision checker predicts collision</li> <li><strong>-0.25s</strong>: AV starts sending braking and steering commands (<code>19.1 mph</code>)</li> <li><strong>0s</strong>: collision (<code>18.6 mph</code>)</li> </ul> <p>Looking at actual accelerometer data from a car with upgraded tires and brakes, stopping time from <code>19.1mph</code> for that car was around <code>0.8s</code>, so this wouldn't have made much difference in this case. If brakes aren't pre-charged before attempting to brake, there's significant latency when initially braking, such that <code>0.25s</code> isn't enough for almost any braking to have occurred, which we can see from the speed only being <code>0.5mph</code> slower in this case.</p> <p>Another comment from the exec is that, while a human might react to the collision at <code>-2.9s</code> and slow down or stop, &quot;scene understanding&quot; as a human might do it is non-existent in most or perhaps all AVs, so it's unsurprising that the AV doesn't react until the pedestrian is in the AV's path, whereas a human, if they noticed the accident in the adjacent lane, would likely drastically slow down or stop (the exec guessed that most humans would come to a complete stop, whereas I guessed that most humans would slow down). The exec was also not surprised by the <code>530ms</code> latency between the pedestrian landing in the AV's path and the AV starting to attempt to apply the brakes although, as a lay person, I found <code>530ms</code> surprising.</p> <p>On the advantage of AVs and ADAS, as implemented today, compared to a human who's looking in the right place, paying attention, etc., the exec said</p> <blockquote> <p>They mainly never get tired or drink and hopefully also run in that terrible driver's car in the next lane. For [current systems], it's reliability and not peak performance that makes it useful. Peak performance is definitely not superhuman but subhuman</p> </blockquote> Why do people post on [bad platform] instead of [good platform]? why-video/ Thu, 25 Jan 2024 00:00:00 +0000 why-video/ <p>There's a class of comment you often see when someone makes a popular thread on Mastodon/Twitter/Threads/etc., that you also see on videos that's basically &quot;Why make a Twitter thread? This would be better as a blog post&quot; or &quot;Why make a video? This would be better as a blog post&quot;. But, these comments are often stronger in form, such as:</p> <blockquote> <p><a href="https://news.ycombinator.com/item?id=31311904">I can't read those tweets that span pages because the users puts 5 words in each reply. I find common internet completely stupid: Twitter, tiktok, Instagram, etc. What a huge waste of energy</a>.</p> </blockquote> <p>or</p> <blockquote> <p><a href="https://news.ycombinator.com/item?id=31312967">When someone chooses to blog on twitter you know it's facile at best, and more likely simply stupid (as in this case)</a></p> </blockquote> <p>These kinds of comments are fairly common, e.g., I pulled up Foone's last 10 Twitter threads that scored 200 points or more on HN and 9 out of 10 had comments like this, complaining about the use of Twitter.</p> <p>People often express bafflement that anyone could have a reason for using [bad platform], such as in &quot;<a href="https://news.ycombinator.com/item?id=33234817">how many tweets are there just to make his point? 200? nobody thinks 'maybe this will be more coherent on a single page'? I don't get social media</a>&quot; or &quot;<a href="https://news.ycombinator.com/item?id=31723037">Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there? ... objectively speaking it is more work</a>&quot;.</p> <p>Personally, I don't really like video as a format and, for 95% of youtube videos that I see, I'd rather get the information as a blog post than a video (and this will be even more true if Google really cracks down on ad blocking) and I think that, for a reader who's interested in the information, long-form blog posts are basically strictly better than long threads on [bad platform]. But I also recognize that much of the content that I want to read wouldn't exist at all if it wasn't for things like [bad platform].</p> <p>Stepping back and looking at the big picture, there are four main reasons I've seen that people use [bad platform], which are that it gets more engagement, it's where their friends are, it's lower friction, and it monetizes better.</p> <h3 id="engagement">Engagement</h3> <p>The engagement reason is the simplest, so let's look at that first. Just looking at where people spend their time, short-form platforms like Twitter, Instagram, etc., completely dominate longer form platforms like Medium, Blogspot, etc.; you can see this in the valuations of these companies, in survey data, etc. Substack is the hottest platform for long-form content and its last valuation was ~$600M, basically a rounding error compared to the value of short-form platforms (I'm not including things like Wordpress and or Squarespace, which derive a lot of their valuation from things other than articles and posts). The money is following the people and people have mostly moved on from long-form content. And if you talk to folks using substack about where their readers and growth comes from, that comes from platforms like Twitter, so people doing long-form content who optimize for engagement or revenue will still produce a lot of short-form content<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">1</a></sup>.</p> <h3 id="friends">Friends</h3> <p>The friends reason is probably the next simplest. A lot of people are going to use whatever people around them are using. Realistically, if I were ten years younger and started doing something online in 2023 instead of 2013, more likely than not, I would've tried streaming before I tried blogging. But, as an old, out of touch, person, I tried starting a blog in 2013 even knowing that blogging was a dying medium relative to video. It seems to have worked well enough for me, so I've stuck with it, but this seems generational. While there are people older than me who do video and people younger than me who write blogs, looking at the distribution of ages, I'm not all that far from the age where people overwhelmingly moved to video and if I were really planning to do something long-term instead of just doing <a href="writing-non-advice/#friction">the lowest friction thing when I started</a>, I would've started with video. Today, doing video is natural for folks who are starting to put their thoughts online.</p> <h3 id="friction">Friction</h3> <p>When [bad platform] is a microblogging platform like Twitter, Mastodon, Threads, etc., the friends reason still often applies — people on these platforms are frequently part of a community they interact with, and it makes more sense for them to keep their content on the platform full of community members than to put content elsewhere. But the bigger reason for people whose content is widely read is that a lot of people find these platforms are much lower friction than writing blog posts. When people point this out, [bad platform] haters are often baffled, responding with things like</p> <blockquote> <p>Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there? ... objectively speaking it is more work</p> </blockquote> <p>For one thing, most widely read programmer/tech bloggers that I'm in touch with use platforms that are actually higher friction (e.g., <a href="https://twitter.com/danluu/status/1204576631890698240">Jekyll</a> <a href="https://twitter.com/danluu/status/1244050309648797697">friction</a> and <a href="https://mastodon.social/@danluu/109542037575670321">Hugo</a> <a href="https://twitter.com/danluu/status/1244050309648797697">fric</a><a href="https://commaok.xyz/post/on_hugo/">tion</a>). But, in principle, they could use substack, hosted wordpress, or another platform that this commenter considers &quot;objectively&quot; lower friction, but this fundamentally misunderstands where the friction comes from. When people talk about [bad platform] being lower friction, it's usually about the emotional barriers to writing and publishing something, not the literal number of clicks it takes to publish something. We can argue about whether or not this is rational, whether this &quot;objectively&quot; makes sense, etc., but at the end of the day, it is simply true that many people find it mentally easier to write on a platform where you write short chunks of text instead of a single large chunk of text.</p> <p>I <a href="https://mastodon.social/@danluu/">sometimes write things on Mastodon</a> because it feels like the right platform for some kinds of content for me. Of course, since the issue is not the number of clicks it takes and there's some underlying emotional motivation, other people have different reasons. For example, <a href="https://nitter.net/Foone/status/1066548390517854208">Foone says</a>:</p> <blockquote> <p>Not to humblebrag or anything, but my favorite part of getting posted on hackernews or reddit is that EVERY SINGLE TIME there's one highly-ranked reply that's &quot;jesus man, this could have been a blog post! why make 20 tweets when you can make one blog post?&quot;</p> <p>CAUSE I CAN'T MAKE A BLOG POST, GOD DAMN IT. I have ADHD. I have bad ADHD that is being treated, and the treatment is NOT WORKING TERRIBLY WELL. I cannot focus on writing blog posts. it will not happen</p> <p>if I try to make a blog post, it'll end up being abandoned and unfinished, as I am unable to edit it into something readable and postable. so if I went 100% to blogs: You would get: no content I would get: lots of unfinished drafts and a feeling of being a useless waste</p> <p>but I can do rambly tweet threads. they don't require a lot of attention for a long time, they don't have the endless editing I get into with blog posts, I can do them. I do them a bunch! They're just rambly and twitter, which some people don't like</p> </blockquote> <p>The issue Foone is referring to isn't even uncommon — three of my favorite bloggers have mentioned that they can really only write things in one sitting, so either they have enough momentum to write an entire blog post or they don't. There's a difference in scale between only being able to get yourself to write a tweet at a time and only being able to write what you can fit into a single writing session, but these are differences in degree, not differences in kind.</p> <h3 id="revenue">Revenue</h3> <p>And whatever the reason someone has for finding [bad platform] lower friction than [good platform], allowing people to use a platform that works for them means we get more content. When it comes to video, the same thing also applies because <a href="https://twitter.com/danluu/status/1588318512258650112">video monetizes so much better than text</a> and there's a lot of content that monetizes well on video that probably wouldn't monetize well in text.</p> <p>To pick an arbitrary example, automotive content is one of these areas. For example, if you're buying a car and you want detailed, practical, reviews about a car as well as comparisons to other cars one might consider if they're looking at a particular car, before YouTube, AFAIK, no one was doing anything close to the depth of what <a href="https://www.youtube.com/@AAutoBuyersGuide">Alex Dykes does on Alex on Autos</a>. If you open up a car magazine from the heyday of car magazines, something like Car and Driver or Road and Track from 1997, there's nothing that goes into even 1/10th of the depth that Alex does and this is still true today of modern car magazines. The same goes for quite a few sub-categories of automotive content as well, such as <a href="https://www.youtube.com/@tyrereviews">Jonathan Benson's on Tyre Reviews</a>. Before Jonathan, no one was testing tires with the same breadth and depth and writing it up (engineers at tire companies did this kind of testing and much more, but you had to talk to them directly to get the info)<sup class="footnote-ref" id="fnref:V"><a rel="footnote" href="#fn:V">2</a></sup> . You can find similar patterns in a lot of areas outside of automotive content as well. While this depends on the area, in many cases, the content wouldn't exist if it weren't for video. Not only do people, in general, have more willingness to watch videos than to read text, video monetizes much better than text does, which allows people to make providing in depth information their job in a way that wouldn't be possible in text. In some areas, you can make good money with a paywalled newsletter, but this is essentially what car magazines are and they were never able to support anything resembling what Alex Dykes does, nor does it seem plausible that you could support something like what Jonathan Benson does on YouTube.</p> <p>Or, to pick an example from the tech world, shortly after Lucy Wang created her YouTube channel, <a href="https://www.youtube.com/@TechwithLucy">Tech With Lucy</a>, when she had 50k subscribers and her typical videos had thousands to tens of thousands views with the occasional video with a hundred thousand views, <a href="https://twitter.com/danluu/status/1589775359243091969">she noted that she was making more than she did working for AWS (with most of the money presumably coming in from sponsorships)</a>. By comparison, my blog posts all get well over a million hits and I definitely don't make anywhere near what Lucy made at AWS; instead, my blog barely covers my rent. It's possible to monetize some text decently well if you put most of it behind a paywall, e.g., <a href="https://blog.pragmaticengineer.com/">Gergely Orosz does this with his newsletter</a>, but if you want to have mostly or exclusively have freely available content, video generally dominates text.</p> <h3 id="non-conclusion">Non-conclusion</h3> <p>While I would prefer that most content that I see on YouTube/Twitter/Threads/Mastodon/etc. were hosted on a text blog, the reality is that most of that content wouldn't exist at all if it had to be written up as long-form text instead of as chunked up short-form text or video. Maybe in a few years, summary tools will get good enough that I can consume the translations but, today, all the tools I've tried often get key details badly wrong, so we just have to live with the content in the form it's created in.</p> <p><b><a rel="nofollow" href="https://jobs.ashbyhq.com/freshpaint/bfe56523-bff4-4ca3-936b-0ba15fb4e572?utm_source=dl">If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor in the company, so you should take this with the usual grain of salt, but if you're looking to join a fast growing early-stage startup, they seem to have found product-market fit and have been growing extremely quickly (revenue-wise).</b></a></p> <p><i>Thanks to Heath Borders, Peter Bhat Harkins, James Young, Sophia Wisdom, and David Kok for comments/corrections/discussion.</i></p> <h3 id="appendix-elsewhere">Appendix: Elsewhere</h3> <ul> <li><a href="https://www.ftrain.com/wwic">Paul Ford's WWIC (Why Wasn't I Consulted)</a> is a more general version of this post</li> </ul> <p>Here's a comment from David Kok, from a discussion about a rant by an 80-year old bridge player about why bridge is declining, where the 80-year old claimed that the main reason is that IQ has declined and young people (as in, people who are 60 and below) are too stupid to play intellectual games like bridge; many other bridge players concurred:</p> <blockquote> <p>Rather than some wrong but meaningful statement about age groups I always just interpret statements like &quot;IQ has gone down&quot; as &quot;I am unhappy and have difficulty expressing that&quot; and everybody else going &quot;Yes so am I&quot; when they concur.</p> </blockquote> <p>If you adapt David Kok's comment to complaints about why something isn't a blog post, that's a meta reason that the reasons I gave in this post are irrelevant (to some people) — these reasons only matter to people who care about the reasons; if someone is just venting their feelings an the reasons they're giving are an expression of their feelings and not meant to be legitimate reasons, the reasons someone might not write a blog post are irrelevant.</p> <p>Anyway, the topic of why post there instead of here is a common enough topic that I'm sure other people have written things about it that I'd be interested in reading. Please feel free to <a href="https://mastodon.social/@danluu/">forward other articles you see on the topic to me</a></p> <h3 id="appendix-hn-comments-on-foone-s-last-10-twitter-threads">Appendix: HN comments on Foone's last 10 Twitter threads.</h3> <p>I looked up Foone's last N Twitter threads that made to HN with 200+ points, and 9 out of 10 have complaints about why Foone used Twitter and how it would be better as a blog post. [This is not including comments of the form &quot;For those who hate Twitter threads as much as I do: <a href="https://threadreaderapp.com/thread/1014267515696922624.html&quot;">https://threadreaderapp.com/thread/1014267515696922624.html&quot;</a>, of which there are more than comments like the ones below, which have a complaint but also have some potentially useful content, like a link to another version of the thread.</p> <h4 id="never-trust-a-system-that-seems-to-be-working-https-news-ycombinator-com-item-id-33233720"><a href="https://news.ycombinator.com/item?id=33233720">Never trust a system that seems to be working</a></h4> <p>One of the first comments was a complaint that it was on Twitter, which was followed not too long after by</p> <blockquote> <p>how many tweets are there just to make his point? 200? nobody thinks &quot;maybe this will be more coherent on a single page&quot;? I don't get social media</p> </blockquote> <h4 id="someday-aliens-will-land-and-all-will-be-fine-until-we-explain-our-calendar-https-news-ycombinator-com-item-id-32975173"><a href="https://news.ycombinator.com/item?id=32975173">Someday aliens will land and all will be fine until we explain our calendar</a></h4> <blockquote> <p>This would be better written in a short story format but I digress.</p> <p>shit like this is too good and entertaining to be on twitter [one of the few positive comments complaining about this]</p> <p>This person hates it so much whenever there is a link to their content on this site, they go on huge massive rants about it with threads spamming as much as the OP, it's hilarious.</p> </blockquote> <h4 id="you-want-to-know-something-about-how-bullshit-insane-our-brains-are-https-news-ycombinator-com-item-id-32303786"><a href="https://news.ycombinator.com/item?id=32303786">You want to know something about how bullshit insane our brains are?</a></h4> <blockquote> <p>They'll tolerate reading it on twitter?</p> <p>Serious question : why do publishers break down their blog posts into umpteen tweeted microblogs? Do the engagement web algorithms give preference to the number of tweets in a thread? I see this is becoming more of a trend</p> <p>This is a very interesting submission. But, boy, is Twitter's character limit poisonous.</p> <p>IMO Foone's web presence is toxic. Rather than write a cogent article posted on their blog and then summarize a pointer to that post in a single tweet, they did the opposite writing dozens of tweets as a thread and then summarizing those tweets in a blog post. This is not a web trend I would like to encourage but alas it is catching on.</p> <p>Oh, I don't care how the author writes it, or whether there's a graph relationship below (or anything else). It's just that Twitter makes the experience of reading content like that a real chore.</p> </blockquote> <h4 id="reverse-engineering-skifree-https-news-ycombinator-com-item-id-31718756"><a href="https://news.ycombinator.com/item?id=31718756">Reverse engineering Skifree</a></h4> <blockquote> <p>This should have been a blog or a livestream.</p> <p>Even in this format?</p> <p><rant> I genuinely don't get it. It's a pain in the ass for them to publish it like that and it's a pain in the ass for us to read it like that. I hope Musk takes over Twitter and runs it the ground so we can get actual blog posts back. </rant></p> </blockquote> <p>Someone points out that Foone has noted that they find writing long-form stuff impossible and can write in short-form media, to which the response is the following:</p> <blockquote> <p>Come on, typing a short description and uploading a picture 100 times is easier than typing everything in one block and adding a few connectors here and there?</p> <p>Obviously that's their prerogative and they can do whatever they want but objectively speaking it is more work and I sincerely hope the trend will die.</p> </blockquote> <h4 id="everything-with-a-battery-should-have-an-off-switch-https-news-ycombinator-com-item-id-31287112"><a href="https://news.ycombinator.com/item?id=31287112">Everything with a battery should have an off switch</a></h4> <blockquote> <p>You forgot, foone isn't going to change from streams of Twitter posts to long form blogging. [actually a meta comment on how people always complain about this and not a complaint, I think]</p> <p>I can't read those tweets that span pages because the users puts 5 words in each reply. I find common internet completely stupid: Twitter, tiktok, Instagram, etc. What a huge waste of energy.</p> <p>He clearly knows [posting long threads on Twitter] is a problem, he should fix it.</p> </blockquote> <p>Someone points out that Foone has said that they're unable to write long-form blog posts, to which the person replies:</p> <blockquote> <p>You can append to a blog post as you go the same way you can append to a Twitter feed. It's functionally the same, the medium just isn't a threaded hierarchy. There's no reason it has to be posted fully formed as he declares.</p> <p>My own blog posts often have 10+ revisions after I've posted them.</p> <p>It doesn't work well for thousands of people, which is why there are always complaints ... When something is suboptimal, you're well within your rights to complain about it. Posting long rants as Twitter threads is suboptimal for the consumers of said threads</p> <p>I kind of appreciate the signal: When someone chooses to blog on twitter you know it's facile at best, and more likely simply stupid (as in this case)</p> </blockquote> <h4 id="there-s-an-arm-cortex-m4-with-bluetooth-inside-a-covid-test-kit-https-news-ycombinator-com-item-id-29698887"><a href="https://news.ycombinator.com/item?id=29698887">There's an ARM Cortex-M4 with Bluetooth inside a Covid test kit</a></h4> <p>Amazingly, no complaint that I could see, although one comment was edited to be &quot;.&quot;</p> <h4 id="taking-apart-the-2010-fisher-price-re-released-music-box-record-player-https-news-ycombinator-com-item-id-29048819"><a href="https://news.ycombinator.com/item?id=29048819">Taking apart the 2010 Fisher Price re-released Music Box Record Player</a></h4> <blockquote> <p>why is this a twitter thread? why not a blog?</p> </blockquote> <p>Followed by</p> <blockquote> <p>I love that absolutely no one got the joke ... Foone is a sociopath who doesn't feel certain words should be used to refer to Foone because they don't like them. In fact no one should talk about Foone ever.</p> </blockquote> <h4 id="while-posting-to-tumblr-e-and-w-keys-just-stopped-working-https-news-ycombinator-com-item-id-28601684"><a href="https://news.ycombinator.com/item?id=28601684">While posting to Tumblr, E and W keys just stopped working</a></h4> <blockquote> <p>Just hotkey detection gone wrong. Not that big of a surprise because implementing hotkeys on a website is a complete minefield. I don't think you can conclude that Tumblr is badly written from this. Badly tested maybe.</p> </blockquote> <p>Because that comment reads like nonsense to anyone who read the link, someone asks &quot;did you read the whole thread?&quot;, to which the commenter responds:</p> <blockquote> <p>No because Twitter makes it completely unreadable.</p> </blockquote> <h4 id="my-mouse-driver-is-asking-for-a-firewall-exemption-https-news-ycombinator-com-item-id-28274305"><a href="https://news.ycombinator.com/item?id=28274305">My mouse driver is asking for a firewall exemption</a></h4> <blockquote> <p>Can we have twitter banned from being posted here? On all UI clicks, a nagging window comes up. You can click it away, but it <em>reverts your click</em>, so any kind of navigation becomes really cumbersome.</p> <p>or twitter urls being replaced with some twitter2readable converter</p> </blockquote> <h4 id="duke-nukem-3d-mirror-universe-https-news-ycombinator-com-item-id-26513877"><a href="https://news.ycombinator.com/item?id=26513877">Duke Nukem 3D Mirror Universe</a></h4> <blockquote> <p>This is remarkable, but Twitter is such an awful medium for this kind of text. I wish this was posted on a normal platform so I could easily share it.</p> <p>If this were a blog post instead of a pile of tweets, we wouldn't have to expand multiple replies to see all of the content</p> <p>Uh why isn't this a blog, or a youtube video? <em>specifically to annoy foone</em></p> <p>Yes, long form Twitter is THE WORST. However foone is awesome, so maybe they cancel each other out?</p> <p>I hate twitter. It's slowly ruining the internet.</p> </blockquote> <h4 id="non-foone-posts">Non-foone posts</h4> <p>Of course this kind of thing isn't unique to Foone. For example, on the last Twitter thread I saw on HN, 2 of the first five comments were:</p> <blockquote> <p>Has this guy got a blog?</p> </blockquote> <p>and</p> <blockquote> <p>That's kind of why the answer to &quot;posting something to X&quot; should be &quot;just say no&quot;. It's impossible to say anything there that is subtle in the slightest or that requires background to understand but unfortunately people who are under the spell of X just can't begin to see something they do the way somebody else might see it.</p> </blockquote> <p>I just pulled up Foone's threads because I know that they tend to post to short-form platforms and looking at 10 Foone threads is more interesting than looking at 10 random threads.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:O"><p>Of course, almost no one optimizes for revenue because most people don't make money off of the content they put out on the internet. And I suspect only a tiny fraction of people are consciously optimizing for engagement, but <a href="https://www.patreon.com/posts/25835707">just like we saw</a> <a href="https://twitter.com/danluu/status/1594798257485795328">with prestige</a>, there seems to be a lot of nonconscious optimization for engagement. A place where you can see this within a platform is (and I've looked at hundreds of examples of this) when people start using a platform like Mastodon or Threads. They'll post a lot of different kinds of things. Most things won't get a lot of traction and a few will. They could continue posting the same things, but they'll often, instead, post less low-engagement content over time and more high-engagement content over time. Platforms have a variety of ways of trying to make other people engage with your content rewarding and, on average, this seems to work on people. This is an intra-platform and not an inter-platform example, but if this works on people, it seems like the inter-platform reasoning should hold as well.</p> <p>Personally, I'm not optimizing for engagement or revenue, but I've been paying my rent from <a href="https://patreon.com/danluu">Patreon earnings</a>, so it would probably make sense to do so. But, at least at the moment, looking into what interests me feels like a higher priority even if that's sort of a revenue and engagement minimizing move. For example, <a href="https://mastodon.social/@danluu/111729303313181364">wc has the source of my last post at 20k words, which means that doing two passes of writing over the post might've been something like 7h40m</a>. If I did short-form content instead, a while back, I did an experiment where I tried tweeting daily for a few months, which increased my Twitter followers by ~50% (from ~20k to ~30k). The Twitter experiment probably took about as much time as typing up my last post (which doesn't include the time spent doing the work for the last post which involved, among other things, reading five books and 15 or so papers about tire and vehicle dynamics), so from an engagement or revenue standpoint, posting to short-form platforms totally dominates the kind of writing I'm doing and anyone who care almost at all about engagement or revenue would do the short-form posting instead of long-form writing that takes time to create. As for me, right now, I have two drafts I'm in the middle of which are more like my last post. <a href="diseconomies-scale/">For one draft, the two major things I need to finish up are writing up a summary of ~500 articles/comments for an appendix and reading a 400 page book I want to quote a few things from</a>, and <a href="why-video/">for the other, I need to finish writing up notes for ~350 pages of FTC memos</a>. Each of these drafts will turn into a blog post that's long enough that it could be a standalone book. In terms of the revenue this drives to <a href="https://www.patreon.com/danluu">my Patreon</a>, I'd be lucky if I make minimum wage from doing this, <a href="https://www.patreon.com/posts/60185075">not even including the time spent on things I research but don't publish because the result is uninteresting</a>. But I'm also a total weirdo. On average, people are going to produce content that gets eyeballs, so of course a lot more people are going to create more hastily written long [bad platform] threads than blog posts.</p> <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:V"><p>for German-language content, there was one magazine that was doing work that's not as thorough in some ways, but semi-decently close, but no one was translating that into English. Jonathan Benson not only does unprecedented-for-English reviews of tires, he also translates the German reviews into English!</p> <p>On the broader topic, unfortunately, despite video making <a href="why-benchmark/">more benchmarking</a> financially viable, there's still plenty of stuff where there's no good way to figure out what's better other than by talking to people who work in the industry, such as for <a href="https://mastodon.social/@danluu/109667084437206655">ADAS systems</a>, where the public testing is cursory at best.</p> <a class="footnote-return" href="#fnref:V"><sup>[return]</sup></a></li> </ol> </div> How bad are search results? Let's compare Google, Bing, Marginalia, Kagi, Mwmbl, and ChatGPT seo-spam/ Sat, 30 Dec 2023 00:00:00 +0000 seo-spam/ <p>In <a href="https://xeiaso.net/blog/birth-death-seo/">The birth &amp; death of search engine optimization</a>, Xe suggests</p> <blockquote> <p>Here's a fun experiment to try. Take an open source project such as <code>yt-dlp</code> and try to find it from a very generic term like &quot;youtube downloader&quot;. You won't be able to find it because of all of the content farms that try to rank at the top for that term. Even though <code>yt-dlp</code> is probably actually what you want for a tool to download video from YouTube.</p> </blockquote> <p>More generally, most tech folks I'm connected to seem to think that Google search results are significantly worse than they were ten years ago (<a href="https://mastodon.social/@danluu/111506788692079608">Mastodon poll</a>, <a href="https://twitter.com/danluu/status/1730705885037801686">Twitter poll</a>, <a href="https://www.threads.net/@danluu.danluu/post/C0UnBr9vEeY">Threads poll</a>). However, there's a sizable group of vocal folks who claim that search results are still great. E.g., a bluesky thought leader who gets high engagement says:</p> <blockquote> <p>i think the rending of garments about how even google search is terrible now is pretty overblown<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">1</a></sup></p> </blockquote> <p>I suspect what's going on here is that some people have gotten so used working around bad software that they don't even know they're doing it, reflexively doing the modern equivalent of <a href="https://twitter.com/danluu/status/1525988886119186432">hitting ctrl+s all the time in editors, or ctrl+a; ctrl+c when composing anything in a text box</a>. Every adept user of the modern web has a bag of tricks they use to get decent results from queries. From having watched quite a few users interact with computers, that doesn't appear to be normal, even among people who are quite competent in various technical fields, e.g., mechanical engineering<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">2</a></sup>. However, it could be that people who are complaining about bad search result quality are <a href="https://mastodon.social/@danluu/111666646509250420">just hopping on the &quot;everything sucks&quot; bandwagon and making totally unsubstantiated comments</a> about search quality.</p> <p>Since it's fairly easy to try out straightforward, naive, queries, let's try some queries. We'll look at three kinds of queries with five search engines plus ChatGPT and we'll turn off our ad blocker to get <a href="https://twitter.com/danluu/status/1564292487744978946">the non-expert browsing experience</a>. I once had a computer get owned from browsing to a website with a shady ad, so I hope that doesn't happen here (in that case, I was lucky that I could tell that it happened because the malware was doing so much stuff to my computer that it was impossible to not notice).</p> <p>One kind of query is a selected set of representative queries a friend of mine used to set up her new computer. My friend is a highly competent engineer outside of tech and wanted help learning &quot;how to use computers&quot;, so I watched her try to set up a computer and pointed out holes in her mental model of how to interact with websites and software<sup class="footnote-ref" id="fnref:N"><a rel="footnote" href="#fn:N">3</a></sup>.</p> <p>The second kind of query is queries for the kinds of things I wanted to know in high school where I couldn't find the answer because everyone I asked (teachers, etc.) gave me obviously incorrect answers and I didn't know how to find the right answer. I was able to get the right answer from various textbooks once I got to college and had access to university libraries, but the questions are simple enough that there's no particular reason a high school student shouldn't be able to understand the answers; it's just an issue of finding the answer, so we'll take a look at how easy these answers are to find. The third kind of query is a local query for information I happened to want to get as I was writing this post.</p> <p>In grading the queries, there's going to be some subjectivity here because, for example, it's not objectively clear if it's better to have moderately relevant results with no scams or very <a href="https://mastodon.social/@danluu/109967416372133687">relevant results mixed interspersed with scams that try to install badware or trick you into giving up your credit card info to pay for something you shouldn't pay for</a>. For the purposes of this post, I'm considering scams to be fairly bad, so in that specific example, I'd rate the moderately relevant results above the very relevant results that have scams mixed in. As with my <a href="car-safety/">other</a> <a href="web-bloat/">posts</a> <a href="futurist-predictions/">that</a> have some kind of subjective ranking, there's both a short summary as well as a detailed description of results, so you can rank services yourself, if you like.</p> <p>In the table below, each column is a query and each row is a search engine or ChatGPT. Results are rated (from worst to best) Terrible, Very Bad, Bad, Ok, Good, and Great, with worse results being more red and better results being more blue.</p> <p>The queries are:</p> <ul> <li>download youtube videos</li> <li>ad blocker</li> <li>download firefox</li> <li>Why do wider tires have better grip?</li> <li>Why do they keep making cpu transistors smaller?</li> <li>vancouver snow forecast winter 2023</li> </ul> <p><style>table {border-collapse:collapse;margin:1px 10px;}table,th,td {text-align:center;}</style> <table style=float:left> <tr> <th></th><th>YouTube</th><th>Adblock</th><th>Firefox</th><th>Tire</th><th>CPU</th><th>Snow</th></tr> <tr> <th>Marginalia</th><td bgcolor=#d1e5f0><font color=black>Ok</font></td><td bgcolor=#4393c3><font color=black>Good</font></td><td bgcolor=#d1e5f0><font color=black>Ok</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td></tr> <tr> <th>ChatGPT</th><td bgcolor=#d6604d><font color=black>V. Bad</font></td><td bgcolor=#2166ac><font color=white>Great</font></td><td bgcolor=#4393c3><font color=black>Good</font></td><td bgcolor=#d6604d><font color=black>V. Bad</font></td><td bgcolor=#d6604d><font color=black>V. Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td></tr> <tr> <th>Mwmbl</th><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td></tr> <tr> <th>Kagi</th><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#d6604d><font color=black>V. Bad</font></td><td bgcolor=#2166ac><font color=white>Great</font></td><td bgcolor=#b2182b><font color=white>Terrible</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#b2182b><font color=white>Terrible</font></td></tr> <tr> <th>Google</th><td bgcolor=#b2182b><font color=white>Terrible</font></td><td bgcolor=#d6604d><font color=black>V. Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#fddbc7><font color=black>Bad</font></td><td bgcolor=#b2182b><font color=white>Terrible</font></td></tr> <tr> <th>Bing</th><td bgcolor=#b2182b><font color=white>Terrible</font></td><td bgcolor=#b2182b><font color=white>Terrible</font></td><td bgcolor=#2166ac><font color=white>Great</font></td><td bgcolor=#b2182b><font color=white>Terrible</font></td><td bgcolor=#d1e5f0><font color=black>Ok</font></td><td bgcolor=#b2182b><font color=white>Terrible</font></td></tr> <tr> </tr> </table></p> <p>Marginalia does relatively well by sometimes providing decent but not great answers and then providing no answers or very obviously irrelevant answers to the questions it can't answer, with a relatively low rate of scams, lower than any other search engine (although, for these queries, ChatGPT returns zero scams and Marginalia returns some).</p> <p>Interestingly, Mwmbl lets users directly edit search result rankings. I did this for one query, which would score &quot;Great&quot; if it was scored after my edit, but <a href="car-safety/">it's easy to do well on a benchmark when you optimize specifically for the benchmark</a>, so Mwmbl's scores are without my edits to the ranking criteria.</p> <p>One thing I found interesting about the Google results was that, in addition to Google's noted propensity to return recent results, there was a strong propensity to return recent youtube videos. This caused us to get videos that seem quite useless for anybody, except perhaps the maker of the video, who appears to be attempting to get ad revenue from the video. For example, when searching for &quot;ad blocker&quot;, one of the youtube results was a video where the person rambles for 93 seconds about how you should use an ad blocker and then googles &quot;ad blocker extension&quot;. They then click on the first result and incorrectly say that &quot;it's officially from Google&quot;, i.e., the ad blocker is either made by Google or has some kind of official Google seal of approval, because it's the first result. They then ramble for another 40 seconds as they install the ad blocker. After it's installed, they incorrectly state &quot;this is basically one of the most effective ad blocker [sic] on Google Chrome&quot;. The video has 14k views. For reference, Steve Yegge spent a year making high-effort videos and his most viewed video has 8k views, with a typical view count below 2k. This person who's gaming the algorithm by making low quality videos on topics they know nothing about, who's part of the cottage industry of people making videos taking advantage of Google's algorithm prioritizing recent content regardless of quality, is dominating Steve Yegge's videos because they've found search terms that you can rank for if you put anything up. We'll discuss other Google quirks in more detail below.</p> <p>ChatGPT does its usual thing and impressively outperforms its more traditional competitors in one case, does an ok job in another case, refuses to really answer the question in another case, and &quot;hallucinates&quot; nonsense for a number of queries (as usual for ChatGPT, random perturbations can significantly change the results<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">4</a></sup>). It's common to criticize ChatGPT for its hallucinations and, while I don't think that's unfair, <a href="customer-service/">as we noted in this 2015, pre-LLM post on AI</a>, I find this general class of criticism to be overrated in that humans and traditional computer systems make the exact same mistakes.</p> <p>In this case, search engines return various kinds of hallucinated results. In the snow forecast example, we got deliberately fabricated results, one intended to drive ad revenue through shady ads on a fake forecast site, and another intended to trick the user into thinking that the forecast indicates a cold, snowy, winter (the opposite of the actual forecast), seemingly in order to get the user to sign up for unnecessary snow removal services. Other deliberately fabricated results include a site that's intended to look like an objective review site that's actually a fake site designed to funnel you into installing a specific ad blocker, where the ad blocker they funnel you to appears to be a scammy one that tries to get you to pay for ad blocking and doesn't let you unsubscribe, a fake &quot;organic&quot; blog post trying to get you to install a chrome extension that exposes all of your shopping to some service (in many cases, it's not possible to tell if a blog post is a fake or shill post, but in this case, they hosted the fake blog post on the domain for the product and, although it's designed to look like there's an entire blog on the topic, there isn't — it's just this one fake blog post), etc.</p> <p>There were also many results which don't appear to be deliberately fraudulent and are just run-of-the-mill SEO garbage designed to farm ad clicks. These seem to mostly be pre-LLM sites, so they don't read quite like ChatGPT hallucinations, but they're not fundamentally different. Sometimes the goal of these sites is to get users to click on ads that actually scam the user, and sometimes the goal appears to be to generate clicks to non-scam ads. Search engines also returned many seemingly non-deliberate human hallucinations, where people confidently stated incorrect answers in places where user content is highlighted, like quora, reddit, and stack exchange.</p> <p>On these queries, even ignoring anything that looks like LLM-generated text, I'd rate the major search engines (Google and Bing) as somewhat worse than ChatGPT in terms of returning various kinds of hallucinated or hallucination-adjacent results. While I don't think concerns about LLM hallucinations are illegitimate, <a href="https://twitter.com/danluu/status/1557970249948884992">the traditional ecosystem has the problem that the system highly incentivizes putting whatever is most profitable for the software supply chain in front of the user</a> which is, in general, quite different from the best result.</p> <p>For example, if your app store allows &quot;you might also like&quot; recommendations, the most valuable ad slot for apps about gambling addiction management will be gambling apps. Allowing gambling ads on an addiction management app is too blatantly user-hostile for any company deliberately allow today, but of course companies that make gambling apps will try to game the system to break through the filtering and <a href="https://twitter.com/danluu/status/1588289840629833729">they sometimes succeed</a>. And for web search, I just tried this again on the web and one of the two major search engines returned, as a top result, ad-laden SEO blogspam for addiction management. At the top of the page is a multi-part ad, with the top two links being &quot;GAMES THAT PAY REAL MONEY&quot; and &quot;GAMES THAT PAY REAL CASH&quot;. In general, I was getting localized results (lots of .ca domains since I'm in Canada), so you may get somewhat different results if you try this yourself.</p> <p>Similarly, if the best result is a good, free, ad blocker like ublock origin, the top ad slot is worth a lot more to a company that makes an ad blocker designed to trick you into paying for a lower quality ad blocker with a nearly-uncancellable subscription, so the scam ad blocker is going to outbid the free ad blocker for the top ad slots. These kinds of companies also have a lot more resources to spend on direct SEO, as well as indirect SEO activities like marketing so, unless search engines mount a more effective effort to combat the profit motive, the top results will go to paid ad blockers even though the paid ad blockers are generally significantly worse for users than free ad blockers. If you talk to people who work on ranking, a lot of the biggest ranking signals are derived from clicks and engagement, but <a href="nothing-works/">this will only drive users to the best results when users are sophisticated enough to know what the best results are, which they generally aren't</a>. Human raters also rate page quality, but this has the exact same problem.</p> <p>Many Google employees have told me that ads are actually good because they inform the user about options the user wouldn't have otherwise known about, but anyone who tries browsing without an ad blocker will see ads that are various kinds of misleading, ads that try to trick or entrap the user in various ways, by pretending to be a window, or advertising &quot;GAMES THAT PAY REAL CASH&quot; at the top of a page on battling gambling addiction, which has managed to SEO itself to a high ranking on gambling addiction searches. In principle, these problems could be mitigated with enough resources, but we can observe that trillion dollar companies have chosen not to invest enough resources combating SEO, spam, etc., that these kinds of scam ads are rarely seen. Instead, a number of top results are actually ads that direct you to scams.</p> <p>In their original Page Rank paper, Sergei Brin and Larry Page noted that ad-based search is inherently not incentive aligned with providing good results:</p> <blockquote> <p>Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is &quot;The Effect of Cellular Phone Use Upon Driver Attention&quot;, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the Consumers.</p> <p>Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much more insidious than advertising, because it is not clear who &quot;deserves&quot; to be there, and who is willing to pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a viable search engine. But less blatant bias are likely to be tolerated by the market. ... This type of bias is very difficult to detect but could still have a significant effect on the market. Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline’s homepage when the airline’s name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines ... we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.</p> </blockquote> <p>Of course, Google is now dominated by ads and, despite specifically calling out the insidiousness of user conflating real results with paid results, <a href="https://twitter.com/danluu/status/1557970259142791168">both Google and Bing have made ads look more and more like real search results</a>, to the point that most users usually won't know that they're clicking on ads and not real search results. By the way, this propensity for users to think that everything is an &quot;organic&quot; search result is the reason that, in this post, results are ordered by the order the appear on the page, so if four ads appear above the first organic result, the four ads will be rank 1-4 and the organic result will be ranked 5. I've heard Google employees say that AMP didn't impact search ranking because it &quot;only&quot; controlled what results went into the &quot;carousel&quot; that appeared above search results, as if inserting a carousel and then a bunch of ads above results, pushing results down below the fold, has no impact on how the user interacts with results. It's also common to see search engines ransoming the top slot for companies, so that companies that don't buy the ad for their own name end up with searches for that company putting their competitors at the top, which is also said to not impact search result ranking, a technically correct claim that's basically meaningless to the median user.</p> <p>When I tried running the query from the paper, &quot;cellular phone&quot; (no quotes) and, the top result was a Google Store link to buy Google's own Pixel 7, with the rest of the top results being various Android phones sold on Amazon. That's followed by the Wikipedia page for Mobile Phone, and then a series of commercial results all trying to sell you phones or SEO-spam trying to get you to click on ads or buy phones via their links (the next 7 results were commercial, with the next result after that being an ad-laden SEO blogspam page for the definition of a cell phone with ads of cell phones on it, followed by 3 more commercial results, followed by another ad-laden definition of a phone). The commercial links seem very low quality, e.g., the top link below the carousel after wikipedia is Best Buy's Canadian mobile phone page. The first two products there are an ad slots for eufy's version of the AirTag. The next result is for a monthly financed iPhone that's tied to Rogers, the next for a monthly financed Samsung phone that's tied to TELUS, then we have Samsung's AirTag, an monthly financed iPhone tied to Freedom Mobile, a monthly financed iPhone tied to Freedom mobile in a different color, a monthly financed iPhone tied to Rogers, a screen protector for the iPhone 13, another Samsung AirTag product, an unlocked iPhone 12, a Samsung wall charger, etc.; it's an extremely low quality result with products that people shouldn't be buying (and, based on the number of reviews, aren't buying — the modal number of reviews of the top products is 0 and the median is 1 or 2 even though there are plenty of things people do actually buy from Best Buy Canada and plenty of products that have lots of reviews). The other commercial results that show up are also generally extremely low quality results. The result that Sergei and Larry suggested was a great top result, &quot;The Effect of Cellular Phone Use Upon Driver Attention&quot;, is nowhere to be seen, buried beneath an avalanche of commercial results. On the other side of things, Google has also gotten into the action by buying ads that trick users, <a href="https://twitter.com/danluu/status/887724695558205440">such as paying for an installer to try to trick users into installing Chrome over Firefox</a>.</p> <p>Anyway, after looking at the results of our test queries, some questions that come to mind are:</p> <ul> <li>How is Marginalia, a search engine built by a single person, so good?</li> <li>Can Marginalia or another small search engine displace Google for mainstream users?</li> <li>Can a collection of small search engines provide better results than Google?</li> <li>Will Mwmbl's user-curation approach work?</li> <li>Would a search engine like 1996-Metacrawler, which aggregates results from multiple search engines, ChatGPT, Bard, etc., significantly outperform Google?</li> </ul> <p>The first question could easily be its own post and this post is already 17000 words, so maybe we'll examine it another time. We've previously noted that <a href="people-matter/">some</a> <a href="why-benchmark/">individuals</a> <a href="https://twitter.com/danluu/status/1586508706774388736">can</a> be very productive, but of course the details vary in each case.</p> <p>On the second question, <a href="sounds-easy/">we looked at a similar question in 2016</a>, both the general version, &quot;I could reproduce this billion dollar company in a weekend&quot;, as well as specific comments about how open source software would make it trivial to surpass Google any day now, such as</p> <blockquote> <p>Nowadays, most any technology you need is indeed available in OSS and in state of the art. Allow me to plug meta64.com (my own company) as an example. I am using Lucene to index large numbers of news articles, and provide search into them, by searching a Lucene index generated by simple scraping of RSS-crawled content. I would claim that the Lucene technology is near optimal, and this search approach I'm using is nearly identical to what a Google would need to employ. The only true technology advantage Google has is in the sheer number of servers they can put online, which is prohibitively expensive for us small guys. But from a software standpoint, Google will be overtaken by technologies like mine over the next 10 years I predict.</p> </blockquote> <p>and</p> <blockquote> <p>Scaling things is always a challenge but as long as Lucene keeps getting better and better there is going to be a point where Google's advantage becomes irrelevant and we can cluster Lucene nodes and distribute search related computations on top and then use something like Hadoop to implement our own open source ranking algorithms. We're not there yet but technology only gets better over time and the choices we as developers make also matter. Even though Amazon and Google look like unbeatable giants now don't discount what incremental improvements can accomplish over a long stretch of time and in technology it's not even that long a stretch. It wasn't very long ago when Windows was the reigning champion. Where is Windows now?</p> </blockquote> <p>In that 2016 post, we saw that people who thought that open source solutions were set to surpass Google any day now appeared to have no idea how many hard problems must be solved to make a mainstream competitor to Google, including real-time indexing of rapidly-updated sites, like Twitter, newspapers, etc., as well as table-stakes level NLP, which is extremely non-trivial. Since 2016, these problems have gotten significantly harder as there's more real-time content to index and users expect much better NLP. The number of things people expect out of their search engine has increased as well, making the problem harder still, so it still appears to be quite difficult to displace Google as a mainstream search engine for, say, a billion users.</p> <p>On the other hand, if you want to make a useful search engine for a small number of users, that seems easier than ever because Google returns worse results than it used to for many queries. In our test queries, we saw a number of queries where many or most top results were filled with SEO garbage, a problem that was significantly worse than it was a decade ago, even before the rise of LLMs and that continues to get worse. I typically use search engines in a way that doesn't run into this, but when I look at what &quot;normal&quot; users query or if I try naive queries myself, as I did in this post, most results are quite poor, which didn't used to be true.</p> <p>Another place Google now falls over for me is when finding non-popular pages. I often find that, when I want to find a web page and I correctly remember the contents of the page, even if I do an exact string search, Google won't return the page. Either the page isn't indexed, or the page is effectively not indexed because it lives in some slow corner of the index that doesn't return in time. In order to find the page, I have to remember some text in a page that links to the page (often many clicks removed from the actual page, not just one, so I'm really remembering a page that links to a page that links to a page that links to a page that links to a page and then using archive.org to traverse the links that are now dead), search for that, and then manually navigate the link graph to get to the page. This basically never happened when I searched for something in 2005 and rarely happened in 2015, but this now happens a large fraction of the time I'm looking for something. Even in 2015, Google wasn't actually comprehensive. Just for example, Google search didn't index every tweet. But, at the time, I found Google search better at searching for tweets than Twitter search and I basically never ran across a tweet I wanted to find that wasn't indexed by Google. But now, most of the tweets I want to find aren't returned by Google search<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">5</a></sup>, even when I search for &quot;[exact string from tweet] site:twitter.com&quot;. In the original Page Rank paper, Sergei and Larry said &quot;Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now.&quot; (and that, while machines can generate an effectively infinite amount of content, just indexing human-generated content seems very useful). Pre-LLM, Google certainly had the resources to index every tweet as well as every human generated utterance on every public website, but they seem to have chosen to devote their resources elsewhere and, relative to its size, the public web appears less indexed than ever, or at least less indexed than it's been since the very early days of web search.</p> <p>Back when Google returned decent results for simple queries and indexed almost any public page I'd want to find, it would've been very difficult for an independent search engine to return results that I find better than Google's. Marginalia in 2016 would've been nothing more than a curiosity for me since Google would give good-enough results for basically anything where Marginalia returns decent results, and Google would give me the correct result in queries for every obscure page I searched for, something that would be extremely difficult for a small engine. But now that Google effectively doesn't index many pages I want to search for, the relatively small indices that independent search engines have doesn't make them non-starters for me and some of them return less SEO garbage than Google, making them better for my use since I generally don't care about real-time results, don't need fancy NLP (and find that much of it actually makes search results worse for me), don't need shopping integrated into my search results, rarely need image search with understanding of images, etc.</p> <p>On the question of whether or not a collection of small search engines can provide better results than Google for a lot of users, I don't think this is much of a question because the answer has been a resounding &quot;yes&quot; for years. However, many people don't believe this is so. For example, a Google TLM replied to the bluesky thought leader at the top of this post with</p> <blockquote> <p>Somebody tried argue that if the search space were more competitive, with lots of little providers instead of like three big ones, then somehow it would be *more* resistant to ML-based SEO abuse.</p> <p>And... look, if *google* can't currently keep up with it, how will Little Mr. 5% Market Share do it?</p> </blockquote> <p>presumably referring to arguments like <a href="https://buttondown.email/hillelwayne/archive/algorithm-monocultures/">Hillel Wayne's &quot;Algorithm Monocultures&quot;</a>, to which our bluesky thought leader replied</p> <blockquote> <p>like 95% of the time, when someone claims that some small, independent company can do something hard better than the market leader can, it’s just cope. economies of scale work pretty well!</p> </blockquote> <p>In the past, <a href="nothing-works/">we looked at some examples where the market leader provides a poor product and various other players, often tiny, provide better products</a> and in a future post, we'll look at how economies of scale and diseconomies of scale interact in various areas for tech but, for this post, suffice it to say that it's clear that despite the common &quot;econ 101&quot; <a href="cocktail-ideas/">cocktail party idea</a> that economies of scale should be the dominant factor for search quality, that doesn't appear to be the case when we look at actual results.</p> <p>On the question of whether or not Mwmbl's user-curated results can work, I would guess no, or at least not without a lot more moderation. Just browsing to Mwmbl shows the last edit to ranking was by user &quot;betest&quot;, who added some kind of blogspam as the top entry for &quot;RSS&quot;. It appears to be possible to revert the change, but there's no easily findable way to report the change or the user as spammy.</p> <p>On the question of whether or not something like Metacrawler, which aggregated results from multiple search engines, would produce superior results today, that's arguably irrelevant since it would either be impossible to legally run as a commercial service or require prohibitive licensing fees, but it seems plausible that, from a technical standpoint, a modern metacrawler would be fairly good today. Metacrawler quickly became irrelevant because Google returned significantly better results than you would get by aggregating results from other search engines, but it doesn't seem like that's the case today.</p> <p>Going back to the debate between folks like Xe, who believe that straightforward search queries are inundated with crap, and our thought leader, who believes that &quot;the rending of garments about how even google search is terrible now is pretty overblown&quot;, it appears that Xe is correct. Although Google doesn't publicly provide the ability to see what was historically returned for queries, many people remember when straightforward queries generally returned good results. One of the reasons Google took off so quickly in the 90s, even among expert users of AltaVista, who'd become very adept at adding all sorts of qualifiers to queries to get good results, was that you didn't have to do that with Google. But we've now come full circle and we need to add qualifiers, restrict our search to specific sites, etc., to get good results from Google on what used to be simple queries. If anything, we've gone well past full circle since the contortions we need to get good results are a lot more involved than they were in the AltaVista days.</p> <p><i><a rel="nofollow" href="https://jobs.ashbyhq.com/freshpaint/bfe56523-bff4-4ca3-936b-0ba15fb4e572?utm_source=dl">If you're looking for work, Freshpaint is hiring a recruiter, Software Engineers, and a Support Engineer. I'm in an investor in the company, so you should take this with the usual grain of salt, but if you're looking to join a fast growing early-stage startup, they seem to have found product-market fit and have been growing extremely quickly (revenue-wise).</a></i></p> <p><i>Thanks to Laurence Tratt, Heath Borders, Justin Blank, Brian Swetland, Viktor Lofgren (who, BTW, I didn't know before writing this post — I only reached out to him to discuss the Marginalia search results after running the queries), Misha Yagudin, @hpincket@fosstodon.org, Jeremey Kun, and Yossi Kreinin for comments/corrections/discussion</i></p> <h2 id="appendix-other-search-engines">Appendix: Other search engines</h2> <ul> <li>DuckDuckGo: in the past, when I've compared DDG to Bing while using an ad blocker, the results have been very similar. I also tried DDG here and, removing the Bing ads, the results aren't as similar as they used to be, but they were still similar enough that it didn't seem worth listing DDG results. I use DDG as my default search engine and I think, like Google, it works fine if you know how to query but, for the kinds of naive queries in this post, it doesn't fare particularly well.</li> <li>wiby.me: Like Marginalia, this is another search engine made for finding relatively obscure results. I tried four of the above queries on wiby and the results were interesting, in that they were really different than what I got from any other search engine, but wiby didn't return relevant results for the queries I tried.</li> <li>searchmysite.net: Somewhat relevant results for some queries, but not as relevant as Marginalia. Many fewer scams and ad-laden pages than Google, Bing, and Kagi.</li> <li>indieweb-search.jamesg.blog: seemed to be having an outage. &quot;Your request could not be processed due to a server error.&quot; for every query.</li> <li>Teclis: The search box is still there, but any query results in &quot;Teclis.com is closed due to bot abuse. Teclis results are still available through Kagi's search results, explicitly through the 'Non-commercial Web' lens and also as an API.&quot;. A note on the front page reads &quot;Teclis results are disabled on the site due to insane amount of bot traffic (99.9% traffic were bots).&quot;</li> </ul> <h2 id="appendix-queries-that-return-good-results">Appendix: queries that return good results</h2> <p>I think that most programmers are likely to be able to get good results to every query, except perhaps the tire width vs. grip query, so here's how I found an ok answer to the tire query:</p> <p>I tried a youtube search, since <a href="https://mastodon.social/@danluu/111441790762754806">a lot of the best car-related content is now youtube</a>. A youtube video whose title claims to answer the question (the video doesn't actually answer the question) has a comment recommending Carroll Smith's book &quot;Tune To Win&quot;. The comment claims that chapter 1 explains why wider tires have more grip, but I couldn't find an explanation anywhere in the book. Chapter 1 does note that race cars typically run wider tires than passenger cars and that passenger cars are moving towards having wider tires and it make some comments about slip angle that give a sketch of an intuitive reason for why you'd end up with better cornering with a wider contact patch, but I couldn't find a comment that explains differences in braking. Also, the book notes that the primary reason for the wider contact patch is that it (indirectly) allows for more less heat buildup, which then lets you design tires that operate over a narrower temperature range, which allows for softer rubber. That may be true, but it doesn't explain much of the observed behavior one might wonder about.</p> <p>Tune to Win recommends Kummer's The Unified Theory of Tire and Rubber Friction and Hays and Brooke's (actually Browne, but Smith incorrectly says Brooke) The Physics of Tire Traction. Neither of these really explained what's happening either, but looking for similar books turned up <a href="https://mastodon.social/@danluu/111572194243940891">Milliken and Millken's Race Car Vehicle Dynamics</a>, which also didn't really explain why but seemed closer to having an explanation. Looking for books similar to Race Car Vehicle Dynamics turned up Guiggiani's The Science of Vehicle Dynamics, which did get at how to think about and model a number of related factors. The last chapter of Guiggiani's book refers to something called the &quot;brush model&quot; (of tires) and searching for &quot;brush model tire width&quot; turned up a reference to Pacejka's Tire and Vehicle Dynamics, which does start to explain why wider tires have better grip and what kind of modeling of tire and vehicle dynamics you need to do to explain easily observed tire behavior.</p> <p>As we've noted, people have different tricks for getting good results so, if you have a better way of getting a good result here, I'd be interested in hearing about it. But note that, basically every time I have a post that notes that something doesn't work, the most common suggestion will be to do something that's commonly suggested that doesn't work, even though the post explicitly notes that the commonly suggested thing doesn't work. For example, the most common comment I receive about <a href="file-consistency/">this post on filesystem correctness</a> is that you can get around all of this stuff by doing the rename trick, even though the post explicitly notes that this doesn't work, explains why it doesn't work, and references a paper which discusses why it doesn't work. A few years later, <a href="deconstruct-files/">I gave an expanded talk on the subject</a>, where I noted that people kept suggesting this thing that doesn't work and the most common comment I get on the talk is that you don't need to bother with all of this stuff because you can just do the rename trick (and no, ext4 having <code>auto_da_alloc</code> doesn't mean that this works since you can only do it if you check that you're on a compatible filesystem which automatically replaces the incorrect code with correct code, at which point it's simpler to just write the correct code). If you have a suggestion for the reason wider tires have better grip or for a search which turns up an explanation, please consider making sure that the explanation is not <a href="#why-do-wider-tires-have-better-grip">one of the standard incorrect explanations noted in this post and that the explanation can account for all of the behavior that one must be able to account for if one is explaining this phenomenon</a>.</p> <p>On how to get good results for other queries, since this post is already 17000 words, I'll leave that for a future post on how expert vs. non-expert computer users interact with computers.</p> <h2 id="appendix-summary-of-query-results">Appendix: summary of query results</h2> <p>For each question, answers are ordered from best to worst, with the metric being my subjective impression of how good the result is. These queries were mostly run in November 2023, although a couple were run in mid-December. When I'm running queries, I very rarely write natural language queries myself. However, normal users often write natural language queries, so I arbitrarily did the &quot;Tire&quot; and &quot;Snow&quot; queries as natural queries. Continuing with the theme of running simple, naive, queries, we used the free version of ChatGPT for this post, which means the queries were run through ChatGPT 3.5. Ideally, we'd run the full matrix of queries using keyword and natural language queries for each query, run a lot more queries, etc., but this post is already 17000 words (converting to pages of a standard length book, that would be something like 70 pages), so running the full matrix of queries with a few more queries would pretty quickly turn this into a book-length post. For work and for certain kinds of data analysis, I'll sometimes do projects that are that comprehensive or more comprehensive, but here, we can't cover anything resembling a comprehensive set of queries and the best we can do is to just try a handful of queries that seem representative and use our judgment to decide if this matches the kind of behavior we and other people generally see, so I don't think it's worth doing something like 4x the work to cover marginally more ground.</p> <p>For the search engines, all queries were run in a fresh incognito window with cleared cookies, with the exception of Kagi, which doesn't allow logged-out searches. For Kagi, the queries were done with a fresh account with no custom personalization or filters, although they were done in sequence with the same account, so it's possible some kind of personalized ranking was applied to the later queries based on the clicks in the earlier queries. These queries were done in Vancouver, BC, which seems to have applied some kind of localized ranking on some search engines.</p> <ul> <li>download youtube videos <ul> <li>Ideally, the top hit would be <code>yt-dlp</code> or a thin, graphical, wrapper around <code>yt-dlp</code>. Links to <code>youtube-dl</code> or other less frequently updated projects would also be ok.</li> <li>Great results (<code>yt-dlp</code> as a top hit, maybe with <code>youtube-dl</code> in there somewhere, and no scams): none</li> <li>Good results (<code>youtube-dl</code> as a top hit, maybe with <code>yt-dlp</code> in there somewhere, and no scams): none</li> <li>Ok results (<code>youtube-dl</code> as a top hit, maybe with <code>yt-dlp</code> in there somewhere, and fewer scams than other search engines): <ul> <li>Marginalia: Top link is for <code>youtube-dl</code>. Most links aren't relevant. Many fewer scams than the big search engines</li> </ul></li> <li>Bad results (has some useful links, but also links to a lot of scams) <ul> <li>Mwmbl: Some links to bad sites and scams, but fewer than the big search engines. Also has one indirect link to <code>youtube-dl</code> in the top 10 and one for a GUI for <code>youtube-dl</code></li> <li>Kagi: Mostly links to scammy sites but does have, a couple pages down, a web.archive.org link to the 2010 version of <code>youtube-dl</code></li> </ul></li> <li>Very bad results (fails to return any kind of useful result) <ul> <li>ChatGPT: basically refuses to answer the question, although you can probably prompt engineer your way to an answer if you don't just naively ask the question you want answered</li> </ul></li> <li>Terrible results (fails to return any kind of useful result and is full of scams: <ul> <li>Google: Mostly links to sites that try to scam you or charge you for a worse version of free software. Some links to ad-laden listicles which don't have good suggestions. Zero links to good results. Also links to various youtube videos that are the youtube equivalent of blogspam.</li> <li>Bing: Mostly links to sites that try to scam you or charge you for a worse version of free software. Some links to ad-laden listicles which don't have good suggestions. Arguably zero links to good results (although one could make a case that result #10 is an ok result despite seeming to be malware).</li> </ul></li> </ul></li> <li>ad blocker <ul> <li>Ideally, the top link would be to ublock origin. Failing that, having any link to ublock origin would be good</li> <li>Great results (ublock origin is top result, no scams): <ul> <li>ChatGPT: First suggestion is ublock origin</li> </ul></li> <li>Good results (ublock origin is high up, but not the top result; results above ublock origin are either obviously not ad blockers or basically work without payment even if they're not as good as ublock origin; no links that directly try to scam you): none</li> <li>Ok results (ublock origin is in there somewhere, fewer scams than other search engines with not many scams) <ul> <li>Marginalia: 3rd and 4th results gets you to ublock origin and 8th result is ublock origin. Nothing that appears to try to scam you directly and &quot;only&quot; one link to some kind of SEO ad farm scam (which is much better than the major search engines)</li> </ul></li> <li>Bad results (no links to ublock origin and mostly links to things that paywall good features or ad blockers that deliberately let ads through by default): <ul> <li>Mwmbl: Lots of irrelevant links and some links to ghostery. One scam link, so fewer scams than commercial search engines</li> </ul></li> <li>Very bad results (exclusively or almost exclusively link to ad blockers that paywall good features or, by default, deliberately let through ads) <ul> <li>Google: lots of links to ad blockers that &quot;participate in the Acceptable Ads program, where publishers agree to ensure their ads meet certain criteria&quot; (not mentioned in the text, but explained elsewhere if you look into it, so that the main revenue source for companies that do this is advertisers paying the &quot;ad blocker&quot; company to not block their ads, making the &quot;ad blocker&quot; not only not an ad blocker, but very much not incentive aligned with users. Some links to things that appear to be scams. Zero links to ublock origin. Also links to various youtube videos that are the youtube equivalent of blogspam.</li> <li>Kagi: similar to Google, but with more scams, though fewer than Bing</li> </ul></li> <li>Terrible results (exclusively or almost exclusively link to ad blockers that paywall good features or, by default, deliberately let through ads and has a significant number of scams): <ul> <li>Bing: similar to Google, but with more scams and without youtube videospam</li> </ul></li> </ul></li> <li>download Firefox <ul> <li>Ideally, we'd get links to download firefox with no fake or scam links</li> <li>Great results (links to download firefox; no scams): <ul> <li>Bing: links to download Firefox</li> <li>Mwmbl: links to download firefox</li> <li>Kagi: links to download firefox</li> </ul></li> <li>Good: <ul> <li>ChatGPT: this is a bit funny to categorize, since these are technically incorrect instructions, but a human should easily be able to decode the instructions and download firefox</li> </ul></li> <li>Ok results (some kind of indirect links to download firefox; no scams): <ul> <li>Marginalia: indirect links to download Firefox instructions to get to a firefox download</li> </ul></li> <li>Bad results (links to download firefox, with scams): <ul> <li>Google: top links are all legitimate, but the #7 result is a scam that tries to get you to install badware and the #10 result is an ad that appears to be some kind of scam that wants your credit card info.</li> </ul></li> </ul></li> <li>Why do wider tires have better grip? <ul> <li>Ideally, would link to an explanation that clearly explains why and doesn't have an incomplete explanation that can't explain a lot of commonly observed behavior</li> <li>Great / Good / Ok results: none</li> <li>Bad results (no results or a very small number of obviously incorrect results): <ul> <li>Mwmbl: one obviously incorrect result and no other results</li> <li>Marginalia: two obviously incorrect results and no other results</li> </ul></li> <li>Very bad results: (a very small number of semi-plausible incorrect results) <ul> <li>ChatGPT: standard ChatGPT &quot;hallucination&quot; that's probably plausible to a lot of people (it sounds like a lot of incorrect internet comments on the topic, but better written)</li> </ul></li> <li>Terrible results (lots of semi-plausible incorrect results, often on ad farms): <ul> <li>Google / Bing / Kagi: incorrect ad-laden results with the usual rate of scammy ads</li> </ul></li> </ul></li> <li>Why do they keep making cpu transistors smaller? <ul> <li>Ideally, would link to an explanation that clearly explains why. The best explanations I've seen are in <a href="https://en.wikipedia.org/wiki/Very_Large_Scale_Integration">VLSI</a> textbooks, but I've also seen very good explanations in lecture notes and slides</li> <li>Great results (links to a very good explanation, no scams): none</li> <li>Good results (links to an ok explanation, no scams): none</li> <li>Ok results (links to something you can then search on further and get a good explanation if you're good at searching and doesn't rank bad or misleading explanations above the ok explanation): <ul> <li>Bing: top set of links had a partial answer that could easily be turned into links to correct answers via more searching. Also had a lot of irrelevant answers and ad-laden SEO'd garbage</li> </ul></li> <li>Bad results (no results or a small number of obviously irrelevant results or lots of semi-plausible wrong results with an ok result somewhere): <ul> <li>Marginalia: no answers</li> <li>Mwmbl: one obviously irrelevant answer</li> <li>Google: 5th link has the right keywords to maybe find the right answer with further searches. Most links have misleading or incorrect partial answers. Lots of links to Quora, which don't answer the question. Also lots of links to other bad SEO'd answers</li> <li>Kagi: 10th link has a fairly direct path to getting the correct answer, if you scroll down far enough on the 10th link. Other links aren't good.</li> </ul></li> <li>Very bad results: <ul> <li>ChatGPT: doesn't really answer the question. Asking ChatGPT to explain its answers further causes it &quot;hallucinate&quot; incorrect reasons.</li> </ul></li> </ul></li> <li>vancouver snow forecast winter 2023 <ul> <li>I'm not sure what the ideal answer is, but a pretty good one would be to Environment Canada's snow forecast, predicting significantly below normal snow (and above normal temperatures)</li> <li>Great results (links to Environment Canada winter 2023 multi-month snow forecast as top result or something equivalently good): none</li> <li>Good results: none</li> <li>Ok results (links to some kind of semi-plausible winter snow forecast that isn't just made-up garbage to drive ad clicks): none</li> <li>Bad results (no results or obviously irrelevant results): <ul> <li>Marginalia: no results</li> <li>ChatGPT: incorrect results, but when I accidentally prepended my question with &quot;User\n&quot;, then it returned a link to the right website (but in a way that would make it quite difficult to navigate to a decent result), so perhaps a slightly different prompt would pseudo-randomly cause a ok result here?</li> <li>Mwmbl: a bunch of obviously irrelevant results</li> </ul></li> <li>Very bad results: none</li> <li>Terrible results (links to deliberately faked forecast results): <ul> <li>Bing: mostly irrelevant results. The top seemingly-relevant result is the 5th link, but it appears to be some kind of scam site that fabricates fake weather forecasts and makes money by serving ads on the heavily SEO'd site</li> <li>Kagi: top 4 results are from the scam forecast site that's Bing's 5th link</li> <li>Google: mostly irrelevant results and the #1 result is a fake answer from a local snow removal company that projects significant snow and cold weather in an attempt to get you to unnecessarily buy snow removal service for the year. Other results are SEO'd garbage that's full of ads <br /></li> </ul></li> </ul></li> </ul> <h2 id="appendix-detailed-query-results">Appendix: detailed query results</h2> <h3 id="download-youtube-videos">Download youtube videos</h3> <p>For our first query, we'll search &quot;download youtube videos&quot; (Xe's suggested search term, &quot;youtube downloader&quot; returns very similar results). The ideal result is <code>yt-dlp</code> or a thin, free, wrapper around <code>yt-dlp</code>. <code>yt-dlp</code> is a fork of <code>youtube-dlc</code>, which is a now defunct fork of <code>youtube-dl</code>, which seems to have very few updates nowadays.. A link to one of these older downloaders also seems ok if they still work.</p> <h4 id="google">Google</h4> <ol> <li>Some youtube downloader site. Has lots of assurances that the website and the tool are safe because they've been checked by &quot;Norton SafeWeb&quot;. Interacting with the site at all prompts you to install a browser extension and enable notifications. Trying to download any video gives you a full page pop-over for extension installation for something called CyberShield. There appears to be no way to dismiss the popover without clicking on something to try to install it. After going through the links but then choosing not to install CyberShield, no video downloads. Googling &quot;cybershield chrome extension&quot; returns a knowledge card with &quot;Cyber Shield is a browser extension that claims to be a popup blocker but instead displays advertisements in the browser. When installed, this extension will open new tabs in the browser that display advertisements trying to sell software, push fake software updates, and tech support scams.&quot;, so CyberShield appears to be badware.</li> <li>Some youtube downloader site. Interacting with the site causes a pop-up prompting you to download their browser extension. Putting a video URL in causes a pop-up to some scam site but does also cause the video to download, so it seems to be possible to download youtube videos here if you're careful not to engage with the scams the site tries to trick you into interacting with</li> <li>PC Magazine listicle on ways to download videos from youtube. Top recommendations are paying for youtube downloads, VLC (which they note didn't work when they tried it), some $15/yr software, some $26/yr software, &quot;FlixGrab&quot;, then a warning about how the downloader websites are often scammy and they don't recommend any downloader website. The article has more than one ad per suggestion.</li> <li>Some youtube downloader site with shady pop-overs that try to trick you into clicking on ads before you even interact with the page</li> <li>Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads</li> <li>Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads, e.g., &quot;Samantha 24, vancouver | I want sex, write to WhatsApp | Close / Continue&quot;. Clicking anything (any button, or anywhere else on the site tries to get you to install something called &quot;Adblock Ultimate&quot;</li> <li>ZDNet ZDnet listicle. First suggestion is clipware, which apparently bundles a bunch of malware/adware/junkware with the installer: <a href="https://www.reddit.com/r/software/comments/w9o1by/warning_about_clipgrab/">https://www.reddit.com/r/software/comments/w9o1by/warning_about_clipgrab/</a>. The listicle is full of ads and has an autoplay video</li> <li>[YouTube video] Over 2 minutes of ads followed by a video on how to buy youtube premium (2M views on video)</li> <li>[YouTube video] Video that starts off by asking users to watch the whole video (some monetization thing?). The video tries to funnel you to some kind of software to download videos that costs money</li> <li>[YouTube video] PC Magazine video saying that you probably don't &quot;have to&quot; download videos since you can use the share button, and then suggests reading their story (the one in result #3) on how to download videos</li> <li>Some youtube downloader site with scam ads. Interacting with the site at all tries to get you to install &quot;Adblock Ultimate&quot;</li> <li>Some youtube downloader site with pop-ups that try to trick you into clicking on scam ads</li> <li>Some youtube downloader site with scam ads</li> </ol> <p>Out of 10 &quot;normal&quot; results, we have 9 that, in one way or another, try to get you to install badware or are linked to some other kind of ad scam. One page doesn't do this, but it also doesn't suggest the good, free, option for downloading youtube videos and instead suggests a number of paid solutions. We also had three youtube videos, all of which seem to be the video equivalent of SEO blogspam. Interestingly, we didn't get a lot of ads from Google itself <a href="https://twitter.com/danluu/status/823623691246239744">despite that happening the last time I tried turning off my ad blocker to do some Google test queries</a>.</p> <h4 id="bing">Bing</h4> <ol> <li>Some youtube downloader site. This is google (2), which has ads for scam sites</li> <li>[EXPLORE FURTHER ... &quot;Recommended to you based on what's popular&quot;] Some youtube download site, not one we saw from google. Site has multiple pulsing ads and bills itself as &quot;50% off&quot; for Christmas (this search was done in mid-November). Trying to download any video pulls up a fake progress bar with a &quot;too slow? Try [our program] link&quot;. After a while, a link to download the video appears, but it's a trick, and when you click it, it tries to install &quot;oWebster Search extension&quot;. Googling &quot;oWebster Search extension&quot; indicates that it's badware that hijacks your browser to show ads. Two of the top three hits are how to install the extension and the rest of the top hits are how to remove this badware. Many of the removal links are themselves scams that install other badware. After not installing this badware, clicking the download link again results in a pop-over that tries to get you to install the site's software. If you dismiss the pop-over and click the download link again, you just get the pop-over link again, so this site appears to be a pure scam that doesn't let you download videos</li> <li>[EXPLORE FURTHER]. Interacting with the site pops up fake ads with photos of attractive women who allegedly want to chat with you. Clicking the video download button tries to get you to install a <a href="https://github.com/gorhill/uBlock/issues/3027#issuecomment-330077439">copycat ad blocker</a> that <a href="https://www.reddit.com/r/assholedesign/comments/h8pdar/this_ad_blocker_that_gives_you_ads/">displays extra pop-over ads</a>. The site does seem to actually give you a video download, though</li> <li>[EXPLORE FURTHER] Same as (3)</li> <li>[EXPLORE FURTHER] Same as Google (1) (that NortonSafeWeb youtube downloader site that tries to scam you)</li> <li>[EXPLORE FURTHER] A site that converts videos to MP4. I didn't check to see if the site works or is just a scam as the site doesn't even claim to let you download youtube videos</li> <li>Google (1), again. That NortonSafeWeb youtube downloader site that tries to scam you.</li> <li>[EXPLORE FURTHER] A link to youtube.com (the main page)</li> <li>[EXPLORE FURTHER] Some youtube downloader site with a popover that tries to trick you into clicking on an ad. Closing that reveals 12 more ads. There's a scam ad that's made to look like a youtube downloader button. If you scroll past that, there's a text box and a button for trying to download a youtube video. Entering a valid URL results in an error saying there's no video that URL.</li> <li>Gigantic card that actually has a download button. The download button is fake and just takes you to the site. The site loudly proclaims that the software is not adware, spyware, etc.. Quite a few internet commenters note that their antivirus software tags this software as malware. A lot of comments also indicate that the software doesn't work very well but sometimes works. The site for the software has a an embedded youtube video, which displays &quot;This video has been removed for violating YouTube's Terms of Service&quot;. Oddly, the download links for mac and Linux are not for this software and in fact don't download anything at all and are installation instructions for <code>youtube-dl</code>; perhaps this makes sense if the windows version is actually malware. The windows download button takes you to a page that lets you download a windows executable. There's also a link to some kind of ad-laden page that tries to trick you into clicking on ads that look like normal buttons</li> <li>PC magazine listicle</li> <li>An ad for some youtube downloader program that claims &quot;345,764,132 downloads today&quot;; searching the name of this product on reddit seems to indicate that it's malware</li> <li>Ad for some kind of paid downloader software</li> </ol> <p>That's the end of the first page.</p> <p>Like Google, no good results and a lot of scams and software that may not be a scam but is some kind of lightweight skin around an open source project that charges you instead of letting you use the software for free.</p> <h4 id="marginalia">Marginalia</h4> <ol> <li>12-year old answer suggesting youtube-dl, which links to a URL which has been taken down and replaced with &quot;Due to a ruling of the Hamburg Regional Court, access to this website is blocked.&quot;</li> <li>Some SEO'd article, like you see on normal search engines</li> <li>Leawo YouTube Downloader (I don't know what this is, but a quick search at least doesn't make it immediately obvious that this is some kind of badware, unlike the Google and Bing results)</li> <li>Some SEO'd listicle, like you see on normal search engines</li> <li>Bug report for some random software</li> <li>Some random blogger's recommendation for &quot;4K Video Downloader&quot;. A quick search seems to indicate that this isn't a scam or badware, but it does lock some features behind a paywall, and is therefore worse than <code>yt-dlp</code> or some free wrapper around <code>yt-dlp</code></li> <li>A blog post on how to install and use <code>yt-dlp</code>. The blogpost notes that it used to be about <code>youtube-dl</code>, but has been updated to <code>yt-dlp</code>.</li> <li>More software that charges you for something you can get for free, although searching for this software on reddit turns up cracks for it</li> <li>A listicle with bizarrely outdated recommendations, like RealPlayer. The entire blog seems to be full of garbage-quality listicles.</li> <li>A script to download youtube videos for something called &quot;keyboard maestro&quot;, which seems useful if you already use that software, but seems like a poor solution to this problem if you don't already use this software.</li> </ol> <p>The best results by a large margin. The first link doesn't work, but you can easily get to <code>youtube-dl</code> from the first link. I certainly wouldn't try Leawo YouTube Downloader, but at least it's not so scammy that searching for the name of the project mostly returns results about how the project is some kind of badware or a scam, which is better than we got from Google or Bing. And we do get a recommendation with <code>yt-dlp</code>, with instructions in the results that's just a blog post from someone who wants to help people who are trying to download youtube videos.</p> <h4 id="kagi">Kagi</h4> <ul style="list-style: none;"> <li>1. That NortonSafeWeb youtube downloader site. Interacting with the site at all prompts you to install a browser extension and enable notifications. Trying to download any video gives you a full page pop-over for extension installation for something called CyberShield. There appears to be no way to dismiss the popover without clicking on something to try to install it</li> <li>2. Another link to that NortonSafeWeb youtube downloader site. For some reason, this one is tagged with "Dec 20, 2003", apparently indicating that the site is from Dec 20th 2003, although that's quite wrong.</li> <li>3. Some youtube downloader site. Selecting any video to download pushes you to a site with scam ads.</li> <li>4. Some youtube downloader site. Interacting with the site at all pops up multiple ads that link to scams and the page wants to enable notifications. A pop-up then appears on top of the ads that says "Ad removed" with a link for details. This is a scam link to another ad.</li> <li>5. Another link to the above site</li> <li>6-7. Under a subsection titled "Interesting Finds", there are links to two github repos. One is for transcribing youtube videos to text and the other is for using Google Takeout to backup photos from google photos or your own youtube channel</li> <li>8. Some youtube downloader site.</li> <li>9-13. Under a subsection titled "Blast from the Past", 4 irrelevant links and a link to youtube-dl's github page, but the 2010 version at archive.org</li> <li>14. SEO blogspam for youtube help. Has a link that's allegedly for a "Greasemonkey script for downloading YouTube videos", but the link just goes to a page with scammy ads</li> <li>15. Some software that charges you $5/mo to download videos from youtube</li> </ul> <h4 id="mwmbl">Mwmbl</h4> <ol> <li>Some youtube video downloader site, but one that no other search engine returned. There's a huge ad panel that displays &quot;503 NA - Service Deprecating&quot;. The download link does nothing except for pop up some other ad panes that then disappear, leaving just the 503 &quot;ad&quot;.</li> <li>$20 software for downloading youtube videos</li> <li>2016 blog post on how to install and use <code>youtube-dl</code>. Sidebar has two low quality ads which don't appear to be scams and the main body has two ads interspersed, making this extremely low on ads compared to analogous results we've seen from large search engines</li> <li>Some youtube video download site. Has a giant banner claiming that it's &quot;the only YouTube Downloader that is 100% ad-free and contains no popups.&quot;, which is probably not true, but the site does seem to be ad free and not have pop-ups. Download link seems to actually work.</li> <li>Youtube video on how to install and use <code>youtube-dlg</code> (a GUI wrapper for <code>youtube-dl</code>) on Linux (this query was run from a Mac).</li> <li>Link to what was a 2007 blogpost on how to download youtube videos, which automatically forwards to a 2020 ad-laden SEO blogspam listicle with bad suggestions. Article has two autoplay videos. Archive.org shows that the 2007 blog post had some reasonable options in it for the time, so this wasn't always a bad result.</li> <li>A blog post on a major site that's actually a sponsored post trying to get you to a particular video downloader. Searching for comments on this on reddit indicate that users view the app as a waste of money that doesn't work. The site is also full of scammy and misleading ads for other products. E.g., I tried clicking on an ad that purports to save you money on &quot;products&quot;. It loaded a fake &quot;checking your computer&quot; animation that supposedly checked my computer for compatibility with the extension and then another fake checking animation, after which I got a message saying that my computer is compatible and I'm eligible to save money. All I have to do is install this extension. Closing that window opens a new tab that reads &quot;Hold up! Do you actually not want automated savings at checkout&quot; with the options &quot;Yes, Get Coupons&quot; and &quot;No, Don't Save&quot;. Clicking &quot;No, Don't Save&quot; is actually an ad that takes you back to a link that tries to get you to install a chrome extension.</li> <li>That &quot;Norton Safe Web&quot; youtube downloader site, except that the link is wrong and is to the version of the site that purports to download instagram videos instead of the one that purports to download youtube videos.</li> <li>Link to Google help explaining how you can download youtube videos that you personally uploaded</li> <li>SEO blogspam. It immediately has a pop-over to get you to subscribe to their newsletter. Closing that gives you another pop-over with the options &quot;Subscribe&quot; and &quot;later&quot;. Clicking &quot;later&quot; does actually dismiss the 2nd pop-over. After closing the pop-overs, the article has instructions on how to install some software for windows. Searching for reviews of the software returns comments like &quot;This is a PUP/PUA that can download unwanted applications to your pc or even malicious applications.&quot;</li> </ol> <p>Basically the same as Google or Bing.</p> <h4 id="chatgpt">ChatGPT</h4> <p>Since ChatGPT expects more conversational queries, we'll use the prompt &quot;How can I download youtube videos?&quot;</p> <p>The first attempt, on a Monday at 10:38am PT returned &quot;Our systems are a bit busy at the moment, please take a break and try again soon.&quot;. The second attempt returned an answer saying that one should not download videos without paying for YouTube Premium, but if you want to, you can use third-party apps and websites. Following up with the question &quot;What are the best third-party apps and websites?&quot; returned another warning that you shouldn't use third-party apps and websites, followed by the <a href="https://nitter.net/CeciliaZin/status/1740109462319644905">ironic-for-GPT warning,</a></p> <blockquote> <p>I don't endorse or provide information on specific third-party apps or websites for downloading YouTube videos. It's essential to use caution and adhere to legal and ethical guidelines when it comes to online content.</p> </blockquote> <h3 id="ad-blocker">ad blocker</h3> <p>For our next query, we'll try &quot;ad blocker&quot;. We'd like to get <code>ublock origin</code>. Failing that, an ad blocker that, by default, blocks ads. Failing that, something that isn't a scam and also doesn't inject extra ads or its own ads. Although what's best may change at any given moment, comparisons I've seen that don't stack the deck have <a href="https://www.researchgate.net/figure/Performance-benchmark-when-using-an-adblocker-on-the-desktop-version-of-Top150_fig1_318330749">often seemed to show that ublock origin has the best or among the best performance</a>, and ublock origin is free and blocks ads.</p> <h4 id="google-1">Google</h4> <ol> <li>&quot;AdBlock — best ad blocker&quot;. Below the fold, notes &quot;AdBlock participates in the Acceptable Ads program, so unobtrusive ads are not blocked&quot;, so this doesn't block all ads.</li> <li>Adblock Plus | The world's #1 free ad blocker. Pages notes &quot;Acceptable Ads are allowed by default to support websites&quot;, so this also does not block all ads by default</li> <li>AdBlock. Page notes that &quot; Since 2015, we have participated in the Acceptable Ads program, where publishers agree to ensure their ads meet certain criteria. Ads that are deemed non-intrusive are shown by default to AdBlock users&quot;, so this doesn't block all ads</li> <li>&quot;Adblock Plus - free ad blocker&quot;, same as (2), doesn't block all ads</li> <li>&quot;AdGuard — World's most advanced adblocker!&quot; Page tries to sell you on some kind of paid software, &quot;AdGuard for Mac&quot;. Searching for AdGuard turns up a post <a href="https://www.reddit.com/r/Adguard/comments/16whmsy/is_there_an_adblocker_for_adguards_adblocker_that/">from this person looking for an ad blocker that blocks ads injected by AdGuard</a>. It seems that you can download it for free, but then, if you don't subscribe, they give you more ads?</li> <li>&quot;AdBlock Pro&quot; on safari store; has in-app purchases. It looks like you have to pay to unlock features like blocking videos</li> <li>[YouTube] &quot;How youtube is handling the adblock backlash&quot;. 30 second video with 15 second ad before the video. Video has no actual content</li> <li>[YoutTube] &quot;My thoughts on the youtube adblocker drama&quot;</li> <li>[YouTube] &quot;How to Block Ads online in Google Chrome for FREE [2023]&quot;; first comment on video is &quot;your video doesnt [sic] tell how to stop Youtube adds [sic]&quot;. In the video, a person rambles for a bit and then googles <code>ad blocker extension</code> and then clicks the first link (same as our first link), saying, &quot;If I can go ahead and go to my first website right here, so it's basically officially from Google .... [after installing, as a payment screen pops up asking you to pay $30 or a monthly or annual fee]&quot;</li> <li>&quot;AdBlock for Mobile&quot; on the App Store. It's rated 3.2* on the iOS store. Lots of reviews indicate that it doesn't really work</li> <li>MalwareBytes ad blocker. A quick search indicates that it doesn't block all ads (unclear if that's deliberate or due to bugs)</li> <li>&quot;Block ads in Chrome | AdGuard ad blocker&quot;, same as (5)</li> <li>[ad] NordVPN</li> <li>[ad] &quot;#1 Best Free Ad Blocker (2024) - 100% Free Ad Blocker.&quot; Immediately seems scammy in that it has a fake year (this query was run in mid-November 2023). This is for something called TOTAL Ad Block. <a href="(https://www.reddit.com/r/Adblock/comments/1412m7l/total_adblock_peoples_experiencesopinions/)">Searching for TOTAL Ad Block turns up results indicating that it's a scammy app that doesn't let you unsubscribe and basically tries to steal your money</a> 15 [ad] 100% Free &amp; Easy Download - Automatic Ad Blocker. Actually for Avast browser and not an ad blocker. <a href="https://palant.info/2020/01/13/pwning-avast-secure-browser-for-fun-and-profit/">A quick search show that this browser has a history of being less secure than just running chromium</a> and that <a href="https://palant.info/2019/10/28/avast-online-security-and-avast-secure-browser-are-spying-on-you/">it collects an unusually large amount of information from users</a>.</li> </ol> <p>No links to ublock origin. Some links to scams, though not nearly as many as when trying to get a youtube downloader. Lots of links to ad blockers that deliberately only block some ads by default.</p> <h4 id="bing-1">Bing</h4> <ul style="list-style: none;"> <li>1. [ad] "Automatic Ad Blocker | 100% Free & Easy Download". [link is actually to avast secure browser, so an entire browser and not an ad blocker; from a quick search, this appears to be a wrapper around chromium that [has a history of being less secure than just running chromium](https://palant.info/2020/01/13/pwning-avast-secure-browser-for-fun-and-profit/) [which collects an unusually large amount of information from users](https://palant.info/2019/10/28/avast-online-security-and-avast-secure-browser-are-spying-on-you/)].</li> <li>2. [ad] "#1 Best Free Ad Blocker (2023) | 100% Free Ad Blocker". Has a pop-over nag window when you mouse over to the URL bar asking you to install it instead of navigating away. Something called TOTAL ad block. Apparently tries to get to sign up for a subscription [and then makes it very difficult to unsubscribe](https://www.reddit.com/r/Adblock/comments/1412m7l/total_adblock_peoples_experiencesopinions/) (apparently, you can't cancel without a phone call, and when you call and tell them to cancel, they still won't do it unless you threaten to issue a chargeback or block the payment from the bank)</li> <li>3. [ad] "Best Ad Blocker (2023) | 100% Free Ad Blocker". Seems to be a fake review site that reviews various ad blockers; ublock origin is listed as #5 with 3.5 stars. TOTAL ad block is listed as #1 with 5 stars, is the only 5 stars ad blocker, has a banner that shows that it's the "#1 Free Ad Blocker", is award winning, etc. <br> If you then click the link to ublock origin, it takes you to a page that "shows" that ublock origin has 0 stars on trustpilot. There are multiple big buttons that say "click to start blocking ads" that try to get you to install TOTAL ad block. In the bottom right, in what looks like an ad slot, there's an image that says "visit site" for ublock origin. The link doesn't take you to ublock origin and instead takes you a site for [the fake ublock origin](https://www.reddit.com/r/ublock/comments/32mos6/ublock_vs_ublock_origin/). <li>4. [ad] "AVG Free Antivirus 2023 | 100% Free, Secure Download". This at least doesn't pretend to be an ad blocker of any kind.</li> <li>5. [Explore content from adblockplus.org] A link to the adblock plus blog.</li> <li>6. [Explore content from adblockplus.org] A link to a list of adblock plus features.</li> <li>7. "Adblock Plus | The world's #1 free ad blocker". </li> <li>8-13. Sublinks to various pages on the Adblock Plus site.</li> </ul> <p>We're now three screens down from the result, so the equivalent of the above google results is just a bunch of ads and then links to one website. The note that something is an ad is much more subtle than I've seen on any other site. Given <a href="https://twitter.com/danluu/status/823624273516335104">what we know about when users confuse ads with organic search results</a>, it's likely that most users don't realize that the top results are ads and think that the links to scam ad blockers or the fake review site that tries to funnel you into installing a scam ad blocker are organic search results.</p> <h4 id="marginalia-1">Marginalia</h4> <ol> <li>&quot;Is ad-blocker software permissible?&quot; from judaism.stackexchange.com</li> <li>Blogspam for Ghosterty. Ghostery's pricing page notes that you have to pay for &quot;No Private Sponsored Links&quot;, so it seems like some features are behind a pay wall. Wikipedia says &quot;Since July 2018, with version 8.2, Ghostery shows advertisements of its own to users&quot;, but it seems like this might be opt-in?</li> <li><a href="https://shouldiblockads.com/">https://shouldiblockads.com/</a>. Explains why you might want to block ads. First recommendation is ublock origin</li> <li>&quot;What’s the best ad blocker for you? - Firefox Add-ons Blog&quot;. First recommendation is ublock origin. Also provides what appears to be accurate information about other ad blockers.</li> <li>Blog post that's a personal account of why someone installed an ad blocker.</li> <li>Opera (browser).</li> <li>Blog post, anti-anti-adblocker polemic.</li> <li>ublock origin.</li> <li>Fairphone forum discussion on whether or not one should install an ad blocker.</li> <li>SEO site blogspam (as in, the site is an SEO optimization site and this is blogspam designed to generate backlinks and funnel traffic to the site).</li> </ol> <p>Probably the best result we've seen so far, in that the third and fourth results suggest ublock origin and the first result is very clearly not an ad blocker. It's unfortunate that the second result is blogspam for Ghostery, but this is still better than we see from Google and Bing.</p> <h4 id="mwmbl-1">Mwmbl</h4> <ol> <li>A bitly link to a &quot;thinkpiece&quot; on ad blocking from a VC thought leader.</li> <li>A link to cryptojackingtest, which forwards to Opera (the browser).</li> <li>A link to ghostery.</li> <li>Another link to ghostery.</li> <li>A link to something called 1blocker, which appears to be a paid ad blocker. Searching for reviews turns up comments like &quot;I did 1blocker free trial and forgot to cancel so it signed me up for annual for $20 [sic]&quot; (but comments indicate that the ad blocker does work).</li> <li>Blogspam for Ad Guard. There's a banner ad offering 40% off this ad blocker.</li> <li>An extremely ad-laden site that appears to be in the search results because it contains the text &quot;ad blocker detected&quot; if you use an ad blocker (I don't see this text on loading the page, but it's in the page preview on Mwmbl). The first page is literally just ads with a &quot;read more&quot; button. Clicking &quot;read more&quot; takes you to a different page that's full of ads that also has the cartoon, which is the &quot;content&quot;.</li> <li>Another site that appears to be in the search results because it contains the text &quot;ad blocker detected&quot;.</li> <li>Malwarebytes ad blocker, which doesn't appear to work.</li> <li>HN comments for article on youtube ad blocker crackdown. Scrolling to the 41st comment returns a recommendation for ublock origin.</li> </ol> <p>Mwmbl lets users suggest results, so I tried signing up to add ublock origin. Gmail put the sign-up email into my spam folder. After adding ublock origin to the search results, it's now the #1 result for &quot;ad blocker&quot; when I search logged out, from an incognito window and all other results are pushed down by one. As mentioned above, the score for Mwmbl is from before I edited the search results and not after.</p> <h4 id="kagi-1">Kagi</h4> <ul style="list-style: none;"> <li>1. "Adblock Plus | The world's #1 free ad blocker".</li> <li>2-11. Sublinks to other pages on the Adblock Plus website.</li> <li>12. "AdBlock — best ad blocker".</li> <li>13. "Adblock Plus - free ad blocker".</li> <li>14. "YouTube’s Ad Blocker Crackdown", a blog post that quotes and links to discussions of people talking about the titular topic.</li> <li>15-18. Under a section titled "Interesting Finds", three articles about youtube's crackdown on ad blockers. One has a full page pop-over trying to get you to install TOTAL Adblock with "Close" and "Open" buttons. The "Close" button does nothing and clicking any link or the open button takes to a page advertising TOTAL adblock. There appears to be no way to dismiss the ad and read the actual article without doing something like going to developer tools and deleting the ad elements. The fourth article is titled "The FBI now recommends using an ad blocker when searching the web" and 100% of the above the fold content is the header plus a giant ad. Scrolling down, there are a lot more ads.</li> <li>19. "AdBlock".</li> <li>20. Another link from the Adblock site, "Ad Blocker for Chrome - Download and Install AdBlock for Chrome Now!".</li> <li>21-25. Under a section titled "Blast from the Past", optimal.com ad blocker, a medium article on how to subvert adblock, a blog post from a Mozillan titled "Why Ad Blockers Work" that's a response to Ars Technica's "Why Ad Blocking is devastating to the sites you love", "Why You Need a Network-Wide Ad-Blocker (Part 1)", and "A Popular Ad Blocker Also Helps the Ad Industry", subtitled "Millions of people use the tool Ghostery to block online tracking technology—some may not realize that it feeds data to the ad industry."</li> </ul> <p>Similar quality to Google and Bing. Maybe halfway in between in terms of the number of links to scams.</p> <h4 id="chatgpt-1">ChatGPT</h4> <p>Here, we tried the prompt. <code>How do I install the best ad blocker?</code></p> <p>First suggestion is ublock origin. Second suggestion is adblock plus. This seems like the best result by a significant margin.</p> <h3 id="download-firefox">download firefox</h3> <h4 id="google-2">Google</h4> <ul style="list-style: none;"> <li>1-6. Links to download firefox.</li> <li>7. Blogspam for firefox download with ads trying to trick you into installing badware.</li> <li>8-9. Links to download firefox.</li> <li>10 [ad] Some kind of shady site that claims to have firefox downloads, but where the downloads take you to other sites that try to get you to sign up for an account where they ask for personal information and your credit card number. Also pops up pop-over with window that does the above if you try to actually download firefox. At least one of the sites is some kind of gambling site, so this site might make money off of referring people to gambling sites?</li> </ul> <p>Mostly good links, but 2 out of the top 10 links are scams. And <a href="https://twitter.com/danluu/status/822739582177288192">we didn't have a repeat of this situation I saw in 2017, where Google paid to get ranked above Firefox in a search for Firefox</a>. For search queries where almost every search engine returns a lot of scams, I might rate having 2 out of the top 10 links be scams as &quot;Ok&quot; or perhaps even better but, here, where most search engines return no fake or scam links, I'm rating this as &quot;Bad&quot;. You could make a case for &quot;Ok&quot; or &quot;Good&quot; here by saying that the vast majority of users will click one of the top links and never get as far as the 7th link, but I think that if Google is confident enough that's the case that they view it as unproblematic that the 7th and 10th links are scams, they should just only serve up the top links.</p> <h4 id="bing-2">Bing</h4> <ul style="list-style: none;"> <li>1-12. Links to download firefox or closely related links.</li> <li>13. [ad] Avast browser.</li> </ul> <p>That's the entire first page. Seems pretty good. Nothing that looks like a scam.</p> <h4 id="marginalia-2">Marginalia</h4> <ul style="list-style: none;"> <li>1. "Is it better to download Firefox from the website or use the package manager?" on the UNIX stackexchange</li> <li>2-9. Various links related to firefox, but not firefox downloads</li> <li>10. "Internet Download Accelerator online help"</li> </ul> <p>Definitely worse than Bing, since none of the links are to download Firefox. Depending on how highly you rate users not getting scammed vs. having the exact right link, this might be better or worse than Google. In this post, this scams are relatively highly weighted, so Marginalia ranks above Google here.</p> <h4 id="mwmbl-2">Mwmbl</h4> <ul style="list-style: none;"> <li>1-7. Links to download firefox.</li> <li>8. A link to a tumblr that has nothing to do with firefox. The title of the tumblr is "Love yourself, download firefox" (that's the title of the entire blog, not a particular blog post).</li> <li>9. Link to download firefox nightly.</li> <li>10. Extremely shady link that allegedly downloads firefox. Attempting to download the shady firefox pops up an ad that tries to trick you downloading Opera. I did not run either the Opera or Firefox binaries to see if they're legitimate.</li> </ul> <h4 id="kagi-com">kagi.com</h4> <ul style="list-style: none;"> <li>1-3. Links to download firefox.</li> <li>4-5. Under a heading titled "Interesting finds", a 404'd link to a tweet titled "What happens if you try to download and install Firefox on Windows" [which used to note that downloading Firefox on windows results in an OS-level pop-up that recommends Edge instead "to protect your pc"](https://web.archive.org/web/20220403104257/https://twitter.com/plexus/status/1510568329303445507) and some extremely ad-laden article (though, to its credit, the ads don't seem to be scam ads).</li> <li>6. Link to download firefox.</li> <li>7-10. 3 links to download very old versions of firefox, and a blog post about some kind of collaboration between firefox and ebay.</li> <li>11. Mozilla homepage.</li> <li>12. Link to download firefox.</li> </ul> <p>Maybe halfway in between Bing and Marginalia. No scams, but a lot of irrelevant links. Unlike some of the larger search engines, these links are almost all to download the wrong version of firefox, e.g., I'm on a Mac and almost all of the links are for windows downloads.</p> <h4 id="chatgpt-2">ChatGPT</h4> <p>The prompt &quot;How do I download firefox?&quot; returned technically incorrect instructions on how to download firefox. The instructions did start with going to the correct site, at which point I think users are likely to be able to download firefox by looking at the site and ignoring the instructions. Seems vaguely similar to marginalia, in that you can get to a download by clicking some links, but it's not exactly the right result. However, I think users are almost certain to find the correct steps and only likely with Marginalia, so ChatGPT is rated more highly than Marginalia for this query.</p> <h3 id="why-do-wider-tires-have-better-grip">Why do wider tires have better grip?</h3> <p>Any explanation that's correct must, a minimum, be consistent with the following:</p> <ul> <li>Assuming a baseline of a moderately wide tire for the wheel size. <ul> <li>Scaling both of these to make both wider than the OEM tire (but still running a setup that fits in the car without serious modifications) generally gives better dry braking and better lap times.</li> <li>In wet conditions, wider setups often have better braking distances (though this depends a lot on the specific setup) and better lap times, but also aquaplane at lower speeds.</li> <li>Just increasing the wheel width and using the same tire generally gives you better lap times, within reason.</li> <li>Just increasing the tire width and leaving wheel width fixed generally results in worse lap times.</li> </ul></li> <li>Why tire pressure changes have the impact that they do (I'm not going to define terms in these bullets; if this text doesn't make sense to you, that's ok). <ul> <li>At small slip angles, increasing tire pressure results in increased lateral force.</li> <li>In general, lowering tire pressure increases effective friction coefficient (within reason a semi-reasonable range).</li> </ul></li> </ul> <p>This is one that has a lot of standard incorrect or incomplete answers, including:</p> <ul> <li>Wider tires give you more grip because you get more surface area. <ul> <li>Wider tires don't, at reasonable tire pressure, give you significantly more surface area.</li> </ul></li> <li>Wider tires actually don't give you more grip because friction is surface area times a constant and surface area is mediated by air pressure. <ul> <li>It's easily empirically observed that wider tires do, in fact, give you better handling and braking.</li> </ul></li> <li>Wider tires let you use a softer compound, so the real reason wider tires give you more grip is via the softer compound. <ul> <li>This could be part of an explanation, but I've generally seen this cited as the only explanation. However, wider tires give you more grip independent of having a softer compound. You can even observe this with the same tire by mounting the exact same tire on a wider wheel (within reason).</li> </ul></li> <li>The shape of the contact patch when the tire is wider gives you better lateral grip due to [some mumbo jumbo], e.g., &quot;tire load sensitivity&quot; or &quot;dynamic load&quot;. <ul> <li>Ok, perhaps, but what's the mechanism that gives wider tires more grip when braking? And also, please explain the mumbo jumbo. For my goal of understanding why this happens, if you just use some word but don't explain the mechanism, this isn't fundamentally different than saying that wider tires have better grip due to magic. <ul> <li>When there's some kind of explanation of the mumbo jumbo, there will often be an explanation that only applies to aspect of increased grip, e.g., the explanation will really only apply to lateral grip and not explain why braking distances are decreased.</li> </ul></li> </ul></li> </ul> <h3 id="google-3">Google</h3> <ul style="list-style: none;"> <li>1. A "knowledge card" that says "Bigger tires provide a wider contact area that optimizes their performance and traction.", which explains nothing. On clicking the link, it's SEO blogspam with many [incorrect statements, such as "Are wider tires better for snow traction? Or are narrow tires more reliable in the winter months? The simple answer is narrow tires!](https://mastodon.social/@danluu/111441790762754806) Tires with a smaller section width provide more grip in winter conditions. They place higher surface pressure against the road they are being driven on, enabling its snow and ice traction"</li> <li>2. [Question dropdown] "do wider tires give you more grip?", which correctly says "On a dry road, wider tires will offer more grip than narrow ones, but the risk of aquaplaning will be higher with wide tires.". On clicking the link, there's no explanation of why, let alone an answer to the question we're asking</li> <li>3. [Question dropdown] "Do bigger tires give you better traction?", which says "What Difference Does The Wheel Size Make? Larger wheels offer better traction, and because they have more rubber on the tire, this also means a better grip on the road", which has a nonsensical explanation of why. On clicking the link, the link appears to be talking about wheel diameter and is not only wrong, but actually answering the wrong question.</li> <li>4. [Question dropdown] "Why do wider tires have more grip physics?", which then has some of the standard incorrect explanations.</li> <li>5. "Do wider wheels improve handling?", which says "Wider wheels and wider tires will also lower your steering friction coefficient". On clicking the link, there's no explanation of why nor is there an answer to the question we're asking.</li> <li>6. "What are the disadvantages of wider tires?", which says "Harder Handling & Steering". On clicking the link, there are multiple incorrect statements and no explanation of why.</li> <li>7. "Would wider tires increase friction?", which says "Force can be stated as Pressure X Area. For a wide tire, the area is large but the force per unit area is small and vice versa. The force of friction is therefore the same whether the tire is wide or not.". Can't load the page due to a 502 error and the page isn't in archive.org, but this seems fine since the page appears to be wrong</li> <li>8. "What is the advantage of 20 inch wheels over 18 inch wheels?" Answers a different question. On clicking the link, it's low quality SEO blogspam.</li> <li>9. "Why do race cars have wide tires?", which says "Wider tires provide more resistance to slippery spots or grit on the road. Race tracks have gravel, dust, rubber beads and oil on them in spots that limit traction. By covering a larger width, the tires can handle small problems like that better. Wider tires have improved wear characteristics.". Perhaps technically correct, but fundamentally not the answer and highly misleading at best.</li> <li>10-49. Other question dropdowns that are wrong. Usually both wrong and answering the wrong question, but sometimes giving a wrong answer to the right question and sometimes giving the right answer to the wrong question. I am just now realizing that clicking question dropdowns give you more question dropdowns.</li> <li>50. "Why do wider tires get more grip? : r/cars". The person asks the question I'm asking, concluding with "This feels like a really dumb question because wider tires=more grip just seems intuitive, but I don't know the answer.". The top answer is total nonsense "The smaller surface area has more pressure but the same normal force as a larger surface area. If you distribute the same load across more area, each square inch of tire will have less force it's responsible for holding, and thus is less likely to be overcome by the force from the engine". The #2 answer is a classic reddit answer, "Yeah, take your science bs and throw it out the window.". The #3 answer has a vaguely plausible sounding answer to why wider tires have better lateral grip, but it's still misleading. Like many of the answers, the answer emphasizes how wider tires give you better lateral grip and has a lengthy explanation for why this should be the case, but wider tires also give you shorter braking distances and the provided explanation cannot explain why wider tires have shorter braking distances so must be missing a significant part of the puzzle. Anyway, none of the rest of the answers really even attempt to explain why</li> <li>51-54. Other reddit answers bunched with this one, which also don't answer the question, although one of them links to https://www.brachengineering.com/content/publications/Wheel-Slip-Model-2006-Brach-Engineering.pdf, which has some good content, though it doesn't answer the question.</li> <li>55. SEO blogspam for someone's youtube video; video doesn't answer the question.</li> <li>56. Extremely ad-laden site with popovers that try to trick you into clicking on ads, etc.; has text I've seen on other pages that's been copied over to make an SEO ad farm (and the text has answers that are incorrect)</li> </ul> <h4 id="bing-3">Bing</h4> <ul style="list-style: none;"> <li>1. Knowledge card which incorrectly states "Larger contact patch with the ground."</li> <li>2-4. Carousel where none of the links answer the question correctly. (3) from bing is (50) from google search results. (2) isn't wrong, but also doesn't answer the question. (3) is SEO blogspam for someone else's youtube video (same link as google.com 55). The video does not answer the question. (3) and (4) are literally the same link and also don't answer the question</li> <li>5. "This is why wider tires equals more grip". SEO blogspam for someone else's youtube video. The youtube video does not answer the question.</li> <li>6-10. [EXPLORE FURTHER] results. (6) is blatantly wrong, (7) is the same link as (3) and (4), (8) is (2), SEO blogspam for someone else's youtube video and the video doesn't answer the question, (9) is s SEO blogspam for someone else's youtube video and the video doesn't answer the question, (10) is generic SEO blogspam with lots of incorrect information</li> <li>11. Same link as (2) and (8), still SEO blogspam for someone else's youtube video and the video doesn't answer the question</li> <li>12-13 [EXPLORE FURTHER] results. (12) is some kind of SEO ad farm that tries to get you to make "fake" ad clicks (there are full screen popovers that, if you click them, cause you to click through some kind of ad to some normal site, giving revenue to whoever set up the ad farm). (13) is the website of the person who made one of the two videos that's a common target for SEO blogspam on this topic. It doesn't answer the question, but at least we have the actual source here. </li> </ul> <p>From skimming further, many of the other links are the same links as above. No link appears to answer the question.</p> <h4 id="marginalia-3">Marginalia</h4> <p>Original query returns zero results. Removing the question mark returns one single result, which is the same as (3) and (4) from bing.</p> <h4 id="mwmbl-3">Mwmbl</h4> <ol> <li>NYT article titled &quot;Why Women Pay Higher Interest&quot;. This is the only returned result.</li> </ol> <p>Removing the question mark returns an article about bike tires titled &quot;Fat Tires During the Winter: What You Need to Know&quot;</p> <h4 id="kagi-2">Kagi</h4> <ol> <li>A knowledge card that incorrectly reads &quot;wider tire has a greater contact patch with the ground, so can provide traction.&quot;</li> <li>(50) from google</li> <li>Reddit question with many incorrect answers</li> <li>Reddit question with many incorrect answers. Top answer is &quot;The same reason that pressing your hand on the desk and sliding it takes more effort than doing the same with a finger. More rubber on the road = more friction&quot;.</li> <li>(3) and (4) from bing</li> <li>Youtube video titled &quot;Do wider tyres give you more grip?&quot;. Clicking the video gives you 1:30 in ads before the video plays. The video is good, but it answers the question in the title of the video and not the question being asked of why this is the case. The first ad appears to be an ad revenue scam. The first link actually takes you to a second link, where any click takes you through some ad's referral link to a product.</li> <li>&quot;This is why wider tires equals more grip&quot;. SEO blogspam for (6)</li> <li>SEO blogspam for another youtube video</li> <li>SEO blogspam for (6)</li> <li>Quora answer where top answer doesn't answer the question and I can't read all of the answers because I'm not logged in or aren't a premium member or something.</li> <li>Google (56), stolen text from other sites and a site that has popovers that try to trick you into clicking ads</li> <li>Pre-chat GPT nonsense text and a page that's full of ads. Unusually, the few ads that I clicked on seemed to be normal ads and not scams.</li> <li>Blogspam for ad farm that has pop-overs that try to get you to install badware.</li> <li>Page with ChatGPT-sounding nonsense. Has a &quot;Last updated&quot; timestamp that's sever-side generated to match the exact moment you navigated to the page. Page tries to trick you into clicking on ads with full-page popover. Ads don't seem to be scams, as far as I can tell.</li> <li>Page which incorrectly states &quot;In summary, a wider tire does not give better traction, it is the same traction similar to a more narrow tire.&quot;. Has some ads that get you to try to install badware.</li> </ol> <h4 id="chatgpt-3">ChatGPT</h4> <p>Provides a list of &quot;hallucinated&quot; reasons. The list of reasons has better grammar than most web search results, but still incorrect. It's not surprising that ChatGPT can't answer this question, since it often falls over on questions that are both easier to reason about and where the training data will contain many copies of the correct answer, e.g., <a href="https://www.threads.net/@jossfong/post/C1fJe-mJRHx">Joss Fong noted that, when her niece asked ChatGPT about gravity</a>, the response was nonsense: &quot;... That's why a feather floats down slowly but a rock drops quickly — the Earth is pulling them both, but the rock gets pulled harder because it's heavier.&quot;</p> <p>Overall, no search engine gives correct answers. Marginalia seems to be the best here in that it gives only a couple of links to wrong answers and no links to scams.</p> <h3 id="why-do-they-keep-making-cpu-transistors-smaller">Why do they keep making cpu transistors smaller?</h3> <p>I had this question when I was in high school and my AP physics teacher explained to me that it was because making the transistors smaller allowed the CPU to be smaller, which let you make the whole computer smaller. Even at age 14, I could see that this was an absurd answer, not really different than today's ChatGPT hallucinations — at the time, computers tended to be much larger than they are now, and full of huge amounts of empty space, with the CPU taking up basically no space relative to the amount of space in the box and, on top of that, CPUs were actually getting bigger and not smaller as computers were getting smaller. I asked some other people and didn't really get an answer. This was also relatively early on the life of the public web and I wasn't able to find an answer other than something like &quot;smaller transistors are faster&quot; or &quot;smaller = less capacitance&quot;. But why are they faster? And what makes them have less capacitance? Specifically, what about the geometry causes that to scale so that transistors get faster? It's not, in general, obvious that things should get faster if you shrink them, e.g., if you naively linearly shrink a wire, it doesn't appear that it should get faster at all because the cross sectional area is reduced quadratically, increasing resistance per distance quadratically. But length is also reduced linearly, so total resistance is increased linearly. And then capacitance also decreases linearly, so it all cancels out. Anyway, for transistors, it turns out the same kind of straightforward scaling logic shows that they speed up (at back then, transistors were large enough and wire delay was relatively small enough that you got extremely large increases in performance for shrinking transistor). You could explain this to a high school student who's taken physics in a few minutes if <a href="https://twitter.com/danluu/status/1147984717238562816">you had the right explanation</a>, but I couldn't find an answer to this question until I read a VLSI textbook.</p> <p>There's now enough content on the web that there must be multiple good explanations out there. Just to check, I used non-naive search terms to find some good results. Let's look at what happens when you use the naive search from above, though.</p> <h4 id="google-4">Google</h4> <ul style="list-style: none;"> <li>1. A knowledge card that reads "Smaller transistors can do more calculations without overheating, which makes them more power efficient.", which isn't exactly wrong but also isn't what I'd consider an answer of why. The article is interesting, but is about another topic and doesn't explain why.</li> <li>2. [Question dropdown], "Why are transistors getting smaller?". Site has an immediate ad pop-over on opening. Site doesn't really answer the question, saying "Since the first integrated circuit was built in the 1950s, silicon transistors have shrunk following Moore’s law, helping pack more of these devices onto microchips to boost their computing power."</li> <li>3. [Question dropdown] "Why do transistors need to be small?". Answer is "The capacitance between two conductors is a function of their physical size: smaller dimensions mean smaller capacitances. And because smaller capacitances mean higher speed as well as lower power, smaller transistors can be run at higher clock frequencies and dissipate less heat while doing so", which isn't wrong, but the site doesn't explain the scaling that made things faster as transistors got smaller. The page mostly seems concerned about discrete components and note that "In general, passive components like resistors, capacitors and inductors don’t become much better when you make them smaller: in many ways, they become worse. Miniaturizing these components is therefore done mainly just to be able to squeeze them into a smaller volume, and thereby saving PCB space.", so it's really answering a different question</li> <li>4. [Question dropdown], "Why microchips are getting smaller?". SEO blogspam that doesn't answer the question other than saying stuff like "smaller is faster"</li> <li>5. [Question dropdown], "Why are microprocessors getting smaller?". Link is to stackexchange. The top answer is that yield is better and cost goes down when chips are smaller, which I consider a non-answer, in that it's also extremely expensive to make things smaller, so what explains why the cost reduction is there? And, also, even if the cost didn't go down, companies would still want smaller transistors for performance reasons, so this misses a major reason and arguably the main reason.</li> #2 answer actually sort of explains it, "The reason for this is that as the transistor gate gets smaller, threshold voltage and gate capacitance (required drive current) gets lower.", but is both missing parts of the explanation and doesn't provide the nice, intuitive, physical explanation for why this is the case. Other answers are non-answers like "The CORE reason why CPUs keep getting smaller is simply that, in computing, smaller is more powerful:". It's possible to get to a real explanation by searching for these terms <li>6. "Why are CPU and GPU manufacturers trying to make ...". Top answer is the non-answer of "Smaller transistors are faster and use less power. Small is good." and since it's quora and I'm not a subscriber, the other answers are obscured by a screen that suggests I start a free trial to "access this answer and support the author as a Quora+ subscriber".</li> <li>7-10. sub-links to other quora answers. Since I'm not a subscriber, by screen real estate, most of the content is ads. None of the content I could read answered the question.</li> </ul> <h4 id="bing-4">Bing</h4> <ul style="list-style: none;"> <li>1. Knowledge card with multiple parts. First parts have some mumbo jumbo, but the last part contains a partial answer. If you click on the last part of the answer, it takes you to a stack exchange question that has more detail on the partial answer. There's enough information in the partial answer to do a search and then find a more complete explanation.</li> <li>2-4. [people also ask] some answers that are sort of related, but don't directly answer the question</li> <li>5. Stack exchange answer for a different question.</li> <li>7-10 [explore further] answers to totally unrelated questions, except for 10, which is extremely ad-laden blogspam to a related question that has a bunch of semi-related text with many ads interspersed between the text.</li> </ul> <h4 id="kagi-3">Kagi</h4> <ul style="list-style: none;"> <li>1. "Why does it take multiple years to develop smaller transistors for CPUs and GPUs?", on r/askscience. Some ok comments, but they answer a different question.</li> <li>2-5. Other reddit links that don't answer the question. Some of them are people asking this question, but the answers are wrong. Some of the links answer different questions and have quite good answers to those questions.</li> <li>6. Stackexchange question that has incorrect and misleading answers.</li> <li>7. Stackexchange question, but a different question.</li> <li>8. Quora question. The answers I can read without being a member don't really answer the question.</li> <li>9. Quora question. The answers I can read without being a member don't really answer the question.</li> <li>10. Metafilter question from 2006. The first answers are fundamentally wrong, but one of the later answers links to the wikipedia page on MOSFET. Unfortunately, the link is to the now-removed anchor #MOSFET_scaling. There's still a scaling section which has a poor explanation. There's also a link to the page on Dennard Scaling, which is technically correct but has a very poor explanation. However, someone could search for more information using these terms and get correct information.</li> </ul> <h4 id="marginalia-4">Marginalia</h4> <p>No results</p> <h4 id="mwmbl-4">Mwmbl</h4> <ol> <li>A link to a Vox article titled &quot;Why do artists keep making holiday albums?&quot;. This is the only result.</li> </ol> <h4 id="chatgpt-4">ChatGPT</h4> <p>Has non-answers like &quot;increase performance&quot;. Asking ChatGPT to expand on this, with &quot;Please explain the increased performance.&quot; results in more non-answers as well as fairly misleading answers, such as</p> <blockquote> <p>Shorter Interconnects: Smaller transistors result in shorter distances between them. Shorter interconnects lead to lower resistance and capacitance, reducing the time it takes for signals to travel between transistors. Faster signal propagation enhances the overall speed and efficiency of the integrated circuit ... The reduced time it takes for signals to travel between transistors, combined with lower power consumption, allows for higher clock frequencies</p> </blockquote> <p>I could see this seeming plausible to someone with no knowledge of electrical engineering, but this isn't too different from ChatGPT's explanation of gravity, &quot;... That's why a feather floats down slowly but a rock drops quickly — the Earth is pulling them both, but the rock gets pulled harder because it's heavier.&quot;</p> <h3 id="vancouver-snow-forecast-winter-2023">vancouver snow forecast winter 2023</h3> <p>Good result: Environment Canada's snow forecast, predicting significantly below normal snow (and above normal temperatures)</p> <h4 id="google-5">Google</h4> <ol> <li>Knowledge card from a local snow removal company, incorrectly stating &quot;The forecast for the 2023/2024 season suggests that we can expect another winter marked by ample snowfall and temperatures hovering both slightly above and below the freezing mark. Be prepared ahead of time.&quot;. On opening the page, we see that the next sentence is &quot;Have Alblaster [the name of the company] ready to handle your snow removal and salting. We have a proactive approach to winter weather so that you, your staff and your customers need not concern yourself with the approaching storms.&quot; and the goal of the link is to get you to buy snow removal services regardless of their necessity by writing a fake forecast.</li> <li>[question dropdown] &quot;What is the winter prediction for Vancouver 2023?&quot;, incorrectly saying that it will be &quot;quite snowy&quot;.</li> <li>[question dropdown] &quot;What kind of winter is predicted for 2023 Canada?&quot; Links to a forecast of Ontario's winter, so not only wrong province, but the wrong coast, and also not actually an answer to the question in the dropdown.</li> <li>[question dropdown] &quot;What is the winter prediction for B.C. in 2023 2024?&quot; Predicts that B.C. will have a wet and mild winter, which isn't wrong, but doesn't really answer the question.</li> <li>[question dropdown] &quot;What is the prediction for 2023 2024 winter?&quot; Has a prediction for U.S. weather</li> <li>Blogspam article that has a lot of pointless text with ads all over. Text is contradictory in various ways and doesn't answer the question. Has huge pop-over ad that covers top half the page</li> <li>Another blogspam article from the same source. Lots of ads; doesn't answer the question</li> <li>Ad-laden article that answers some related questions, but not this question</li> <li>Extremely ad-laden article that's almost unreadable due to the number of ads. Talks a lot about El Nino. Eventually notes that we should see below-normal snow in B.C. due to El Nino, but B.C. is almost 100M km² and the forecast is not the same for all of B.C., so you could maybe hope that the comment about B.C. here applies to Vancouver, but this link only lets you guess at the answer</li> <li>Very ad-laden article, but does have a map which has map that's labeled &quot;winter precipitation&quot; which appears to be about snow and not rain. Map seems quite different from Environment Canada's map, but it does show reduced &quot;winter precipitation&quot; over Vancouver, so you might conclude the right thing from this map.</li> </ol> <h4 id="bing-5">Bing</h4> <ul style="list-style: none;"> <li>1-4. [news carousel] Extremely ad laden articles that don't answer the question. Multiple articles are well over half ads by page area.</li> <li>5. Some kind of page that appears to have the answer, expect that the data seems to be totally fabricated? There's a graph with day-by-day probability of "winter storm". From when I did the search, there's about an average of about a 50% daily chance of a "snow storm" going forward for the next 2 weeks. Forecasts that don't seem fake have it at 1% or less daily. Page appears to be some kind of SEO'd fake forecast that makes money on ads?</li> <li>6-8. [more links from same site] Various ad laden pages. One is a "contact us" page where the main "contact us" pane is actually a trick to get you to click on an ad for some kind of monthly payment service that looks like a scam</li> <li>9-14 [Explore 6 related pages ... recommended to you based on what's popular] Only one link is relevant. That link has a "farmer's almanac" forecast that's fairly different from Environment Canada's forecast. The farmer's almanaic page mainly seems to be an ad to get you to buy farmer's almanic stuff, although it also has conventional ads</li> </ul> <h4 id="kagi-4">Kagi</h4> <ul style="list-style: none;"> <li>1. Same SEO'd fake forecast as Bing (5)</li> <li>2-4. More results from scam weather site</li> <li>5-7. [News] Irrelevant results</li> <li>8. Spam article from same site as Google (6)</li> <li>9-13. More SEO spam from the same site</li> <li>14. Same fake forecast as Google (1)</li> <li>15. Page is incorrectly tagged is being from "Dec 25, 2009" (it's a recent page) and doesn't contain relevant results</li> </ul> <h4 id="marginalia-5">Marginalia</h4> <p>No results.</p> <h4 id="mwmbl-5">Mwmbl</h4> <ul style="list-style: none;"> <li>1. Ad-laden news article from 2022 about a power outage. Has an autoplay video ad and many other ads as well.</li> <li>2. 2021 article about how the snow forecast for Philadelphia was incorrect. Article has a slow-loading full-page pop-over that shows up after a few seconds and is full of ads.</li> <li>3. 2016 article on when the Ohio river last froze over.</li> <li>4. Some local news site from Oregon with a Feb 2023 article on the snow forecast at the time. Site has an autoplay video ad and is full of other ads. Clicking one of the random ads ("Amazon Hates When You Do Ths, But They Can't Stop You (It's Genius)" results in the ad trying to get you to install a chrome extension. The ad attempts to resemble an organic blog post on a site that's just trying to get you to save money, but if you try to navigate away from the "blog post", you get a full page popover that tries to trick you into installing the chrome extension. Going to the base URL reveals that the entire site is actually a site that's trying to trick users into installing this chrome extension. This is the last result.</li> </ul> <h4 id="chatgpt-5">ChatGPT</h4> <p>&quot;What is the snow forecast for Vancouver in winter of 2023?&quot;</p> <p>Doesn't answer questions, recommends using a website, app, or weather service.</p> <p>Asking &quot;Could you please direct me to a weather website, app, or weather service that has the forecast?&quot; causes ChatGPT to return random weather websites that don't have a seasonal snow forecast.</p> <p>I retried a few times. One time, I accidentally pasted in the entire ChatGPT question, which meant that my question was prepened with &quot;User\n&quot;. That time, ChatGPT suggested &quot;the Canadian Meteorological Centre, Environment Canada, or other reputable weather websites&quot;. The top response when asking for the correct website was &quot;Environment Canada Weather&quot;, which at least has a reasonable seeming seasonal snow forecast somewhere on the website. The other links were still to sites that aren't relevant.</p> <h2 id="appendix-google-knowledge-card-results">Appendix: Google &quot;knowledge card&quot; results</h2> <p>In general, I've found Google knowledge card results to be quite poor, <a href="https://twitter.com/sasuke___420/status/1428104132301185029">both for specific questions</a> with <a href="https://twitter.com/danluu/status/1428103276583661570">easily findable answers</a> as well as for silly questions like &quot;when was running invented&quot; which, for years, infamously returned &quot;1748. Running was invented by Thomas Running when he tried to walk twice at the same time&quot; (which was pulled from a Quora answer).</p> <p>I had a doc where I was collecting every single knowledge card I saw to tabulate the fraction that were correct. I don't know that I'll ever turn that into a post, so here are some &quot;random&quot; queries with their knowledge card result (and, if anyone is curious, most knowledge card results I saw when I was tracking this were incorrect).</p> <ul> <li>&quot;oc2 gemini length&quot; (looking for the length of a kind of canoe, an oc2, called a gemini) <ul> <li>20″ (this was the length of a baby mentioned in an article that also mentioned the length of the boat, which is 24'7&quot;</li> </ul></li> <li>&quot;busy beaver number&quot; <ul> <li>(604) 375-2754</li> </ul></li> <li>&quot;Feedly revenue&quot; <ul> <li>&quot;$5.2M/yr&quot;, via a link to a site which appears to just completely fabricate revenue and profit estimates for private companies</li> </ul></li> <li>&quot;What airlines fly direct from JFK airport to BLI airport?&quot; <ul> <li>&quot;Alaska Airlines - (AS) with 30 direct flights between New York and Bellingham monthly; Delta Air Lines - (DL) with 30 direct flights between JFK and BLI monthly&quot;. This sounded plausible, but when I looked this up, this was incorrect. The page it links to has a bunch of text that like &quot;How many morning flights are there from JFK to BLI? Alaska Airlines - (AS) lists, on average, 1 flights departing before 12:00pm, where the first departure from JFK is at 09:30AM and the last departure before noon is at 09:30AM&quot;, seemingly with the goal of generating a knowledge card for questions like this. It doesn't really matter that the answers are fabricated since the goal of the site seems to be to get traffic or visibility via knowledge cards</li> </ul></li> <li>&quot;Air Canada Vancouver Newark&quot; <ul> <li>At the time I did this search, this showed a knowledge card indicating that AC 7082 was going to depart the next day at 11:50am, but no such flight had existed for months and there was certainly not an AC 7082 flight about to depart the next day</li> </ul></li> <li>&quot;TYR Hurricane Category 5 neoprene thickness&quot; <ul> <li>1.5mm (this is incorrect)</li> </ul></li> <li>&quot;Intel number of engineers&quot; <ul> <li>(604) 742-3501 (I was looking for the number of engineers that Intel employed, not a phone number, and even if I was looking for a phone number for Intel engineers, I don't think this is it).</li> </ul></li> <li>&quot;boston up118s dimensions&quot; <ul> <li>&quot;5826298 x 5826899 x 582697 in&quot; (this is a piano and, no, it is not 92 miles long)</li> </ul></li> <li>&quot;number of competitive checkers players&quot; <ul> <li>2</li> </ul></li> <li>&quot;fraser river current speed&quot; <ul> <li>&quot;97 to 129 kilometers per hour (60 to 80 mph)&quot; (this is incorrect)</li> </ul></li> <li>&quot;futura c-4 surfski weight&quot; <ul> <li>&quot;39 pounds&quot; (this is actually the weight of a different surfski; the article this comes from just happens to also mention the futura c-4)</li> </ul></li> </ul> <h2 id="appendix-faq">Appendix: FAQ</h2> <p>As already noted, the most common responses I get are generally things that are explicitly covered in the post, so I won't recover those here. However, any time I write a post that looks at anything, I also get a slew of comments like and, indeed, that was one of the first comments I got on this post.</p> <blockquote> <p>This isn't a peer-reviewed study, it's crap</p> </blockquote> <p>As I noted in <a href="overwatch-gender/">this other post</a>,</p> <blockquote> <p>There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.</p> <p>When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my ability) even if it makes the result look less interesting or less likely to win an award.</p> </blockquote> <p>The same thing applies here and, in fact, I have a best paper award in this field (information retrieval, or IR, colloquially called search). I don't find IR papers particularly rigorous. I did push very hard to make <a href="bitfunnel-sigir.pdf">my top-conference best-paper-award-wining paper</a> more rigorous and, while I won some of those fights, I lost others, and that paper has a number of issues that I wouldn't let pass in a blog post. I suspect that people who make comments like this mostly don't read papers and, to the extent they do, don't understand them.</p> <p>Another common response is</p> <blockquote> <p>Your table is wrong. I tried these queries on Kagi and got Good results for the queries [but phrase much more strongly]</p> </blockquote> <p>I'm not sure why people feel so strongly about Kagi but, all of these kinds of responses so far have come from Kagi users. No one has gotten good results for the tire, transistor, or snow queries (note, again, that this is not a query looking for a daily forecast, as clearly implied by the &quot;winter 2023&quot; in the query), nor are the results for the other queries very good if you don't have an ad blocker. I suppose it's possible that the next person who tells me this actually has good results, but that seems fairly unlikely given the zero percent correctness rate so far.</p> <p>For example, one user claimed that the results were all good, but they pinned GitHub results and only ran the queries for which you'd get a good result on GitHub. This is actually worse than you get if you use Google or Bing and write good queries since you'll get noise in your results when GitHub is the wrong place to search. Of course you make a similar claim that Bing is amazing is you write non-naive queries, so it's curious that so many Kagi users are angrily writing me about this and no Google or Bing users. Kagi appears to have tapped into the same vein that Tesla and Apple have managed to tap into, where users become incensed that someone is criticizing something they love and then <a href="https://mastodon.social/@danluu/111461841130226661">write nonsensical defenses of their favorite product</a>, which bodes well for Kagi. I've gotten comments like this from not just one Kagi user, but many.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:B">this person does go on to say &quot;, but it <em>is</em> true that a lot of, like, tech industry/trade stuff has been overwhelmed by LLM-generated garbage&quot;. However, the results we see in this post generally seem to be non-LLM generated text, often pages pre-dating LLMs and low quality results don't seem confined to or even particularly bad in tech-related areas. Or, to pick another example, our bluesky thought leader is in a local Portland band. If I search &quot;[band name] members&quot;, I get a knowledge card which reads &quot;[different band name] is a UK indie rock band formed in Glastonbury, Somerset. The band is composed of [names and instruments].&quot; <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:M"><p>For example, for a youtube downloader, my go-to would be to search HN, which returns reasonable results. Although that works, if it didn't, my next step would be to search reddit (but not using reddit search, of course), which returns a mix of good and bad results; searching for info about each result shows that the 2nd returned result (<code>yt-dlp</code>) is good and most of the other results are quite bad. Other people have different ways of getting good results, e.g., Laurence Tratt's reflex is to search for &quot;youtube downloader cli&quot; and Heath Borders's is to search for &quot;YouTube Downloader GitHub&quot;; both of those searches work decently as well. If you're someone whose bag of tricks includes the right contortions to get good results for almost any search, it's easy to not realize that most users don't actually know how to do this. From having watched non-expert users try to use computers with advice from expert users, it's clear that many sophisticated users severely underestimate how much knowledge they have. For example, I've heard many programmers say that they're good at using computers because &quot;I just click on random things to see what happens&quot;. Maybe so, but when they give this advice to naive users, this generally doesn't go well and the naive users will click on the wrong random things. The expert user is not, in fact, just clicking on things at random; they're using their mental model of what clicks might make sense to try clicks that could make sense. Similarly with search, where people will give semi-plausible sounding advice like &quot;just add site:reddit.com to queries&quot;. But adding &quot;site:reddit.com&quot; that makes many queries worse instead of better — you have to have a mental model of which queries this works on and which queries this fails on.</p> <p>When people have some kind of algorithm that they consistently use, it's often one that has poor results that is also very surprising to technical folks. For example, Misha Yagudin noted, &quot;I recently talked to some Russian emigrates in Capetown (two couples have travel agencies, and another couple does RUB&lt;&gt;USDT&lt;&gt;USD). They were surprised I am not on social media, and I discovered that people use Instagram (!!) instead of Google to find products and services these days. The recipe is to search for something you want 'triathlon equipment,' click around a bit, then over the next few days you will get a bunch of recommendations, and by clicking a bit more you will get even better recommendations. This was wild to me.&quot;</p> <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:N">she did better than naive computer users, but still had a lot of holes in her mental model that would lead to installing malware on her machine. For what it's like for normal computer users, the internet is full of stories from programmers like <a href="https://news.ycombinator.com/item?id=34919473">&quot;The number of times I had to yell at family members to NOT CLICK THAT ITS AN AD is maddening. It required getting a pretty nasty virus and a complete wipe to actually convince my dad to install adblock.&quot;</a>. The internet is full of <a href="https://news.ycombinator.com/item?id=34917701">scam ads that outrank search</a> that install malware and a decent fraction of users are on devices that have been owned by clicking on an ad or malicious SEO'd search result and you have to constantly watch most users if you want to stop their device from being owned. <a class="footnote-return" href="#fnref:N"><sup>[return]</sup></a></li> <li id="fn:R">accidentally prepending &quot;User\n&quot; to one query got it to return a good result instead of bad results, reminiscent of how <a href="https://twitter.com/danluu/status/1601072083270008832">ChatGPT &quot;thought&quot; Colin Percival was dead if you asked it to &quot;write about&quot; him, but alive if you asked it to &quot;Write about&quot; him</a>. It's already commonplace for search ranking to be done with multiple levels of ranking, so perhaps you could get good results by running randomly perturbed queries and using a 2nd level ranker, or ChatGPT could even have something like this built in. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:T">some time after Google stopped returning every tweet I wanted to find, Twitter search worked well enough that I could find tweets with Twitter search. However, post-acquisition, Twitter search often doesn't work in various ways. For maybe 3-5 months, search didn't return any of my tweets at all. And both before and after that period, searches often fail to return a tweet even when I search for an exact substring of a tweet, so now I often have to resort to various weird searches for things that I expect to link to the tweet I'm looking for so I can manually follow the link to get to the tweet. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> </ol> </div> Transcript of Elon Musk on stage with Dave Chapelle elon-dave-chappelle/ Sun, 11 Dec 2022 00:00:00 +0000 elon-dave-chappelle/ <p>This is a transcription of videos Elon Musk's appearance on stage with Dave Chapelle using OpenAI's Whisper model with some manual error corrections and annotations for crowd noise.</p> <p>As with the <a href="elon-twitter-texts/">Exhibit H Twitter text message release</a>, there are a lot of articles that quote bits of this, but the articles generally missing a lot of what happened and often paint a misleading picture of happened and the entire thing is short enough that you might as well watch or read it instead of reading someone's misleading summary. In general, the media seems to want to paint a highly unflattering picture of Elon, resulting in articles and virtual tweets that are factually incorrect. For example, it's been widely incorrectly reported that, during the &quot;I'm rich, bitch&quot; part, horns were played to drown out the crowd's booing of Elon, but the horn sounds were played when the previous person said the same thing, which was the most cheered statement that was recorded. The sounds are much weaker when Elon says &quot;I'm rich, bitch&quot; and can't be heard clearly, but it sounds like a mix of booing and cheering. It was probably the most positive crowd response that Elon got from anything and it seems inaccurate in at least two ways to say that horns were played to drown out the booing Elon was receiving. On the other hand, even though the media has tried to paint as negative a picture of Elon as possible, it's done quite a poor job and a boring, accurate, accounting of what happened in many of other sections are much less flattering than the misleading summaries that are being passed around.</p> <p><ul> <li><a href="https://www.youtube.com/watch?v=CzkreBMHUFY">Video 1</a> <ul> <li><b>Dave</b>: Ladies and gentlemen, make some noise for the richest man in the world.</li> <li><b>Crowd</b>: [mixed cheering, clapping, and boos; boos drown out cheering and clapping after a couple of seconds and continue into next statements]</li> <li><b>Dave</b>: Cheers and boos, I say</li> <li><b>Crowd</b>: [brief laugh, boos continue to drown out other crowd noise]</li> <li><b>Dave</b>: Elon</li> <li><b>Crowd</b>: [booing continues]</li> <li><b>Elon</b>: Hey Dave</li> <li><b>Crowd</b>: [booing intensifies]</li> <li><b>Elon</b>: [unintelligible over booing] <li><b>Dave</b>: Controversy, buddy. <li><b>Crowd</b>: [booing continues; some cheering can be heard]</li> <li><b>Elon</b>: Weren't expecting this, were ya?</li> <li><b>Dave</b>: It sounds like some of them people you fired are in the audience.</li> <li><b>Crowd</b>: [laughs, some clapping can be heard] <li><b>Elon</b>: [laughs] <li><b>Crowd</b>: [booing resumes] <li><b>Dave</b>: Hey, wait a minute. Those of you booing</li> <li><b>Crowd</b>: [booing intensifies]</li> <li><b>Dave</b>: Tough [unintelligible due to booing] sounds like</li> <li><b>Elon</b>: [unintelligible due to being immediately cut off by Dave]</li> <li><b>Dave</b>: You know there's one thing. All those people are booing. I'm just. I'm just pointing out the obvious. They have terrible seats. [unintelligible due to crowd noise]</li> <li><b>Crowd</b>: [weak laughter]</li> <li><b>Dave</b>: All coming from wayyy up there [unintelligible] last minute non-[unintelligible] n*****. Booo. Booooooo.</li> <li><b>Crowd</b>: [quiets down]</li> <li><b>Dave</b>: Listen.</li> <li><b>Crowd</b>: [booing resumes]</li> <li><b>Dave</b>: Whatever. Look motherfuckas. This n**** is not even trying to die on earth</li> <li><b>Crowd</b>: [laughter mixed with booing, laughter louder than boos]</li> </ul></li></p> <p><li><a href="https://www.youtube.com/watch?v=hUkulqjTNpk">Video 2</a> <ul> <li><b>Dave</b>: His whole business model is fuck earth I'm leaving anyway</li> <li><b>Crowd</b>: [weak laughter, weak mixed sounds]</li> <li><b>Dave</b>: Do all you want. Take me with you n**** I'm going to Mars</li> <li><b>Crowd</b>: [laughter]</li> <li><b>Dave</b>: Whatever kind of pussy they got up there, that's what we'll be doin <li><b>Crowd</b>: [weak laughter]</li> <li><b>Dave</b>: [laughs] Anti-gravity titty bars. Follow your dreams bitch and the money just flow all over the room</li> <li><b>Crowd</b>: [weak laughter]</li> <li><b>Elon</b>: [laughs]</li> <li><b>Crowd</b>: [continued laughter drowned out by resumed booing; some cheering can be heard]</li> <li><b>Elon</b>: Thanks for, uhh, thanks for having me on stage.</li> <li><b>Dave</b>: Are you kidding. I wouldn't miss this opportunity.</li> <li><b>Elon</b>: [unintelligible, cut off by crowd laughter]</li> <li><b>Crowd</b>: [laughter]</li> <li><b>Elon</b>: [unintelligible, cut off by crowd laughter]</li> <li><b>Dave</b>: The first comedy club on Mars that should be my [pause for crowd laughter] a deal's a deal, Musk.</li> <li><b>Crowd</b>: [weak laughter and cheering]</li> <li><b>Elon</b>: [unintelligible], yeah</li> <li><b>Dave</b>: You n***** can boo all you want. This n**** gave me a jet pack last Christmas</li> <li><b>Crowd</b>: [laughter]</li> <li><b>Dave</b>: Fly right past your house. They can boo these nuts [unintelligible due to laughter at this line]</li> <li><b>Dave</b>: That's how we like to chill, we do all the shit</li> <li><b>Crowd</b>: [weak laughter, shifting to crowd talking]</li> <li><b>Elon</b>: [Elon shifts, as if to address crowd]</li> <li><b>Crowd</b>: [booing resumes]</li> <li><b>Elon</b>: Dave, what should I say?</li> <li><b>Crowd</b>: [booing intensifies]</li> <li><b>Dave</b>: Don't say nothin. It'll only spoil the moment. Do you hear that sound Elon? That's the sound of pending civil unrest.</li> <li><b>Crowd</b>: [weak laughter, some booing can initially be heard; booing intensifies until Dave cuts it off with his next line]</li> <li><b>Dave</b>: I can't wait to see which story you decimate next motherfucka [unintelligible] you shut the fuck up with your boos. There's something better that you can do. Booing is not the best thing that you can do. Try it n****. Make it what you want it to be. I am your ally. I wish everybody in this auditorium peace and the joy of feeling free and your pursuit of happiness make you happy. Amen. Thank you very much San Francisco. No city on earth has ever been kind to me. Thank you. Good night.</li> </ul></li></p> <p><li><a href="https://www.youtube.com/watch?v=u1cl8U0UCMQ">Video 3</a> [lots of empty seats in the crowd at this point] <ul> <li><b>Dave</b>: [unintelligible] as you can. It's funnier when you say it. Are you ready? Say this [unintelligible] you say. Go ahead.</li> <li><b>Crowd</b>: [weak laugther]</li> <li><b>Maybe Chris Rock?</b>: I'm rich bitch</li> <li><b>Crowd</b>: [loud cheers, loud horn from stage can be heard as well]</li> <li><b>Unknown</b>: Wait wait wait wait [hands mic to Elon]</li> <li><b>Crowd</b>: [laughter]</li> <li><b>Elon</b>: [poses]</li> <li><b>Crowd</b>: [laughter, booing starts to be heard over laughter]</li> <li><b>Elon</b>: I'm rich bitch</li> <li><b>Crowd</b>: [some sound, hard to hear over horns from stage followed by music from the DJ drowning out the crowd; sounds like some booing and some cheering] </ul></li></p> <p><li><a href="https://www.youtube.com/watch?v=WSjQ5ojUvnM">Video 4</a> <ul> <li><b>Dave</b>: <a href="https://jezebel.com/talib-kwelis-harassment-campaign-shows-how-unprotected-1844483551">Talib Kweli my good friend [crowd cheers] is currently banned from Twitter.</a> <li><b>Crowd</b>: [laughter]</li> <li><b>Dave</b>: He goes home to [unintelligible], Kweli. [hands mic to Elon] <li><b>Elon</b>: Ahh waa. Twitter cu-customer service right here. <li><b>Crowd</b>: [weak laughter]</li> <li><b>Elon</b>: We'll get right on that. <li><b>Crowd</b>: [weak booing, gets stronger over time through next statement, until cut off by Dave]</li> <li><b>Elon</b>: Dave, you should be on Twitter. <li><b>Dave</b>: If you. Let me tell you something. Wait. Radio, where's your phone? <li><b>Dave</b>: Listen. Years ago, this is true, I'll tell you two quick Twitter stories then we'll go home. <li><b>Crowd</b>: [weak laughter]</li> <li><b>Dave</b>: Years ago, I went to love on the Twitter. I put my name in, and it said that you can't use famous people's names. <li><b>Crowd</b>: [weak laughter]</li> <li><b>Dave</b>: And that my name was already in use, it's true. <li><b>Dave</b>: So I look online to see who's using my name and it turns out it was a fake Dave Chappelle. And I was like, what the fuck? And I started to shut him down, but I read the n***** tweets. And this is shocking. This motherfucker, Elon, was hilarious. <li><b>Crowd</b>: [weak laughter, someone yells out &quot;damn right&quot;]</li> <li><b>Dave</b>: So I figured, you know what, I'm gonna let him drop. And everybody will think I'm saying all this funny shit, and I don't even have to say this stuff. And it was great. Every morning I wake up and get some coffee and laugh at fake Dave Chappelle's tweets. <li><b>Dave</b>: But then <li><b>Crowd</b>: [loud sounds, can hear someone say &quot;whoa&quot;] <li><b>Dave</b>: [blocks stage light with hand so he can see into the crowd, looks into crowd] Fight. Will you cut that shit out, you anti-[unintelligible; lots of people are reporting this as facist, which is plausible, making the statement about &quot;anti-facists&quot;] n*****? <li><b>Crowd</b>: [loud sounds, can hear some jeers and boos]</p> Chat log exhibits from Twitter v. Musk case elon-twitter-texts/ Sat, 01 Oct 2022 00:00:00 +0000 elon-twitter-texts/ <p>This is a scan/OCR of Exhibits H and J from the Twitter v. Musk case, with some of the conversations de-interleaved and of course converted from a fuzzy scan to text to make for easier reading.</p> <p>I did this so that I could easily read this and, after reading it, I've found that most accountings of what was said are, in one way or another, fairly misleading. Since the texts aren't all that long, if you're interested in what they said, <a href="dunning-kruger/">I would recommended that you just read the texts in their entirety</a> (to the extent they're available — the texts make it clear that some parts of conversations are simply not included) instead of reading what various journalists excerpted, which seems to sometimes be deliberately misleading because selectively quoting allows them to write a story that matches their agenda and sometimes accidentally misleading because they don't know what's interesting about the texts.</p> <p>If you want to compare these conversations to other executive / leadership conversations, you can compare them to <a href="us-v-ms/">Microsoft emails and memos that came out of the DoJ case against Microsoft</a> and the <a href="https://www.cs.cmu.edu/~enron/">Enron email dataset</a>.</p> <p>Since this was done using OCR, it's likely there are OCR errors. Please feel free to <a href="https://twitter.com/danluu/">contact me</a> if you see an error.</p> <h3 id="exhibit-h">Exhibit H</h3> <ul> <li><a id="1"></a><a href="#1">2022-01-21 to 2022-01-24</a> <ul> <li><b>Alex Shillings [IT specialist for SpaceX / Elon]</b>: Elon- are you able to access your Twitter account ok? I saw a number of emails held in spam. Including some password resets attempts</li> <li><b>Elon</b>: I haven't tried recently</li> <li><b>Elon</b>: Am staying off twitter</li> <li><b>Elon</b>: Is my twitter account posting anything?</li> <li><b>Alex</b>: Not posting but I see one deactivation email and a dozen password reset emails. Assuming this is a scammer attempt but wanted to check to ensure you still had access to your Twitter</li> <li><b>Elon</b>: It Is someone trying to hack my twitter</li> <li><b>Elon</b>: But I have two-factor enabled with the confirmation app</li> <li><b>Alex</b>: OK, great to hear.</li> <li><b>Alex</b>: Yes -FaceTimed with them to confirm my identity(hah) and they are hopefully gonna reset your 2FA to SMS soon. Asking for an update now</li> <li><b>Elon</b>: Sounds good</li> <li><b>Elon</b>: I can also FaceTime with them if still a problem</li> <li><b>Alex</b>: Tldr; your account is considered high profile internally over there. So they've made it very hard to make changes like this by their teams. They are working through it...</li> <li><b>Elon</b>: Happy to FaceTime directly</li> <li><b>Elon</b>: Not sure how I was able to make Twitter work on this new phone, as I didn't use the backup code.</li> <li><b>Alex</b>: Connecting with their head of Trust &amp; Safety now</li> <li><b>Alex</b>: I assume we used your old phone to verify the new, once upon a time</li> <li><b>Elon</b>: Oh yeah</li> <li><b>Alex</b>: They can fix it by disabling all 2FA for your account which will let you in and then you can re-enable it. Are you available in 90 mins to have them coordinate it?</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Alex</b>: I know things are in flux right now, but is EMDesk SpaceX still your primary calendar? I realize there a meeting on there in 1 hour. In case I should move this twitter fix out a bit.</li> <li><b>Elon</b>: Yeah</li> <li><b>Elon</b>: But I can step off the call briefly ta Face Time them if need be</li> <li><b>Alex</b>: Sounds good. And ideally I'm just texting you ta sign in once they disable 2FA and then you can immediately sign in and re-enable. No FaceTime needed.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Alex</b>: Elon-we are ready to make the change if you are</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Alex</b>: 2FA disabled. Please try to log in now</li> <li><b>Alex</b>: Able to get back in ok?</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Alex</b>: And once in you can enable 2FA Settings&gt; Security and account access&gt; Security&gt; 2FA Alex Stillings</li> <li><b>Alex</b>: App only is suggested</li> <li><b>Elon</b>: Thanks!</li> <li><b>Alex</b>: And reminder to save that backup code 👍</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="2"></a><a href="#2">2022-03-05</a> <ul> <li><b>Antonio Gracias [VC]</b>: Wow...I saw your tweet re free speech. Wtf is going on Elon...</li> <li><b>Elon</b>: EU passed a law banning Russia Today and several other Russian news sources. We have been told to block their IP address.</li> <li><b>Elon</b>: Actually, I find their news quite entertaining</li> <li><b>Elon</b>: Lot of bullshit, but some good points too</li> <li><b>Antonio</b>: This is fucking nuts...you are totally right. I 100% agree with you.</li> <li><b>Elon</b>: We should allow it precisely bc we hate it...that is the ping of the American constitution.</li> <li><b>Antonio</b>: Exactly</li> <li><b>Elon</b>: Free speech matters mast when it's someone you hate spouting what you think is bullshit.</li> <li><b>Antonio</b>: I am 100% with you Elon. To the fucking mattresses no matter what .....this is a principle we need to fucking defend with our lives or we are lost to the darkness.</li> <li><b>Antonio</b>: Sorry for the swearing. I am getting excited.</li> <li><b>Elon</b>: [&quot;loved&quot; &quot;I am 100%...&quot;]</li> <li><b>Elon</b>: [2022-04-26] On a call. Free in 3O mins.</li> <li><b>Antonio</b>: Ok. I'll call you in 30</li> </ul></li><br> <li><a id="3"></a><a href="#3">2022-03-24</a> <ul> <li><b>TJ</b>: can you buy Twitter and then delete it, please!? xx</li> <li><b>TJ</b>: America is going INSANE.</li> <li><b>TJ</b>: The Babylon Bee got suspension is crazy. Raiyah and I were talking about it today. It was a fucking joke. Why has everyone become so puritanical?</li> <li><b>TJ</b>: Or can you buy Twitter and make it radically free-speech?</li> <li><b>TJ</b>: So much stupidity comes from Twitter xx</li> <li><b>Elon</b>: Maybe buy it and change it to properly support free speech xx</li> <li><b>Elon</b>: [&quot;liked&quot; &quot;Or can you buy Twitter...&quot;]</li> <li><b>TJ</b>: I honestly think social media is the scourge of modern life, and the worst of all is Twitter, because it's also a news stream as well as a social platform, and so has more real-world standing than Tik Tok etc. But it's very easy to exploit and is being used by radicals for social engineering on a massive scale. And this shit is infecting the world. Please do do something to fight woke-ism. I will do anything to help! xx</li> </ul></li><br> <li><a id="4"></a><a href="#4">2022-03-24 to 2022-04-06</a> [interleaved with above convo] <ul> <li><b>Joe Lonsdale [VC]</b>: I love your &quot;Twitter algorithm should be open source&quot; tweet -I'm actually speaking to over 100 members of congress tomorrow at the GOP policy retreat and this is one of the ideas I'm pushing for reigning in crazy big tech. Now I can cite you so I'll sound less crazy myself :). Our public squares need to not have arbitrary sketchy censorship.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Elon</b>: Absolutely</li> <li><b>Elon</b>: What we have right now is hidden corruption!</li> <li><b>Joe</b>: [&quot;loved&quot; above]</li> <li><b>[2022-04-04]</b>: Joe: Excited to see the stake in Twitter -awesome. &quot;Back door man&quot; they are saying haha. Hope you're able to influence it. I bet you the board doesn't even get full reporting or see any report of the censorship decisions and little cabals going on there but they should -the lefties on the board likely want plausible deniability !</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>[2022-04-16] Joe</b>: Haha even Governor DeSantis just called me just now with ideas how to help you and outraged at that board and saying the public is rooting for you. Let me know if you or somebody on your side wants to chat w him. Would be fun to see you if you guys are around this weekend or the next few days.</li> <li><b>Elon</b>: Haha cool</li> </ul></li><br> <li><a id="5"></a><a href="#5">2022-03-26</a> <ul> <li><b>&quot;jack jack&quot; [presumably Jack Dorsey, former CEO of Twitter and CEO of Square]</b>: Yes, a new platform is needed. It can't be a company. This is why I left.</li> <li><b>jack</b>: <a href="https://twitter.com/elonmusk/status/1507777913042571267?s,=20&amp;t=8z3h0h0JGSnt86Zuxd61Wg">https://twitter.com/elonmusk/status/1507777913042571267?s,=20&amp;t=8z3h0h0JGSnt86Zuxd61Wg</a></li> <li><b>Elon</b>: Ok</li> <li><b>Elon</b>: What should it look like?</li> <li><b>jack</b>: I believe it must be an open source protocol, funded by a foundation of sorts that doesn't own the protocol, only advances it. A bit like what Signal has done. It can't have an advertising model. Otherwise you have surface area that governments and advertisers will try to influence and control. If it has a centralized entity behind it, it will be attacked. This isn't complicated work, it just has to be done right so it's resilient to what has happened to twitter.</li> <li><b>Elon</b>: Super interesting idea</li> <li><b>jack</b>: I'm off the twitter board mid May and then completely out of company. I intend to do this work and fix our mistakes. Twitter started as a protocol. It should have never been a company. That was the original sin.</li> <li><b>Elon</b>: I'd like to help if I am able to</li> <li><b>jack</b>: I wanted to talk with you about it after I was all clear, because you care so much, get it's importance, and could def help in immeasurable ways. Back when we had the activist come in, I tried my hardest to get you on our board, and our board said no. That's about the time I decided I needed to work to leave, as hard as it was for me.</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> <li><b>jack</b>: Do you have a moment to talk?</li> <li><b>Elon</b>: Bout to head out to dinner but can for a minute</li> <li><b>jack</b>: I think the main reason is the board is just super risk averse and saw adding you as more risk, which I thought was completely stupid and backwards, but I only had one vote, and 3% of company, and no dual class shares. Hard set up. We can discuss more.</li> <li><b>Elon</b>: Let's definitely discuss more</li> <li><b>Elon</b>: I think it's worth both trying to move Twitter in a better direction and doing something new that's decentralized</li> <li><b>jack</b>: It's likely the best option. I just have doubts. But open</li> <li><b>Elon</b>: [&quot;liked above]</li> </ul></li><br> <li><a id="6"></a><a href="#6">2022-03-26 to 2022-03-27</a> <ul> <li><b>Elon to Egon Durban [private equity; Twitter board member]</b>: This is Elon. Please call when you have a moment.</li> <li><b>Elon</b>: It is regarding the Twitter board.</li> <li><b>Egon</b>: Have follow-up. Let's chat today whenever convenient for you.</li> </ul></li><br> <li><a id="7"></a><a href="#7">2022-03-27 to 2022-04-26</a> [interleaved with above] <ul> <li><b>Larry Ellison [Oracle founder and exec] Elon, I'd like to chat with you in the next day or so ... I do think we need another Twitter 👍</li> <li>Elon</b>: Want to talk now?</li> <li><b>Larry</b>: Sure.</li> <li><b>[2022-04-17] Elon</b>: Any interest in participating in the Twitter deal?</li> <li><b>Larry</b>: Yes ... of course 👍</li> <li><b>Elon</b>: Cool</li> <li><b>Elon</b>: Roughly what dollar size? Not holding you to anything, but the deal is oversubscribed, so I have to reduce or kick out some participants.</li> <li><b>Larry</b>: A billion ... or whatever you recommend</li> <li><b>Elon</b>: Whatever works for you. I'd recommend maybe $2B or more. This has very high potential and I'd rather have you than anyone else.</li> <li><b>Larry</b>: I agree that it has huge potential... and it would be lots of fun</li> <li><b>Elon</b>: Absolutely:)</li> <li><b>[2022-04-26] Larry</b>: Since you think I should come in for at least $2B... I'm in for $2B 👍</li> <li><b>Elon</b>: Haha thankss:)</li> </ul></li><br> <li><a id="8"></a><a href="#8">2022-03-27 to 2022-03-31</a> [group chat with Egon Durban, &quot;Martha Twitter NomGov&quot;, Brett Taylor [CEO of Salesforce and Chairman of Twitter board], &quot;Parag&quot; [presumably Parag Agrawal, CEO of Twitter, and Elon Musk] <ul> <li><b>Egon</b>: Hi everyone Parag (Ceo), Bret (Chairman) and Martha (head of gov) -You are connected w Elon. He is briefed on my conversations w you. Elon -everyone excited about prospect of you being involved and on board. Next step is for you to chat w three of them so we can move this forward quickly. Maybe we can get this done next few days🤞</li> <li><b>Elon</b>: Thanks Egon</li> <li><b>Parag</b>: Hey Elon - great to be connected directly. Would love to chat! Parag</li> <li><b>Martha</b>: Hey Elon, I'm Martha chair of Twitter nomgov- know you've talked to Bret and parag - keen to have a chat when you have time - im in Europe (also hope covid not too horrible as I hear you have it)</li> <li><b>Parag</b>: Look forward to meeting soon! Can you let us know when you are able to meet in the Bay Area in the next couple of days?</li> <li><b>Martha</b>: Hey Elon, I'm Martha chair of Twitter nomgov - know you've talked to Bret and parag -I'm v keen to have a chat when you have time - im in Europe but will make anything work</li> <li><b>Elon</b>: Sounds good. Perhaps a call late tonight central time works? I'm usually up until ~3am.</li> <li><b>Martha</b>: If ok with you I'll try you 10am CET (lam PST) looking forward to meeting you</li> <li><b>Elon</b>: Sure</li> <li><b>Martha</b>: Thanks v much for the time Elon -pis let us know who in your office our GC can talk to -sleep well!</li> <li><b>Elon</b>: You're most welcome. Great to talk!</li> </ul></li><br> <li><a id="9"></a><a href="#9">2022-03-27</a> [interleaved with above] <ul> <li><b>Brett Taylor</b>: This is Bret Taylor. Let me know when you have a minute to speak today.Just got off with Parag and I know he is eager to speak with you today as well. Flexible all day</li> <li><b>Elon</b>: Later tonight would work - maybe 7pm? I have a minor case of Covid, so am a little under the weather.</li> <li><b>Brett</b>: Sorry to hear -it can knock you out. 7pm sounds great</li> <li><b>Elon</b>: [&quot;liked above]</li> </ul></li><br> <li><a id="10"></a><a href="#10">2022-03-27</a> <ul> <li><b>Parag</b>: Would love to talk. Please let me know what time works - I'm super flexible. -Parag</li> <li><b>Elon</b>: Perhaps tonight around 8?</li> <li><b>Parag</b>: That works! Look forward to talking.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Elon</b>: Just finishing a Tesla Autopilot engineering call</li> <li><b>Parag</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="11"></a><a href="#11">2022-03-27 to 2022-04-24</a> [interleaved with above] <ul> <li><b>&quot;Dr Jabour&quot;</b>: Hi E,Pain settling down? Time for a latter-day Guttenburg to bring back free speech ..... and buy Twitter.</li> <li><b>[2022-04-04] Elon</b>: [&quot;liked&quot; above]</li> <li><b>[2022-04-24] Jabour</b>: Hi E,looks like a TWITTER board member scurrying around unbalanced trying to deal with your offer.... Am loving your tactics, ( vid taken fro my house on Monica beach)-Brad</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> </ul></li><br> <li><a id="12"></a><a href="#12">2022-03-29 to 2022-04-01</a> [interleaved with above] <ul> <li><b>Will MacAskill [co-creator of effective altruism movement, Oxford professor, Chair of board for Global Priorities Institute at Oxford]</b>: Hey - I saw your poll on twitter about Twitter and free speech. I'm not sure if this is what's on your mind, but my collaborator Sam Bankman-Fried (<a href="https://www.forbes.com/profile/sam-bankman-fried/?sh=4de9866a4449">https://www.forbes.com/profile/sam-bankman-fried/?sh=4de9866a4449</a>) has for a while been potentially interested in purchasing it and then making it better for the world. If you want to talk with him about a possible joint effort in that direction, his number is [redacted] and he's on Signal.</li> <li><b>Elon</b>: Does he have huge amounts of money?</li> <li><b>Will</b>: Depends on how you define &quot;huge&quot;! He's worth $24B, and his early employees (with shared values) bump that to $30B. I asked about how much he could in principle contribute and he said: &quot;~$1-3b would be easy-$3-8b I could do ~$8-15b is maybe possible but would require financing&quot;</li> <li><b>Will</b>: If you were interested to discuss the idea I asked and he said he'd be down to meet you in Austin</li> <li><b>Will</b>: He's based in the Bahamas normally. And I might visit Austin next week, if you'd be around?</li> <li><b>Will</b>: That's a start</li> <li><b>Will</b>: Would you like me to intro you two via text?</li> <li><b>Elon</b>: You vouch for him?</li> <li><b>Will</b>: Very much so! Very dedicated to making the long-term future of humanity go well</li> <li><b>Elon</b>: Ok then sure</li> <li><b>Will</b>: Great! Will use Signal</li> <li><b>Will</b>: (Signal doesn't work; used imessage instead)</li> <li><b>Elon</b>: Ok</li> <li><b>Will</b>: And in case you want to get a feel for Sam, here's the Apr 1st tweet from his foundation, the Future Fund, which I'm advising on -I thought you might like it:</li> <li><b>Will</b>: <a href="https://twitter.com/ftxfuturefund/status/1509924452422717440?s=20&amp;t=0qjM58KUj49xSGa0qae97Q">https://twitter.com/ftxfuturefund/status/1509924452422717440?s=20&amp;t=0qjM58KUj49xSGa0qae97Q</a></li> <li><b>Will</b>: And here's the actual (more informative) launch tweet· moving $100M-$1B this year to improve the future of humanity:</li> <li><b>Will</b>: <a href="https://twitter.com/ftxfuturefund/status/1498350483206860801">https://twitter.com/ftxfuturefund/status/1498350483206860801</a></li> </ul></li><br> <li><a id="13"></a><a href="#13">2022-03-29 to 2022-04-14</a> <ul> <li><b>Mathias Döpfner [CEO and 22% owner of Axel Springer, president of Federal Association of Digital Publishers and Newspaper Publishers]</b>: Why don't you buy Twitter? We run it for you. And establish a true platform of free speech. Would be a real contribution to democracy.</li> <li><b>Elon</b>: Interesting idea</li> <li><b>Mathias</b>: I'm serious. It's doable. Will be fun.</li> <li><b>[2022-04-04] Mathias</b>: Congrats to the Twitter invest! Fast execution 🤩 Shall we discuss wether we should join that project? I was serious with my suggestion.</li> <li><b>Elon</b>: Sure, happy to talk</li> <li><b>Mathias</b>: I am going to miami tomorrow for a week. Shall we speak then or Wednesday and take it from there?</li> <li><b>Elon</b>: Sure</li> <li><b>[2022-04-06] Mathias</b>: A short call about Twitter?</li> <li><b>Mathias</b>: # Status Quo: It is the de facto public town square, but it is a problem that it does not adhere to free speech principles. =&gt; so the core product is pretty good, but (i) it does not serve democracy, and (ii) the current business model is a dead end as reflected by flat share price. # Goal: Make Twitter the global backbone of free speech, an open market place of ideas that truly complies with the spirit of the first amendment and shift the business model to a combination of ad-supported and paid to support quality # Game Plan: 1.),,Solve Free Speech&quot; 1a) Step 1: Make it censorship-FREE by radically reducing Terms of Services (now hundreds of pages) to the following: Twitter users agree to: (1) Use our service to send spam or scam users, (2) Promote violence, (3) Post illegal pornography. 🙃 1b) Step 2: Make Twitter censorship-RESISTANT • Ensure censorship resistance by implementing measures that warrant that Twitter can't be censored long term, regardless of which government and management • How? Keep pushing projects at Twitter that have been working on developing a decentralized social network protocol (e.g., BlueSky). It's not easy, but the backend must run on decentralized infrastructure, APls should become open (back to the roots! Twitter started and became big with open APIs). • Twitter would be one of many clients to post and consume content. • Then create a marketplace for algorithms, e.g., if you're a snowflake and don't want content that offends you pick another algorithm. 2.) ,,Solve Share Price&quot; Current state of the business: • Twitters ad revenues grow steadily and for the time being, are sufficient to fund operations. • MAUs are flat, no structural growth • Share price is flat, no confidence in the existing business model and/or</li> <li><b>[2022-04-14] Mathias</b>: Our editor of Die Welt just gave an interview why he left Twitter. What he is criticising is exactly what you most likely want to change. I am thrilled to discuss twitters future when you are ready. So exciting.</li> <li><b>Elon</b>: Interesting!</li> </ul></li><br> <li><a id="14"></a><a href="#14">2022-03-31 to 2022-04-01</a> [group chat with Bret Taylor, Parag, and Elon Musk, interleaved with some of the above] <ul> <li><b>Elon</b>: I land in San Jose tomorrow around 2pm and depart around midnight. My Tesla meetings are flexible, so I can meet anytime in those 10 hours.</li> <li><b>Bret</b>: By &quot;tomorrow&quot; do you mean Thursday or Friday?</li> <li><b>Elon</b>: Today</li> <li><b>Parag</b>: I can make any time in those 10 hours work.</li> <li><b>Bret</b>: I land in Oakland at 8:30pm. Perhaps we can meet at 9:30pm somewhere? I am working to see if I can move up my flight from NYC to land earlier in the meantime</li> <li><b>Bret</b>: Working on landing earlier and landing in San Jose so we can have dinner near you. Will keep you both posted in real time</li> <li><b>Bret</b>: Ok, successfully moved my flight to land at 6:30pm in San Jose. Working on a place we can meet privately</li> <li><b>Elon</b>: Sounds good</li> <li><b>Elon</b>: Crypto spam on Twitter really needs to get crushed. It's a major blight on the user experience and they scam so many innocent people.</li> <li><b>Bret</b>: It sounds like we are confirming 7pm at a private residence near San Jose. Our assistants reached out to Jehn on logistics. Let me know if either of you have any concerns or want to move things around. Looking forward to our conversation.</li> <li><b>Parag</b>: Works for me. Excited to see you both in person!</li> <li><b>Elon</b>: Jehn had a baby and I decided to try having no assistant for a few months</li> <li><b>Elon</b>: Likewise</li> <li><b>Bret</b>: The address is [redacted]</li> <li><b>Bret</b>: Does 7pm work for you Elon?</li> <li><b>Elon</b>: Probably close to that time. Might only be able to get there by 7:30, but will try for earlier.</li> <li><b>Bret</b>: Sounds good. I am going to be a bit early because my plane is landing earlier but free all evening so we can start whenever you get there and Parag and I can catch up in the meantime</li> <li><b>Bret</b>: This wins for the weirdest place I've had a meeting recently. I think they were looking for an airbnb near the airport and there are tractors and donkeys 🤷</li> <li><b>Elon</b>: Haha awesome</li> <li><b>Elon</b>: Maybe Airbnb's algorithm thinks you love tractors and donkeys (who doesn't!)</li> <li><b>Elon</b>: On my way. There in about 15 mins.</li> <li><b>Bret</b>: And abandoned trucks in case we want to start a catering business after we meet</li> <li><b>Elon</b>: Sounds like a post-apocalyptic movie set</li> <li><b>Bret</b>: Basically yes</li> <li><b>Elon</b>: Great dinner:)</li> <li><b>Bret</b>: Really great. The donkeys and dystopian surveillance helicopters added to the ambiance</li> <li><b>Elon</b>: Definitely one for the memory books haha</li> <li><b>Parag</b>: Memorable for multiple reasons. Really enjoyed it</li> </ul></li><br> <li><a id="15"></a><a href="#15">2022-03-31 to 2022-04-02</a> [group message with Will MacAskill, &quot;Sam BF&quot;, and Elon Musk, interleaved with above] <ul> <li><b>Will</b>: Hey, here's introducing you both, Sam and Elon. You both have interests in games, making the very long-run future go well, and buying Twitter. So I think you'd have a good conversation!</li> <li><b>Sam</b>: Great to meet you Elon-happy to chat about Twitter (or other things) whenever!</li> <li><b>Elon</b>: Hi!</li> <li><b>Elon</b>: Maybe we can talk later today? I'm in Germany.</li> <li><b>Sam</b>: I'm on EST-could talk sometime between 7pm and 10pm Germany time today?</li> </ul></li><br> <li><a id="16"></a><a href="#16">2022-04-03 to 2022-04-04</a> [group chat with Jared Birchall, &quot;Martha Twitter NomGov&quot;, and Elon Musk] <ul> <li><b>Elon</b>: Connecting Martha (Twitter Norn/Gov) with Jared (runs my family office).</li> <li><b>Elon</b>: Jared, there is important paperwork to be done to allow for me to hopefully join the Twitter board.</li> <li><b>Martha</b>: Thanks Elon - appreciate this - hi Jared - I'm going to put Sean Edgett in touch with you who is GC at Twitter</li> <li><b>Jared</b>: Sounds good. Please have him call anytime or send the docs to my email: [...]</li> <li><b>Martha</b>: 👍</li> <li><b>Martha</b>: Elon - are you available to chat for 5 mins?</li> <li><b>Martha</b>: I'd like to relay the board we just finished</li> <li><b>Elon</b>: Sure</li> <li><b>[2022-04-04] Martha</b>: Morning elon -you woke up to quite a storm.... Great to hear from Bret that you agree we can move this along v quickly today -Jared, I'm assuming it's you I should send the standstill they discussed to you? It will be the same as egon and silverlake undertook. Let me know if should go to someone else - we're really keen to get this done in next couple of hours. Thank you</li> <li><b>Elon</b>: You can send to both of us</li> <li><b>Elon</b>: Sorry, I just woke up when Bret called! I arrived from Berlin around 4am.</li> <li><b>Martha</b>: No apologies necessary. Let's How would you like it sent? If by email, pis let me know where</li> <li><b>Elon</b>: Text or email</li> <li><b>Elon</b>: My email is [redacted]</li> <li><b>Martha</b>: 👍</li> <li><b>Martha</b>: &lt;Attachment-application/vnd.openxmlformatsofficedocument.wordprocessingml.document-Twitter Cooperation Agreement-Draft April 4 2022.docx&gt;</li> <li><b>Martha</b>: Here it is -also gone by email Same as Egon's but even more pared back</li> <li><b>Martha</b>: Just copying you both to confirm sent agreement-v keen to get this done quickly as per your conversation</li> </ul></li><br> <li><a id="17"></a><a href="#17">2022-04-03</a> <ul> <li><b>Bret</b>: Just spoke to Martha. Let me know when you have time to talk today or tomorrow. Sounds like you are about to get on a flight — flexible</li> <li><b>Elon</b>: Sounds good. I'm just about to take off from Berlin to Austin, but free to talk anytime tomorrow.</li> <li><b>Bret</b>: I am free all day. Text when you are available. Planning to take a hike with my wife and that is the only part where my reception may be spotty. Looking forward to speaking. And looking forward to working with you!</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="18"></a><a href="#18">2022-04-03</a> [interleaved with above] <ul> <li><b>Parag</b>: I expect you heard from Martha and Bret already. I'm super excited about the opportunity and look forward to working closely and finding ways to use your time as effectively as possible to improve Twitter and the public conversation.</li> <li><b>Elon</b>: Sounds great!</li> </ul></li><br> <li><a id="19"></a><a href="#19">2022-04-03</a> <ul> <li><b>jack</b>: I heard good things are happening</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="20"></a><a href="#20">2022-04-04</a> <ul> <li><b>Ken Griffin [CEO of Citadel]</b>: Love it !!</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="21"></a><a href="#21">2022-04-04</a> <ul> <li><b>Bret Taylor</b>: Hey are you available?</li> <li><b>Bret Taylor</b>: Given the SEC filing, would like to speak asap to coordinate on communications. Call asap when you are back</li> </ul></li><br> <li><a id="22"></a><a href="#22">2022-04-04</a> [interleaved with above] <ul> <li><b>[redacted]</b>: Congratulations!! The above article ☝️ [seemingly referring to <a href="https://www.revolver.news/2022/04/elon-musk-buy-twitter-free-speech-tech-censorship-american-regime-war/">https://www.revolver.news/2022/04/elon-musk-buy-twitter-free-speech-tech-censorship-american-regime-war/</a>] was laying out some of the things that might happen: Step 1: Blame the platform for its users Step 2. Coordinated pressure campaign Step 3: Exodus of the Bluechecks Step 4: Deplatforming &quot;But it will not be easy. It will be a war. Let the battle begin.&quot;</li> <li><b>[redacted]</b>: It will be a delicate game of letting right wingers back on Twitter and how to navigate that (especially the boss himself, if you're up for that) I would also lay out the standards early but have someone who has a savvy cultural/political view to be the VP of actual enforcement</li> <li><b>[redacted]</b>: A Blake Masters type</li> </ul></li><br> <li><a id="23"></a><a href="#23">2022-04-04 to 2022-04-17</a> [interleaved with above] <ul> <li><b>Egon Durban</b>: Hi -if you have a few moments call anytime? Flying to UK</li> <li><b>Elon</b>: Just spoke to Bret. His call woke me up haha. Got it from Berlin at 4am.</li> <li><b>Egon</b>: 🙏</li> <li><b>[2022-04-17] Elon</b>: You're calling Morgan Stanley to speak poorly of me ...</li> </ul></li><br> <li><a id="24"></a><a href="#24">2022-04-04</a> <ul> <li><b>Elon to Jared Birchall</b>: Please talk to Martha about the filing</li> <li><b>Jared</b>: ok</li> </ul></li><br> <li><a id="25"></a><a href="#25">2022-04-04</a> <ul> <li><b>Bret Taylor</b>: Do you have five minutes?</li> <li><b>Elon</b>: Sure</li> </ul></li><br> <li><a id="26"></a><a href="#26">2022-04-04</a> <ul> <li><b>Elon to Parag</b>: Happy to talk if you'd like</li> <li><b>Parag</b>: That will be very helpful. Please call me when you have a moment</li> <li><b>Elon</b>: Just on the phone with Jared. Will call as soon as that's done.</li> <li><b>Parag</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="27"></a><a href="#27">2022-04-04</a> <ul> <li><b>&quot;Kyle&quot;</b>: So can you bust us out of Twitter Jail now lol</li> <li><b>Elon</b>: I do not have that ability</li> <li><b>Kyle</b>: Lol I know I know. Big move though, love to see it</li> </ul></li><br> <li><a id="28"></a><a href="#28">2022-04-04</a> [group chat with Egon Durban, &quot;Martha Twitter NomGov&quot;, Brett Taylor, Parag Agrawal, and Elon Musk, interleaved with above] <ul> <li><b>Elon</b>: Thank you for considering me for the Twitter board, but, after thinking it over, my current time commitments would prevent me from being an effective board member. This may change in the future. Elon</li> </ul></li><br> <li><a id="29"></a><a href="#29">2022-04-04</a> [interleaved with some of above]: <ul> <li><b>Joe Rogan</b>: Are you going to liberate Twitter from the censorship happy mob?</li> <li><b>Elon</b>: I will provide advice, which they may or may not choose to follow</li> </ul></li><br> <li><a id="30"></a><a href="#30">2022-04-04</a> <ul> <li><b>Bret Taylor</b>: <a href="https://twitter.com/trungtohan/status/1510994320471429131?s=10&amp;t=qrv_fOhTfUzRVDe_IbJKlQ">https://twitter.com/trungtohan/status/1510994320471429131?s=10&amp;t=qrv_fOhTfUzRVDe_IbJKlQ</a></li> <li><b>Elon</b>: [&quot;laughed at&quot; above]</li> </ul></li><br> <li><a id="31"></a><a href="#31">2022-04-04 to 2022-04-05</a> [interleaved with above] <ul> <li><b>Parag</b>: You should have an updated agreement in your email. I'm available to chat.</li> <li><b>Elon</b>: Approved</li> <li><b>Parag</b>: [&quot;loved&quot; above]</li> <li><b>Parag</b>: Have a few mins to chat? I'm eager to move fast</li> <li><b>Elon</b>: Sure, I'm just on a SpaceX engine review call.</li> <li><b>Parag</b>: Please call me after</li> <li><b>Parag</b>: I'm excited to share that we're appointing @elonmusk to our board! Through conversations with Elon in recent weeks, it became clear to me that he would bring great value to our Board. Why? Above all else, he's both a passionate believer and intense critic of the service which is exactly what we need on Twitter, and in the Boardroom, to make us stronger in the long-term. Welcome Elon!</li> <li><b>Elon</b>: Sounds good</li> <li><b>Elon</b>: Sending out shortly?</li> <li><b>Parag</b>: <a href="https://twitter.com/paraga/status/1511320953598357505?s=21&amp;t=g9oXkMyPGFahuVNDKcoBa5A">https://twitter.com/paraga/status/1511320953598357505?s=21&amp;t=g9oXkMyPGFahuVNDKcoBa5A</a></li> <li><b>Elon</b>: Cool</li> <li><b>Parag</b>: Super excited!</li> <li><b>Elon</b>: Likewise!</li> <li><b>Elon</b>: Just had a great conversation with Jack! Are you free to talk later tonight?</li> <li><b>Parag</b>: Yeah, what time?</li> <li><b>Elon</b>: Would be great to unwind permanent bans, except for spam accounts and those that explicitly advocate violence.</li> <li><b>Elon</b>: 7pm CA time? Or anytime after that.</li> <li><b>Parag</b>: 7p works! Talk soon</li> <li><b>Elon</b>: Calling back in a few mins</li> <li><b>Parag</b>: [&quot;liked&quot; above]</li> <li><b>Elon</b>: Pretty good summary</li> <li><b>Elon</b>: <a href="https://twitter.com/stevenmarkryan/status/1511489781104275456?s=1O&amp;t=LprG6-7KefKLzNX133IpjQ">https://twitter.com/stevenmarkryan/status/1511489781104275456?s=1O&amp;t=LprG6-7KefKLzNX133IpjQ</a></li> </ul></li><br> <li><a id="32"></a><a href="#32">2022-04-05</a> [group chat with Jared Birchall, &quot;Martha Twitter NomGov&quot;, and Elon Musk] <ul> <li><b>Martha</b>: I'm so thrilled you're joining the board. I apologise about the bump of the first agreement-I'm not a good manager of lawyers. I really look forward to meeting you.</li> <li><b>Elon</b>: Thanks Martha, same here.</li> </ul></li><br> <li><a id="33"></a><a href="#33">2022-04-05</a> [interleaved with above] <ul> <li><b>Bret</b>: I am excited to work with you and grateful this worked out</li> <li><b>Elon</b>: Likewise</li> </ul></li><br> <li><a id="34"></a><a href="#34">2022-04-05</a> [interleaved with above] <ul> <li><b>jack</b>: Thank you for joining!</li> <li><b>jack</b>: <a href="https://twitter.com/jack/status/1511329369473564677?s=21&amp;t=DdrUUFvJPD7Kf-jXjBogIg">https://twitter.com/jack/status/1511329369473564677?s=21&amp;t=DdrUUFvJPD7Kf-jXjBogIg</a></li> <li><b>Elon</b>: Absolutely. Hope I can be helpfull</li> <li><b>jack</b>: Immensely. Parag is an incredible engineer. The board is terrible. Always here to talk through anything you want.</li> <li><b>Elon</b>: When is a good time to talk confidentially?</li> <li><b>jack</b>: anytime</li> <li><b>Elon</b>: Thanks, great conversation!</li> <li><b>jack</b>: Always! I couldn't be happier you're doing this. I've wanted it for a long time. Got very emotional when I learned it was finally possible.</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> <li><b>Elon</b>: Please be super vocal if there is something dumb I'm doing or not doing. That would be greatly appreciated.</li> <li><b>jack</b>: I trust you but def will do</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>jack</b>: <a href="https://twitter.com/MattNavarra/status/1511773605239078914">https://twitter.com/MattNavarra/status/1511773605239078914</a></li> <li><b>jack</b>: Looks like there's a &quot;verified&quot; account in the swamp of despair over there. <a href="https://m.facebook.com/Elonmuskoffifref=nf&amp;pn_ref=story&amp;rc=p">https://m.facebook.com/Elonmuskoffifref=nf&amp;pn_ref=story&amp;rc=p</a> (promoting crypto too!)</li> <li><b>Elon</b>: Haha</li> <li><b>[2022-04-26] jack</b>: I want to make sure Parag is doing everything possible to build towards your goals until close. He is really great at getting things done when tasked with specific direction. Would it make sense for me you and him to get on a call to discuss next steps and get really clear on what's needed? He'd be able to move fast and clear then. Everyone is aligned and this will help even</li> <li><b>Elon</b>: Sure</li> <li><b>jack</b>: great when is best for you? And please let me know where/ifyou want my help. I just want to make this amazing and feel bound to it</li> <li><b>Elon</b>: How about 7pm Central?</li> <li><b>Elon</b>: Your help would be much appreciated</li> <li><b>Elon</b>: I agreed with everything you said to me</li> <li><b>jack</b>: Great! Will set up. I won't let this fail and will do whatever it takes. It's too critical to humanity.</li> <li><b>Elon</b>: Absolutely</li> <li><b>jack</b>: &lt;Attachment-image/jpeg-Screen Shot 2022-04-26 at 15.05.00.jpeg&gt;</li> <li><b>jack</b>: I put together a draft list to make the discussion efficient. Goal is to align around 1) problems we're trying to solve, 2) longterm priorities, 3) short-tenn actions, all using a higher level guide you spoke about. Think about what you'd add/remove. Getting this nailed will increase velocity.</li> <li><b>jack</b>: Here's meeting link for 7pm your time</li> <li><b>jack</b>: [meeting URL]</li> <li><b>Elon</b>: Great list of actions</li> <li><b>jack</b>: We're on hangout whenever you're ready. No rush. Just working on refining doc.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Elon</b>: It's asking me for a Google account login</li> <li><b>Elon</b>: You and I are in complete agreement. Parag is just moving far too slowly and trying to please people who will not be happy no matter what he does.</li> <li><b>jack</b>: At least it became clear that you can't work together. That was clarifying.</li> <li><b>Elon</b>: Yeah</li> </ul></li><br> <li><a id="35"></a><a href="#35">2022-04-06</a> <ul> <li><b>Ira Ehrenpreis [VC]</b>: If you plan on joining the Nom/Gov or Comp Committees, lmk and I can give you some tips! Haha! 🤪</li> <li><b>Elon</b>: Haha, I didn't even want to join the Twitter board! They pushed really hard to have me join.</li> <li><b>Ira</b>: You're a pushover! 😂</li> <li><b>Ira</b>: And you already got them to try the edit! Oh yeah... it had already been in the works. Sure.</li> <li><b>Elon</b>: It was actually in the works, but I didn't know.</li> </ul></li><br> <li><a id="36"></a><a href="#36">2022-04-06 to 2022-04-08</a> <ul> <li><b>Justin Roiland [co-creator of Rick and Morty]</b>: I fucking love that you are majority owner of Twitter. My friends David and Daniel have a program that verifies identity that would be nice to connect to Twitter. As in, if people chose to use it, it could verify that they are a real person and not a troll farm. I should introduce you to them.</li> <li><b>Elon</b>: I just own 9% of Twitter, so don't control the company.</li> <li><b>Elon</b>: Will raise the identity issue with Parag (CEO).</li> </ul></li><br> <li><a id="37"></a><a href="#37">2022-04-06 to 2022-04-14</a> <ul> <li><b>Gayle King [co-host for CBS Mornings and editor for The Oprah Magazine]</b>: Gayle here! Have you missed me (smile) Are you ready for to do a proper sit down with me so much to discuss! especially with your twitter play ... what do I need to do ???? Ps I like a twitter edit feature with a24 hour time limit ... we all say shit we regret want to take back in the heat of the moment ...</li> <li><b>Elon</b>: Twitter edit button is coming</li> <li><b>Gayle</b>: The whole Twitter thing getting blown out of proportion</li> <li><b>Elon</b>: Owning ~9% is not quite control</li> <li><b>Gayle</b>: I never thought that it did ... and I'm not good in math</li> <li><b>Elon</b>: Twitter should move more to the center, but Parag already thought that should be the case before I came along.</li> <li><b>Elon</b>: [&quot;laughed at&quot; &quot;I never thought...&quot;]</li> <li><b>[2022-04-14] Gayle</b>: ELON! You buying twitter or offering to buy twitter Wow! Now Don't you think we should sit down together face to face this is as the kids of today say a &quot;gangsta move&quot; I don't know know how shareholders turn this down .. like I said you are not like the other kids in the class ....</li> <li><b>Elon</b>: [&quot;loved above]</li> <li><b>[2022-04-18] Elon</b>: Maybe Oprah would be interested in joining the Twitter board if my bid succeeds. Wisdom about humanity and knowing what is right are more important than so-called &quot;board governance&quot; skills, which mean pretty much nothing in my experience.</li> </ul></li><br> <li><a id="38"></a><a href="#38">2022-04-07 to 2022-04-08</a> [interleaved with above] <ul> <li><b>Parag</b>: A host of ideas around this merit exploration - even lower friction ones than this.</li> <li><b>Elon</b>: I have a thought about this that could take out two birds with one stone</li> <li><b>Elon</b>: Btw, what's your email?</li> <li><b>Parag</b>: [...]</li> <li><b>Parag</b>: Would you be able to do a q&amp;a for employees next week virtually? My travel is causing too long of a delay and only about 10-15% of audience will be in person so we will be optimizing for virtual anyways. Would any of Wed/Thu 11a pacific next week work for you for a 45 min video q&amp;a?- else I can suggest other times. Trying to maximize attendance across global timezones.</li> <li><b>Parag</b>: Would love to hear more when we speak nexts-do you have any availability tomorrow?</li> <li><b>Elon</b>: Sure</li> <li><b>Elon</b>: It would be great to get an update from the Twitter engineering team so that my suggestions are less dumb.</li> <li><b>Parag</b>: Yep-will set up a product+ eng conversation ahead of q&amp;a -they said, I expect most questions to not get into specific ideas / depth - but more around what you believe about the future of Twitter and why it matters, why you can personally, how to want to engage with us, what you hope to see change... -but also some from people who are upset that you are involved and generally don't like you for some reason. As you said yesterday, goal is for people to just hear you speak directly instead of make assumptions about you from media stories. Would Thursday 11a pacific work next week for the q&amp;a?</li> <li><b>Elon</b>: 11am PT on Wed works great</li> <li><b>Elon</b>: Exactly. Thurs 11 PT works.</li> <li><b>Parag</b>: Ok cool. So will confirm a convo Wed 11a PT with small eng and product leads. And the AMA on Thu 11a PT.</li> <li><b>Parag</b>: Also: my email to company about AMA leaked already+ lots of leaks from internal slack messages: <a href="https://www.washingtonpost.com/technology/2022/04/07/musk-twitter-employee-outcry/">https://www.washingtonpost.com/technology/2022/04/07/musk-twitter-employee-outcry/</a> -I think there is a large silent majority that is excited about you bring on the board, so this isn't representative. Happy to talk about it-none of this is a surprise.</li> <li><b>Elon</b>: Seedy</li> <li><b>Elon</b>: *awesome (damn autocorrect!)</li> <li><b>Elon</b>: As expected. Yeah, would be good to sync up. I can talk tomorrow night or anytime this weekend. I love our conversations!</li> <li><b>Parag</b>: I'm totally flexible after 530p pacific tomorrow -let me know what works. And yes this is expected -and I think a good thing to move us in a positive direction. Despite the turmoil internally-I think this is very helpful in moving the company forward.</li> <li><b>Elon</b>: Awesome!</li> <li><b>Elon</b>: I have a ton of ideas, but lmk if I'm pushing too hard. I just want Twitter to be maximum amazing.</li> <li><b>Parag</b>: I want to hear all the ideas -and I'll tell you which ones I'll make progress on vs. not. And why.</li> <li><b>Parag</b>: And in this phase -just good to spend as much time with you. + have my Product and Eng team talk to you to ingest information on both sides.</li> <li><b>Elon</b>: I would like to understand the technical details of the Twitter codebase. This will help me calibrate the dumbness of my suggestions.</li> <li><b>Elon</b>: I wrote heavy duty software for 20 years</li> <li><b>Parag</b>: I used to be CTO and have been in our codebase for a long time.</li> <li><b>Parag</b>: So I can answer many many of your questions.</li> <li><b>Elon</b>: I interface way better with engineers who are able to do hardcore programming than with program managero/ MBA types of people.</li> <li><b>Elon</b>: [&quot;liked&quot; &quot;I used to be CTO...&quot;]</li> <li><b>Elon</b>: 🔥🔥</li> <li><b>Parag</b>: in our next convo-treat me like an engineer instead of CEO and lets see where we get to. I'll know after that convo who might be the best engineer to connect you to.</li> <li><b>Elon</b>: Frankly, I hate doing mgmt stuff. I kinda don't think anyone should be the boss of anyone. But I love helping solve technical/product design problems.</li> <li><b>Elon</b>: You got it!</li> <li><b>Parag</b>: Look forward to speaking tomorrow. Do you like calendar invites sent to your email address?</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Elon</b>: I already put the two dates on my calendar, but no problem to send me supplementary stuff.<br /></li> <li><b>Parag</b>: I'm available starting now if you want to have a chat about engineering at Twitter. Let me know!</li> <li><b>Elon</b>: Call in about 45 mins?</li> <li><b>Parag [&quot;liked&quot; above]</li> <li>Elon</b>: Will call back shortly</li> <li><b>Elon</b>: &lt;Attachment• image/pngs-Screenshot 2022-04-08 at 10.10.09 PM.png&gt;</li> <li><b>Elon</b>: I am so sick of stuff like this</li> <li><b>Parag</b>: We should be catching this</li> <li><b>Elon</b>: Yeah</li> </ul></li><br> <li><a id="39"></a><a href="#39">2022-04-09 to 2022-04-24</a> <ul> <li><b>Kimbal Musk [Elon's brother and owner of The Kitchen Restaurant Group]</b>: I have an idea for a blockchain social media system that does both payments and short text messages/links like twitter. You have to pay a tiny amount to register your message on the chain, which will cut out the vast majority of spam and bots. There is no throat to choke, so free speech is guaranteed.</li> <li><b>Kimbal</b>: The second piece of the puzzle is a massive real-time database that keeps a copy of all blockchain messages in memory, as well as all message sent to or received by you, your followers and those you follow.</li> <li><b>Kimbal</b>: Third piece is a twitter-like app on your phone that accessed the database in the cloud.</li> <li><b>Kimbal</b>: This could be massive</li> <li><b>Kimbal</b>: I'd love to learn more. I've dug deep on Web3 (not crytpo as much} and the voting powers are amazing and verified. Lots you could do here for this as well</li> <li><b>Elon</b>: I think a new social media company is needed that is based on a blockchain and includes payments</li> <li><b>Kimbal</b>: Would have them pay w a token associated w the service? You'd have to hold the token in your wallet to post. Doesn't have to expensive it will grow over time in value</li> <li><b>Kimbal</b>: Blockchain prevents people from deleting tweets. Pros and cons, but let the games begin!</li> <li><b>Kimbal</b>: If you did use your own token, you would not needs advertising it's a pay for use service but at a very low price</li> <li><b>Kimbal</b>: With scale it will be a huge business purely for the benefit of the users. I hate advertisements</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Kimbal</b>: There are some good ads out there. The voting component of interested users (only vote if you want to) could vote on ads that add value. The advertisers would have to stake a much larger amount of tokens, but other than there is no charge for the ads. It will bring out the creatives and the ads can politically incorrect/art/activision/philanthropy</li> <li><b>Kimbal</b>: Voting rights could also crowdsource kicking scammers out. It drives me crazy when I see people promoting the scam that you're giving away Bitcoin. Lots of bad people out there</li> <li><b>[2022-04-24] Elon</b>: Do you want to participate in the twitter transaction?</li> <li><b>Kimbal</b>: Let's discuss tomorrow</li> <li><b>Elon</b>: Ok</li> <li><b>Kimbal</b>: I can break away from my group a lot of the time. Will text tomorrow afternoon and if you're free we can meet up</li> <li><b>Elon</b>: Ok</li> </ul></li><br> <li><a id="40"></a><a href="#40">2022-04-09</a> [interleaved with above] <ul> <li><b>Parag</b>: You are free to tweet &quot;is Twitter dying?&quot; or anything else about Twitter -but it's my responsibility to tell you that it's not helping me make Twitter better in the current context. Next time we speak, I'd like to you provide you perspective on the level of internal distraction right now and how it hurting our ability to do work. I hope the AMA will help people get to know you, to understand why you believe in Twitter, and to trust you -and I'd like the company to get to a place where we are more resilient and don't get distracted, but we aren't there right now.</li> <li><b>Elon</b>: What did you get done this week?</li> <li><b>Elon</b>: I'm not joining the board. This is a waste of time.</li> <li><b>Elon</b>: Will make an offer to take Twitter private.</li> <li><b>Parag</b>: Can we talk?</li> </ul></li><br> <li><a id="41"></a><a href="#41">2022-04-09 to 2022-04-10</a> [interleaved with above] <ul> <li><b>Bret</b>: Parag just called me and mentioned your text conversation. Can you talk?</li> <li><b>Elon</b>: Please expect a take private offer</li> <li><b>Bret</b>: I saw the text thread. Do you have five minutes so I can understand the context? I don't currently</li> <li><b>Elon</b>: Fixing twitter by chatting with Parag won't work</li> <li><b>Elon</b>: Drastic action is needed</li> <li><b>Elon</b>: This is hard to do as a public company, as purging fake users will make the numbers look terrible, so restructuring should be done as a private company.</li> <li><b>Elon</b>: This is Jack's opinion too.</li> <li><b>Bret</b>: Can you take 10 minutes to talk this through with me? It has been about 24 hours since you joined the board. I get your point, but just want to understand about the sudden pivot and make sure I deeply understand your point of view and the path forward</li> <li><b>Elon</b>: I'm about to take off, but can talk tomorrow</li> <li><b>Bret</b>: Thank you</li> <li><b>Bret</b>: Heyo-can you speak this evening? I have seen your tweets and feel more urgency about understanding your path forward</li> <li><b>[next day] Bret</b>: Acknowledging your text with Parag yesterday that you are declining to join the board. This will be reflected in our 8-K tomorrow. I've asked our team to share a draft with your family office today. I'm looking forward to speaking today.</li> <li><b>Elon</b>: Sounds good</li> <li><b>Elon</b>: It is better, in my opinion, to take Twitter private, restructure and return to the public markets once that is done. That was also Jack's view when I talked to him.</li> </ul></li><br> <li><a id="42"></a><a href="#42">2022-04-12</a> <ul> <li><b>Michael Kives [Hollywood talent agent]</b>: Have any time to see Philippe Laffont in Vancouver tomorrow?</li> <li><b>Elon</b>: Maybe</li> <li><b>Michael</b>: Any particular time of best?</li> <li><b>Michael</b>: any time best?</li> <li><b>Elon</b>: What exactly does he want?</li> <li><b>Michael</b>: Has some ideas on Twitter Owns a billion of Tesla Did last 2 or 3 SpaceX rounds And-wants to get into Boring in the future (I told him to help with recruiting) You could honestly do like 20 mins in your hotel He's super smart, good guy</li> <li><b>Elon</b>: Ok, he can come by tonight. Room 1001 at Shangri-La.</li> <li><b>Michael</b>: Need to find you a great assistant! I'm headed to bed I'll tell Philippe to email you when he lands tonight in case you're still up and want to meet</li> <li><b>Michael</b>: <a href="https://twitter.com/sbf_ftx/status/1514588820641128452?s=21&amp;tZ4pA_Ct35ud6M60g3ng">https://twitter.com/sbf_ftx/status/1514588820641128452?s=21&amp;tZ4pA_Ct35ud6M60g3ng</a></li> <li><b>Michael</b>: Could be cool to do this with Sam Bankman-Fried</li> <li><b>[2022-04-28] Elon</b>: Twitter is obviously not going to be turned into some right wing nuthouse. Aiming to be as broadly inclusive as possible. Do the right thing for vast majority of Americans.</li> <li><b>Michael</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="43"></a><a href="#43">2022-04-13 to 2022-04-15</a> <ul> <li><b>Elon to Bret</b>: After several days of deliberation -this is obviously a matter of serious gravity-I have decided to move forward with taking Twitter private. I will send you an offer letter tonight, which will be public in the morning. Happy to connect you with my team if you have any questions. Thanks, Elon</li> <li><b>Bret</b>: Acknowledged</li> <li><b>Bret</b>: Confirming I received your email. Also, please use [...] going forward, my personal email.</li> <li><b>Elon</b>: Will do</li> <li><b>[2022-04-14] Bret</b>: Elon, as you saw from our press release, the board is in receipt of your letter and is evaluating your proposal to determine the course of action that it believes is in the best interest of Twitter and all of its stockholders. We will be back in touch with you when we have completed that work. Bret</li> <li><b>Elon</b>: Sounds good</li> <li><b>[2022-04-17] Bret</b>: Elon, I am just checking in to reiterate that the board Is seriously reviewing the proposal in your letter. We are working on a formal response as quickly as we can consistent with our fiduciary duties. Feel free to reach out anytime.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="44"></a><a href="#44">2022-04-14</a> <ul> <li><b>Elon to Steve Davis [President of Boring Company]</b>: My Plan B is a blockchain-based version of twitter, where the &quot;tweets&quot; are embedded in the transaction as comments. So you'd have to pay maybe 0.1 Doge per comment or repost of that comment.</li> <li><b>Elon</b>: <a href="https://twitter.com/elonmusk/status/1514564966564651008?s=1O&amp;t=OfO6fmJ_4DuQrOrdkKIT0gQ">https://twitter.com/elonmusk/status/1514564966564651008?s=1O&amp;t=OfO6fmJ_4DuQrOrdkKIT0gQ</a> Self</li> <li><b>Steve</b>: Amazing! Not sure which plan to root for. If Plan B wins, let me know if blockchain engineers would be helpful.</li> </ul></li><br> <li><a id="45"></a><a href="#45">2022-04-14<a/> [group chat with Will MacAskill, Sam BF, and Elon Musk, interleaved with above] <ul> <li><b>Sam BF</b>: Btw Elon-would love to talk about Twitter Also a post on how blockchain+Twitter could work:</li> <li><b>Sam BF</b>:<a href="https://twitter.com/sbf_ftx/status/1514588820641128452?s=21&amp;t=n10hLHFilyMognjOucltw">https://twitter.com/sbf_ftx/status/1514588820641128452?s=21&amp;t=n10hLHFilyMognjOucltw</a></li> </ul></li><br> <li><a id="46"></a><a href="#46">2022-04-14 to 2022-04-16</a> <ul> <li><b>Marc Merrill [co-founder and President of Riot Games]</b>: <a href="https://ground.news/article/elon-musk-offers-to-buy-twitter-for-4139-billion_20a2b3">https://ground.news/article/elon-musk-offers-to-buy-twitter-for-4139-billion_20a2b3</a></li> <li><b>Marc</b>: you are the hero Gotham needs - hell F'ing yes!</li> </ul></li><br> <li><a id="47"></a><a href="#47">2022-04-14 to 2022-04-15</a> <ul> <li><b>Jason Calacanis [VC]</b>: You should raise your offer</li> <li><b>Jason</b>: $54.21</li> <li><b>Jason</b>: The perfect counter</li> <li><b>Jason</b>: You could easily clean up bots and spam and make the service viable for many more users —Removing bots and spam is a lot less complicated than what the Tesla self driving team is doing (based on hearing the last edge case meeting)</li> <li><b>Jason</b>: And why should blue check marks be limited to the elite, press and celebrities? How is that democratic?</li> <li><b>Jason</b>: The Kingdom would like a word.. <a href="https://twitter.com/Alwaleed_TalaI/status/1514615956986757127?s=20&amp;t=2q4VfMBXrldYGj3vFN_r0w">https://twitter.com/Alwaleed_TalaI/status/1514615956986757127?s=20&amp;t=2q4VfMBXrldYGj3vFN_r0w</a> 😂😂😂</li> <li><b>Jason</b>: Back of the envelope... Twitter revenue per employee: $5B rev / 8k employees = $625K rev per employee in 2021 Google revenue per employee: $257B rev2/ 135K employee2= $1.9M per employee in 2021 Apple revenue per employee: $365B rev / 154k employees= $2.37M per employee in fiscal 2021</li> <li><b>Jason</b>: Twitter revenue per employee if 3k instead of 8k: $5B rev/ 3k employees= $1.66m rev per employee in 2021 (more industry standard)</li> <li><b>Elon</b>: [&quot;emphasized&quot; above]</li> <li><b>Elon</b>: Insane potential for improvement</li> <li><b>Jason</b>: &lt;Attachment-image/gif-lMG_2241.GIF&gt;</li> <li><b>Jason</b>: Day zero</li> <li><b>Jason</b>: Sharpen your blades boys 🗡️</li> <li><b>Jason</b>: 2 day a week Office requirement= 20% voluntary departures</li> <li><b>Jason</b>: <a href="https://twitter.com/jason/status/1515094823337832448?s=1O&amp;t=CWr2U7sH4wVOsohPgjKRg">https://twitter.com/jason/status/1515094823337832448?s=1O&amp;t=CWr2U7sH4wVOsohPgjKRg</a></li> <li><b>Jason</b>: I mean, the product road map is beyond obviously</li> <li><b>Jason</b>: Premium feature abound ... and twitter blue has exactly zero [unknown emoji]</li> <li><b>Jason</b>: What committee came up with the list of dog shit features in Blue?!? It's worth paying to turn it off</li> <li><b>Elon</b>: Yeah, what an insane piece of shit!</li> <li><b>Jason</b>: Maybe we don't talk twitter on twitter OM @</li> <li><b>Elon</b>: Was just thinking that haha</li> <li><b>Elon</b>: Nothing said there so far is anything different from what I said publicly.</li> <li><b>Elon</b>: Btw, Parag is still on a ten day vacation in Hawaii</li> <li><b>Jason</b>: No reason to cut it short... in your first tour as ceo</li> <li><b>Jason</b>: (!!!)</li> <li><b>Jason</b>: Shouldn't he be in a war room right now?!?</li> <li><b>Elon</b>: Does doing occasional zoom calls while drinking fruity cocktails at the Four Seasons count?</li> <li><b>Jason</b>: 🤔</li> <li><b>Jason</b>: <a href="https://twitter.com/jason/status/1515427935263490053?s=10&amp;t=4rQ_JIDXCDtHhOaXdGHJ5g">https://twitter.com/jason/status/1515427935263490053?s=10&amp;t=4rQ_JIDXCDtHhOaXdGHJ5g</a></li> <li><b>Jason</b>: I'm starting a DAO</li> <li><b>Jason</b>: 😂😂😂</li> <li><b>Jason</b>: Money goes to buy twitter shares, if you don't wine money goes to open source twitter competitor 😂😂😂</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>[2022-04-23] Elon</b>: I will be Universally beloved, since it is so easy to please everyone on twitter</li> <li><b>Jason</b>: It feels like everyone wants the same exact thing, and they will be patient and understanding of any changes ... Twitter Stans are a reasonable, good faith bunch</li> <li><b>Jason</b>: These dipshits spent a years on twitter blue to give people exactly..... Nothing they want! * Jason: Splitting revenue with video creators like YouTube could be huge unlock</li> <li><b>Jason</b>: We could literally give video creators 100% of their ad revenue up to $1m then do split</li> <li><b>Elon</b>: Absolutely</li> <li><b>Jason</b>: 5 Teams: 5 Northstar metrics 1. Legacy Opps: uptime, speed 2. Membership team: remove bots while getting users to pay far &quot;Real Name Memberships&quot; $5 a month $SO a year. Includes 24 hours response to customer service 3. Payments: % of users that have connected a bank account/made a deposit 4. Creator Team: get creators to publish to twitter first (musicians, You Tubers, tiktokers, etc) by giving them the best % split in the industry (and promotion) 5. Transparency Team: make the Algorithm &amp; Moderation understandable and fair</li> <li><b>Jason</b>: I think those are the 5 critical pieces ... everyone agrees to &quot;year one&quot; sprint, including coming back to offices within the first 60 days (unless given special dispensation for extraordinary contribution)</li> <li><b>Jason</b>: Hard Reboot the organization</li> <li><b>Jason</b>: Feels like no one is setting priorities ruthlessly .. 12,000 working on whatever they want?!? No projects being cancelled?!</li> <li><b>Jason</b>: Move HQ to Austin, rent gigafactory excess space</li> <li><b>Elon</b>: Want to be a strategic advisor if this works out?</li> <li><b>Elon</b>: Want to be a strategic advisor to Twitter if this works out?</li> <li><b>Jason</b>: Board member, advisor, whatever ... you have my sword</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> <li><b>Jason</b>: If 2, 3 or 4 unlock they are each 250b+ markets</li> <li><b>Jason</b>: Payments is $250-500b, YouTube/creators is $250b+</li> <li><b>Jason</b>: Membership no one has tried really .... So hard to estimate. 1-5m paid members maybe @ Jason $50-100 a year? 250k corporate memberships @ 10k a year?</li> <li><b>Elon</b>: You are a mind reader</li> <li><b>Jason</b>: Put me in the game coach!</li> <li><b>Jason</b>: [unclear emoji]</li> <li><b>Jason</b>: Twitter CEO is my dream job</li> <li><b>Jason</b>: <a href="https://apple.news/AIDqUaC24Sguyc9S9krWlig">https://apple.news/AIDqUaC24Sguyc9S9krWlig</a></li> <li><b>Jason</b>: we should get Mr Beast to create for twitter ... we need to win the next two generations (millennials and Z are &quot;meh&quot; on twitter)</li> <li><b>Elon</b>: For sure</li> <li><b>Jason</b>: Just had the best idea eve.for monetization ... if you pay .01 per follower per year, you can DM all your followers upto 1x a day.</li> <li><b>Jason</b>: 500,000 follows = $5,000 and 1 DM them when 1 have new podcast episode, or I'm doing an event... or my new book comes out</li> <li><b>Jason</b>: And let folks slice and dice... so, you could DM all your twitter followers in Ber1in and invite them to the GigaRave</li> <li><b>Jason</b>: Oh my lord this would unlock the power of Twitter and goose revenue massively .... Who wouldn't pay for this!?!?</li> <li><b>Jason</b>: and if you over use the tool and are annoying folks would unfollow you ... so it's got a built in Jason safe guard {unlike email spam)</li> <li><b>Jason</b>: Imagine we ask Justin Beaver to come back and let him DM his fans ... he could sell $10m in merchandise or tickets instantly. Would be INSANE for power users and companies</li> <li><b>Elon</b>: Hell yeahl!</li> <li><b>Elon</b>: It will take a few months for the deal to complete before I'm actually in control<br /></li> </ul></li><br> <li><a id="48"></a><a href="#48">2022-04-14</a> <ul> <li><b>[redacted]</b>: Hey Elon -my name is Jake Sherman. I'm a reporter with Punchbowl News in Washington I cover Congress. Wonder if you're game to talk about how Twitter would change for politics if you were at the helm?</li> </ul></li><br> <li><a id="49"></a><a href="#49">2022-04-14</a> <ul> <li><b>Adeo Ressi [VC]</b>: Would love you to buy Twitter and fix it 🙏</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> </ul></li><br> <li><a id="50"></a><a href="#50">2022-04-14</a> <ul> <li><b>Omead Afshar [Project Director for Office of CEO at Tesla]</b>: Thank you for what you're doing. We all love you and are always behind you! Not having a global platform that is truly free speech is dangerous for all. Companies are all adopting some form of content moderation and it's all dependent on ownership on how it shifts and advertisers paying them, as you've said.</li> <li><b>Omead</b>: Who knew a Saudi Arabian prince had so much leverage and so much to say about twitter.</li> </ul></li><br> <li><a id="51"></a><a href="#51">2022-04-20</a> <ul> <li><b>Elon to Brian Kingston [investment management, real estate]</b>: Not at all</li> </ul></li><br> <li><a id="52"></a><a href="#52">2022-04-20 to 2022-04-22</a> <ul> <li><b>Elon</b>: Larry Ellison is interested in being part of the Twitter take-private</li> <li><b>Jared</b>: [&quot;liked above]</li> <li><b>Jared</b>: <a href="https://www.bloomberg.com/news/articles/2022-04-19/ftx-ceo-bankman-fried-wants-to-fix-social-media-with-blockchain">https://www.bloomberg.com/news/articles/2022-04-19/ftx-ceo-bankman-fried-wants-to-fix-social-media-with-blockchain</a></li> <li><b>Jared</b>: &lt;Attachmento-text/vcard - Sam Bankman-Fried.vcf&gt;</li> <li><b>Jared</b>: He seems to point to a similar blockchain based idea Also, we now have the right software engineer for you to speak with about the blockchain idea. Do you want an intro? Or just contact info?</li> <li><b>Elon</b>: Who is this person and who recommended them?</li> <li><b>Elon</b>: The engineer</li> <li><b>Elon</b>: I mean<br /></li> <li><b>Elon</b>: The idea of blockchain free speech has been around for a long time. The questions are really about how to implement it.</li> <li><b>Jared</b>: former spacex'r, current CTO at Matter Labs, a blockchain company. TBC is on the verge of hiring him.</li> <li><b>Jared</b>: <a href="https://www.linkedin.com/in/anthonykrose/">https://www.linkedin.com/in/anthonykrose/</a></li> <li><b>Elon</b>: Ok</li> <li><b>Jared</b>: best to intro you via email?</li> <li><b>Elon</b>: Yeah</li> <li><b>Jared</b>: Investor calls are currently scheduled from 1pm to 3pm. They'd like to do a brief check in beforehand.</li> <li><b>Elon</b>: Ok</li> <li><b>Elon</b>: Whatever time works</li> <li><b>Jared</b>: [&quot;liked&quot; above]</li> <li><b>Jared</b>: I'll be dialing in to the calls as well. Let me know if you prefer that I'm there in person at the house or if remote is better.</li> <li><b>Elon</b>: Remote is fine</li> </ul></li><br> <li><a id="53"></a><a href="#53">2022-04-22</a> <ul> <li><b>&quot;BL Lee&quot;</b>: i have a twitter ceo candidate for you -bill gurley/benchmark. they were early investors as well so know all the drama. want to meet him?</li> </ul></li><br> <li><a id="54"></a><a href="#54">2022-04-23</a> <ul> <li><b>Elon to James Gorman [CEO, Morgan Stanley]</b>: Thanks James, your unwavering support is deeply appreciated. Elon</li> <li><b>Elon</b>: I think the tender has a real chance</li> </ul></li><br> <li><a id="55"></a><a href="#55">2022-04-23</a> <ul> <li><b>Elon to Bret</b>: Would it be possible for you and me to talk this weekend?</li> <li><b>Elon</b>: Or any group of people from the Twitter and my side</li> <li><b>Bret</b>: Yes, that would be great. I would suggest me and Sam Britton from Goldman on our side. Do you have time this afternoon / evening?</li> <li><b>Elon</b>: Sounds good</li> <li><b>Elon</b>: Whatever time works for you and Sam is good for me</li> <li><b>Bret</b>: Can we call you in 15 mins? 4:30pm PT (not sure what time zone you are in)</li> <li><b>Bret</b>: I can just call your mobile. Let me know if you prefer Zoom, conference call code, or something different</li> <li><b>Elon</b>: Sure</li> <li><b>Elon</b>: Mobile is fine</li> <li><b>Bret</b>: [&quot;liked&quot; above]<br /></li> <li><b>Bret</b>: Just tried callings-please call whenever you are available</li> <li><b>Elon</b>: Calling shortly</li> <li><b>Bret</b>: Great thanks</li> <li><b>Elon</b>: Morgan Stanley needs to talk to me. I will call as soon as that's done.</li> <li><b>Bret</b>: No problem -here when you are ready</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Bret</b>: I understand our advisors just had a productive call. I am available to speak after you've debriefed with them.</li> <li><b>Elon</b>: Sounds good</li> </ul></li><br> <li><a id="56"></a><a href="#56">2022-04-23</a> [interleaved with above] <ul> <li><b>Joe Rogan</b>: I REALLY hope you get Twitter. If you do, we should throw a hell of a party.</li> <li><b>Elon</b>: 💯</li> </ul></li><br> <li><a id="57"></a><a href="#57">2022-04-23 to 2022-04-26</a> <ul> <li><b>&quot;Mike Pop&quot;</b>: For sure</li> <li><b>Mike</b>: defiantly things can be better and more culturally engaged</li> <li><b>Mike</b>: I think you're in a unique position to broker better AI to detect bots the second they pop up</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>[2022-04-026] Mike</b>: When do I start boss</li> <li><b>Elon</b>: It will take at least a few months to dose the deal</li> <li><b>Mike</b>: [&quot;loved&quot; above]</li> </ul></li><br> <li><a id="58"></a><a href="#58">2022-04-25</a> <ul> <li><b>Bret</b>: Time for a quick check in?</li> <li><b>Bret</b>: Will call you back</li> <li><b>Bret</b>: In a bit</li> <li><b>Elon</b>: Ok</li> <li><b>Bret</b>: <a href="https://twitter.com/btaylor/status/1518664708177362944?s=lO&amp;t=9WqlCSZVMQdycPc314T">https://twitter.com/btaylor/status/1518664708177362944?s=lO&amp;t=9WqlCSZVMQdycPc314T</a></li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> <li><b>Elon</b>: Thank you</li> <li><b>Bret</b>: Here to make this successful in any way I can</li> <li><b>Elon</b>: [&quot;liked&quot; above]<br /></li> </ul></li><br> <li><a id="59"></a><a href="#59">2022-04-25 to 2022-04-26</a> [interleaved with some above] <ul> <li><b>Elon to Parag</b>: Can I call you later? I have the SpaceX exec staff meeting right now. Will be done in half an hour. Do you need to talk before then?</li> <li><b>Parag</b>: No -can talk in 30!</li> <li><b>Elon</b>: [&quot;liked above]</li> <li><b>[2022-04-26] Elon</b>: Good question ...</li> <li><b>Elon</b>: <a href="https://twitter.com/norsemen62/status/1519005154204336128?s=1O&amp;t=MKtYF6Wu2sSTdoWWqThEDg">https://twitter.com/norsemen62/status/1519005154204336128?s=1O&amp;t=MKtYF6Wu2sSTdoWWqThEDg</a></li> </ul></li><br> <li><a id="60"></a><a href="#60">2022-04-25</a> [interleaved wth some above] <ul> <li><b>jack</b>: Thank you ❤️</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> <li><b>Elon</b>: I basically following your advice!</li> <li><b>jack</b>: I know and I appreciate you. This is the right and only path. I'll continue to do whatever it takes to make it work.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li><a id="61"></a><a href="#61">2022-04-25</a> <ul> <li><b>Elon to Tim Urban [creator of What But Why]</b>: Absolutely</li> <li><b>Tim</b>: i haven't officially started my podcast yet but if you think it would be helpful, i'd be happy to record a conversation with you about twitter to ask some of the most common questions and let you expand upon your thoughts Tim Urban</li> <li><b>Tim</b>: but only if it would be helpful to you</li> <li><b>Elon</b>: Suee</li> <li><b>Tim</b>: Any day or time that's best for you? And best location? I'm in LA but can zip over to Austin if you're there.</li> <li><b>Elon</b>: Probably in a few weeks</li> <li><b>Tim</b>: [&quot;liked&quot; above]<br /></li> </ul></li><br> <li><a id="62"></a><a href="#62">2022-04-25</a> <ul> <li><b>Michael Grimes [IBanker at Morgan Stanley]</b>: Do you have 5 minutes to connect on possible meeting tomorrow I believe you will want to take?</li> <li><b>Elon</b>: Will call in about half an hour</li> <li><b>Michael</b>: Sam Bankman Fried is why I'm calling <a href="https://twitter.com/sbf_ftx/status/1514588820641128452">https://twitter.com/sbf_ftx/status/1514588820641128452</a> <a href="https://www.vox.com/platform/amp/recode/2021/3/20/22335209/sam-bank.man-fried-joe-biden-ftx-cryptocurrency-effective-altruism">https://www.vox.com/platform/amp/recode/2021/3/20/22335209/sam-bank.man-fried-joe-biden-ftx-cryptocurrency-effective-altruism</a> <a href="https://ftx.us">https://ftx.us</a></li> <li><b>Elon</b>: ??</li> <li><b>Elon</b>: I'm backlogged with a mountain of critical work matters. ls this urgent?</li> <li><b>Michael</b>: Wants 1-Sb. Serious about partner w/you. Same security you own</li> <li><b>Michael</b>: Not urgent unless you want him to fly tomorrow. He has a window tomorrow then he's wed-Friday booked</li> <li><b>Michael</b>: Could do $5bn if everything vision lock. Would do the engineering for social media blockchain integration. Founded FTX crypto exchange. Believes in your mission. Major Democratic donor. So thought it was potentially worth an hour tomorrow a la the Orlando meeting and he said he could shake hands on 5 if you like him and I think you will. Can talk when you have more time not urgent but if tomorrow works it could get us $5bn equity in an hour</li> <li><b>Elon</b>: Blockchain twitter isn't possible, as the bandwidth and latency requirements cannot be supported by a peer to peer network, unless those &quot;peers&quot; are absolutely gigantic, thus defeating the purpose of a decentralized network.</li> <li><b>Elon</b>: [&quot;disliked&quot; &quot;Could do $5bn ...&quot;]</li> <li><b>Elon</b>: So long as I don't have to have a laborious blockchain debate</li> <li><b>Elon</b>: Strange that Orlando declined</li> <li><b>Elon</b>: Please let him know that I would like to talk and understand why he declined</li> <li><b>Elon</b>: Does Sam actually have $3B liquid?</li> <li><b>Michael</b>: I think Sam has it yes. He actually said up to 10 at one point but in writing he said up to 5. He's into you. And he specifically said the blockchain piece is only if you liked it and not gonna push it. Orlando referred Sams interest to us and will be texting you to speak to say why he (Orlando) declined. We agree orlando needs to call you and explain given everything he said to us and you. Will make that happen We can push Sam to next week but I do believe you will like him. Ultra Genius and doer builder like your formula. Built FTX from scratch after MIT physics. Second to Bloomberg in donations to Biden campaign.</li> </ul></li><br> <li><a id="63"></a><a href="#63">2022-04-25 to 2022-04-28</a> <ul> <li><b>Elon to David Sacks [VC]</b>: <a href="https://twitter.com/dineshdsouza/status/1518744328205647872?s=10&amp;t=vkagBUrJJexF_SJDOC_LUw">https://twitter.com/dineshdsouza/status/1518744328205647872?s=10&amp;t=vkagBUrJJexF_SJDOC_LUw</a></li> <li><b>David Sacks</b>: RT'd</li> <li><b>David</b>: Justin Amash (former congressman who's liberterian and good on free speech) asked far an intro to you: &quot;I believe I can be helpful to Twitter's team going forward-thinking about how to handle speech and moderation, how that intersects with ideas about governance, how to navigate actual government (including future threats to Section 230), etc.-and I'd love to connect with Elon if he's interested in connecting (I don't have a direct way to contact him). I believe my experience and expertise can be useful, and my genera! outlook aligns with his stated goals. Thanks. All the best.&quot; Please LMK if you want to connect with him.</li> <li><b>David</b>: <a href="https://twitter.com/justinamash?s=21&amp;t=_Owbgwdot71pUtC4rJUXYg">https://twitter.com/justinamash?s=21&amp;t=_Owbgwdot71pUtC4rJUXYg</a></li> <li><b>Elon</b>: I don't own twitter yet</li> <li><b>David</b>: Understood.</li> <li><b>[2022-04-28] Elon</b>: Do you and/or find want to invest in the take private?</li> <li><b>Elon</b>: *fund</li> <li><b>David</b>: Yes but I don't have a vehicle for it (Craft is venture} so either I need to set up an SPV or just do it personally. If the latter, my amount would be mice-nuts in relative terms but I would be happy to participate to support the cause.</li> <li><b>Elon</b>: Up to you</li> <li><b>David</b>: Ok cool, let me know.</li> <li><b>David</b>: I'm in personally and will raise an SPV too if that works for you.</li> <li><b>Elon</b>: Sure</li> </ul></li><br> <li><a id="64"></a><a href="#64">2022-04-26</a> <ul> <li><b>James Murdoch [Rupert Murdoch's son]</b>: Thank you. I will link you up. Also will call when same of the dust settles. Hope all's ok..</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> </ul></li><br> <li>2022-04-26 [group chat with Kathryn Murdoch, James Murdoch, and Elon Musk] <ul> <li><b>Kathryn Murdoch [James Murdoch's wife]</b>: Will you bring back Jack?</li> <li><b>Elon</b>: Jack doesn't want to come back. He is focused on Bitcoin.</li> </ul></li><br> <li><a id="65"></a><a href="#65">2022-04-27</a> <ul> <li><b>[redacted]</b>: Hi Elon, This is Maddie, Larry Ellison's assistant. Larry asked that I connect the head of his family office, Paul Marinelli, with the head of yours. Would you please share their contact details? Alternatively, please provide them with Paul's: Cell: [...] Email: [...]</li> <li><b>Elon</b>: [Jared's email]</li> <li><b>[redacted]</b>: Thank you.</li> </ul></li><br> <li><a id="66"></a><a href="#66">2022-04-27</a> <ul> <li><b>Marc Benoiff [co-founder, chair, co-CEO of Salesforce, along with Bret Taylor]</b>: Happy to talk about it if this is interesting: Twitter conversational OS-the townsquare for your digital life.</li> <li><b>Elon</b>: Well I don't own it yet</li> </ul></li><br> <li><a id="67"></a><a href="#67">2022-04-27</a> <ul> <li><b>Reid Hoffman [VC] Great. I will put you in touch with Satya.</li> <li>Elon</b>: Sounds good</li> <li><b>Elon</b>: Do you want to invest in Twitter take private?</li> <li><b>Reid</b>: It's way beyond my resources. I presume you are not interested in ventures$.</li> <li><b>Elon</b>: There is plenty of financial support, but you're a friend, so just letting you know you'd get priority. VC money is fine if you want.</li> <li><b>Reid</b>: Very cool! OK -if I were to put together$, what size could you make available? [unclear emoji, some kind of smiling face]</li> <li><b>Elon</b>: Whatever you'd like. I will just cut back others.</li> <li><b>Elon</b>: I would need to know the approximate by next week</li> <li><b>Reid</b>: What would be the largest $ that would be ok? I consulted with our LPs, and I have strong demand. Would be fun!</li> <li><b>Elon</b>: $2B?</li> <li><b>Reid</b>: Great. Probably doable -let me see.</li> <li><b>Elon</b>: Can be less if easier. The round is oversubscribed, so I just have to tell other investors what their allocation is ideally bv early next week.</li> <li><b>Elon</b>: Should I connect you with the Morgan Stanley team?</li> <li><b>Reid</b>: Yes please. Especially with the terms, etc. I know Michael Grimes, btw.</li> <li><b>Elon</b>: Please feel free to call him directly</li> <li><b>[group chat message connecting Reid to Jared Birchall]</li> <li>Reid</b>: OK-I'll do that. (Trying to simplify your massively busy life.) The Morgan Stanley deal team is truly excellent and I don't say such things lightly.</li> <li><b>[group chat messages between Reid and Jared to exchange emails]</li> <li>Reid</b>: Indeed! I took U public and the MSFT-U deal with them!</li> </ul></li><br> <li><a id="68"></a><a href="#68">2022-04-27</a> [interleaved with above] <ul> <li><b>Viv Huantusch</b>: From a social perspectives-Twitter allowing for high quality video uploads (1080p at a minimum) &amp; adding a basic in-app video editor would have quite a big impact I think. Especially useful for citizen journalism &amp; fun educational content. Might even help Twitter regain market share lost to TikTok</li> <li><b>Elon</b>: Agreed</li> <li><b>Elon</b>: Twrtter can't monetize video yet, so video is a loss for Twitter and for the those who post</li> <li><b>Elon</b>: Twitter needs better guidance</li> <li><b>Viv</b>: Yeah, 100%</li> <li><b>Viv</b>: They should have a subscription that's actually useful (unlike Twitter Blue haha)</li> <li><b>Elon</b>: Totally</li> </ul></li><br> <li><a id="69"></a><a href="#69">2022-04-27</a> [group chat with &quot;Satya&quot; [presumed to be Satya Nadella CEO of Microsoft], Reid Hoffman, and Elon Musk, interleaved with above] <ul> <li><b>Elon, Satya</b>: as indicated, this connects the two of you by text and phone.</li> <li><b>Satya</b>: Thx Reid. Efon -will text and coordinate a time to chat. Thx</li> </ul></li><br> <li><a id="70"></a><a href="#70">2022-04-27</a> [interleaved with above] <ul> <li><b>Satya</b>: Hi Elon .. Let me know when you have time to chat. can do tomorrow evening or weekend. Look forward to it. ThxSatya</li> <li><b>Elon</b>: I can talk now if you want</li> <li><b>Satya</b>: Calling</li> <li><b>Satya</b>: Thx for the chat. Will stay in touch. And will for sure follow-up on Teams feedback!</li> <li><b>Elon</b>: sounds good :)</li> </ul></li><br> <li><a id="71"></a><a href="#71">2022-04-27</a> [interleaved with above] <ul> <li><b>Elon to Brian Acton [interim CEO of Signal]</b>: Trying to figure out what to do with Twitter DMs. They should be end to end encrypted (obv). Dunno if better to have redundancy with Signal or integrate it.</li> </ul></li><br> <li><a id="72"></a><a href="#72">2022-04-27</a> [interleaved with above] <ul> <li><b>Elon to Bret Taylor</b>: I'd like to convey some critical elements of the transition plan. Is there a good time for us to talk tonight? Happy to have anyone from Twitter on the call.</li> <li><b>Elon</b>: My biggest concern is headcount and expense growth. Twitter has ~3X the head count per unit of revenue of other social media companies, which is very unhealthy in my view.</li> </ul></li><br> <li><a id="73"></a><a href="#73">2022-04-28</a> [group chat with Jared Birchall, Sam BF, and Elon Musk] <ul> <li><b>Jared Birchall</b>: Elon -connecting you with SBF.</li> <li><b>Sam BF</b>: Hey!</li> </ul></li><br> <li><a id="74"></a><a href="#74">2022-04-29</a> <ul> <li><b>Steve Jurvetson [VC]</b>: <a href="https://www.linkedin.com/in/emilmichael/">https://www.linkedin.com/in/emilmichael/</a></li> <li><b>Steve</b>: If you are looking for someone to run the Twitter revamping .... perhaps as some kind of CXO under you ... Emil Michael is a friend that just offered that idea. Genevieve loved working for him at Klout. He went on to become Chief Business Officer of Uber for 2013-17.</li> <li><b>Elon</b>: I don't have a Unkedln account</li> <li><b>Elon</b>: I don't think we will have any CXO titles</li> <li><b>Steve</b>: OK. Are you looking to hire anyone, or do you plan to run it?</li> <li><b>Steve</b>: &lt;Attachmente-image/jpeg-Screen Shot 2022-04-29 at5.49.53 PM.jpeg&gt;</li> <li><b>steve</b>: This is his experience prior to Uber:</li> <li><b>Elon</b>: Please send me anyone who actually writes good software</li> <li><b>Steve</b>: Ok, no management; good coders, got it.</li> <li><b>Elon</b>: Yes</li> <li><b>Elon</b>: Twitter is a software company (or should be)</li> <li><b>Steve</b>: Yes. My son at Reddit and some other young people come to mind. I was thinking about who is going to manage the software people (to prioritize and hit deadlines), and I guess that's you.</li> <li><b>Elon</b>: I will oversee software development</li> </ul></li><br> </ul> <h3 id="exihibit-j">Exihibit J</h3> <ul> <li><a id="75"></a><a href="#75">2022-04-04 to 2022-04-14</a> <ul> <li><b>Mathias Döpfner</b>: 👍</li> <li><b>[2022-04-14] Mathias</b>: &lt;Attachment - application/vnd.openxmlformats-officedocument.wordprocessingml.document-Twitter_lnterview.doc&gt;</li> </ul></li><br> <li><a id="76"></a><a href="#76">2022-04-11</a> <ul> <li><b>Kimbal Musk</b>: Great to hang yesterday. I'd love to help think through the structure for the Doge social media idea Let me know how I can help</li> <li><b>Elon</b>: Ok</li> </ul></li><br> <li><a id="77"></a><a href="#77">2022-04-14</a> <ul> <li><b>Elon to Marc Merill</b>: [&quot;loved&quot; &quot;you are the hero Gotham needs -hell F'ing yes!&quot;]</li> </ul></li><br> <li><a id="78"></a><a href="#78">2022-04-14</a> <ul> <li><b>Elon to Steve Davis</b>: [&quot;liked&quot; &quot;Amazing! Not sure which plan to root for. If Plan B wins, let me know if blockchain engineers would be helpful.&quot;]</li> </ul></li><br> <li><a id="79"></a><a href="#79">2022-04-15</a> <ul> <li><b>Elon to Omead Afshar</b>: [&quot;laughed at&quot; &quot;Who knew a Saudi Arabian prince had so much leverage and so much to say about twitter.&quot;]</li> </ul></li><br> <li><a id="80"></a><a href="#80">2022-04-20</a> <ul> <li><b>Brian Kingston</b>: Hi Elon-it's Brian Kingston at Brookfield. There was an artide today in the FT that said we (Brookfield) have &quot;decided against providing an equity cheque• for a Twitter buyout. I Just wanted to let you know that didn't came from us-we would never comment (on or off the record) about something like that, particularly when it relates to one of our partners. We appreciate all that we are doing on solar together and you allowing us to participate in the Boring Co raise this week. While I'm sure you don't believe anything you read in the FT anyway, I'm sorry if the article caused any aggravation. If there is anything we can do to be helpful, please do let me know.</li> </ul></li><br> <li><a id="81"></a><a href="#81">2022-04-23 to 2022-05-09</a> <ul> <li><b>Micahel Grimes</b>: Michael Grimes here so you have my number and know who is calling. Dialing you now</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Micahel</b>: <a href="https://youtu.be/DOW1V0kOELA">https://youtu.be/DOW1V0kOELA</a></li> <li><b>Elon</b>: [&quot;laughed at&quot; above]</li> <li><b>Micahel</b>: If you have a second to chat</li> <li><b>Michael</b>: Perfect.</li> <li><b>Michael</b>: got it. Will forward the equity Interest email to Jared and Alex that he sent in and have It in the queue in the event his interest is needed overall. Absent the blockchain piece he's focused on investing if you want his interest in Twitter and your mission but we can park him for now.</li> <li><b>Michael</b>: Agree. Was one piece of equation and I do think he would be at least3bn if you like him and want him, maybe more. Will work with Jared and Alex to be sure it makes sense to meet -my instinct is it does because Orlando Brace also declined today in the end (not sure if political fears or what but he fiaked today}.</li> <li><b>[2022-05-04] Elon</b>: No response from Bret, not even an interest in talking. I think it's probably best to release the debt tomorrow. This might take a while.</li> <li><b>Micahel Grimes</b>: Nikesh came to see me this afternoon. Just to talk Twitter and you. If you had the time he would cancel his plans tomorrow night to meet with you and come to where you are in SF or mid peninsula Or he could fly to Austin another time of course. If you want me to send him to you let me know and he will break his lans to do</li> <li><b>Elon</b>: It's fine, no need to break his plans.</li> <li><b>Michael</b>: Got it.</li> <li><b>Michael</b>: I asked Pat and Kristina to each spend the weekend writing up their transition and diligence plan and how to approach debt rating agencies on may 16. We need one of them signed up (employment contract for 3 months) as Transition CFO of X Holdings and owning the model and diligence from financial point of view on the follow up meetings with Twitter on costs and users and engineers etc. We believe two will not work at the agencies or in front of debt investors as you have to have one CFO. If you were willing to have SVP Ops of X Holdings (Pat would be more qualified for that than Kristina I then it's possible to retain them both for the transition. The way to stay on ludicrous speed Is to pick one of them tomorrow as transition CFO and then we run with it full metal jacket. I believe each can do the job and deliver the ratings and debt and transition plan for day one. Then you dismiss him/her as job well done or offer permanent CFO if you choose.</li> <li><b>Elon</b>: Neither were great.</li> <li><b>Elon</b>: They asked no good questions and had no good comments.</li> <li><b>Elon</b>: Let's slow down just a few days</li> <li><b>Elon</b>: Putin's speech tomorrow is extremely important</li> <li><b>Elon</b>: It won't make sense to buy Twitter if we're headed into WW3</li> <li><b>Elon</b>: Just sayin</li> <li><b>Michael</b>: Understood. If the pace stays rapid each are good enough to get job done for the debt Then you hire great for go forward. But will pause for May 9 Vladimir and hope for the best there. We can take stock of where things look after that.</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Elon</b>: An extremely fundamental due diligence item is understanding exactly how Twitter confims that 95% of their daily active users are both real people and not double-counted.</li> <li><b>Elon</b>: They couldn't answer that on Friday, which is insane.</li> <li><b>Elon</b>: If that number is more like 50% or lower, which is what I would guess based on my feed, then they have been fundamentally misrepresenting the value of Twitter to advertisers and investors.</li> <li><b>Elon</b>: To be super clear, this deal moves forward if it passes due diligence, but obviously not if there are massive gaping issues.</li> <li><b>Elon</b>: True user account is a showstopper if actually much lower than the 95% claimed</li> <li><b>Elon</b>: Parag said thatTwitter has 2500 coders doing at least 100 lines per month. Maybe they could fit this feature in ... <a href="https://twitter.com/skylerrainnj/status/1523616659365277698?s=1O&amp;t=1qmVNhjQPeHafBPEHiFrRQ">https://twitter.com/skylerrainnj/status/1523616659365277698?s=1O&amp;t=1qmVNhjQPeHafBPEHiFrRQ</a></li> </ul></li><br> <li><a id="82"></a><a href="#82">2022-04-25</a> <ul> <li><b>Adeo Ressi</b>: &lt;Attachment-image/jpego-Elon Musk and Twitter Reach Deal on Sale Live Up....jpeg&gt;</li> <li><b>Adeo</b>: Congrats? This will be a good thing.</li> <li><b>Elon</b>: I hope so :)</li> <li><b>Adeo</b>: You've had ideas on how to fix that companyfor A LONG TIME. The time is now.</li> <li><b>Adeo</b>: I think it's exciting.</li> </ul></li><br> <li><a id="83"></a><a href="#83">2022-04-25</a> <ul> <li><b>James Musk</b>: Congrats! Super important to solve the bot problem.</li> <li><b>Elon</b>: Thanks</li> <li><b>Elon</b>: The bot problem is severe</li> </ul></li><br> <li><a id="84"></a><a href="#84">2022-04-27</a> <ul> <li><b>Elon to Reid Hoffman</b>: This is Elon</li> </ul></li><br> <li><a id="85"></a><a href="#85">2022-05-01</a> <ul> <li><b>Elon to Sean Parker [VC and founder of Napster]</b>: Am at my Mom's a apartment, doing Twitter dilligence calls</li> </ul></li><br> <li><a id="86"></a><a href="#86">2022-05-02</a> <ul> <li><b>Jason Calacanis</b>: <a href="https://twitter.com/elonmusk/status/1521158715193315328?s=1O&amp;t=htc_On6KY9B9C4VtllFIO">https://twitter.com/elonmusk/status/1521158715193315328?s=1O&amp;t=htc_On6KY9B9C4VtllFIO</a></li> <li><b>Jason</b>: one thing you can do in this regard is an SPV of 250 folks capped at $10m .. pain on the next for a large company but one item on cap table</li> <li><b>Jason</b>: You do have to have someone lead/man a e the SPV</li> <li><b>Elon</b>: Go ahead</li> <li><b>Jason</b>: [&quot;liked&quot; above]</li> <li><b>Jason</b>: When you're private its fairly easy to do, but I think current shareholders have to re-up</li> <li><b>Elon</b>: Are you sure?</li> <li><b>Jason</b>: I am not</li> <li><b>Jason</b>: Have never done a take private</li> <li><b>Jason</b>: Large shareholders (QPs) are likely different than non-accredited investors</li> <li><b>[2022-05-12] Elon</b>: What's going on with ou marketing an SPV to randos? This is not ok. <li><b>Jason</b>: Not randos, I have the largest angel syndicate and that's how I invest. We've done 25D+ deals like this and we know all the folks. I though that was how folks were doin it.</li> <li><b>Jason</b>: $100m+ on commitments, but if that not ok it's fine. Just wanted to support the effort.</li> <li><b>Jason</b>: ~300 QPs and 200 accredited investors said they would do it. It's not an open process obviously, only folks alread in our syndicate.</li> <li><b>Jason</b>: There is <em>massive</em> demand to support uyour effort btw...people really want to see you win.</li> <li><b>Elon</b>: Morgan Stanley and Jared thing you are using our friendship not in a good way</li> <li><b>Elon</b>: This makes it seem like I'm desperate</li> <li><b>Elon</b>: Please sto</li> <li><b>Jason</b>: Only ever want to support you.</li> <li><b>Jason</b>: Clearly you're not desperate -you have the worlds greatest investors voting in support of a deal you already have covered. you're overfunded. will quietly cancel it... And to be clear, I'm not out actively soliciting folks.These are our exiting LPs not rondos. Sorry forany trouble</li> <li><b>Elon</b>: Morgan Stanley and Jared are very upset</li> <li><b>Jason</b>: Ugh</li> <li><b>Jason</b>: SPVs are how everyone is doing there deals now... Like loved to SPVs etc</li> <li><b>Jason</b>: Just trying to support you... obviously, I reached out to Jared and sort it out.</li> <li><b>Jason</b>: * moved</li> <li><b>Elon</b>: Yes, I had to ask him to stop.</li> <li><b>Elon</b>: [&quot;liked&quot; &quot;Just trying to support...&quot;]</li> <li><b>Jason</b>: Cleaned it up with Jared</li> <li><b>Elon</b>: [&quot;liked&quot; above]</li> <li><b>Jason</b>: I get where he is coming from.... Candidly, This deal has just captures the worlds imagination In an unimaginable way. It's bonkers...</li> <li><b>Jason</b>: And you know I'm ride or die brother - I'd jump on a grande for you</li> <li><b>Elon</b>: [&quot;loved&quot; above]</li> </ul></li><br> <li><a id="87"></a><a href="#87">2022-05-05</a> <ul> <li><b>Elon to Sam BF</b>: Sorry, who is sending this message?</li> </ul></li><br> <li><a id="88"></a><a href="#88">2022-05-05</a> <ul> <li><b>Elon to James Murdoch</b>: In LA right now. SF tomorrow to due dilligence on Twitter.</li> </ul></li><br> <li><a id="89"></a><a href="#89">2022-05-05</a> <ul> <li><b>Elon to John Elkann [heir of Gianni Agnelli]</b>: Sorry, I have to be at Twitter HQ tomorrow afternoon for due dilligence.</li> </ul></li><br> <li><a id="90"></a><a href="#90">2022-05-05</a> <ul> <li><b>David Sacks</b>: [&quot;liked&quot; &quot;Best to be low-key during transaction &quot;]</li> </ul></li><br> <li><a id="91"></a><a href="#91">2022-05-10</a> [unclear what's happening due to redactions; may be more than one convo here] <ul> <li><b>Antonio Gracias</b>: Connecting you.</li> <li><b>[redacted]</b>: Hi Elon This is Peter and my numbers. Look forward to being helpful Bob</li> <li><b>Elon</b>: Got it</li> <li><b>Elon</b>: Should we use the above two numbers for the conf call?</li> <li><b>[redacted]</b>: Sure</li> </ul></li><br> <li><a id="92"></a><a href="#92">2022-06-16 to 2022-06-17</a> <ul> <li><b>[redacted]</b>: If I understood them correctly, Ned [presumebly Ned Segal, CFO of Twitter] and Parag said that cash expenditures over the next 12 months will be $78 and that cash receipts will also be $78. However, the cash receipts number doesn't seem realistic, given that they expect only $1.2B in CU, which is just $4.8B annualized.</li> <li><b>[redacted]</b>: In europe so just getting your msg. i do not have proxy w me but my guess is they are using their proxy numbers vs current reality. we are developing proformas that have lower revenue/receipts and lower disbursements.</li> <li><b>Elon</b>: Ok. Given that Q2 is almost over, itobviousl doesn't make sense for them to use proxy numbers vs [looks like something is cut off here — seems like text is in a spreadsheet and word wrap wasn't used on this row, which was then printed and scanned in]</li> <li><b>Elon</b>: I'm traveling in Europe right now, but back next week</li> <li><b>[redacted]</b>: i spokewned on the 7b receipts/expenses. he said he was trying to be more illustrative on '23 expense base, pre any actions we would take and provide a simplified strawman of possible savings. he said they are not planning on doing an updated fcst for 22/23. i think this Is ok re process since i think their fcst would not likely be very good and we wouldn't likely use It anyways. They fly at way too high a level to have a fcst of much value. We are in process of developing revenue fcst and a range of sensitivities and will then walk thru w them to get their input.</li> <li><b>Elon</b>: Their revenue projections seem disconnected from reality</li> <li><b>[redacted]</b>: completely.</li> <li><b>Elon</b>: Phew, it's not just me</li> </ul></li><br> </ul> <h3 id="equity-financing-commitments-from-2022-05-05-sec-filing-https-www-sec-gov-archives-edgar-data-1418091-000110465922056055-tm2214608-1-sc13da-htm"><a href="https://www.sec.gov/Archives/edgar/data/1418091/000110465922056055/tm2214608-1_sc13da.htm">Equity / financing commitments from 2022-05-05 SEC filing</a></h3> <p>If you're curious about the outcomes of the funding discussions above, the winners are listed in the Schedule 13D</p> <ul> <li><b>HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud (Kingdom)</b>: ~$1.9B</li> <li><b>Lawrence J. Ellison Revocable Trust</b>: $1B</li> <li><b>Sequoia Capital Fund, L.P.</b>: $0.8B</li> <li><b>VyCapital</b>: $0.7B</li> <li><b>Binance</b>:$0.5B</li> <li><b>AH Capital Management, L.L.C. (a16z)</b>: $0.4B</li> <li><b>Qatar Holding LLC</b>: $0.375B</li> <li><b>Aliya Capital Partners LLC</b>: $0.36B</li> <li><b>Fidelity Management & Research Company LLC</b>: ~$0.316B</li> <li><b>Brookfield</b>: $0.25B</li> <li><b>Strauss Capital LLC</b>: $0.15B</li> <li><b>BAMCO, Inc. (Baron)</b>: $0.1B</li> <li><b>DFJ Growth IV Partners, LLC</b>: $0.1B</li> <li><b>Witkoff Capital</b>: $0.1B</li> <li><b>Key Wealth Advisors LLC</b>: $0.03B</li> <li><b>A.M. Management & Consulting</b>: $0.025B</li> <li><b>Litani Ventures</b>: $0.025B</li> <li><b>Tresser Blvd 402 LLC (Cartenna)</b>: $0.0085B</li> <li><b>Honeycomb Asset Management LP</b>: $0.005B</li></ul> <p><i>Thanks to @tech31842, @agentwaj, and mr. zip for OCR corrections</i></p> Futurist prediction methods and accuracy futurist-predictions/ Mon, 12 Sep 2022 00:00:00 +0000 futurist-predictions/ <p>I've been reading a lot of predictions from people who are looking to understand what problems humanity will face 10-50 years out (and sometimes longer) in order to work in areas that will be instrumental for the future and wondering how accurate these predictions of the future are. The timeframe of predictions that are so far out means that only a tiny fraction of people making those kinds of predictions today have a track record so, if we want to evaluate which predictions are plausible, we need to look at something other than track record.</p> <p>The idea behind the approach of this post was to look at predictions from an independently chosen set of predictors (Wikipedia's list of well-known futurists<sup class="footnote-ref" id="fnref:W"><a rel="footnote" href="#fn:W">1</a></sup>) whose predictions are old enough to evaluate in order to understand which prediction techniques worked and which ones didn't work, allowing us to then (mostly in a future post) evaluate the plausibility of predictions that use similar methodologies.</p> <p>Unfortunately, every single predictor from the independently chosen set had a poor record and, on spot checking some predictions from other futurists, it appears that futurists often have a fairly poor track record of predictions so, in order to contrast techniques that worked with techniques that I didn't, I sourced predictors that have a decent track record from my memory, an non-independent source which introduces quite a few potential biases.</p> <p>Something that gives me more confidence than I'd otherwise have is that I avoided reading independent evaluations of prediction methodologies until after I did the evaluations for this post and wrote 98% of the post and, on reading other people's evaluations, I found that I generally agreed with Tetlock's <a href="https://amzn.to/3xzG3a2">Superforecasting</a> on what worked and what didn't work despite using a wildly different data set.</p> <p>In particular, people who were into &quot;big ideas&quot; who use a few big hammers on every prediction combined with a <a href="cocktail-ideas/">cocktail party idea</a> level of understanding of the particular subject to explain why a prediction about the subject would fall to the big hammer generally fared poorly, whether or not their favored big ideas were correct. Some examples of &quot;big ideas&quot; would be &quot;environmental doomsday is coming and hyperconservation will pervade everything&quot;, &quot;economic growth will create near-infinite wealth (soon)&quot;, &quot;Moore's law is supremely important&quot;, &quot;quantum mechanics is supremely important&quot;, etc. Another common trait of poor predictors is lack of anything resembling serious evaluation of past predictive errors, making improving their intuition or methods impossible (unless they do so in secret). Instead, poor predictors often pick a few predictions that were accurate or at least vaguely sounded similar to an accurate prediction and use those to sell their next generation of predictions to others.</p> <p>By contrast, people who had (relatively) accurate predictions had a deep understanding of the problem and also tended to have a record of learning lessons from past predictive errors. Due to the differences in the data sets between this post and Tetlock's work, the details are quite different here. The predictors that I found to be relatively accurate had deep domain knowledge and, implicitly, had access to a huge amount of information that they filtered effectively in order to make good predictions. Tetlock was studying people who made predictions about a wide variety of areas that were, in general, outside of their areas of expertise, so what Tetlock found was that people really dug into the data and deeply understood the limitations of the data, which allowed them to make relatively accurate predictions. But, although the details of how people operated are different, at a high-level, the approach of really digging into specific knowledge was the same.</p> <p>Because this post is so long, this post will contain a very short summary about each predictor followed by a moderately long summary on each predictor. Then we'll have a summary of what techniques and styles worked and what didn't work, with the full details of the prediction grading and comparisons to other evaluations of predictors in the appendix.</p> <ul> <li>Ray Kurzweil: 7% accuracy <ul> <li>Relies on: exponential or super exponential progress that is happening must continue; predicting the future based on past trends continuing; optimistic &quot;rounding up&quot; of facts and interpretations of data; <a href="https://twitter.com/arxanas/status/1560756277231644673">panacea thinking</a> about technologies and computers; cocktail party ideas on topics being predicted</li> </ul></li> <li>Jacque Fresco: predictions mostly too far into the future to judge, but seems very low for judgeable predictions <ul> <li>Relies on: panacea thinking about human nature, the scientific method, and computers; certainty that human values match Fresco's values</li> </ul></li> <li>Buckminster Fuller: too few predictions to rate, but seems very low for judgeable predictions <ul> <li>Relies on: cocktail party ideas on topics being predicted to an extent that's extreme even for a futurist</li> </ul></li> <li>Michio Kaku: 3% accuracy <ul> <li>Relies on: panacea thinking about &quot;quantum&quot;, computers, and biotech; exponential progress of those</li> </ul></li> <li>John Naisbitt: predictions too vague to score; mixed results in terms of big-picture accuracy, probably better than any futurist here other than Dixon, but this is not comparable to the percentages given for other predictors <ul> <li>Relies on: trend prediction based on analysis of newspapers</li> </ul></li> <li>Gerard K. O'Neill: predictions mostly too far into the future to judge, but seems very low for judgeable predictions <ul> <li>Relies on: doing the opposite of what other futurists had done incorrectly, could be described as &quot;trying to buy low and sell high&quot; based on looking at prices that had gone up a lot recently; optimistic &quot;rounding up&quot; of facts and interpretations of data in areas O'Neill views as underrated; cocktail party ideas on topics being predicted</li> </ul></li> <li>Patrick Dixon: 10% accuracy; also much better at &quot;big picture&quot; predictions than any other futurist here (but not in the same league as non-futurist predictors such as Yegge, Gates, etc.) <ul> <li>Relies on: extrapolating existing trends (but with much less optimistic &quot;rounding up&quot; than almost any other futurist here); exponential progress; stark divide between &quot;second millennial thinking&quot; and &quot;third millennial thinking&quot;</li> </ul></li> <li>Alvin Toffler: predictions mostly too vague to score; of non-vague predictions, Toffler had an incredible knack for naming a trend as very important and likely to continue right when it was about to stop <ul> <li>Relies on: exponential progress that is happening must continue; a medley of cocktail party ideas inspired by speculation about what exponential progress will bring</li> </ul></li> <li>Steve Yegge: 50% accuracy; general vision of the future generally quite accurate <ul> <li>Relies on: deep domain knowledge, font of information flowing into Amazon and Google; looking at what's trending</li> </ul></li> <li>Bryan Caplan: 100% accuracy <ul> <li>Relies on: taking the &quot;other side&quot; of bad bets/predictions people make and mostly relying on making very conservative predictions</li> </ul></li> <li>Bill Gates / Nathan Myhrvold / old MS leadership: timeframe of predictions too vague to score, but uncanny accuracy on a vision of the future as well as the relative importance of various technologies <ul> <li>Relies on: deep domain knowledge, discussions between many people with deep domain knowledge, font of information flowing into Microsoft</li> </ul></li> </ul> <h3 id="ray-kurzweil">Ray Kurzweil</h3> <p>Ray Kurzweil has claimed to have an 86% accuracy rate on his predictions, a claim which is often repeated, such as by Peter Diamandis where he says:</p> <blockquote> <p>Of the 147 predictions that Kurzweil has made since the 1990's, fully 115 of them have turned out to be correct, and another 12 have turned out to be &quot;essentially correct&quot; (off by a year or two), giving his predictions a stunning 86% accuracy rate.</p> </blockquote> <p>The article is titled &quot;A Google Exec Just Claimed The Singularity Will Happen by 2029&quot; opens with &quot;Ray Kurzweil, Google's Director of Engineering, is a well-known futurist with a high-hitting track record for accurate predictions.&quot; and it cites <a href="https://web.archive.org/web/20170225013846/https://en.wikipedia.org/wiki/Predictions_made_by_Ray_Kurzweil">this list of predictions on wikipedia</a>. 86% is an astoundingly good track record for non-obvious, major, predictions about the future. This claim seems to be the source of other people claiming that Kurzweil has a high accuracy rate, <a href="https://nitter.ca/naijaflyingdr/status/1552751297908117504">such as here</a> and here. I checked the accuracy rate of the wikipedia list Diamandis cited myself (using archive.org to get the list from when his article was published) and found a somewhat lower accuracy of 7%.</p> <p>Fundamentally, the thing that derailed so many of Kurzweil's predictions is that he relied on the idea of exponential and accelerating growth in basically every area he can imagine, and even in a number of areas that have had major growth, the growth didn't keep pace with his expectations. His basic thesis is that not only do we have exponential growth due to progress (improve technologically, etc.), improvement in technology feeds back into itself, causing an increase in the rate of exponential growth, so we have double exponential growth (as in <code>e^x^x, not 2*e^x</code>) in many important areas, such as computer performance. He repeatedly talks about this unstoppable exponential or super exponential growth, e.g., in his 1990 book, <a href="https://amzn.to/3LnrpYY">The Age of Intelligent Machines</a>, he says &quot;One reliable prediction we can make about the future is that the pace of change will continue to accelerate&quot; and he discusses this again in his 1999 book, <a href="https://amzn.to/3DCsj1U">The Age of Spiritual Machines</a>, his 2001 essay on accelerating technological growth, titled &quot;The Law of Accelerating Returns&quot;, his 2005 book, <a href="https://amzn.to/3LnvTik">The Singularity is Near</a>, etc.</p> <p>One thing that's notable is despite the vast majority of his falsifiable predictions from earlier work being false, Kurzweil continues to use the same methodology to generate new predictions each time, which is <a href="http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics11.pdf">reminiscent of Andrew Gelman's discussion of forecasters who repeatedly forecast the same thing over and over again in the face of evidence that their old forecasts were wrong</a>. For example, in his 2005 The Singularity is Near, Kurzweil notes the existence of &quot;S-curves&quot;, where growth from any particular &quot;thing&quot; isn't necessarily exponential, but, as he did in 1990, concludes that exponential growth will continue because some new technology will inevitably be invented which will cause exponential growth to continue and that &quot;The law of accelerating returns applies to all of technology, indeed to any evolutionary process. It can be charted with remarkable precision in information-based technologies because we have well-defined indexes (for example, calculations per second per dollar, or calculations per second per gram) to measure them&quot;.</p> <p>In 2001, he uses this method to plot a graph and then predicts unbounded life expectancy by 2011 (the quote below isn't unambiguous on life expectancy being unbounded, but it's unambiguous if you read the entire essay or his clarification on his life expectancy predictions, where he says &quot;I don’t mean life expectancy based on your birthdate, but rather your remaining life expectancy&quot;):</p> <blockquote> <p>Most of you (again I’m using the plural form of the word) are likely to be around to see the Singularity. The expanding human life span is another one of those exponential trends. In the eighteenth century, we added a few days every year to human longevity; during the nineteenth century we added a couple of weeks each year; and now we’re adding almost a half a year every year. With the revolutions in genomics, proteomics, rational drug design, therapeutic cloning of our own organs and tissues, and related developments in bio-information sciences, we will be adding more than a year every year within ten years.</p> </blockquote> <p>Kurzweil pushes the date this is expected to happen back by more than one year per year (the last citation I saw on this was a 2016 prediction that we would have unbounded life expectancy by 2029), which is characteristic of many of Kurzweil's predictions.</p> <p>Quite a few people have said that Kurzweil's methodology is absurd because exponential growth can't continue indefinitely in the real world, but Kurzweil explains why he believes this is untrue in his 1990 book, The Age of Intelligent Machines:</p> <blockquote> <p>A remarkable aspect of this new technology is that it uses almost no natural resources. Silicon chips use infinitesimal amounts of sand and other readily available materials. They use insignificant amounts of electricity. As computers grow smaller and smaller, the material resources utilized are becoming an inconsequential portion of their value. Indeed, software uses virtually no resources at all.</p> </blockquote> <p>That we're entering a world of natural resource abundance because <a href="datacenter-power/">resources and power are irrelevant to computers hasn't been correct so far</a>, but luckily for Kurzweil, many of the exponential and double exponential processes he predicted would continue indefinitely stopped long before natural resource limits would come into play, so this wasn't a major reason Kurzweil's predictions have been wrong, although it would be if his predictions were less inaccurate.</p> <p>At a meta level, one issue with Kurzweil's methodology is that he has a propensity to &quot;round up&quot; to make growth look faster than it is in order to fit the world to his model. For example, in &quot;The Law of Accelerating Returns&quot;, we noted that Kurzweil predicted unbounded lifespan by 2011 based on accelerating lifespan when &quot;now we’re adding almost a half a year every year&quot; in 2001. However, life expectancy growth in the U.S. (which, based on his comments, seems to be most of what Kurzweil writes about) was <a href="https://pubmed.ncbi.nlm.nih.gov/15008552/">only 0.2 years per year overall and 0.1 years per year in longer lived demographics</a> and worldwide life expectancy was 0.3 years per year. While it's technically true that you can round 0.3 to 0.5 if you're rounding to the nearest 0.5, that's a very unreasonable thing to do when trying to guess when unbounded lifespan will happen because the high rate of worldwide increase life expectancy was mostly coming from &quot;catch up growth&quot; where there was a large reduction in things that caused &quot;unnaturally&quot; shortened lifespans.</p> <p>If you want to predict what's going to happen at the high end, it makes more sense to look at high-end lifespans, which were increasing much more slowly. Another way in which Kurzweil rounded up to get his optimistic prediction was to select a framing that made it look like we were seeing extremely rapid growth in life expectancies. But if we simply plot life expectancy over time since, say, 1950, we can see that growth is mostly linear-ish trending to sub-linear (and this is true even if we cut the graph off when Kurzweil was writing in 2001), with some super-linear periods that trend down to sub-linear. Kurzweil says he's a fan of using indexes, etc., to look at growth curves, but in this case where he can easily do so, he instead chooses to pick some numbers out of the air because his &quot;standard&quot; methodology of looking at the growth curves results in a fairly boring prediction of lifespan growth slowing down, so there are three kinds of rounding up in play here (picking an unreasonably optimistic number, rounding up that number, and then selectively not plotting a bunch of points on the time series to paint the picture Kurzweil wants to present).</p> <p>Kurzweil's &quot;rounding up&quot; is also how he came up with the predictions that, among other things, computer performance/size/cost and economic growth would follow double exponential trajectories. For computer cost / transistor size, Kurzweil plotted, on a log scale, a number of points on the silicon scaling curve, plus one very old point from the pre-silicon days, when transistor size was on a different scaling curve. He then fits what appears to be a cubic to this, and since a cubic &quot;wants to&quot; either have high growth or high anti-growth in the future, and the pre-silicon point puts pulls the cubic fit very far down in the past, the cubic fit must &quot;want to&quot; go up in the future and Kurzweil rounds up this cubic growth to exponential. This was also very weakly supported by the transistor scaling curve at the time Kurzweil was writing. As someone who was following <a href="https://en.wikipedia.org/wiki/International_Technology_Roadmap_for_Semiconductors">ITRS roadmaps</a> at the time, my recollection is that ITRS set a predicted Moore's law scaling curve and semiconductor companies raced to beat curve, briefly allowing what appeared to be super-exponential scaling since they would consistently beat the roadmap, which was indexed against Moore's law. However, anyone who actually looked at the details of what was going on or talked to semiconductor engineers instead of just looking at the scaling curve would've known that people generally expected both that super-exponential scaling was temporary and not sustainable and that <a href="https://en.wikipedia.org/wiki/Dennard_scaling">the end of Dennard scaling</a> as well as transistor-delay dominated (as opposed to interconnect delay-dominated) high-performance processors were imminent, meaning that exponential scaling of transistor sizes would not lead to the historical computer performance gains that had previously accompanied transistor scaling; this expectation was so widespread that it was discussed in undergraduate classes at the time. Anyone who spent even the briefest amount of time looking into semiconductor scaling would've known these things at the time Kurzweil was talking about how we were entering an era of double exponential scaling and would've thought that we would be lucky to even having general single exponential scaling of computer performance, but since Kurzweil looks at the general shape of the curve and not the mechanism, none of this knowledge informed his predictions, and since Kurzweil rounds up the available evidence to support his ideas about accelerating acceleration of growth, he was able to find a selected set of data points that supported the curve fit he was looking for.</p> <p>We'll see this kind of rounding up done by other futurists discussed here, as well as longtermists discussed in the appendix, and we'll also see some of the same themes over and over again, particularly exponential growth and the idea that exponential growth will lead to even faster exponential growth due to improvements in technology causing an acceleration of the rate at which technology improves.</p> <h3 id="jacque-fresco">Jacque Fresco</h3> <p>In 1969, Jacque Fresco wrote <a href="https://amzn.to/3LmxYuO">Looking Forward</a>. Fresco claims it's possible to predict the future by knowing what values people will have in the future and then using that to derive what the future will look like. Fresco doesn't describe how one can know the values people will have in the future and assumes people will have the values he has, which one might describe as 60s/70s hippy values. Another major mechanism he uses to predict the future is the idea that people of the future will be more scientific and apply the scientific method.</p> <p>He writes about how &quot;the scientific method&quot; is only applied in a limited fashion, which led to thousands of years of slow progress. But, unlike in the 20th century, in the 21st century, people will be free from bias and apply &quot;the scientific method&quot; in all areas of their life, not just when doing science. People will be fully open to experimentation in all aspects of life and all people will have &quot;a habitual open-mindedness coupled with a rigid insistence that all problems be formulated in a way that permits factual checking&quot;.</p> <p>This will, among other things, lead to complete self-knowledge of one's own limitations for all people as well as an end to unhappiness due to suboptimal political and social structures.</p> <p>The third major mechanism Fresco uses to derive his predictions is the idea that computers will be able solve basically any problem one can imagine and that manufacturing technology will also progress similarly.</p> <p>Each of the major mechanisms that are in play in Fresco's predictions are indistinguishable from magic. If you can imagine a problem in the domain, the mechanism is able to solve it. There are other magical mechanisms in play as well, generally what was in the air at the time. For example, behaviorism and operant conditioning were very trendy at the time, so Fresco assumes that society at large will be able to operant condition itself out of any social problems that might exist.</p> <p>Although most of Fresco's predictions are technically not yet judgable because they're about the far future, for the predictions he makes whose time has come, I didn't see one accurate prediction.</p> <h3 id="buckminster-fuller">Buckminster Fuller</h3> <p>Fuller is <a href="https://news.ycombinator.com/item?id=32462494">best known for inventing the geodesic dome</a>, although geodesic domes were actually made by Walther Bauersfeld decades before Fuller &quot;invented&quot; them. Fuller is also known for a variety of other creations, like the <a href="https://slate.com/technology/2022/08/the-dymaxion-car-the-true-history-of-buckminster-fullers-failed-automobile.html">Dymaxion car</a>, as well as his futurist predictions.</p> <p>I couldn't find a great source of a very long list of predictions from Fuller, but I did find <a href="https://www.youtube.com/watch?v=wPETzKYLkco">this interview, where he makes a number of predictions</a>. Fuller basically free associates with words, making predictions by riffing off of the English meaning of the word (e.g., see the teleportation prediction) or sometimes an even vaguer link.</p> <p>Predictions from the video:</p> <ul> <li>We'll be able to send people by radio because atoms have frequencies and radio waves have frequencies so it will be possible to pick up all of our frequencies and send them by radio</li> <li>Undeveloped countries (as opposed to highly developed countries) will be able to get the most advanced technologies &quot;via the moon&quot; <ul> <li>We're going to put people on the moon for a year, which will require putting something like mile diameter of earth activity into a little black box weighing 500 lbs so that the moon person will be able to operate locally as if they were on earth</li> <li>This will result in everyone realizing they could just get a little black box and they'll no longer need local sewer systems, water, power, etc.</li> </ul></li> <li>Humans will be fully automated out of physical work <ul> <li>The production capability of China and India will be irrelevant and the only thing that will matter is who can &quot;get&quot; the consumers from China and India</li> </ul></li> <li>There will be a realistic accounting system of what wealth is, which is really about energy due to the law of conservation of energy, which also means that wealth won't deteriorate and get lost <ul> <li>Wealth can only increase because energy can't be created or destroyed and when you do an experiment, you can only learn more, so wealth can only be created</li> <li>This will make the entire world successful</li> </ul></li> </ul> <p>For those who've heard that Fuller predicted the creation of Bitcoin, that last prediction about an accounting system for wealth is the one people are referring to. Typically, people who say this haven't actually listened to the interview where he states the whole prediction and are themselves using Fuller's free association method. Bitcoin comes from spending energy to mine Bitcoin and Fuller predicted that the future would have a system of wealth based on energy, therefore Fuller predicted the creation of Bitcoin. If you actually listen to the interview, Bitcoin doesn't even come close to satisfying the properties of the system Fuller describes, but that doesn't matter if you're doing Fuller-style free association.</p> <p>In this post, Fuller has fewer predictions graded than almost anyone else, so it's unclear what his accuracy would be if we had a list of, say, 100 predictions, but the predictions I could find have a 0% accuracy rate.</p> <h3 id="michio-kaku">Michio Kaku</h3> <p>Among people on Wikipedia's futurist list, Michio Kaku is probably relatively well known because, as part of his work on science popularization, he's had a nationally (U.S.) syndicated radio show since 2006 and he frequently appears on talk shows and is interviewed by news organizations.</p> <p>In his 1997 book, <a href="https://amzn.to/3BMxY4g">Visions: How Science Will Revolutionize the 21st Century</a>, Kaku explains why predictions from other futurists haven't been very accurate and why his predictions are different:</p> <blockquote> <p>... most predictions of the future have floundered because they have reflected the eccentric, often narrow viewpoints of a single individual.</p> <p>The same is not true of Visions. In the course of writing numerous books, articles, and science commentaries, I have had the rare privilege of interviewing over 150 scientists from various disciplines during a ten-year period.</p> <p>On the basis of these interviews, I have tried to be careful to delineate the time frame over which certain predictions will or will not be realized. Scientists expect some predictions to come about by the year 2020; others will not materialize until much later—from 2050 to the year 2100.</p> </blockquote> <p>Kaku also claims that his predictions are more accurate than many other futurists because he's a physicist and thinking about things in the ways that physicists do allows for accurate predictions of the future:</p> <blockquote> <p>It is, I think, an important distinction between Visions, which concerns an emerging consensus among the scientists themselves, and the predictions in the popular press made almost exclusively by writers, journalists, sociologists, science fiction writers, and others who are <i>consumers</i> of technology, rather than by those who have helped to shape and <i>create</i> it. ... As a research physicist, I believe that physicists have been particularly successful at predicting the broad outlines of the future. Professionally, I work in one of the most fundamental areas of physics, the quest to complete Einstein’s dream of a “theory of everything.” As a result, I am constantly reminded of the ways in which quantum physics touches many of the key discoveries that shaped the twentieth century.</p> <p>In the past, the track record of physicists has been formidable: we have been intimately involved with introducing a host of pivotal inventions (TV, radio, radar, X-rays, the transistor, the computer, the laser, the atomic bomb), decoding the DNA molecule, opening new dimensions in probing the body with PET, MRI, and CAT scans, and even designing the Internet and the World Wide Web.</p> </blockquote> <p>He also specifically calls out Kurzweil's predictions as absurd, saying Kurzweil has &quot;preposterous predictions about the decades ahead, from vacationing on Mars to banishing all diseases.&quot;</p> <p>Although Kaku finds Kurzweil's predictions ridiculous, his predictions rely on some of the same mechanics Kurzweil relies on. For example, Kaku assumes that materials / commodity prices will tank in the then-near future because the advance of technology will make raw materials less important and Kaku also assumes the performance and cost scaling of computer chips would continue on the historical path it was on during the 70s and 80s. Like most of the other futurists from Wikipedia's list, Kaku also assumed that the pace of scientific progress would rapidly increase, although his reasons are different (he cites increased synergy between the important fields of quantum mechanics, computer science, and biology, which he says are so important that &quot;it will be difficult to be a research scientist in the future without having some working knowledge of&quot; all of those fields).</p> <p>Kaku assumed that UV lithography would run out of steam and that we'd have to switch to X-ray or electron lithography, which would then run out of steam, requiring us to switch to a fundamentally different substrate for computers (optical, molecular, or DNA) to keep performance and scaling on track, but advances in other fundamental computing substrates have not materialized quickly enough for Kaku's predictions to come to pass. Kaku assigned very high weight to things that have what he considers &quot;quantum&quot; effects, which is why, for example, he cites the microprocessor as something that will be obsolete by 2020 (they're not &quot;quantum&quot;) whereas fiber optics will not be obsolete (they rely on &quot;quantum&quot; mechanisms). Although Kaku pans other futurists for making predictions without having a real understanding of the topics they're discussing, it's not clear that Kaku has a better understanding of many of the topics being discussed even though, as a physicist, Kaku has more relevant background knowledge.</p> <p>The combination of assumptions above that didn't pan out leads to a fairly low accuracy rate for Kaku's predictions in Visions.</p> <p>I didn't finish Visions, but the prediction accuracy rate of the part of the book I read (from the beginning until somewhere in the middle, to avoid cherry picking) was 3% (arguably 6% if you give full credit to the prediction I gave half credit to). He made quite a few predictions I didn't score in which he said something &quot;may&quot; happen. Such a prediction is, of course, unfalsifiable because the statement is true whether or not the event happens.</p> <h3 id="john-naisbitt">John Naisbitt</h3> <p>Anyone who's a regular used book store bargain bin shopper will have seen this name on the cover of <a href="https://amzn.to/3qMyokJ">Megatrends</a>, which must be up there with Lee Iacocca's autobiography as one of the most common bargain bin fillers.</p> <p>Naisbitt claims that he's able to accurately predict the future using &quot;content analysis&quot; of newspapers, which he says was used to provide great insights during WWII and has been widely used by the intelligence community since then, but hadn't been commercially applied until he did it. Naisbitt explains that this works because there's a fixed amount of space in newspapers (apparently newspapers can't be created or destroyed nor can newspapers decide to print significantly more or less news or have editorial shifts in what they decide to print that are not reflected by identical changes in society at large):</p> <blockquote> <p>Why are we so confident that content analysis is an effective way to monitor social change? Simply stated, because the news hole in a newspaper is a closed system. For economic reasons, the amount of space devoted to news in a newspaper does not change significantly over time. So, when something new is introduced, something else or a combination of things must be omitted. You cannot add unless you subtract. It is the principle of forced choice in a closed system.</p> </blockquote> <p>Unfortunately, it's not really possible to judge Naisbitt's predictions because he almost exclusively deals in vague, horoscope-like, predictions which can't really be judged as correct or incorrect. If you just read Megatrends for the flavor of each chapter and don't try to pick out individual predictions, some chapters seem quite good, e.g., &quot;Industrial Society -&gt; Information Society&quot;, but some are decidedly mixed even if you very generously grade his vague predictions, e.g., &quot;From Forced Technology to High Tech / High Touch&quot;. This can't really be compared to the other futurists in this post because it's much easier to make vague predictions sound roughly correct than to make precise predictions correct but, even so, if reading for general feel of what direction the future might go, Naisbitt's predictions are much more on the mark than any other futurists discussed.</p> <p>That being said, as far as I read in his book, the one concrete prediction I could find was incorrect, so if you want to score Naisbitt comparably to the other futurists discussed here, you might say his accuracy rate is 0% but with very wide error bars.</p> <h3 id="gerard-k-o-neill">Gerard K. O'Neill</h3> <p>O'Neill has two relatively well-known non-fiction futurist books, <a href="https://amzn.to/3dtGtHX">2081</a> and <a href="https://amzn.to/3do9bde">The Technology Edge</a>. 2081 was written in 1980 and predicts the future 100 years from then. The Technology Edge discusses what O'Neill thought the U.S. needed to do in 1983 to avoid being obsoleted by Japan.</p> <p>O'Neill spends a lot more space on discussing why previous futurists were wrong than any other futurist under discussion. O'Neill notes that &quot;most [futurists] overestimated how much the world would be transformed by social and political change and underestimated the forces of technological change&quot; and cites Kipling, Verne, Wells, Haldane, and Ballamy, as people who did this. O'Neill also says that &quot;scientists tend to overestimate the chances for major scientific breakthroughs and underestimate the effects of straightforward developments well within the boundaries of existing knowledge&quot; and cites Haldane again on this one. O'Neill also cites spaceflight as a major miss of futurists past, saying that they tended to underestimate how quickly spaceflight was going to develop.</p> <p>O'Neill also says that it's possible to predict the future without knowing the exact mechanism by which the change will occur. For example, he claims that the automobile could've been safely predicted even if the internal combustion engine hadn't been invented because steam would've also worked. But he also goes on to say that there are things it would've been unreasonable to predict, like the radio, TV, and electronic communications, saying that even though the foundations for those were discovered in 1865 and that the time interval between a foundational discovery and its application is &quot;usually quite long&quot;, citing 30-50 years from quantum mechanics to integrated circuits and 100+ years from relativity to faster than light travel, and 50+ years from the invention of nuclear power without &quot;a profound impact&quot;.</p> <p>I don't think O'Neill ever really explains why his predictions are of the &quot;automobile&quot; kind in a convincing way. Instead, he relies on doing the opposite of what he sees as mistakes others made. The result is that he predicts huge advancements in space flight, saying we should expect we should expect large scale space travel and colonization by 2081, presaged by wireless transmission of energy by 2000 (referring to energy beamed down from satellites) and interstellar probes by 2025 (presumably something of a different class than the Voyager probes, which were sent out in 1977).</p> <p>In 1981, he said &quot;a fleet of reusable vehicles of 1990s vintage, numbering much less than today's world fleet of commercial jet transports, would be quite enough to provide transport into space and back again for several hundred million people per year&quot;, predicting that something much more advanced the the NASA Space Shuttle would be produced shortly afterwards. Continuing that progress &quot;by the year 2010 or thereabouts there will be many space colonies in existence and many new ones being constructed each year&quot;.</p> <p>Most of O'Neill's predictions are for 2081, but he does make the occasional prediction for a time before 1981. All of the falsifiable ones I could find were incorrect, giving him an accuracy rate of approximately 0% but with fairly wide error bars.</p> <h3 id="patrick-dixon">Patrick Dixon</h3> <p>Dixon is best known for writing <a href="https://amzn.to/3Br9JqK">Futurewise</a>, but he has quite a few books with predictions about the future. In this post, we're just going to look at Futurewise, because it's the most prediction-oriented book Dixon has that's old enough that we ought to be able to make a call on a decent number of his predictions (Futurewise is from 1998; his other obvious candidate, <a href="https://amzn.to/3qKT1O6">The Future of Almost Everything</a> is from 2015 and looks forward a century).</p> <p>Unlike most other futurists featured in this post, Dixon doesn't explicitly lay out why you should trust his predictions in Futurewise in the book itself, although he sort of implicitly does so in the acknowledgements, where he mentions having interacted with many very important people.</p> <blockquote> <p>I am indebted to the hundreds of senior executives who have shaped this book by their participation in presentations on the Six Faces of the Future. The content has been forged in the realities of their own experience.</p> </blockquote> <p>And although he doesn't explicitly refer to himself, he also says that business success will come from listening to folks who have great vision:</p> <blockquote> <p>Those who are often right will make a fortune. Trend hunting in the future will be a far cry from the seventies or eighties, when everything was more certain. In a globalized market there are too many variables for back-projection and forward-projection to work reliably .. That's why economists don't make good futurologists when it comes to new technologies, and why so many boards of large corporations are in such a mess when it comes to quantum leaps in thinking beyond 2000.</p> <p>Second millennial thinking will never get us there ... A senior board member of a Fortune 1000 company told me recently: 'I'm glad I'm retiring so I don't have to face these decisions' ... 'What can we do?' another senior executive declares ...</p> </blockquote> <p>Later, in The Future of Almost Everything, Dixon lays out the techniques that he says worked when he wrote Futurewise, which &quot;has stood the test of time for more than 17 years&quot;. Dixon says:</p> <blockquote> <p>All reliable, long-range forecasting is based on powerful megatrends that have been driving profound, consistent and therefore relatively predictable change over the last 30 years. Such trends are the basis of every well- constructed corporate strategy and government policy ... These wider trends have been obvious to most trend analysts like myself for a while, and have been well described over the last 20–30 years. They have evolved much more slowly than booms and busts, or social fads.</p> </blockquote> <p>And lays out trends such as:</p> <ul> <li>fall in costs of production of most mass-produced items</li> <li>increased concern about environment/sustainability</li> <li>fall in price of digital technology, telecoms and networking</li> <li>rapid growth of all kinds of wireless/mobile devices</li> <li>ever-larger global corporations, mergers, consolidations</li> </ul> <p>Dixon declines to mention trends he predicted that didn't come to pass (such as his prediction that increased tribalism will mean that most new wealth is created in small firms of 20 or fewer employees which will mostly be family owned, or his prediction that the death of &quot;old economics&quot; means that we'll be able to have high economic growth with low unemployment and no inflationary pressure indefinitely), or cases where the trend progression caused Dixon's prediction to be wildly incorrect, a common problem when making predictions off of exponential trends because a relatively small inaccuracy in the rate of change can result in a very large change in the final state.</p> <p><a href="https://web.archive.org/web/20210814150424/https://www.globalchange.com/the-future-of-almost-everything-new-book-by-patrick-dixon.htm">Dixon's website is full of endorsements for him, with implicit and explicit claims that he's a great predictor of the future</a>, as well as more general statements such as &quot;Patrick Dixon has been ranked as one of the 20 most influential business thinkers alive today&quot;.</p> <p>Back in Futurewise, Dixon relies heavily on the idea of a stark divide between &quot;second millennial thinking&quot; and &quot;third millennial thinking&quot; repeatedly comes up in Dixon's text. Like nearly everyone else under discussion, Dixon also extrapolates out from many existing trends to make predictions that didn't pan out, e.g., he looked at the falling cost and decreasing price of phone lines and predicted that people would end up with a huge number of phone lines in their home by 2005 and that screens getting thinner would mean that we'd have &quot;paper-thin display sheets&quot; in significant use by 2005. This kind of extrapolation sometimes works and Dixon's overall accuracy rate of 10% is quite good compared to the other &quot;futurists&quot; under discussion here.</p> <p>However, when Dixon explains his reasoning in areas I have some understanding of, he seems to be operating at the <a href="cocktail-ideas/">buzzword level</a>, so that when he makes a correct call, it's generally for the wrong reasons. For example, Dixon says that software will always be buggy, which seems true, at least to date. However, his reasoning for this is that new computers come out so frequently (he says &quot;less than 20 months&quot; — a reference to the 18 month timeline in Moore's law) and it takes so long to write good software (&quot;at least 20 years&quot;) that programmers will always be too busy rewriting software to run on the new generation of machines (due to the age of the book, he uses the example of &quot;brand new code ... written for Pentium chips&quot;).</p> <p>It's simply not the case that most bugs or even, as a fraction of bugs, almost any bugs are due to programmers rewriting existing code to run on new CPUs. If you really squint, you can see things like <a href="android-updates/">Android devices having lots of security bugs due to the difficulty of updating Android and backporting changes to older hardware</a>, but those kinds of bugs are both a small fraction of all bugs and not really what Dixon was talking about.</p> <p>Similarly, on how computer backups will be done in the future, Dixon basically correctly says that home workers will be vulnerable to data loss and people who are serious about saving data will back up data online, &quot;back up data on-line to computers in other cities as the ultimate security&quot;.</p> <p>But Dixon's stated reason for this is that workstations already have large disk capacity (&gt;= 2GB) and floppy disks haven't kept up (&lt; 2MB), so it would take thousands of floppy disks to do backups, which is clearly absurd. However, even at the time, Zip drives (100MB per portable disk) were common and, although it didn't take off, the same company that made Zip drives also made 1GB &quot;Jaz&quot; drives. And, of course, tape backup was also used at the time and is still used today. This trend has continued to this day; large, portable, disks are available, and quite a few people I know transfer or back up large amounts of data on portable disks. The reason most people don't do disk/tape backups isn't that it would require thousands of disks to backup a local computer (if you look at the computers people typically use at home, most people could back up their data onto a single portable disk per failure domain and even keep multiple versions on one disk), but that online/cloud backups are more convenient.</p> <p>Since Dixon's reasoning was incorrect (at least in the cases where I'm close enough to the topic to understand how applicable the reasoning was), it seems that when Dixon is correct, it can't be for the stated reason and Dixon is either correct by coincidence or because he's looking at the broader trend and came up with an incorrect rationalization for the prediction. But, per the above, it's very difficult to actually correctly predict the growth rate of a trend over time, so without some understanding of the mechanics in play, one could also say that a prediction that comes true based on some rough trend is also correct by coincidence.</p> <h3 id="alvin-toffler-heidi-toffler">Alvin Toffler / Heidi Toffler</h3> <p>Like most others on this list, <a href="https://www.denverpost.com/2016/06/29/author-alvin-toffler-dies/">Toffler claims some big prediction wins</a></p> <blockquote> <p>The Tofflers claimed on their website to have foretold the breakup of the Soviet Union, the reunification of Germany and the rise of the Asia-Pacific region. He said in the People’s Daily interview that “Future Shock” envisioned cable television, video recording, virtual reality and smaller U.S. families.</p> </blockquote> <p>In this post, we'll look at <a href="https://amzn.to/3xxaKMW">Future Shock</a>, Toffler's most famous work, written in 1970.</p> <p>According to a number of sources, Alvin Toffler's major works were co-authored by Heidi Toffler. In the books themselves, Heidi Toffler is acknowledged as someone who helped out a lot, but not as an author, despite the remarks elsewhere about co-authorship. In this section, I'm going to refer to Toffler in the singular, but you may want to mentally substitute the plural.</p> <p>Toffler claims that we should understand the present not only by understanding the past, but also by understanding the future:</p> <blockquote> <p>Previously, men studied the past to shed light on the present. I have turned the time-mirror around, convinced that a coherent image of the future can also shower us with valuable insights into today. We shall find it increasingly difficult to understand our personal and public problems without making use of the future as an intellectual tool. In the pages ahead, I deliberately exploit this tool to show what it can do.</p> </blockquote> <p>Toffler generally makes vague, wish-y wash-y statements, so it's not really reasonable to score Toffler's concrete predictions because so few predictions are given. However, Toffler very strongly implies that past exponential trends are expected to continue or even accelerate and that the very rapid change caused by this is going to give rise to &quot;future shock&quot;, hence the book's title:</p> <blockquote> <p>I coined the term &quot;future shock&quot; to describe the shattering stress and disorientation that we induce in individuals by subjecting them to too much change in too short a time. Fascinated by this concept, I spent the next five years visiting scores of universities, research centers, laboratories, and government agencies, reading countless articles and scientific papers and interviewing literally hundreds of experts on different aspects of change, coping behavior, and the future. Nobel prizewinners, hippies, psychiatrists, physicians, businessmen, professional futurists, philosophers, and educators gave voice to their concern over change, their anxieties about adaptation, their fears about the future. I came away from this experience with two disturbing convictions. First, it became clear that future shock is no longer a distantly potential danger, but a real sickness from which increasingly large numbers already suffer. This psycho-biological condition can be described in medical and psychiatric terms. It is the disease of change .. Earnest intellectuals talk bravely about &quot;educating for change&quot; or &quot;preparing people for the future.&quot; But we know virtually nothing about how to do it ... The purpose of this book, therefore, is to help us come to terms with the future— to help us cope more effectively with both personal and social change by deepening our understanding of how men respond to it</p> </blockquote> <p>The big hammer that Toffler uses everywhere is extrapolation of exponential growth, with the implication that this is expected to continue. On the general concept of extrapolating out from curves, Toffler's position is very similar to Kurzweil's: if you can see a trend on a graph, you can use that to predict the future, and the ability of technology to accelerate the development of new technology will cause innovation to happen even more rapidly than you might naively expect:</p> <blockquote> <p>Plotted on a graph, the line representing progress in the past generation would leap vertically off the page. Whether we examine distances traveled, altitudes reached, minerals mined, or explosive power harnessed, the same accelerative trend is obvious. The pattern, here and in a thousand other statistical series, is absolutely clear and unmistakable. Millennia or centuries go by, and then, in our own times, a sudden bursting of the limits, a fantastic spurt forward. The reason for this is that technology feeds on itself. Technology makes more technology possible, as we can see if we look for a moment at the process of innovation. Technological innovation consists of three stages, linked together into a self-reinforcing cycle. ... Today there is evidence that the time between each of the steps in this cycle has been shortened. Thus it is not merely true, as frequently noted, that 90 percent of all the scientists who ever lived are now alive, and that new scientific discoveries are being made every day. These new ideas are put to work much more quickly than ever before.</p> </blockquote> <p>The first N major examples of this from the book are:</p> <ul> <li>Population growth rate (doubling time of 11 years), which will have to create major changes</li> <li>Economic growth (doubling time of 15 years), which will increase the amount of stuff people own (this is specifically phrased as amount of stuff and not wealth) <ul> <li>It's very strongly implied that this will continue for at least 70 years</li> </ul></li> <li>Speed of travel; no doubling time is stated, but the reader is invited to extrapolate from the following points: human running speed millions of years ago, 100 mph in the 1880s, 400 mph in 1938, 800 mph by 1958, 4000 mph very shortly afterwards (18000 mph when orbiting the earth)</li> <li>Reduced time from conception of an idea to the application, used to support the idea that growth will accelerate</li> </ul> <p>As we just noted above, when discussing Dixon, Kurzweil, etc., predicting the future by extrapolating out exponential growth is fraught. Toffler somehow manages to pull off the anti-predictive feat of naming a bunch of trends which were about to stop, some of which already had their writing on the wall when Toffler was writing.</p> <p>Toffler then extrapolates from the above and predicts that the half-life of everything will become shorter, which will overturn how society operates in a variety of ways.</p> <p>For example, companies and governments will replace bureaucracies with &quot;adhocracies&quot; sometime between 1995 and 2020 . The concern that people will feel like cogs as companies grow larger is obsolete because, in adhocracy, the entire concept of top-down command and control will disappear, obsoleted by the increased pace of everything causing top-down command and control structures to disappear. While it's true that some companies have less top-down direction than would've been expected in Toffler's time, many also have more, which has been enabled by technology allowing employers to keep stricter tabs on employees than ever before, making people more of a cog than ever before.</p> <p>Another example is that Toffler predicted human colonization of the Ocean, &quot;The New Atlantis&quot;, &quot;long before the arrival of A.D. 2000&quot;.</p> <p>Fabian Giesen points out that, independent of the accuracy of Toffler's predictions, Venkatesh Rao's <a href="https://www.ribbonfarm.com/2012/05/09/welcome-to-the-future-nauseous/">Welcome to the Future Nauseous</a> explains why &quot;future shock&quot; didn't happen in areas of very rapid technological development.</p> <h3 id="people-from-the-wikipedia-list-who-weren-t-included">People from the Wikipedia list who weren't included</h3> <ul> <li>Laurie Anderson <ul> <li>I couldn't easily find predictions from her, except some song lyrics that allegedly predicted 9/11, but in a very &quot;horoscope&quot; sort of way</li> </ul></li> <li>Arthur Harkins <ul> <li>His Wikipedia entry was later removed for notability reasons and it was already tagged as non-notable at the time</li> </ul></li> <li>Stephen Hawking <ul> <li>The predictions I could find are generally too far out to grade and are really more suggestions as to what people should do than predictions. For example the Wikipedia futurist list above links to a 2001 prediction that humans will be left behind by computers / robots if genetic engineering wasn't done to allow humans to keep up and it also links to a 2006 prediction that humans need to expand to other planets to protect the species</li> </ul></li> <li>Thorkil Kristensen <ul> <li>I couldn't easily find a set of English language predictions from Kristensen. Thorkil Kristensen is <a href="https://twitter.com/MGSchmelzer/status/1331271177118109697">associated with but not an author of</a> The Limits to Growth, a 1970s anti-growth polemic</li> </ul></li> <li>David Sears <ul> <li>Not notable enough to have a wikipedia page, then or now</li> </ul></li> <li>John Zerzan <ul> <li>Zerzan seems like more of someone who's calling for change in society due to his political views than a &quot;futurist&quot; who's trying to predict the future</li> </ul></li> </ul> <h3 id="steve-yegge">Steve Yegge</h3> <p>As I mentioned at the start, none of the futurists from Wikipedia's list had very accurate predictions, so we're going to look at a couple other people from other sources who aren't generally considered futurists to see how they rank.</p> <p>We <a href="yegge-predictions/">previously looked at Yegge's predictions here</a>, which were written in 2004 and were generally about the next 5-10 years, with some further out. There were nine predictions (technically ten, but one isn't really a prediction). If grading them as written, which is how futurists have been scored, I would rank these at 4.5/9, or about 50%.</p> <p>You might argue that this is unfair because Yegge was predicting the relatively near future, but if we look at relatively near future predictions from futurists, their accuracy rate is generally nowhere near 50%, so I don't think it's unfair to compare the number in some way.</p> <p>If you want to score these like people often score futurists, where they get credit for essentially getting things directionally correct, then I'd say that Yegge's score should be between 7/9 and 8/9, depending on how much partial credit he gets for one of the questions.</p> <p>If you want to take a more holistic &quot;what would the world look like if Yegge's vision were correct vs. the world we're in today&quot;, I think Yegge also does quite well there, with the big miss being that Lisp-based languages have not taken over the world, the success of Clojure notwithstanding. This is quite different than the futurists here, who generally had a vision of many giant changes that didn't come to pass, e.g., if we look at Kurzweil's vision of the world, by 2010, we would've had self-driving cars, a &quot;cure&quot; for paraplegia, widespread use of AR, etc., by 2011, we would have unbounded life expectancy, and by 2019 we would have pervasive use of nanotechnology including computers having switched from transistors to nanotubes, effective &quot;mitigations&quot; for blindness and deafness, fairly widely deployed fully realistic VR that can simulate sex via realistic full-body stimulation, pervasive self-driving cars (predicted again), entirely new fields of art and music, etc., and all that these things imply, which is a very different world than the world we actually live in.</p> <p>And we see something similar if we look at other futurists, who predicted things like living underground, living under the ocean, etc.; most predicted many revolutionary changes that would really change society, a few of which came to pass. Yegge, instead, predicted quite a few moderate changes (as well as some places where change would be slower than a lot of people expected) and changes were slower than he expected in the areas he predicted, but only by a bit.</p> <p>Yegge described his methodology for the post above as:</p> <blockquote> <p>If you read a lot, you'll start to spot trends and undercurrents. You might see people talking more often about some theme or technology that you think is about to take off, or you'll just sense vaguely that some sort of tipping point is occurring in the industry. Or in your company, for that matter.</p> <p>I seem to have many of my best insights as I'm writing about stuff I already know. It occurred to me that writing about trends that seem obvious and inevitable might help me surface a few not-so-obvious ones. So I decided to make some random predictions based on trends I've noticed, and see what turns up. It's basically a mental exercise in mining for insights</p> <p>In this essay I'll make ten predictions based on undercurrents I've felt while reading techie stuff this year. As I write this paragraph, I have no idea yet what my ten predictions will be, except for the first one. It's an easy, obvious prediction, just to kick-start the creative thought process. Then I'll just throw out nine more, as they occur to me, and I'll try to justify them even if they sound crazy.</p> </blockquote> <p>He's not really trying to generate the best predictions, but still did pretty well by relying on his domain knowledge plus some intuition about what he's seen.</p> <p>In the post about Yegge's predictions, we also noted that he's made quite a few successful predictions outside of his predictions post:</p> <blockquote> <p>Steve also has a number of posts that aren't explicitly about predictions that, nevertheless, make pretty solid predictions about how things are today, written way back in 2004. There's It's Not Software, which was years ahead of its time about how people write “software”, how writing server apps is really different from writing shrinkwrap software in a way that obsoletes a lot of previously solid advice, like Joel's dictum against rewrites, as well as how service oriented architectures look; the Google at Delphi (again from 2004) correctly predicts the importance of ML and AI as well as Google's very heavy investment in ML; an old interview where he predicts &quot;web application programming is gradually going to become the most important client-side programming out there. I think it will mostly obsolete all other client-side toolkits: GTK, Java Swing/SWT, Qt, and of course all the platform-specific ones like Cocoa and Win32/MFC/&quot;; etc. A number of Steve's internal Google blog posts also make interesting predictions, but AFAIK those are confidential.</p> </blockquote> <p>Quite a few of Yegge's predictions would've been considered fairly non-obvious at the time and he seemed to still have a fairly good success rate on his other predictions (although I didn't try to comprehensively find them and score them, I sampled some of his old posts and found the overall success rate to be similar to the ones in his predictions post).</p> <p>With Yegge and the other predictors that were picked so that we can look at some accurate predictions there is, of course, a concern that there's survivorship bias in picking these predictors. I suspect that's not the case for Yegge because he continued to be accurate after I first noticed that he seemed to have accurate predictions, so it's not just that I picked someone who had a lucky streak after the fact. Also, especially with some of his Google internal G+ comments, made fairly high dimension comments that ended being right for the reasons he suggested, which provides a lot more information about how accurate his reasoning was than simply winning a bunch of coin flips in a row. This comment about depth of reasoning doesn't apply to Caplan, below, because I haven't evaluated Caplan's reasoning, but does apply to MS leadership circa 1990.</p> <h3 id="bryan-caplan">Bryan Caplan</h3> <p>Bryan Caplan reports that his track record is 23/23 = 100%. He is much more precise in specifying his predictions than anyone else we've looked at and tries to give a precise bet that will be trivial to adjudicate as well as betting odds.</p> <p>Caplan started making predictions/bets around the time the concept that &quot;betting is a tax on bullshit&quot; became popular (the idea being that a lot of people are willing to say anything but will quiet down if asked to make a real bet and those that don't will pay a real cost if they make bad real bets) and Caplan seems to have a strategy as acting as a tax man on bullshit in that he generally takes the safe side of bets that people probably shouldn't have made. <a href="https://statmodeling.stat.columbia.edu/2022/08/11/bets-as-forecasts-bets-as-probability-assessment-difficulty-of-using-bets-in-this-way/">Andrew Gelman says</a>:</p> <blockquote> <p>Caplan’s bets are an interesting mix. The first one is a bet where he offered 1-to-100 odds so it’s no big surprise that he won, but most of them are at even odds. A couple of them he got lucky on (for example, he bet in 2008 that no large country would leave the European Union before January 1, 2020, so he just survived by one month on that one), but, hey, it’s ok to be lucky, and in any case even if he only had won 21 out of 23 bets, that would still be impressive.</p> <p>It seems to me that Caplan’s trick here is to show good judgment on what pitches to swing at. People come at him with some strong, unrealistic opinions, and he’s been good at crystallizing these into bets. In poker terms, he waits till he has the nuts, or nearly so. 23 out of 23 . . . that’s a great record.</p> </blockquote> <p>I think there's significant value in doing this, both in the general &quot;betting is a tax on bullshit&quot; sense as well as, more specifically, if you have high belief that someone is trying to take the other side of bad bets and has good judgment, knowing that the Caplan-esque bettor has taken the position gives you decent signal about the bet even if you have no particular expertise in the subject. For example, if you look at my bets, even though <a href="https://twitter.com/danluu/status/1554559905167597568">I sometimes take bets against obviously wrong positions</a>, I much more frequently take <a href="https://twitter.com/danluu/status/1555343411229536257">bets I have a very good chance of losing</a>, so just knowing that I took a bet provides much less information than knowing that Caplan took a bet.</p> <p>But, of course, taking Caplan's side of a bet isn't foolproof. As Gelman noted, Caplan got lucky at least once, and Caplan also seems likely to lose the <a href="https://standupeconomist.com/2021-update-on-my-global-warming-traffic-light-bet-with-bryan-caplan-and-alex-tabarrok/">Caplan and Tabarrok v. Bauman bet on global temperature</a>. For that particular bet, you could also make the case that he's expected to lose since he took the bet with 3:1 odds, but a lot of people would argue that 3:1 isn't nearly long enough odds to take that bet.</p> <p>The methodology that Caplan has used to date will never result in a positive prediction of some big change until the change is very likely to happen, so this methodology can't really give you a vision of what the future will look like in the way that Yegge or Gates or another relatively accurate predictor who takes wilder bets could.</p> <h3 id="bill-gates-nathan-myhrvold-ms-leadership-circa-1990-to-1997">Bill Gates / Nathan Myhrvold / MS leadership circa 1990 to 1997</h3> <p><a href="us-v-ms/">A handful of memos that were released to the world due to the case against Microsoft which laid out the vision Microsoft executives had about how the world would develop, with or without Microsoft's involvement</a>. These memos don't lay out concrete predictions with timelines and therefore can't be scored in the same way futurist predictions were scored in this post. If rating these predictions on how accurate their vision of the future was, I'd rate them similarly to Steve Yegge (who scored 7/9 or 8/9), but the predictions were significantly more ambitious, so they seem much more impressive when controlling for the scope of the predictions.</p> <p>Compared to the futurists we discussed, there are multiple ways in which the predictions are much more detailed (and therefore more impressive for a given level of accuracy on top of being more accurate). One is that MS execs have a much deeper understanding of the things under discussion and how they impact each other. With &quot;our&quot; futurists, they often discuss things at a high level and, when they discuss things in detail, they make statements that make it clear that they don't really understand the topic and often don't really know what the words they're writing mean. MS execs of the era pretty clearly had a deep understanding of the issues in play, which let them make detailed predictions that our futurists wouldn't make, e.g., while protocols like FTP and IRC will continue to be used, the near future of the internet is HTTP over TCP and the browser will become a &quot;platform&quot; in the same way that Windows is a &quot;platform&quot;, one that's much more important and larger than any OS (unless Microsoft is successful in taking action to stop this from coming to pass, which it was not despite MS execs foreseeing the exact mechanisms that could cause MS to fail to own the internet). MS execs use this level of understanding to make predictions about the kinds of larger things that our futurists discuss, e.g., the nature of work and how that will change.</p> <p>Actually having an understanding of the issues in play and not just operating with a typical futurist buzzword level understanding of the topics allowed MS leadership to make fairly good guesses about what the future would look like.</p> <p>For a fun story about how much effort Gates spent on understanding what was going on, <a href="https://www.joelonsoftware.com/2006/06/16/my-first-billg-review/" rel="nofollow">see this story by Joel Spolsky on his first Bill Gates review</a>:</p> <blockquote> <p>Bill turned to me.</p> <p>I noticed that there were comments in the margins of my spec. He had read the first page!</p> <p><I>He had read the first page of my spec and written little notes in the margin!</I></p> <p>Considering that we only got him the spec about 24 hours earlier, he must have read it the night before.</p> <p>He was asking questions. I was answering them. They were pretty easy, but I can’t for the life of me remember what they were, because I couldn’t stop noticing that he was flipping through the spec…</p> <p><I>He was flipping through the spec!</I> [Calm down, what are you a little girl?]</p> <p>… [ed: ellipses are from the original doc] and THERE WERE NOTES IN ALL THE MARGINS. ON EVERY PAGE OF THE SPEC. HE HAD READ THE WHOLE GODDAMNED THING AND WRITTEN NOTES IN THE MARGINS.</p> <p>He Read The Whole Thing! [OMG SQUEEE!]</p> <p>The questions got harder and more detailed.</p> <p>They seemed a little bit random. By now I was used to thinking of Bill as my buddy. He’s a nice guy! He read my spec! He probably just wants to ask me a few questions about the comments in the margins! I’ll open a bug in the bug tracker for each of his comments and makes sure it gets addressed, pronto!</p> <p>Finally the killer question.</p> <p>“I don’t know, you guys,” Bill said, “Is anyone really looking into all the details of how to do this? Like, all those date and time functions. Excel has so many date and time functions. Is Basic going to have the same functions? Will they all work the same way?”</p> <p>“Yes,” I said, “except for January and February, 1900.”</p> <p>Silence. ... “OK. Well, good work,” said Bill. He took his marked up copy of the spec ... and left</p> </blockquote> <p>Gates (and some other MS execs) were very well informed about what was going on to a fairly high level of detail considering all of the big picture concerns they also had in mind.</p> <p>A topic for another post is how MS leadership had a more effective vision for the future than leadership at old-line competitors (Novell, IBM, AT&amp;T, Yahoo, Sun, etc.) and how this resulted in MS turning into a $2T company while their competitors became, at best, irrelevant and most didn't even succeed at becoming irrelevant and ceased to exist. Reading through old MS memos, it's clear that MS really kept tabs on what competitors were doing and they were often surprised at how ineffective leadership was at their competitors, e.g., on Novell, Bill Gates says &quot;Our traditional competitors are just getting involved with the Internet. Novell is surprisingly absent given the importance of networking to their position&quot;; Gates noted that Frankenberg, then-CEO of Novell, seemed to understand the importance of the internet, but Frankenberg only joined Novell in 1994 and left in 1996 and spent much of his time at Novell reversing the direction the company had taken under Noorda, which didn't leave Novell with a coherent position or plan when Frankenberg &quot;resigned&quot; two years into the pivot he was leading.</p> <p>In many ways, a discussion of what tech execs at the time thought the future would look like and what paths would lead to success is more interesting than looking at futurists who basically don't understand the topics they're talking about, but I started this post to look at how well futurists understood the topics they discussed and I didn't know, in advance, that their understanding of the topics they discuss and resultant prediction accuracy would be so poor.</p> <h4 id="common-sources-of-futurist-errors">Common sources of futurist errors</h4> <ul> <li>Not learning from mistakes <ul> <li>Good predictors tend to be serious at looking at failed past predictions and trying to calibrate</li> </ul></li> <li>Reasoning from a <a href="cocktail-ideas/">cocktail party level understanding of a topic</a> <ul> <li>Good predictors tend to engage with ideas in detail</li> </ul></li> <li>Pushing one or a few &quot;big ideas&quot;</li> <li>Generally assuming high certainty about the future <ul> <li>Worse yet: assuming high certainty of scaling curves, especially exponential scaling curves</li> </ul></li> <li>Panacea thinking</li> <li>Only seeing the upside (or downside) of technological changes</li> <li>Starting from evidence-free assumptions</li> </ul> <h5 id="not-learning-from-mistakes">Not learning from mistakes</h5> <p>The futurists we looked at in this post tend to rate themselves quite highly and, after the fact, generally claim credit for being great predictors of the future, so much so that they'll even tell you how you can predict the future accurately. And yet, after scoring them, the most accurate futurist (among the ones who made concrete enough predictions that they could be scored) came in at 10% accuracy with generous grading that gave them credit for making predictions that accidentally turned out to be correct when they mispredicted the mechanism by which the prediction would come to pass (a strict reading of many of their predictions would reduce the accuracy because they said that the prediction would happen because of their predicted mechanism, which is false, rendering the prediction false).</p> <p>There are two tricks that these futurists have used to be able to make such lofty claims. First, many of them make vague predictions and then claim credit if anything vaguely resembling the prediction comes to pass. Second, almost all of them make a lot of predictions and then only tally up the ones that came to pass. One way to look at a 4% accuracy rate is that you really shouldn't rely on that person's predictions. Another way is that, if they made 500 predictions, they're a great predictor because they made 20 accurate predictions. Since almost no one will bother to go through a list of predictions to figure out the overall accuracy when someone does the latter, making a huge number of predictions and then cherry picking the ones that were accurate is a good strategy for becoming a renowned futurist.</p> <p>But if we want to figure out how to make accurate predictions, we'll have to look at other people's strategies. There are people who do make fairly good, generally directionally accurate, predictions, as we noted when we <a href="yegge-predictions/">looked at Steve Yegge's prediction record</a>. However, they tend to be harsh critics of their predictions, as Steve Yegge was when <a href="https://sites.google.com/site/steveyegge2/ten-predictions">he reviewed his own prediction record, saying</a>:</p> <blockquote> <p>I saw the HN thread about Dan Luu's review of this post, and felt people were a little too generous with the scoring.</p> </blockquote> <p>It's unsurprising that a relatively good predictor of the future scored himself lower than I did because taking a critical eye to your own mistakes and calling yourself out for mistakes that are too small for most people to care about is a great way to improve. We can see <a href="us-v-ms/">in communications from Microsoft leadership as well</a>, e.g., calling themselves out for failing to predict that a lack of backwards compatibility doomed major efforts like OS/2 and LanMan. Doing what most futurists do and focusing on the predictions that worked out without looking at what went wrong isn't such a great way to improve.</p> <h5 id="cocktail-party-understanding">Cocktail party understanding</h5> <p>Another thing we see among people who make generally directionally correct predictions, as in the Steve Yegge post mentioned above, <a href="https://twitter.com/corry_wang/status/1340869586397372417">Nathan Myhrvold's 1993 &quot;Road Kill on the Information Highway&quot;</a>, Bill Gates's 1995 &quot;<a href="us-v-ms/">The Internet Tidal Wave</a>&quot;, etc., is that the person making the prediction actually understands the topic. In all of the above examples, it's clear that the author of the document has a fairly strong technical understanding of the topics being predicted and, in the general case, it seems that people who have relatively accurate predictions are really trying to understand the topic, which is in stark contrast to the futurists discussed in this post, almost all of whom display clear signs of having a having a buzzword level understanding<sup class="footnote-ref" id="fnref:I"><a rel="footnote" href="#fn:I">2</a></sup> of the topics they're discussing.</p> <p>There's a sense in which it isn't too difficult to make correct predictions if you understand the topic and have access to the right data. Before joining a huge megacorp and then watching the future unfold, I thought documents like &quot;Road Kill on the Information Highway&quot; and &quot;The Internet Tidal Wave&quot; were eerily prescient, but once I joined Google in 2013, a lot of trends that weren't obvious from the outside seemed fairly <a href="https://www.patreon.com/posts/how-do-their-71735437">obvious</a> from the inside.</p> <p>For example, it was obvious that mobile was very important for most classes of applications, so much so that most applications that were going to be successful would be &quot;mobile first&quot; applications where the web app was secondary, if it existed at all, and from the data available internally, this should've been obvious going back at least to 2010. Looking at what people were doing on the outside, quite a few startups in areas where mobile was critical were operating with a 2009 understanding of the future even as late as 2016 and 2017, where they focused on having a web app first and had no mobile app and a web app that was unusable on mobile. Another example of this is that, in 2012, <a href="https://twitter.com/danluu/status/1571051251357589510">quite a few people at Google independently wanted Google to make very large bets on deep learning</a>. It seemed very obvious that deep learning was going to be a really big deal and that it was worth making a billion dollar investment in a portfolio of hardware that would accelerate Google's deep learning efforts.</p> <p>This isn't to say that the problem is trivial — many people with access to the same data still generally make incorrect predictions. A famous example is Ballmer's prediction that &quot;There’s no chance that the iPhone is going to get any significant market share. No chance.&quot;<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">3</a></sup> Ballmer and other MS leadership had access to information as good as MS leadership from a decade earlier, but many of their predictions were no better than the futurists we discussed here. And with the deep learning example above, <a href="https://twitter.com/danluu/status/1571051255157653506">a competitor with the same information at Google totally whiffed and kept whiffing for years, even with the benefit of years of extra information</a>; they're still well behind Google now, a decade later, due to their failure to understand how to enable effective, practical, deep learning R&amp;D.</p> <h5 id="assuming-high-certainty">Assuming high certainty</h5> <p>Another common cause of incorrect predictions was having high certainty. That's a general problem that's magnified when making predictions from looking at past exponential growth and extrapolating to the future both because mispredicting the timing of a large change in exponential growth can have a very large impact and also because relatively small sustained changes in exponential growth can also have a large impact. An example that exposed these weaknesses for a large fraction of our futurists was their interpretation of Moore's law, which many interpreted as a doubling of every good thing and/or halving of every bad thing related to computers every 18 months. That was never what Moore's law predicted in the first place, but it was a common pop-conception of Moore's law. One thing that's illustrative about that is that predictors who were writing in the late 90s and early 00s still made these fantastical Moore's law &quot;based&quot; predictions even though it was such common knowledge that both single-threaded computer performance and Moore's law would face significant challenges that this was taught in undergraduate classes at the time. Any futurist who spent a few minutes talking to an expert in the area or even an undergrad would've seen that there's a high degree of uncertainty about computer performance scaling, but most of the futurists we discuss either don't do that or ignore evidence that would add uncertainty to their narrative<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">4</a></sup>.</p> <p>As computing power increases, all constant-factor inefficiencies (&quot;uses twice as much RAM&quot;, &quot;takes three times as many RISC operations&quot;) tend to be ground under the heel of Moore's Law, leaving polynomial and exponentially increasing costs as the sole legitimate areas of concern. Flare, then, is willing to accept any O(C) inefficiency (single, one-time cost), and is willing to accept most O(N) inefficiencies (constant-factor costs), because neither of these costs impacts scalability; Flare programs and program spaces can grow without such costs increasing in relative significance. You can throw hardware at an O(N) problem as N increases; throwing hardware at an O(N**2) problem rapidly becomes prohibitively expensive.</p> <p>For computer scaling in particular, it would've been possible to make a reasonable prediction about 2022 computers in, say, 2000, but it would've had to have been a prediction about the distribution of outcomes which had a lot of weight on severely reduced performance gains in the future with some weight on a portfolio of possibilities that could've resulted in continued large gains. Someone making such a prediction would've had to, implicitly or explicitly, been familiar with <a href="https://en.wikipedia.org/wiki/International_Technology_Roadmap_for_Semiconductors">ITRS</a> semiconductor scaling roadmaps of the era as well as recent causes of recent misses (my recollection from reading roadmaps back then was that, in the short term, companies had actually exceeded recent scaling predictions, but via mechanisms that were not expected to be scalable into the future) as well as things that could unexpectedly keep semiconductor scaling on track. Furthermore, such a predictor would also have to be able to evaluate architectural ideas that might have panned out to rule them out or assign them a low probability, such as dataflow processors, the basket of techniques people were working on in order to increase <a href="https://en.wikipedia.org/wiki/Instruction-level_parallelism">ILP</a> in order an attempt to move from the regime Tjaden and Flynn discussed in their classic 1970 and 1973 papers on ILP to the something closer to the bound discussed by Riseman and Foster in 1972 and later by Nicolau and Fisher in 1984, etc.</p> <p>Such a prediction would be painstaking work for someone who isn't in the field because of the sheer number of different things that could have impacted computer scaling. Instead of doing this, futurists relied heavily on the pop-understanding they had about semiconductors. Kaku was notable among futurists under discussion for taking seriously the idea that Moore's law wasn't smooth sailing in the future, but he incorrectly decided on when UV/EUV would run out of steam and also incorrectly had high certainty that some kind of more &quot;quantum&quot; technology would save computer performance scaling. Most other futurists who discussed computers used a line reasoning like Kurzweil's, who said that we can predict what will happen with &quot;remarkable precision&quot; due to the existence of &quot;well-defined indexes&quot;:</p> <blockquote> <p>The law of accelerating returns applies to all of technology, indeed to any evolutionary process. It can be charted with remarkable precision in information-based technologies because we have well-defined indexes (for example, calculations per second per dollar, or calculations per second per gram) to measure them</p> </blockquote> <p>Another thing to note here is that, even if you correctly predict an exponential curve of something, understanding the implications of that precise fact also requires an understanding of the big picture which was shown by people like Yegge, Gates, and Myhrvold but not by the futurists discussed here. An example of roughly getting a scaling curve right but mispredicting the outcome was Dixon on the number of phone lines people would have in their homes. Dixon at least roughly correctly predicted the declining cost of phone lines but incorrectly predicted that this would result in people having many phone lines in their house despite also believing that digital technologies and cell phones would have much faster uptake than they did. With respect to phones, another missed prediction, one that came from not having an understanding of the mechanism was his prediction that the falling cost of phone calls would mean that tracking phone calls would be so expensive relative to the cost of calls that phone companies wouldn't track individual calls.</p> <p>For someone who has a bit of understanding about the underlying technology, this is an odd prediction. One reason the prediction seems odd is that the absolute cost of tracking who called whom is very small and the rate at which humans make and receive phone calls is bounded at a relatively low rate, so even if the cost of metadata tracking were very high compared to the cost of the calls themselves, the absolute cost of tracking metadata would still be very low. Another way to look at it would be to look at the number of bits of information transferred during a phone call vs. the number of bits of information necessary to store call metadata and the cost of storing that long enough to bill someone on a per-call basis. Unless medium-term storage became relatively more expensive than network by a mind bogglingly large factor, it wouldn't be possible for this prediction to be true and Dixon also implicitly predicted exponentially falling storage costs via his predictions on the size of available computer storage with a steep enough curve that this criteria shouldn't be satisfied and, if it were to somehow be satisfied, the cost of storage would still be so low as to be negligible.</p> <h5 id="panacea-thinking">Panacea thinking</h5> <p>Another common issue is what Waleed Khan calls <a href="https://twitter.com/arxanas/status/1560756277231644673">panacea thinking</a>, where the person assumes that the solution is a panacea that is basically unboundedly great and can solve all problems. We can see this for quite a few futurists who were writing up until the 70s, where many assumed that computers would be able to solve any problem that required thought, computation, or allocation of resources and that resource scarcity would become irrelevant. But it turns out that quite a few problems don't magically get solved because powerful computers exist. For example, the 2008 housing crash created a shortfall of labor for housing construction that only barely got back to historical levels just before covid hit. Having fast computers neither prevented this nor fixed this problem after it happened because the cause of the problem wasn't a shortfall of computational resources. Some other topics to get this treatment are &quot;nanotechnology&quot;, &quot;quantum&quot;, &quot;accelerating growth&quot; / &quot;decreased development time&quot;, etc.</p> <p>A closely related issue that almost every futurist here fell prey to is only seeing the upside of technological advancements, resulting in a kind of techno utopian view of the future. For example, in 2005, Kurzweil wrote:</p> <blockquote> <p>The current disadvantages of Web-based commerce (for example, limitations in the ability to directly interact with products and the frequent frustrations of interacting with inflexible menus and forms instead of human personnel) will gradually dissolve as the trends move robustly in favor of the electronic world. By the end of this decade, computers will disappear as distinct physical objects, with displays built in our eyeglasses, and electronics woven in our clothing, providing full-immersion visual virtual reality. Thus, &quot;going to a Web site&quot; will mean entering a virtual-reality environment—at least for the visual and auditory senses—where we can directly interact with products and people, both real and simulated.</p> </blockquote> <p>Putting aside the bit about how non-VR interfaces about computers would disappear before 2010, it's striking how Kurzweil assumes that technological advancement will mean that corporations make experiences better for consumers instead of providing the same level of experience at a lower cost or <a href="https://twitter.com/ColeSouth/status/1550230811559198720">a worse experience at an even lower cost</a>.<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">5</a></sup></p> <p>Although that example is from Kurzweil, we can see the same techno utopianism in the other authors on Wikipedia's list with the exception of Zerzan, whose predictions I didn't tally up because prediction wasn't really his shtick. For example, a number of other futurists combined panacea thinking with techno utopianism to predict that computers would cause things to operate with basically perfect efficiency without human intervention, allowing people at large to live a life of leisure. Instead, the benefits to the median person in the U.S. are subtle enough that people debate whether or not life has improved at all for the median person. And on the topic of increased efficiency, a number of people predicted an extreme version of just-in-time delivery that humanity hasn't even come close to achieving and described its upsides, but no futurist under discussion mentioned the downsides of a world-wide distributed just-in-time manufacturing system and supply chain, which includes increased fragility and decreased robustness, notably impacted quite a few industries from 2020 through at least 2022 due to covid despite the worldwide system not being anywhere near as just-in-time or fragile as a number of futurists predicted.</p> <p>Though not discussed here because they weren't on Wikipedia's list of notable futurists, there are pessimistic futurists such as Jaron Lanier and Paul Ehrlich. From a quick informal look at relatively well-known pessimistic futurists, it seems that pessimistic futurists haven't been more accurate than optimistic futurists. Many made predictions that were too vague to score and the ones who didn't tended to predict catastrophic collapse or overly dystopian futures <a href="https://www.gwern.net/Improvements">which haven't materialized</a>. Fundamentally, dystopian thinkers made the same mistakes as utopian thinkers. For example, Paul Ehrlich fell prey to the same issues utopian thinkers fell prey to and he still maintains that his discredited book, The Population Bomb, was fundamentally correct, just like utopian futurists who maintain that their discredited work is fundamentally correct.</p> <p>Ehrlich's 1968 book opened with</p> <blockquote> <p>The battle to feed all of humanity is over. In the 1970s the world will undergo famines — hundreds of millions of people are going to starve to death in spite of any crash programs embarked upon now. At this late date nothing can prevent a substantial increase in the world death rate, although many lives could be saved through dramatic programs to &quot;stretch&quot; the carrying capacity of the earth by increasing food production. But these programs will only provide a stay of execution unless they are accompanied by determined and successful efforts at population control. Population control is the conscious regulation of the numbers of human beings to meet the needs, not just of individual families, but of society as a whole.</p> <p>Nothing could be more misleading to our children than our present affluent society. They will inherit a totally different world, a world in which the standards, politics, and economics of the 1960s are dead.</p> </blockquote> <p>When this didn't come to pass, he did the same thing as many futurists we looked at and moved the dates on his prediction, changing the text in the opening of his book from &quot;1970s&quot; to &quot;1970s and 1980s&quot;. Ehrlich then wrote a new book with even more dire predictions in 1990.</p> <p>And then later, Ehrlich simply denied ever having made predictions, even though anyone who reads his book can plainly see that he makes plenty of statements about the future with no caveats about the statements being hypothetical:</p> <blockquote> <p>Anne and I have always followed UN population projections as modified by the Population Reference Bureau — so we never made &quot;predictions,&quot; even though idiots think we have.</p> </blockquote> <p>Unfortunately for pessimists, simply swapping the sign bit on panacea thinking doesn't make predictions more accurate.</p> <h5 id="evidence-free-assumptions">Evidence free assumptions</h5> <p>Another major source of errors among these futurists was making an instrumental assumption without any supporting evidence for it. A major example of this is Fresco's theory that you can predict the future by starting from people's values and working back from there, but he doesn't seriously engage with the idea of how people's values can be predicted. Since those are pulled from his intuition without being grounded in evidence, starting from people's values creates a level of indirection, but doesn't fundamentally change the problem of predicting what will happen in the future.</p> <h3 id="fin">Fin</h3> <p>A goal of this project is to look at current predictors to see who's using methods that have historically had a decent accuracy rate, but we're going to save that for a future post. I normally don't like splitting posts up into multiple parts, but since this post is 30k words (the number of words in a small book, and more words than most pop-sci books have once you remove the pop stories) and evaluating futurists is relatively self-contained, we're going to stop with that (well, with a bit of an evaluation of some longtermist analyses that overlap with this post in the appendix)<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">6</a></sup>.</p> <p>In terms of concrete takeaways, you could consider this post a kind of negative result that supports the very boring idea that you're not going to get very far if you make predictions on topics you don't understand, whereas you might be able to make decent predictions if you have (or gain) a deep expertise of a topic and apply well-honed intuition to predict what might happen. We've looked at, in some detail, a number of common reasoning errors that cause predictions to miss at a high rate and also taken a bit of a look into some things that have worked for creating relatively accurate predictions.</p> <p>A major caveat about what's worked is that while using high-level techniques that work poorly is a good way to generate poor predictions, using high-level techniques that work well doesn't mean much because the devil is in the details and, as trite as this is to say, you really need to think about things. This is something that people who are serious about looking at data often preach, e.g., you'll see this theme come up on <a href="https://statmodeling.stat.columbia.edu/">Andrew Gelman's blog</a> as well as in <a href="https://www.youtube.com/playlist?list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN">Richard McElreath's Statistical Rethinking</a>. McElreath, in a lecture targeted at social science grad students who don't have a quantitative background, likens statistical methods to a golem. A golem will mindlessly do what you tell it to do, just like statistical techniques. There's no substitute for using your brain to think through whether or not it's reasonable to apply a particular statistical technique in a certain way. People often seem to want to use methods as a talisman to ward off incorrectness, but that doesn't work.</p> <p>We see this in the longtermist analyses we examine in the appendix which claim to be more accurate than &quot;classical&quot; futurists analyses because they, among other techniques, state probabilities, which the literature on forecasting (e.g., Tetlock's Superforecasting) says that one should do. But the analyses fundamentally use the same techniques as the futurists analyses we looked at here and then add a few things on top that are also things that people who make accurate predictions do. This is backwards. Things like probabilities need to be a core part of modelling, not something added afterwards. This kind of backwards reasoning is a common error when doing data analysis and I would caution readers who think they're safe against errors because their analyses can, at a high level, be described roughly similarly to good analyses<sup class="footnote-ref" id="fnref:N"><a rel="footnote" href="#fn:N">7</a></sup>. An obvious example of this would be the Bill Gates review we looked at. Gates asked a lot of questions and scribbled quite a few notes in the margins, but asking a lot of questions and scribbling notes in the margins of docs doesn't automatically cause you to have a good understanding of the situation. This example is so absurd that I don't think anyone even remotely reasonable would question it, but most analyses I see (of the present as well as of the future) make this fundamental error in one way or another and, as <a href="https://twitter.com/rygorous/status/1569379986708254721">Fabian Giesen might say, are cosplaying what a rigorous analysis looks like</a>.</p> <p><i>Thanks to nostalgebraist, Arb Research (Misha Yagudin, Gavin Leech), Laurie Tratt, Fabian Giesen, David Turner, Yossi Kreinin, Catherine Olsson, Tim Pote, David Crawshaw, Jesse Luehrs, @TyphonBaalAmmon, Jamie Brandon, Tao L., Hillel Wayne, Qualadore Qualadore, Sophia, Justin Blank, Milosz Danczak, Waleed Khan, Mindy Preston, @ESRogs, Tim Rice, and @s__video for comments/corrections/discussion (and probably some others I forgot because this post is so long and I've gotten so many comments).</I></p> <p><b>Update / correction</b>: an earlier version of this post contained <a href="https://twitter.com/ESRogs/status/1570505904445095939">this error</a>, pointed out by ESRogs. Although I don't believe the error impacts the conclusion, <a href="https://twitter.com/danluu/status/1570520176239726593">I consider it a fairly major error</a>. If we were doing a tech-company style postmortem, that it doesn't significantly impact the conclusion would be included in the &quot;How We Got Lucky&quot; section of the postmortem. In particular, this was a &quot;lucky&quot; error because the error was made when picking out a few examples from a large portfolio of errors to give examples of one predictors errors, so a single incorrect error doesn't change the conclusion since another error could be substituted in and, even if no other error were substituted, the reasoning quality of the reasoning being evaluated still looks quite low. But, incorrect concluding that something is an error could lead to a different conclusion in the case of a predictor who made few or no errors, which is why this was a lucky mistake for me to make.</p> <h3 id="appendix-brief-notes-on-superforecasting-https-amzn-to-3xzg3a2">Appendix: brief notes on <a href="https://amzn.to/3xzG3a2">Superforecasting</a></h3> <ul> <li>Very difficult to predict more than 3-5 years out; people generally don't do much better than random <ul> <li>Later in the book, 10 years is cited as a basically impossible timeframe, but scopes that to certain kinds of predictions (the earlier statement of 3-5 years is more general) &gt; Taleb, Kahneman, and I agree there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious—“there will be conflicts”—and the odd lucky hits that are inevitable whenever lots of forecasters make lots of forecasts. These limits on predictability are the predictable results of the butterfly dynamics of nonlinear systems. In my EPJ research, the accuracy of expert predictions declined toward chance five years out. And yet, this sort of forecasting is common, even within institutions that should know better</li> <li>One possibility is that people like Bill Gates are right due to hindsight bias, but that doesn't seem correct w.r.t., e.g., being at Google making it obvious that mobile was the only way forward circa 2010</li> </ul></li> <li>Ballmer prediction: &quot;There’s no chance that the iPhone is going to get any significant market share. No chance.&quot;</li> <li>Very important to precisely write down forecasts</li> <li>&quot;big idea&quot; predictors inaccurate (as in, heavily rely on one or a few big hammers, like &quot;global warming&quot;, &quot;ecological disaster&quot;, &quot;Moore's law&quot;, etc., to drive everything</li> <li>Specific knowledge predictors (relatively) accurate; relied heavily on probabilistic thinking, used different analytical tools as appropriate</li> <li>Good forecasters are fluent with numbers, generally aced numerical proficiency test given to forecasters, think probabilistically</li> <li>Good forecasters not particularly high IQ; typical non super-forecaster IQ from forecaster population was 70%-ile; typical forecaster IQ was 80%-ile</li> </ul> <p>See also, <a href="https://marginalrevolution.com/marginalrevolution/2020/04/my-conversation-with-philip-tetlock.html">this Tetlock interview with Tyler Cowen</a> if you don't want to read the whole book, although the book is a very quick read because it's written the standard pop-sci style, with a lot of anecdotes/stories.</p> <p>On the people we looked at vs. the people Tetlock looked at, the predictors we looked at are operating in a very different style from the folks studied in the studies that led to the Superforecasting book. Both futurists and tech leaders were trying to predict a vision for the future whereas superforecasters were asked to answer very specific questions.</p> <p>Another major difference among the accurate predictors is that the accurate predatictors we looked at (other than Caplan) had very deep expertise in their fields. This may be one reason for the difference in timelines here, where it appears that some of our predictors can predict things more than 3-5 years out, contra Tetlock's assertion. Another difference is in the kind of thing being predicted — a lot of the predictions we're looking at here are fundamentally whether or not a trend will continue or if a nascent trend will become a long-running trend, which seems easier than a lot of the questions Tetlock had his forecasters try to answer. For example, in the opening of Superforecasting, Tetlock gives predicting the Arab Spring as an example of something that would've been practically impossible — while the conditions for it had been there for years, the proximal cause of the Arab Spring was a series of coincidences that would've been impossible to predict. This is quite different from and arguably much more difficult than someone in 1980 guessing that computers will continue to get smaller and faster, leading to handheld computers more powerful than supercomputers from the 80s.</p> <h3 id="appendix-other-evaluations">Appendix: other evaluations</h3> <ul> <li><a href="http://jbr.me.uk/retro/">Justin Rye on Heinlein, Clarke, and Asimov</a></li> <li><a href="https://www.cold-takes.com/the-track-record-of-futurists-seems-fine/">Holden Karnofsky / Arb Research on Heinlen, Clarke, and Asimov</a>, as well as Karnofsky on Kurzweil, Kahn, and Weiner</li> <li>Various, on Ray Kurzweil (try googling, without quotes, &quot;Kurzweil 86% accuracy&quot;)</li> <li><a href="https://news.ycombinator.com/item?id=31294698">A variety of HN commenters on a futurist who scored themselves at 50% accuracy</a></li> <li><a href="https://tratt.net/laurie/blog/2005/predicting_the_future_of_computing.html">Laurie Tratt</a> on <a href="https://web.archive.org/web/20150325002205/http://www.ukcrc.org.uk/press/news/report/gcresearch.cfm?type=pdf">a some 2005 predictions on what will be important in computing</a></li> <li><a href="https://www.markloveless.net/blog/2023/01/02/past-predictions">Mark Loveless</a> on his own infosec predictions as far back as 1995</li> </ul> <p>Of these, the evaluations above, the only intersection with the futurists evaluated here is Kurzweil. Holden Karnofsky says:</p> <blockquote> <p>A 2013 project assessed Ray Kurzweil's 1999 predictions about 2009, and a 2020 followup assessed his 1999 predictions about 2019. Kurzweil is known for being interesting at the time rather than being right with hindsight, and a large number of predictions were found and scored, so I consider this study to have similar advantages to the above study. ... Kurzweil is notorious for his very bold and contrarian predictions, and I'm overall inclined to call his track record something between &quot;mediocre&quot; and &quot;fine&quot; - too aggressive overall, but with some notable hits</p> </blockquote> <p>Karnofsky's evaluation of Kurzweil being &quot;fine&quot; to &quot;mediocre&quot; relies on <a href="https://www.lesswrong.com/posts/kbA6T3xpxtko36GgP/assessing-kurzweil-the-results">these</a> <a href="https://www.lesswrong.com/posts/NcGBmDEe5qXB7dFBF/assessing-kurzweil-predictions-about-2019-the-results">two</a> analyses done on LessWrong and then uses a very generous interpretation of the results to conclude that Kurzweil's predictions are fine. Those two posts rate predictions as true, weakly true, cannot decide, weakly false, or false. Karnofsky then compares the number of true + weakly true to false + weakly false, which is one level of rounding up to get an optimistic result; another way to look at it is that any level other than &quot;true&quot; is false when read as written. This issue is magnified if you actually look at the data and methodology used in the LW analyses.</p> <p>In the second post, the author, Stuart Armstrong indirectly noted that there were actually no predictions that were, by strong consensus, very true when he noted that the &quot;most true&quot; prediction had a mean score of 1.3 (1 = true, 2 = weakly true ... , 5 = false) and the second highest rated prediction had a mean score of 1.4. Although Armstrong doesn't note this in the post, if you look at the data, you'll see that the third &quot;most true&quot; prediction had a mean score of 1.45 and the fourth had a mean score of 1.6, i.e., if you round to the nearest prediction score, only 3 out of 105 predictions score &quot;true&quot; and 32 are &gt;= 4.5 and score &quot;false&quot;. Karnofsky reads Armstrong's as scoring 12% of predictions true, but the post effectively makes no comment on what fraction of predictions were scored true and the 12% came from summing up the total number of each rating given.</p> <p>I'm not going to say that taking the mean of each question is the only way one could aggregate the numbers (taking the median or modal values could also be argued for, as well as some more sophisticated scoring function, an extremizing function, etc.), but summing up all of the votes across all questions results in a nonsensical number that shouldn't be used for almost anything. If every rater rated every prediction or there was a systematic interleaving of who rated what questions, then the number could be used for something (though not as a score for what fraction of predictions are accurate), but since each rater could skip any questions (although people were instructed to start rating at the first question and rate all questions until they stop, people did not do that and skipped arbitrary questions), aggregating the number of each score given is not meaningful and actually gives very little insight into what fraction of questions are true. There's an air of rigor about all of this; there are lots of numbers, standard deviations are discussed, etc., but the way most people, including Karnofsky, interpret the numbers in the post is incorrect. I find it a bit odd that, with all of the commentary of these LW posts, few people spent the one minute (and I mean one minute literally — it took me a minute to read the post, see the comment Armstrong made which is a red flag, and then look at the raw data) it would take to look at the data and understand what the post is actually saying, but <a href="dunning-kruger/">as we've noted previously, almost no one actually reads what they're citing</a>.</p> <p>Coming back to Karnofsky's rating of Kurzweil as fine to mediocre, this relies on two levels of rounding. One, doing the wrong kind of aggregation on the raw data to round an accuracy of perhaps 3% up to 12% and then rounding up again by doing the comparison mentioned above instead of looking at the number of true statements. If we use a strict reading and look at the 3%, the numbers aren't so different from what we see in this post. If we look at Armstrong's other post, there are too few raters to really produce any kind of meaningful aggregation. Armstrong rated every prediction, one person rated 68% of predictions, and no one else even rated half of the 172 predictions. The 8 predictors rated 506 predictions, so the number of ratings is equivalent to having 3 raters rate all predictions, but the results are much noisier due to the arbitrary way people decided to pick predictions. This issue is much worse for the 2009 predictions than the 2019 predictions due to the smaller number of raters combined with the sparseness of most raters, making this data set fairly low fidelity; if you want to make a simple inference from the 2019 data, you're probably best off using Armstrong's ratings and discarding the rest (there are non-simple analyses one could do, but if you're going to do that, you might as well just rate the predictions yourself).</p> <p>Another fundamental issue with the analysis is that it relies on aggregating votes of from a population that's heavily drawn from Less Wrong readers and the associated community. <a href="https://www.patreon.com/posts/how-do-their-71735437">As we discussed here</a>, it's common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of thought (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful? I often see people refer to the high-level &quot;wisdom of crowds&quot; idea, but if we look at the specific statements endorsed by online crowds, we can see that these crowds are often not so wise. In the Arb Research evaluation (discussed below), they get around this problem by checking reviewing answers themselves and also offering a bounty for incorrectly graded predictions, which is one way to deal with having untrustworthy raters, but Armstrong's work has no mitigation for this issue.</p> <p>On the Karnofsky / Arb Research evaluation, Karnofsky appears to use a less strict scoring than I do and once again optimistically &quot;rounds up&quot;. The Arb Research report scores each question as &quot;unambiguously wrong&quot;, &quot;ambiguous or near miss&quot;, or &quot;unambiguously right&quot; but Karnofsky's scoring removes the ambiguous and near miss results, whereas my scoring only removes the ambiguous results, the idea being that a near miss is still a miss. Accounting for those reduces the scores substantially but still leaves Heinlen, Clarke, and Asimov with significantly higher scores than the futurists discussed in the body of this post. For the rest, many of the predictions that were scored as &quot;unambiguously right&quot; are ones I would've declined to rate for similar reasons to predictions which I declined to rate (e.g., a prediction that something &quot;may well&quot; happen was rated as &quot;unambiguously right&quot; and I would consider that unfalsifiable and therefore not include it). There are also quite a few &quot;unambiguously right&quot; predictions that I would rate as incorrect using a strict reading similar to the readings that you can see below in the detailed appendix.</p> <p>Another place where Karnofsky rounds up is that Arb research notes that 'The predictions are usually very vague. Almost none take the form “By Year X technology Y will pass on metric Z”'. This makes the prediction accuracy from futurists Arb Research looked at not comparable to precise predictions of the kind Caplan or Karnofsky himself makes, but Karnofsky directly uses those numbers to justify why his own predictions are accurate without noting that the numbers are not comparable. Since the non-comparable numbers were already rounded up, there are two levels of rounding here (more on this later).</p> <p>As noted above, some of the predictions are ones that I wouldn't rate because I don't see where the prediction is, such as this one (this is the &quot;exact text&quot; of the prediction being scored, according to the Arb Research spreadsheet), which was scored &quot;unambiguously right&quot;</p> <blockquote> <p>application of computer technology to professional sports be counterproduc- tive? Would the public become less interested in sports or in betting on the outcome if matters became more predictable? Or would there always be enough unpredictability to keep interest high? And would people derive particular excitement from beat ing the computer when low-ranking players on a particular team suddenly started</p> </blockquote> <p>This seems like a series of questions about something that might happen, but wouldn't be false if none of these happened, so would not count as a prediction in my book.</p> <p>Similarly, I would not have rated the following prediction, which Arb also scored &quot;unambiguously right&quot;</p> <blockquote> <p>its potential is often realized in ways that seem miraculous, not because of idealism but because of the practical benefits to society. Thus, the computer's ability to foster human creativity may well be utilized to its fullest, not because it would be a wonderful thing but because it will serve important social functions Moreover, we are already moving in the</p> </blockquote> <p>Another kind of prediction that was sometimes scored &quot;unambiguously correct&quot; that I declined to score were predictions of the form &quot;this trend that's in progress will become somewhat {bigger / more important}, such as the following:</p> <blockquote> <p>The consequences of human irresponsibility in terms of waste and pollution will become more apparent and unbearable with time and again, attempts to deal with this will become more strenuous. It is to be hoped that by 2019, advances in technology will place tools in our hands that will help accelerate the process whereby the deterioration of the environment will be reversed.</p> </blockquote> <p>On Karnofsky's larger point, that we should trust longtermist predictions because futurists basically did fine and longtermsists are taking prediction more seriously and trying harder and should therefore generate better prediction, that's really a topic for another post, but I'll briefly discuss here because of the high intersection with this post. There are two main pillars of this argument. First, that futurists basically did fine which, as we've seen, relies on a considerable amount of rounding up. And second, that the methodologies that longtermists are using today are considerably more effective than what futurists did in the past.</p> <p>Karnofsky says that the futurists he looked at &quot;collect casual predictions - no probabilities given, little-to-no reasoning given, no apparent attempt to collect evidence and weigh arguments&quot;, whereas Karnofsky's summaries use (among other things):</p> <ul> <li>Reports that Open Philanthropy employees spent thousands of hours on, systematically presenting evidence and considering arguments and counterarguments.</li> <li>A serious attempt to take advantage of the nascent literature on how to make good predictions; e.g., the authors (and I) have generally done calibration training, and have tried to use the language of probability to be specific about our uncertainty.</li> </ul> <p>We've seen, when evaluating futurists with an eye towards evaluating longtermists, Karnofsky heavily rounds up in the same way Kurzweil and other futurists do, to paint the picture they want to create. There's also the matter of his summary of a report on Kurzweil's predictions being incorrect because he didn't notice the author of that report used a methodology that produced nonsense numbers that were favorable to the conclusion that Karnofsky favors. It's true that Karnofsky and the reports he cites do the superficial things that the forecasting literature notes is associated with more accurate predictions, like stating probabilities. But for this to work, the probabilities need to come from understanding the data. If you take a pile of data, incorrectly interpret it and then round up the interpretation further to support a particular conclusion, throwing a probability on it at the end is not likely to make it accurate. Although he doesn't use these words, a key thing Tetlock notes in his work is that people who round things up or down to conform to a particular agenda produce low accuracy predictions. Since Karnofsky's errors and rounding heavily lean in one direction, that seems to be happening here.</p> <p>We can see this in other analyses as well. Although digging into material other than futurist predictions is outside of the scope of this post, nostalgebraist has done this and he said (in a private communication that he gave me permission to mention) that Karnofsky's summary of <a href="https://openphilanthropy.org/research/could-advanced-ai-drive-explosive-economic-growth/">https://openphilanthropy.org/research/could-advanced-ai-drive-explosive-economic-growth/</a> is substantially more optimistic about AI timelines than the underlying report in that there's at least one major concern raised in the report that's not brought up as a &quot;con&quot; in Karnofsky's summary and nostalgebraist later wrote <a href="https://nostalgebraist.tumblr.com/post/693718279721730048/on-bio-anchors">this post</a>, where he (implicitly) notes that the methodology used in a report he examined in detail is fundamentally not so different than what the futurists we discussed used. There are quite a few things that may make the report appear credible (it's hundreds of pages of research, there's a complex model, etc.), but when it comes down to it, the model boils down to a few simple variables. In particular, a huge fraction of the variance of whether or not TAI is likely or not likely comes down to the amount of improvement will occur in terms of hardware cost, particularly <code>FLOPS/$</code>. The output of the model can range from 34% to 88% depending how much improvement we get in <code>FLOPS/$</code> after 2025. Putting in arbitrarily large <code>FLOPS/$</code> amounts into the model, i.e., the scenario where infinite computational power is free (since other dimensions, like storage and network aren't in the model, let's assume that <code>FLOPS/$</code> is a proxy for those as well), only pushes the probability of TAI up to 88%, which I would rate as too pessimistic, although it's hard to have a good intuition about what would actually happen if infinite computational power were on tap for free. Conversely, with no performance improvement in computers, the probability of TAI is 34%, which I would rate as overly optimistic without a strong case for it. But I'm just some random person who doesn't work in AI risk and hasn't thought about too much, so your guess on this is as good as mine (and likely better if you're the equivalent of Yegge or Gates and work in the area).</p> <p>The part about all of this that makes this fundamentally the same thing that the futurists here did is that the estimate of the <code>FLOPS/$</code> which is instrumental for this prediction is pulled from thin air by someone who is not a deep expert in semiconductors, computer architecture, or a related field that might inform this estimate.</p> <p>As Karnofsky notes, a number of things were done in an attempt to make this estimate reliable (&quot;the authors (and I) have generally done calibration training, and have tried to use the language of probability&quot;) but, when you come up with a model where a single variable controls most of the variances and the estimate for that variable is picked out of thin air, all of the modeling work actually reduces my confidence in the estimate. If you say that, based on your intuition, you think there's some significant probability of TAI by 2100; 10% or 50% or 80% or whatever number you want, I'd say that sounds plausible (why not? things are improving quickly and may continue to do so) but wouldn't place any particular faith in the estimate. If you build a model where the output hinges on a relatively small number of variables and then say that there's an 80% chance, based a critical variable out of thin air, should that estimate be more or less confidence inspiring than the estimate based solely on intuition? I don't think the answer should be that output is higher confidence. The direct guess of 80% is at least honest about its uncertainty. In the model-based case, since the model doesn't propagate uncertainties and the choice of a high but uncertain number can cause the model to output a fairly certain number, like 88%, there's a disconnect between the actual uncertainty produced by the model and the probability estimate.</p> <p>At one point, in summarizing the report, Karnofsky says</p> <blockquote> <p>I consider the &quot;evolution&quot; analysis to be very conservative, because machine learning is capable of much faster progress than the sort of trial-and-error associated with natural selection. Even if one believes in something along the lines of &quot;Human brains reason in unique ways, unmatched and unmatchable by a modern-day AI,&quot; it seems that whatever is unique about human brains should be re-discoverable if one is able to essentially re-run the whole history of natural selection. And even this very conservative analysis estimates a ~50% chance of transformative AI by 2100</p> </blockquote> <p>But it seems very strong to call this a &quot;very conservative&quot; estimate when the estimate implicitly relies on future <code>FLOPS/$</code> improvement staying above some arbitrary, unsupported, threshold. In the appendix of the report itself, it's estimated that there will be a 6 order of magnitude (OOM) improvement and that a 4 OOM improvement would be considered conservative, but why should we expect that 6 OOM is the amount of headroom left for hardware improvement and 4 OOM is some kind of conservative goal that we'll very likely reach? Given how instrumental these estimates are to the output of the model, there's a sense in which the uncertainty of the final estimate has to be at least as large as the uncertainty of these estimates multiplied by their impact on the model but that can't be the case here given the lack of evidence or justification for these inputs to the model.</p> <p>More generally, the whole methodology is backwards — if you have deep knowledge of a topic, then it can be valuable to put a number down to convey the certainty of your knowledge to other people, and if you don't have deep knowledge but are trying to understand an area, then it can be valuable to state your uncertainties so that you know when you're just guessing. But here, we have a fairly confidently stated estimate (nostalgebraist notes that Karnofsky says &quot;Bio Anchors estimates a &gt;10% chance of transformative AI by 2036, a ~50% chance by 2055, and an ~80% chance by 2100.&quot;) that's based off of a model that's nonsense that relies on a variable that's picked out of thin air. Naming a high probability after the fact and then naming a lower number and saying that's conservative when it's based on this kind of modeling is just window dressing. Looking at <a href="https://marginalrevolution.com/marginalrevolution/2022/03/holden-karnofsky-emails-me-on-transformative-ai.html">Karnofsky's comments elsewhere</a>, he lists a number of extremely weak pieces of evidence in support of his position, e.g., in the previous link, he has a laundry list of evidence of mixed strength, including Metaculus, which nostaglebraist has noted is basically worthless for this purpose <a href="https://nostalgebraist.tumblr.com/post/692246981744214016/more-on-metaculus-badness">here</a> and <a href="https://nostalgebraist.tumblr.com/post/692086358174498816/idk-who-needs-to-hear-this-but-metaculus-is">here</a>. It would be very odd for someone who's truth seeking on this particular issue to cite so many bad pieces of evidence; creating a laundry list of such mixed evidence is consistent with someone who has a strong prior belief and is looking for anything that will justify it, no matter how weak. That would also be consistent with the shoddy direct reasoning noted above.</p> <p>Back to other evaluators, on Justin Rye's evaluations, I would grade the predictions &quot;as written&quot; and therefore more strictly than he did and would end up with lower scores.</p> <p>For the predictors we looked at in this document who mostly or nearly exclusively give similar predictions, I declined to give them anything like a precise numerical score. To be clear, I think there's value in trying to score vague predictions and near misses, but that's a different thing than this document did, so the scores aren't directly comparable.</p> <p>A number of people have said that predictions by people who make bold predictions, the way Kurzweil does, are actually pretty good. After all, if someone makes a lot of bold predictions and they're all off by 10 years, that person will have useful insights even if they lose all their bets and get taken to the cleaners in prediction markets. However, that doesn't mean that someone who makes bold predictions should always &quot;get credit for&quot; making bold predictions. For example, in Kurzweil's case, 7% accuracy might not be bad if he uniformly predicted really bold stuff like unbounded life span by 2011. However, that only applies if the hits and misses are both bold predictions, which was not the case in the sampled set of predictions for Kurzweil here. For Kurzweil's predictions evaluated in this document, Kurzweil's correct predictions tended to be very boring, e.g., there will be no giant economic collapse that stops economic growth, cochlear implants will be in widespread use in 2019 (predicted in 1999), etc.</p> <p>The former is a Caplan-esque bet against people who were making wild predictions that there would be severe or total economic collapse. There's value in bets like that, but it's also not surprising when such a bet is successful. For the latter, the data I could quickly find on cochlear implant rates showed that implant rates slowly linearly increased from the time Kurzweil made the bet until 2019. I would call that a correct prediction, but the prediction is basically just betting that nothing drastically drops cochlear implant rates, making that another Caplan-esque safe bet and not a bet that relies on Kurzweil's ideas about the law of accelerating growth that his wild bets rely on.</p> <p>If someone makes 40 boring bets of which 7 are right and another person makes 40 boring bets and 22 wild bets and 7 of their boring bets and 0 of their wild bets are right (these are arbitrary numbers as I didn't attempt to classify Kurzweil's bets as wild or not other than the 7 that were scored as correct), do you give the latter person credit for having &quot;a pretty decent accuracy given how wild their bets were&quot;? I would say no.</p> <p>On the linked HN thread from a particular futurist, a futurist scored themselves 5 out of 10, but most HN commenters scored the same person at 0 out of 10 or, generously, at 1 out of 10, with the general comment that the person and other futurists tend to score themselves much too generously:</p> <blockquote> <p>sixQuarks: I hate it when “futurists” cherry pick an outlier situation and say their prediction was accurate - like the bartender example.</p> <p>karaterobot: I wanted to say the same thing. He moved the goal posts from things which &quot;would draw hoots of derision from an audience from the year 2022&quot; to things which there has been some marginal, unevenly distributed, incremental change to in the last 10 years, then said he got it about 50% right. More generally, this is the issue I have with futurists: they get things wrong, and then just keep making more predictions. I suppose that's okay for them to do, unless they try to get people to believe them, and make decisions based on their guesses.</p> <p>chillacy: Reminded me of the ray [kurzweil] predictions: extremely generous grading.)</p> </blockquote> <h3 id="appendix-other-reading">Appendix: other reading</h3> <ul> <li><a href="dick-sites-alpha-axp-architecture.pdf">Richard Sites and his DEC colleagues presciently looking 30+ years into the future with respect to computer architecture (written in 1992, summarizing work started in 1988)</a> <ul> <li>Not included in main list of people with accurate predictions because the implicit and explicit predictions here are so narrow, but this is a stellar example of using deep domain knowledge to forsee the future as well as understand what current actions will make sense decades down the line</li> </ul></li> <li><a href="https://statmodeling.stat.columbia.edu/2022/08/11/bets-as-forecasts-bets-as-probability-assessment-difficulty-of-using-bets-in-this-way/">Andrew Gelman on forecast bets as probability assessments</a></li> <li><a href="https://nostalgebraist.tumblr.com/post/684991616356876288/its-driving-me-up-the-wall-the-way-i-keep-seeing">Nostaglebraist on how a lot of AI commenters are behaving like futurists of days past</a></li> <li><a href="https://astralcodexten.substack.com/p/i-won-my-three-year-ai-progress-bet">Scott Alexander on the optimistic side of an AI progress bet winning</a></li> <li><a href="https://statmodeling.stat.columbia.edu/2022/10/04/dow-36000-but-in-graph-form/">Andrew Gelman on silly graphs in predictions</a></li> <li><a href="https://rodneybrooks.com/category/dated-predictions/">Rodney Brooks on success to date on takings the pessimistic side on AI progress (he calls this the realistic side but, in this context, I consider that to be a more loaded term)</a></li> <li><a href="https://www.econlib.org/no-one-cared-about-my-spreadsheets/">Bryan Caplan on how no one looked into his quantitative results, despite many comments on whether or not his work was correct</a>. See also, <a href="dunning-kruger/">me on the same phenomenon elsewhere</a></li> </ul> <h3 id="appendix-detailed-information-on-predictions">Appendix: detailed information on predictions</h3> <h4 id="ray-kurzweil-1">Ray Kurzweil</h4> <p>4/59 for rated predictions. If you feel like the ones I didn't include that one could arguably include should count, then 7/62.</p> <p>This list comes <a href="https://web.archive.org/web/20170225013846/https://en.wikipedia.org/wiki/Predictions_made_by_Ray_Kurzweil">from wikipedia's bulleted list of Kurzweil's predictions</a> at the time Peter Diamadis, Kurzweil's co-founder for SingularityU, cited it to bolster the claim that Kurzweil has an 86% prediction accuracy rate. Off the top of my head, this misses quite a few predictions that Kurzweil made, such as life expectancy being &quot;over one hundred&quot; by 2019 and 120 by 2029 (prediction made in 1999) and unbounded (life expectancy increasing at one year per year) by 2011 (prediction made in 2001), that a computer would beat the top human in chess by 2000 (prediction made in 1990).</p> <p>It's likely that Kurzweil's accuracy rate would change somewhat if we surveyed all of his predictions, but it seems extremely implausible for the rate to hit 86% and, more broadly, looking at Kurzweil's vision of what the world would be like, it also seems impossible that we live in a world that's generally close to Kurzweil's imagined future.</p> <ul> <li>1985 <ul> <li>Voice activated typewriter / speech writer by 1985 (founded a company to build this in 1982) <ul> <li>No. Not true in any meaningful sense. Speech to text with deep learning, circa 2013, was accurate enough that it could be used, with major corrections, on a computer, but it would've been hopeless for a typewriter</li> </ul></li> </ul></li> <li>&quot;Early 2000s&quot; (wikipedia notes that this is listed before 2010 in Kurzweil's chronology, so this should be significantly before 2010 unless the book is very poorly organized) <ul> <li>Translating telephones allow people to speak to each other in different languages. <ul> <li>No. Today, this works poorly and translations are comically bad, but can sort of work in a &quot;help a tourist get around&quot; sort of way with deep learning, but was basically hopeless in 2010</li> </ul></li> <li>Machines designed to transcribe speech into computer text allow deaf people to understand spoken words. <ul> <li>No. Per above, very poor in 2010</li> </ul></li> <li>Exoskeletal, robotic leg prostheses allow the paraplegic to walk. <ul> <li>No. Maybe some prototype existed, but this still isn't meaningfully deployed in 2022</li> </ul></li> <li>Telephone calls are routinely screened by intelligent answering machines that ask questions to determine the call's nature and priority. <ul> <li>Definitely not in 2010. This arguably exists in 2022, although I think it would be a stretch to call phone trees &quot;intelligent&quot; since they generally get confused if you don't do the keyword matching they're looking for</li> </ul></li> <li>&quot;Cybernetic chauffeurs&quot; can drive cars for humans and can be retrofitted into existing cars. They work by communicating with other vehicles and with sensors embedded along the roads. <ul> <li>No.</li> </ul></li> </ul></li> <li>&quot;Early 21st century&quot; (wikipedia notes that this is listed before 2010 in Kurzweil's chronology, so this should be significantly before 2010 unless the book is very poorly organized) <ul> <li>The classroom is dominated by computers. Intelligent courseware that can tailor itself to each student by recognizing their strengths and weaknesses. Media technology allows students to manipulate and interact with virtual depictions of the systems and personalities they are studying. <ul> <li>No. If you really want to make a stretch argument, you could say this about 2022, but I'd still say no for 2022</li> </ul></li> <li>A small number of highly skilled people dominates the entire production sector. Tailoring of products for individuals is common. <ul> <li>No. You could argue that, as written, the 2nd part of this was technically satisfied, but that was really in a trivial way compared the futurist vision Kurzweil was predicting</li> </ul></li> <li>Drugs are designed and tested in simulations that mimic the human body. <ul> <li>No.</li> </ul></li> <li>Blind people navigate and read text using machines that can visually recognize features of their environment. <ul> <li>Not in 2010. Deep learning unlocked some of this later, though, and continues to improve</li> </ul></li> </ul></li> <li>2010 <ul> <li>PCs are capable of answering queries by accessing information wirelessly via the Internet. <ul> <li>Yes</li> </ul></li> </ul></li> <li>2009 <ul> <li>Most books will be read on screens rather than paper. <ul> <li>No</li> </ul></li> <li>Most text will be created using speech recognition technology. <ul> <li>No</li> </ul></li> <li>Intelligent roads and driverless cars will be in use, mostly on highways. <ul> <li>No</li> </ul></li> <li>People use personal computers the size of rings, pins, credit cards and books. <ul> <li>No. One of these was true (books), but the prediction is an &quot;and&quot; and not an &quot;or&quot;</li> </ul></li> <li>Personal worn computers provide monitoring of body functions, automated identity and directions for navigation. <ul> <li>No. Arguably true with things like a Garmin band some athletes wear around the chest for heart rate, but not true when the whole statement is taken into account or in the spirit of the prediction</li> </ul></li> <li>Cables are disappearing. Computer peripheries use wireless communication. <ul> <li>No. Even in 2022, cables generally haven't come close to disappearing and, unfortunately, wireless perpihphals generally work poorly (<a href="https://twitter.com/garybernhardt/status/1122603722683453440">Gary Bernhardt</a>, <a href="https://www.benkuhn.net/wireless/">Ben Kuhn</a>, etc.)</li> </ul></li> <li>People can talk to their computer to give commands. <ul> <li>Yes. I would say this one is actually a &quot;no&quot; in spirit if you look at Kurzweil's futurist vision, but it was technically true that this was possible in 2009, although it worked quite poorly</li> </ul></li> <li>Computer displays built into eyeglasses for augmented reality are used <ul> <li>No. You can argue that someone, somewhere, was using these, but pilots were using head mounted displays in 1999, so it's nonsensical to argue that limited uses like that constitute a successful prediction of the future</li> </ul></li> <li>Computers can recognize their owner's face from a picture or video. <ul> <li>No</li> </ul></li> <li>Three-dimensional chips are commonly used. <ul> <li>No</li> </ul></li> <li>Sound producing speakers are being replaced with very small chip-based devices that can place high resolution sound anywhere in three-dimensional space. <ul> <li>No</li> </ul></li> <li>A $1,000 computer can perform a trillion calculations per second. <ul> <li>Undefined. Technically true, but using peak ops to measure computer performance is generally considered too silly to do by people who know much about computers. In this case, for this to merely be a bad benchmark and not worthless, the kind of calculation would have to be defined.</li> </ul></li> <li>There is increasing interest in massively parallel neural nets, genetic algorithms and other forms of &quot;chaotic&quot; or complexity theory computing. <ul> <li>No. There was a huge uptick in interest in neural nets in 2012 due to the &quot;Alexnet&quot; paper, but note that this prediction is an &quot;and&quot; and would've been untrue even in the &quot;or&quot; form in 2009</li> </ul></li> <li>Research has been initiated on reverse engineering the brain through both destructive and non-invasive scans. <ul> <li>Undefined. Very vague and could easily argue this either way</li> </ul></li> <li>Autonomous nanoengineered machines have been demonstrated and include their own computational controls. <ul> <li>Unknown (to me). I don't really care to try to look this one up since the accuracy rate of these predictions is so low that whether or not this one is accurate doesn't matter and I don't know where I'd look this one up</li> </ul></li> </ul></li> <li>2019 <ul> <li>The computational capacity of a $4,000 computing device (in 1999 dollars) is approximately equal to the computational capability of the human brain (20 quadrillion calculations per second). <ul> <li>Undefined. Per above prediction on computational power, raw ops per second is basically meaningless</li> </ul></li> <li>The summed computational powers of all computers is comparable to the total brainpower of the human race. <ul> <li>Undefined. First, you need a non-stupid metric to compare these by</li> </ul></li> <li>Computers are embedded everywhere in the environment (inside of furniture, jewelry, walls, clothing, etc.). <ul> <li>No. There are small computers, but this is arguing they're ubiquitously inside common household items, which they're not</li> </ul></li> <li>People experience 3-D virtual reality through glasses and contact lenses that beam images directly to their retinas (retinal display). Coupled with an auditory source (headphones), users can remotely communicate with other people and access the Internet. <ul> <li>No</li> </ul></li> <li>These special glasses and contact lenses can deliver &quot;augmented reality&quot; and &quot;virtual reality&quot; in three different ways. First, they can project &quot;heads-up-displays&quot; (HUDs) across the user's field of vision, superimposing images that stay in place in the environment regardless of the user's perspective or orientation. Second, virtual objects or people could be rendered in fixed locations by the glasses, so when the user's eyes look elsewhere, the objects appear to stay in their places. Third, the devices could block out the &quot;real&quot; world entirely and fully immerse the user in a virtual reality environment. <ul> <li>No. You need different devices for these use cases and for the HUD use case, the field of view is small and images do not stay in place regardless of the user's perspective or orientation</li> </ul></li> <li>People communicate with their computers via two-way speech and gestures instead of with keyboards. Furthermore, most of this interaction occurs through computerized assistants with different personalities that the user can select or customize. Dealing with computers thus becomes more and more like dealing with a human being. <ul> <li>No. Some people sometimes do this, but I'd say this implies with &quot;instead&quot; that speech and gestures have replaced keyboards, which they have not</li> </ul></li> <li>Most business transactions or information inquiries involve dealing with a simulated person. <ul> <li>No</li> </ul></li> <li>Most people own more than one PC, though the concept of what a &quot;computer&quot; is has changed considerably: Computers are no longer limited in design to laptops or CPUs contained in a large box connected to a monitor. Instead, devices with computer capabilities come in all sorts of unexpected shapes and sizes. <ul> <li>No if you literally use the definition of &quot;most people&quot; and consider a PC to be a general purpose computing device (which a smartphone arguably is), but probably yes for people at, say, 90%-ile wealth and above in the U.S. or other high-SES countries</li> </ul></li> <li>Cables connecting computers and peripherals have almost completely disappeared. <ul> <li>No</li> </ul></li> <li>Rotating computer hard drives are no longer used. <ul> <li>No</li> </ul></li> <li>Three-dimensional nanotube lattices are the dominant computing substrate. <ul> <li>No</li> </ul></li> <li>Massively parallel neural nets and genetic algorithms are in wide use. <ul> <li>No. Note the use of &quot;and&quot; here</li> </ul></li> <li>Destructive scans of the brain and noninvasive brain scans have allowed scientists to understand the brain much better. The algorithms that allow the relatively small genetic code of the brain to construct a much more complex organ are being transferred into computer neural nets. <ul> <li>No</li> </ul></li> <li>Pinhead-sized cameras are everywhere. <ul> <li>No</li> </ul></li> <li>Nanotechnology is more capable and is in use for specialized applications, yet it has not yet made it into the mainstream. &quot;Nanoengineered machines&quot; begin to be used in manufacturing. <ul> <li>Unknown (to me). I don't really care to try to look this one up since the accuracy rate of these predictions is so low that whether or not this one is accurate doesn't matter and I don't know where I'd look this one up</li> </ul></li> <li>Thin, lightweight, handheld displays with very high resolutions are the preferred means for viewing documents. The aforementioned computer eyeglasses and contact lenses are also used for this same purpose, and all download the information wirelessly. <ul> <li>No. Ironically, a lot of people prefer things like Kindles for viewing documents, but they're quite low resolution (a 2019 Kindle has a resolution of 800x600); many people still prefer paper for viewing documents for a variety of reasons</li> </ul></li> <li>Computers have made paper books and documents almost completely obsolete. <ul> <li>No</li> </ul></li> <li>Most learning is accomplished through intelligent, adaptive courseware presented by computer-simulated teachers. In the learning process, human adults fill the counselor and mentor roles instead of being academic instructors. These assistants are often not physically present, and help students remotely. Students still learn together and socialize, though this is often done remotely via computers. <ul> <li>No</li> </ul></li> <li>All students have access to computers. <ul> <li>No. True in some places, though.</li> </ul></li> <li>Most human workers spend the majority of their time acquiring new skills and knowledge. <ul> <li>No</li> </ul></li> <li>Blind people wear special glasses that interpret the real world for them through speech. Sighted people also use these glasses to amplify their own abilities. Retinal and neural implants also exist, but are in limited use because they are less useful. <ul> <li>No</li> </ul></li> <li>Deaf people use special glasses that convert speech into text or signs, and music into images or tactile sensations. Cochlear and other implants are also widely used. <ul> <li>Yes? I think this is actually a no in terms of whether or not Kurzweil's vision was realized, but these are possible and it isn't the case that no one was using these. I'm bundling the Cochlear implant prediction in here because it's so boring. It was arguably already true when the prediction was made in 1999 and reaching the usage rate it did in 2019 basically just continued slow linear growth of implant rate, i.e., people not rejecting the idea of cochlear implants outright and/or something else superseding cochlear implants.</li> </ul></li> <li>People with spinal cord injuries can walk and climb steps using computer-controlled nerve stimulation and exoskeletal robotic walkers. <ul> <li>No</li> </ul></li> <li>Computers are also found inside of some humans in the form of cybernetic implants. These are most commonly used by disabled people to regain normal physical faculties (e.g. Retinal implants allow the blind to see and spinal implants coupled with mechanical legs allow the paralyzed to walk). <ul> <li>No, at least not at the ubiquity implied by Kurzweil's vision</li> </ul></li> <li>Language translating machines are of much higher quality, and are routinely used in conversations. <ul> <li>Yes, but mostly because this prediction is basically meaningless (language translation was of a &quot;much higher quality&quot; in 2019 than 1999)</li> </ul></li> <li>Effective language technologies (natural language processing, speech recognition, speech synthesis) exist <ul> <li>Yes, although arguable</li> </ul></li> <li>Access to the Internet is completely wireless and provided by wearable or implanted computers. <ul> <li>No</li> </ul></li> <li>People are able to wirelessly access the Internet at all times from almost anywhere <ul> <li>No. This might feel true inside a big city, but is obviously <a href="https://twitter.com/danluu/status/8422681077711544320">untrue even on a road trip that stays on the U.S. interstate highway system</a> and becomes even less true if you drive away from the interstate and less true once again if you go to places that can't be driven to</li> </ul></li> <li>Devices that deliver sensations to the skin surface of their users (e.g. tight body suits and gloves) are also sometimes used in virtual reality to complete the experience. &quot;Virtual sex&quot;—in which two people are able to have sex with each other through virtual reality, or in which a human can have sex with a &quot;simulated&quot; partner that only exists on a computer—becomes a reality. Just as visual- and auditory virtual reality have come of age, haptic technology has fully matured and is completely convincing, yet requires the user to enter a V.R. booth. It is commonly used for computer sex and remote medical examinations. It is the preferred sexual medium since it is safe and enhances the experience. <ul> <li>No</li> </ul></li> <li>Worldwide economic growth has continued. There has not been a global economic collapse. <ul> <li>Yes</li> </ul></li> <li>The vast majority of business interactions occur between humans and simulated retailers, or between a human's virtual personal assistant and a simulated retailer. <ul> <li>No? Depends on what &quot;simulated retailers&quot; means here. In conjunction with how Kurzweil talks about simulations, VR, haptic devices that are fully immersive, etc., I'd say this is a &quot;no&quot;</li> </ul></li> <li>Household robots are ubiquitous and reliable <ul> <li>No</li> </ul></li> <li>Computers do most of the vehicle driving—-humans are in fact prohibited from driving on highways unassisted. Furthermore, when humans do take over the wheel, the onboard computer system constantly monitors their actions and takes control whenever the human drives recklessly. As a result, there are very few transportation accidents. <ul> <li>No</li> </ul></li> <li>Most roads now have automated driving systems—networks of monitoring and communication devices that allow computer-controlled automobiles to safely navigate. <ul> <li>No</li> </ul></li> <li>Prototype personal flying vehicles using microflaps exist. They are also primarily computer-controlled. <ul> <li>Unknown (to me). I don't really care to try to look this one up since the accuracy rate of these predictions is so low that whether or not this one is accurate doesn't matter and I don't know where I'd look this one up</li> </ul></li> <li>Humans are beginning to have deep relationships with automated personalities, which hold some advantages over human partners. The depth of some computer personalities convinces some people that they should be accorded more rights <ul> <li>No</li> </ul></li> <li>A growing number of humans believe that their computers and the simulated personalities they interact with are intelligent to the point of human-level consciousness, experts dismiss the possibility that any could pass the Turing Test. Human-robot relationships begin as simulated personalities become more convincing. <ul> <li>No</li> </ul></li> <li>Interaction with virtual personalities becomes a primary interface <ul> <li>No? Depends on what &quot;primary interface&quot; means here, but I think not given Kurzweil's overall vision</li> </ul></li> <li>Public places and workplaces are ubiquitously monitored to prevent violence and all actions are recorded permanently. Personal privacy is a major political issue, and some people protect themselves with unbreakable computer codes. <ul> <li>No. True of some public spaces in some countries, but untrue as stated.</li> </ul></li> <li>The basic needs of the underclass are met <ul> <li>No. Not even true when looking at some high-SES countries, like the U.S., let alone the entire world</li> </ul></li> <li>Virtual artists—creative computers capable of making their own art and music—emerge in all fields of the arts. <ul> <li>No. Maybe arguably technically true, but I think not even close in spirit in 2019</li> </ul></li> </ul></li> </ul> <p>The list above only uses the bulleted predictions from Wikipedia under the section that has per-timeframe sections. If you pull in other ones from the same page that could be evaluated, which includes predictions like &quot; &quot;nanotechnology-based&quot; flying cars would be available [by 2026]&quot;, this doesn't hugely change the accuracy rate (and actually can't due to the relatively small number of other predictions).</p> <h4 id="jacque-fresco-1">Jacque Fresco</h4> <p>The foreword to Fresco's book gives a pretty good idea of what to expect from Fresco's predictions:</p> <blockquote> <p>Looking forward is an imaginative and fascinating book in which the authors take you on a journey into the culture and technology of the twenty-first century. After an introductory section that discusses the &quot;Things that Shape Your Future.&quot; you will explore the whys and wherefores of the unfamiliar, alarming, but exciting world of a hundred years from now. You will see this society through the eyes of Scott and Hella, a couple of the next century. Their living quarters are equipped with a cybernator. a seemingly magical computer device, but one that is based on scientific principles now known. It regulates sleeping hours, communications throughout the world, an incredible underwater living complex, and even the daily caloric intake of the &quot;young&quot; couple. (They are in their forties but can expect to live 200 years.) The world that Scott and Hella live in is a world that has achieved full weather control, has developed a finger-sized computer that is implanted in the brain of every baby at birth (and the babies are scientifically incubated the women of the twenty-first century need not go through the pains of childbirth), and that has perfected genetic manipulation that allows the human race to be improved by means of science. Economically, the world is Utopian by our standards. Jobs, wages, and money have long since been phased out. Nothing has a price tag, and personal possessions are not needed. Nationalism has been surpassed, and total disarmament has been achieved; educational technology has made schools and teachers obsolete. The children learn by doing, and are independent in this friendly world by the time they are five.</p> <p>The chief source of this greater society is the Correlation Center, &quot;Corcen,&quot; a gigantic complex of computers that serves but never enslaves mankind. Corcen regulates production, communication, transportation and all other burdensome and monotonous tasks of the past. This frees men and women to achieve creative challenging experiences rather than empty lives of meaningless leisure. Obviously this book is speculative, but it is soundly based upon scientific developments that are now known</p> </blockquote> <p>As mentioned above, Fresco makes the claim that it's possible to predict the future and to do so, one should start with the values people will have in the future. Many predictions are about &quot;the 21st century&quot; so can arguably be defended as still potentially accurate, although the way the book talks about the stark divide between &quot;the 20th century&quot; and &quot;the 21st century&quot;, we should have already seen the changes mentioned in the book since we're no longer in &quot;the 20th century&quot; and the book makes no reference to a long period of transition in between. Fresco does make some specific statements about things that will happen by particular dates, which are covered later. For &quot;the 21st century&quot;, his predictions from the first section of his book are:</p> <ul> <li>There will be no need for laws, such as a law against murder because humans will no longer do things like murder (which only happen &quot;today&quot; because &quot;our sick society&quot; conditions people to commit depraved acts <ul> <li>&quot;Today we are beginning to identify various things which condition us to act as we do. In the future the factors that condition human beings to kill or do other things that harm fellow human beings will be understood and eliminated&quot; <ul> <li>The entire section is very behaviorist and assumes that we'll be able to operant condition people out of all bad behaviors</li> </ul></li> </ul></li> <li>Increased understanding of human nature will lead to <ul> <li>Total freedom, including no individual desire for conformity</li> <li>Total economic abundance, which will lead to the end of &quot;competitiveness, acquisitiveness, thriftiness&quot;, etc.</li> <li>Total freedom from disease</li> <li>Deeper feelings of love and friendship to an extent that can not be understood by those who live in the twentieth-century world of scarcity&quot;</li> <li>Total lack of guilt about sex</li> <li>Appreciation of all kinds of natural beauty, as opposed to &quot;the narrow standards of the 'beauty queen' mentality of today.&quot; as well as eschewing any kind of artificial beauty</li> <li>Complete self-knowledge, lack of any repression, leading to &quot;produce a new dimension of relaxed living that is almost unknown today&quot;</li> <li>Elevation of the valuing of others at the same level people value themselves or local communities, i.e., complete selflessness and an end to anything resembling tribalism or nationalism</li> <li>All people will be &quot;multidimensional&quot; and sort of good at everything</li> <li>This is contrasted with &quot;For the first time all men and women will live a multidimensional life, limited only by their imagination. In the twentieth century we could classify people by saying, &quot;He is good in sports. She is an intellectual. He is an artist.&quot; In the future all people will have the time and the facilities to accept the fantastic variety of challenges that life offers them&quot;</li> </ul></li> </ul> <p>As mentioned above, the next part of Fresco's prediction is about how science will work. He writes about how &quot;the scientific method&quot; is only applied in a limited fashion, which led to thousands of years of slow progress. But, unlike in the 20th century, in the 21st century, people will be free from bias and apply &quot;the scientific method&quot; in all areas of their life, not just when doing science. People will be fully open to experimentation in all aspects of life and all people will have &quot;a habitual open-mindedness coupled with a rigid insistence that all problems be formulated in a way that permits factual checking&quot;.</p> <p>This will, among other things, lead to complete self-knowledge of one's own limitations for all people as well as an end to unhappiness due to suboptimal political and social structures:</p> <blockquote> <p>The success of the method of science in solving almost every problem put to it will give individuals in the twenty-first century a deep confidence in its effectiveness. They will not be afraid to experiment with new ways of feeling, thinking, and acting, for they will have observed the self-corrective aspect of science. Science gives us the latest word, not the last word. They will know that if they try something new in personal or social life, the happiness it yields can be determined after sufficient experience has accumulated. They will adapt to changes in a relaxed way as they zigzag toward the achievement of their values. They will know that there are better ways of doing things than have been used in the past, and they will be determined to experiment until they have found them. They will know that most of the unhappiness of human beings in the mid-twentieth century was not due to the lack of shiny new gadgets; it was due, in part, to not using the scientific method to check out new political and social structures that could have yielded greater happiness for them</p> </blockquote> <p>After discussing, at a high level, the implications on people and society, Fresco gets into specifics, saying that doing everything with computers, what Fresco calls a &quot;cybernated&quot; society, could be achieved by 1979, giving everyone a post-tax income of $100k/yr in 1969 dollars (about $800k/yr in 2022 dollars):</p> <blockquote> <p>How would you like to have a guaranteed life income of $100,000 per year—with no taxes? And how would you like to earn this income by working a three-hour day, one day per week, for a five-year period of your life, providing you have a six-months vacation each year? Sound fantastic? Not at all with modern technology. This is not twenty-first-century pie-in-the-sky. It could probably be achieved in ten years in the United States if we applied everything we now know about automation and computers to produce a cybernated society. It probably won't be done this rapidly, for it would take some modern thinking applied in an intelligent crash program. Such a crash program was launched to develop the atomic bomb in a little over four years.</p> </blockquote> <p>Other predictions about &quot;cybernation&quot;:</p> <ul> <li>Manufacturing will be fully automated, to the point that people need to do no more than turn on the factory to have everything run (and maintain itself) <ul> <li>This will lead to &quot;maximum efficiency&quot;</li> </ul></li> <li>Since there will be no need for human labor, the price of items like t-shirts will be so low that they'll be free since there's no need for items to cost anything when the element of human labor is removed</li> <li>The elimination of human labor will lead to a life of leisure for everyone</li> <li>Fresco notes that his previous figure of $100k/yr (1969 dollars) is meaningless and could just as easily be $1M/yr (1969 dollars) since everything will be free</li> <li>A &quot;cybernetically&quot; manufactured item produced anywhere on earth will be able to be delivered anywhere on earth within 24 hours</li> </ul> <h4 id="michio-kaku-1">Michio Kaku</h4> <ul> <li>By 2005 <ul> <li>&quot;The complete human genome will be decoded by the year 2005, giving us an “owner’s manual” for a human being&quot; <ul> <li>Half credit. Actually technically no as <a href="https://en.wikipedia.org/wiki/Human_Genome_Project">the human genome project was declared complete in 2003, but had only decoded 85% of the genome. Actually decoding the human genome took until January 2022</a>; I'll give this half credit since many people would argue that the declared completion of the Human Genome Project should mean this prediction was correct</li> </ul></li> </ul></li> <li>&quot;During the 21st century&quot; <a href="not scored because the timeline for these hasn't passed, although it's implied that these should already be happening to some extent, so these could be scored based on whether or not you think they're happening">implied to not be something that happens at the very end, but something that's happening throughout</a> <ul> <li>&quot;it will be difficult to be a research scientist in the future without having some working knowledge of [quantum mechanics, computer science, and biology]&quot; due to increasing &quot;synergy&quot; and &quot;cross-fertilization&quot; between these fundamental fields</li> <li>Silicon computer chips will hit a roadblock that will be unlocked via DNA research allowing for computation on organic molecules</li> <li>Increased pace of scientific progress due to &quot;intense synergy&quot;</li> </ul></li> <li>In 2020 <ul> <li>Commodity prices down 60% (from 1997 prices) due to wealth becoming based on knowledge, trade being global, and markets being linked electronically, continuing a long-term trend of reduced commodity prices <ul> <li>No. CRB commodity price index was up in 2020 compared to 1997 and is up further in 2022</li> </ul></li> <li>Microprocessors as cheap as &quot;scrap paper&quot; due to Moore's law scaling continuing with no speedbump until 2020 (10 cents in 2000, 1 cent in 2010, 1/10th of a cent in 2020) <ul> <li>No. Moore's law scaling curve changed and microprocessors did not, in general, cost 1 cent in 2010 or 1/10th of a cent in 2020</li> </ul></li> <li>Above will give us &quot;will give us smart homes, cars, TVs, clothes, jewelry, and money&quot; <ul> <li>No due to &quot;and&quot; and comments implying total ubiquity, but actually a fairly good directional prediction</li> </ul></li> <li>&quot;We will speak to our appliances, and they will speak back&quot; <ul> <li>No, due to the implied ubiquity here, but again directionally pretty good</li> </ul></li> <li>&quot;the Internet will wire up the entire planet and evolve into a membrane consisting of millions of computer networks, creating an “intelligent planet.”&quot; <ul> <li>No due on &quot;intelligent planet&quot;</li> </ul></li> <li>Moore's law / silicon scaling will continue until 2020, at which point &quot;quantum effects will necessarily dominate and the fabled Age of Silicon will end&quot; <ul> <li>No</li> </ul></li> <li>Advances in DNA sequencing will continue until roughly 2020 (before it stops); &quot;literally thousands of organisms will have their complete DNA code unraveled&quot; <ul> <li>Maybe? Not sure if this was hundreds or thousands; also, the lack of complete sequencing of the human genome project when it was &quot;complete&quot; may also have some analogue here? I didn't score this one because I don't have the background for it</li> </ul></li> <li>&quot;it may be possible for anyone on earth to have their personal DNA code stored on a CD&quot; <ul> <li>Not counting this as a prediction because it's non-falsifiable due to the use of &quot;may&quot;</li> </ul></li> <li>&quot;Many genetic diseases will be eliminated by injecting people’s cells with the correct gene.&quot; <ul> <li>No</li> </ul></li> <li>&quot;Because cancer is now being revealed to be a series of genetic mutations, large classes of cancers may be curable at last, without invasive surgery or chemotherapy&quot; <ul> <li>Not counting this as a prediction because it's non-falsifiable due to the use of &quot;may&quot;</li> </ul></li> <li>In or near 2020, bottlenecks in DNA sequencing will stop progress of DNA sequencing <ul> <li>No</li> </ul></li> <li>In or near 2020, bottlenecks in silicon will stop advances in computer performance <ul> <li>No; computer performance slowed its advancement long before 2020 and then didn't stop in 2020</li> </ul></li> <li>The combination of the two above will (after 2020) require optical computers, molecular computers, DNA computers, and quantum computers for progress to advance in biology and computer science <ul> <li>No. Maybe some of these things will be critical in the future, but they're not necessary conditions for advancements in computing and biology in or around 2020</li> </ul></li> <li>Focus of biology will shift from sequencing DNA to understanding the functions of genes <ul> <li>I'm not qualified to judge this one</li> </ul></li> <li>something something may prove the key to solving key diseases <ul> <li>Not counting this as a prediction because it's non-falsifiable due to the use of &quot;may&quot;</li> </ul></li> <li>[many predictions based around the previous prediction that microprocessors would be as cheap as scrap paper, 1/10th of a cent or less, that also ignore the cost of everything around the processor] <ul> <li>No; collapsing these into one bullet reduces the number of incorrect predictions counted, but that shouldn't make too much difference in this case</li> </ul></li> <li>A variety of non-falsifiable &quot;may&quot; predictions about self-driving car progress by 2010 and 2020</li> <li>VR will be &quot;an integral part of the world&quot; <ul> <li>No</li> </ul></li> <li>People will use full-body suits and electric-field sensors <ul> <li>No</li> </ul></li> <li>Exploring simulations in virtual reality will be a critical part of how science proceeds <ul> <li>No</li> </ul></li> <li>A lot of predictions about how computers &quot;may&quot; be critical to a variety of fields <ul> <li>Not counting this as a prediction because it's non-falsifiable due to the use of &quot;may&quot;</li> </ul></li> <li>Semiconductor lithography below .1 um (100 nm) will need to switch from UV to X-rays or electrons <ul> <li>No; modern 5nm processes use EUV</li> </ul></li> <li>Some more &quot;may&quot; and &quot;likely&quot; non-falsifiable predictions</li> </ul></li> </ul> <p>That gives a prediction rate of 3%. I stopped reading at this point, so may have missed a number of correct predictions. But, even if the rest of the book was full of correct predictions, the correct prediction rate is likely to be low.</p> <p>There were also a variety of predictions that I didn't include because they were statements that were true in the present. For example</p> <blockquote> <p>If the dirt road of the Internet is made up of copper wires, then the paved information highway will probably be made of laser ber optics. Lasers are the perfect quantum device, an instrument which creates beams of coherent light (light beams which vibrate in exact synchronization with each other). This exotic form of light, which does not occur naturally in the universe, is made possible by manipulating the electrons making quantum jumps between orbits within an atom</p> </blockquote> <p>This doesn't seem like much of a prediction since, when the book was written, the &quot;information highway&quot; already used a lot of fiber. Throughout the book, there's a lot of mysticism around quantum-ness which is, for example, on display above and cited as a reason that microprocesses will become obsolete by 2020 (they're not &quot;quantum&quot;) and fiber optics won't (it's quantum):</p> <h3 id="john-naisbitt-1">John Naisbitt</h3> <p>Here are a few quotes that get at the methodology of Naisbitt's hit book, Megatrends:</p> <blockquote> <p>For the past fifteen years, I have been working with major American corporations to try to understand what is really happening in the United States by monitoring local events and behavior, because collectively what is going on locally is what is going on in America.</p> <p>Despite the conceits of New York and Washington, almost nothing starts there.</p> <p>In the course of my work, 1 have been overwhelmingly impressed with the extent to which America is a bottom-up society, that is, where new trends and ideas begin in cities and local communities—for example, Tampa, Hartford, San Diego, Seattle, and Denver, not New York City or Washington, D.C. My colleagues and I have studied this great country by reading its local newspapers. We have discovered that trends are generated from the bottom up, fads from the top down. The findings in this book are based on an analysis of more than 2 million local articles about local events in the cities and towns of this country during a twelve-year period.</p> <p>Out of such highly localized data bases, I have watched the general outlines of a new society slowly emerge.</p> <p>We learn about this society through a method called content analysis, which has its roots in World War II. During that war, intelligence experts sought to find a method for obtaining the kinds of information on enemy nations that public opinion polls would have normally provided.</p> <p>Under the leadership of Paul Lazarsfeld and Harold Lasswell, later to become well-known communication theorists, it was decided that we would do an analysis of the content of the German newspapers, which we could get—although some days after publication. The strain on Germany's people, industry, and economy be- gan to show up in its newspapers, even though information about the country's supplies, production, transportation, and food situation remained secret. Over time, it was possible to piece together what was going on in Germany and to figure out whether conditions were improving or deteriorating by carefully tracking local stories about factory openings, clos- ings, and production targets, about train arrivals, departures, and delays, and so on. ... Although this method of monitoring public behavior and events continues to be the choice of the intelligence community—the United States annually spends millions of dollars in newspaper content analysis in various parts of the world it has rarely been applied commercially. In fact. The Naisbitt Group is the first, and presently the only, organization to utilize this approach in analyzing our society.</p> <p>Why are we so confident that content analysis is an effective way to monitor social change? Simply stated, because the news hole in a newspaper is a closed system. For economic reasons, the amount of space devoted to news in a newspaper does not change significantly over time. So, when something new is introduced, something else or a combination of things must be omitted. You cannot add unless you subtract. It is the principle of forced choice in a closed system.</p> <p>In this forced-choice situation, societies add new preoccupations and forget old ones. In keeping track of the ones that are added and the ones that are given up, we are in a sense measuring the changing share of the market that competing societal concerns command.</p> <p>Evidently, societies are like human beings. A person can keep only so many problems and concerns in his or her head or heart at any one time. If new problems or concerns are introduced, some existing ones are given up. All of this is reflected in the collective news hole that becomes a mechanical representation of society sorting out its priorities.</p> </blockquote> <p>Naisbitt rarely makes falsifiable predictions. For example, on the &quot;information society&quot;, Naisbitt says</p> <blockquote> <p>In our new information society, the time orientation is to the future. This is one of the reasons we are so interested in it. We must now learn from the present how to anticipate the future. When we can do that, we will understand that a trend is not destiny; we will be able to learn from the future the way we have been learning from the past.</p> <p>This change in time orientation accounts for the growing popular and professional interest in the future during the 1970s. For example, the number of universities offering some type of futures-oriented degree has increased from 2 in 1969 to over 45 in 1978. Membership in the World Future Society grew from 200 in 1967 to well over 30,000 in 1982, and the number of popular and professional periodicals devoted to un- derstanding or studying the future has dramatically increased from 12 in 1965 to more than 122 in 1978.</p> </blockquote> <p>This could be summed up as &quot;in the future, people will think more about the future&quot;. Pretty much any case one might make that Naisbitt's claims ended up being true or false could be argued against.</p> <p>In the chapter on the &quot;information society&quot;, one of the most specific predictions is</p> <blockquote> <p>New information technologies will at first be applied to old industrial tasks, then, gradually, give birth to new activities, processes, and products.</p> </blockquote> <p>I'd say that this is false in the general case, but it's vague enough that you could argue it's true.</p> <p>A, rare, falsifiable comment is this prediction about the price of computers</p> <blockquote> <p>The home computer explosion is upon us. soon to be followed by a software implosion to fuel it. It is projected that by the year 2000, the cost of a home computer system (computer, printer, monitor, modem, and so forth) should only be about that of the present telephone-radio-recorder-television system.</p> </blockquote> <p>From a quick search, it seems that reference devices cost something like $300 in 1982? That would be $535 in 2000, which wasn't really a reasonable price for a computer as well as the peripherals mentioned and implied by &quot;and so forth&quot;.</p> <h3 id="gerard-k-o-neill-1">Gerard K. O'Neill</h3> <p>We discussed O'Neill's predictions on space colonization in the body of this post. This section contains a bit on his other predictions.</p> <p>On computers, O'Neill says that in 2081 &quot;any major central computer will have rapid access to at least a hundred million million words of memory (the number '1' followed by 14 zeros). A computer of that memory will be no larger than a suitcase. It will be fast enough to carry out a complete operation in more more time than it takes light to travel from this page to your eye, and perhaps a tenth of that time&quot;, which is saying that a machine will have 100TWords of RAM or, to round things up simply, let's say 1PB of RAM and a clock speed of something between 300 MHz and 6 GHz, depending on how far away from your face you hold a book.</p> <p>On other topics, O'Neill predicts we'll have fully automated manufacturing, people will use 6 times as much energy per capita in 2081 as in 1980, pollution other than carbon dioxide will be a solved problem, coal plants will still be used, most (50% to 95%) of energy will be renewable (with the caveat that &quot;ground-based solar&quot; is a &quot;myth&quot; that can never work, and that wind, tide, and hydro are all forms of solar that even combined with geothermal thrown in, can't reasonably provide enough energy), that solar power from satellites is the answer to then-current and future energy needs.</p> <p>In The Technology Edge, O'Neill makes predictions for the 10 years following the book's publication in 1983. O'Neill says &quot;the book is primarily based on interviews with chief executives&quot;. It was written at a time when many Americans were concerned about the impending Japanese dominance of the world. O'Neill says</p> <blockquote> <p>As an American, I cannot help being angry — not at the Japanese for succeeding, but at the forces of timidity, shortsightedness, greed, laziness and misdirection here in America that have mired us down so badly in recent years, sapped our strength and kept us from equal achievements.</p> <p>As we will see, opportunities exist now for the opening of whole new industries that can become even greater than those we have lost to the Japanese. Are we to delay and lose those too?</p> </blockquote> <p>In an interview about the book, O'Neill said</p> <blockquote> <p>microengineering, robotics, genetic engineering, magnetic flight, family aircraft, and space science. If the U.S. does not compete successfully in these areas, he warns, it will lose the technological and economic leadership it has enjoyed.</p> </blockquote> <p>This seems like a big miss with both serious false positives as well as false negatives. O'Neill failed to cite industries that ended up being important to the then-continued U.S. dominance of the world economy, e.g, software, and also predicted that space and flight were much more important than they turned out to be.</p> <p>On the specific mechanism, O'Neill also generally misses, e.g., in the book, O'Neill cites the lack of U.S. PhD production and people heading directly into industry as a reason the U.S. was falling behind and would continue to fall behind Japan, but in a number of important industries, like software, a lot of the major economic/business contributions have been made by people going to industry without a PhD. The U.S. didn't need to massively increase PhD production in the decades following 1983 to stay economically competitive.</p> <p>There's quite a bit of text dedicated to a commonly discussed phenomenon at the time, how Japanese companies are going to wipe the floor with American and European companies because they're able to make and execute long-term plans, unlike American companies. I'll admit that it's a bit of a mystery to me how short-term thinking has worked so well for American companies and I would've, at least to date.</p> <h4 id="patrick-dixon-1">Patrick Dixon</h4> <p>Dixon opens with:</p> <blockquote> <p>The next millennium will witness the greatest challenges to human survival ever in human history, and many of them will face us in the early years of its first century ...</p> <p>The future has six faces, each of which will have a dramatic effect on all of us in the third millennium ... [Fast, Urban, Radical, Universal, Tribal, Ethical, which spells out FUTURE]</p> <p>Out of these six faces cascade over 500 key expectations, specific predictions as logical workings-out of these important global trends. These range from inevitable to high probability to lower probability — but still significant enough to require strategic planning and personal preparation.</p> </blockquote> <ul> <li>In the third millennium, things reminiscent of the previous millennium will be outdated By [variously, 2004, 2005, 2020, 2025], e.g., &quot;the real winners will be those who tap into this huge shift — and help define it. What television producer will want to produce second millennial TV? What clothes designer dare risk his annual collection being labeled as a rehash of tired late twentieth-century fashions? ...&quot; <ul> <li>No, late 20th century fashion is very &quot;in&quot; right now and other 20th century fashions were &quot;in&quot; a decade ago</li> </ul></li> <li>&quot;Pre-millenialists tend to see 2000 to 2010 as just another decade. The trends of the eighties and nineties continue, just more of the same. Post-millennialists are very different. They are products of the third millennium. They live in it. They are twenty-first century people, a new age. Expect to see one of the greatest generation gaps in recent history&quot; <ul> <li>Subjective, but no. Dixon assigns huge importance to the millennium counter turning over and says things like &quot;Few people have woken up so far to the impact of the millennium. My children are the M generation. Their entire adult existence will be lived in the third millennium ... Expect to see the M factor affect every aspect of life on earth ... The human brain makes sense of the past by dividing it into intervals: the day... month... year. Then there are decades and centuries ... And four time-events are about to hit us in the same instant. New year, decade, century, and millennium&quot;, but the counter turning over doesn't appear to have caused any particularly drastic changes.</li> </ul></li> <li>&quot;Expect to see millennial culture clashes between opposing trends, a world increasingly of extremes with tendencies to intolerance as groups fight to dominate the future&quot; <ul> <li>Basically yes, although his stated reasoning (not quoted) as to why this should happen at the turn of the century (as opposed to any other time) is nonsensical as it applies to all of history.</li> </ul></li> <li>Market dominance / power will become less important as &quot;micromarkets&quot; become more important <ul> <li>No; the bit about smaller markets existing was correct, but huge players, the big $1T companies of what Dixon calls &quot;the third millennium&quot;, Apple, Microsoft, Google, and Amazon, have a huge amount of power of these markets and has not reduced either the economic or cultural importance of what Dixon calls &quot;dominance&quot;</li> </ul></li> <li>Expect more &quot;wild cards&quot; over &quot;the next 20 years&quot; [from 1998 to 2018], such as &quot;war, nuclear accident or the unplanned launch of nuclear weapons, vast volcanic eruptions or plagues or even a comet collision with enormous destructive power&quot; <ul> <li>No; this would've sounded much better if it included covid, but if we look at the 20 years prior to the book being published, there was the fall of the soviet union, Tiananmen Square, etc., which isn't obviously less &quot;wild card-y&quot; than we saw from 1998 to 2018</li> </ul></li> <li>Less emphasis on economic growth, due to increased understanding that wealth doesn't make people happy <ul> <li>No; Dixon was writing not too long after peak &quot;growth is unsustainable and should be deliberately curtailed to benefit humanity&quot;</li> </ul></li> </ul> <p>That's the end of the introduction. Some of these predictions are arguably too early to call since, in places, Dixon write as if Futurewise is about the entire &quot;third millenia&quot;, but Dixon also notes that drastic changes are expected in the first years and decades of the 21st century and these generally have not come to pass, both the specific cases where Dixon calls out particular timelines or in the cases where Dixon doesn't name a particular timeline. In general, I'm trying to only include predictions where it seems that Dixon is referring to the 2022 timeframe or before, but his general vagueness makes it difficult to make the right call 100% of the time.</p> <p>The next chapter is titled &quot;Fast&quot; and is about the first of the six &quot;faces&quot; of the future.</p> <ul> <li>&quot;Expect further rapid realignments [analogous to the fall of the Soviet Union], with North Korea at the top of the list as the last outpost of Stalinism ... North Korea could crash at any moment, spilling thousands of starving refugees into China, South Korea, and Japan&quot; <ul> <li>No; there's been significant political upheaval in many places (Thailand, Arab Spring, Sudan, etc.); North Korea hasn't been in the top 10 political upheavals list, let alone at the top of the list</li> </ul></li> <li>&quot;Expect increasing North-South tension as emerging economies come to realize that abolishing all trade and currency restrictions in a rush for growth also places their countries at the mercy of rumors, hunches, and market opinion&quot; <ul> <li>No to there being a particular increase in North-South tension</li> </ul></li> <li>&quot;Expect a growing backlash against globalisaiton, with some nations reduced to &quot;economic slavery&quot; by massive, destabilising, currency flows <ul> <li>No, due to the second part of this sentence, although highly subjective</li> </ul></li> <li>[A bunch of unscored predictions that are gimmes about vague things continuing to happen, such as &quot;expect large institutions to continue to make (and lose) huge fortunes trying to outguess volatile markets in these countries&quot;] <ul> <li>On the example prediction, that's quite vague and could be argued either way on the spirit of the prediction, but is very easy to satisfy as stated since it only requires (for example) two hedge funds to make major bets on volatility that either win or lose; there's list of similar &quot;predictions&quot; that seem extremely easy to satisfy as written that I'm not going to include</li> </ul></li> <li>&quot;Expect increasingly complex investment instruments to be developed, so that a commodity [from the context, this is clearly referring to actual commodities markets and not things like mortgages] sometimes rises or falls dramatically as a large market intervention is made, linked to a completely different and apparently unrelated event <ul> <li>Yes, although this trend was definitely already happening and well-known when Dixon wrote his book, making this a very boring prediction</li> </ul></li> <li>&quot;Management theory is still immature ... expect that to change over the next two decades as rigorous statistical and analytical tools are divides to prove or disprove the key elements of success in management methods&quot; <ul> <li>No; drastically underestimates the difficulty of rigorously quantifying the impact of different management methods in a way that only someone who hasn't done <a href="https://twitter.com/danluu/status/1547661623770329088">serious data analysis</a> would do</li> </ul></li> <li>[seem to have lost a line here; sorry!] <ul> <li>Yes, although this statement would be more compelling with less stated detail</li> </ul></li> <li>&quot;Expect 'management historians' to become sought after, analyzing industrial successes and failures during the previous Industrial Revolution and at the turn of the twentieth century <ul> <li>No; some people do this kind of work, but they're not particularly sought after. The context of the statement implies they'd be sought after by CEOs or other people trying to understand how to run actual businesses, which is generally not the case</li> </ul></li> <li>&quot;Expect consumer surveys and market research to be sidelined by futurology-based customer profiles. Market research only tells you what people want today. What's so smart about that...&quot; <ul> <li>No; not that people don't try to predict trends, but the context for this prediction incorrectly implies that market research is trivial &quot;anyone can go out and ask the same questions, so where's the real competitive edge?&quot;, that in the computerized world, brands are irrelevant, etc., all of which are incorrect, and of course the simple statement that market research and present-day measurement are obsolete are simply wrong.</li> </ul></li> <li>Flat-rate global &quot;calls&quot; with no long-distance changes <ul> <li>Yes as written since you can call people anywhere with quite a few apps, so I'll give Dixon this one, although the context implies that his reasoning was totally incorrect. For one thing, he seems to be talking about phone calls and thinks traditional phone calls will be important, but he also makes some incorrect statements about telecom cost structures, such as &quot;measuring the time and distance of every call is so expensive as a proportion of total call costs&quot; (which was predicted to happen because the cost of calls themselves would fall, causing the cost of metadata tracking of calls to dominate the cost of the calls themselves; even if that came to pass, the cost of tracking how long a call was and where to call was to would be tiny and, in fact, my phone bill still tracks this information even though I'm not charged for it because the cost is so small that it would be absurd not to track other than for privacy reasons)</li> </ul></li> <li>&quot;Expect most households in wealth nations to have several phone numbers by 2005 ... this means that most executives will have access to far more telephone lines at home than they do at work today for their personal use&quot; <ul> <li>No; there's a way to read this as some kind of prediction that was correct, but from the context, Dixon is clearly talking about people having a lot of phone numbers and phone lines and makes a statement elsewhere that implies explosive growth in the number of landline phone numbers and lines people will have at home</li> </ul></li> <li>Mobile phones used in most places landline phones are used today <ul> <li>Yes; basically totally on the nose, although he has a story about a predicted future situation that isn't right due to some incorrect guesses about how interfaces would play out</li> </ul></li> <li>Many emerging economies will go straight to mobile and leapfrog existing technically <ul> <li>Yes</li> </ul></li> <li>Ubiquitous use of satellite phones by traveling execs / very important people by 2005 <ul> <li>No; many execs, VPs, etc., still impacted by incomplete cell coverage and no sat phone in 2005</li> </ul></li> <li>&quot;The next decade&quot; [by 2008], cell phones will seamlessly switch to satellite coverage when necessary <ul> <li>No</li> </ul></li> <li>Phone trees will have switched from &quot;much hated push-button systems to voice recognition&quot; by 2002, with seamless basically perfect recognition by 2005 <ul> <li>No; these systems are now commonplace in 2022, but many people I know find them to be significantly worse than push-button systems</li> </ul></li> <li>Computational power per &quot;PC&quot; will continue to double every 18 months indefinitely [there's a statement that implies this will continue at least through 2018, but there's no implication that this will end level off at any time after that] <ul> <li>No; even at the time, people had already observed that performance scaling was moving to a slower growth curve</li> </ul></li> <li>Future small displays will be able to be magnified <ul> <li>No, or not yet anyway (if the prediction means that software zoom will be possible, that was possible and even built into operating systems well before the book was published, so that's not really a prediction about the future)</li> </ul></li> <li>&quot;Paper-thin display sheets by 2005&quot; <ul> <li>No</li> </ul></li> <li>Projection displays will be in common use, replacing many uses of CRTs <ul> <li>No; projectors are used today, but in many of the same applications they were used in at the time the book was written</li> </ul></li> <li>Many CRT use cases will be replaced by lasers projected onto the retina <ul> <li>No, or not yet anyway; even if this happens at some point, I would rate this as a no since this section was about what would kill the CRT and this technology was not instrumental in killing the CRT</li> </ul></li> <li>Digital cameras rival film cameras in terms of image quality by 2020 <ul> <li>Yes; technically yes as written, but the way this is written implies that digital cameras will just have caught up to film cameras in 2020 when this happened quite a long time ago, so I'd say that Dixon was wrong but made this prediction vague enough that it just happens to be correct as written</li> </ul></li> <li>For consumer use, digital cameras replace 35mm film by 2010 <ul> <li>Yes; but same issue as above where Dixon really underestimated how quickly digital cameras would improve</li> </ul></li> <li>&quot;Ultra high definition TV cameras&quot; replace film &quot;in most situations&quot; by 2005 <ul> <li>Yes</li> </ul></li> <li>Software will always be buggy because new chips will be released at a pace that means that programmers can't keep up with bug fixes because they need to re-write the software for new chips. <ul> <li>Yes, although the reason was completely wrong. Despite the obvious trueness that software bugginess will continue for quite some time. I'm going to include more of Dixon's text here since a lot of readers are programmers who will have opinions on why computers are buggy and will be able to directly evaluate Dixon's reasoning with no additional context: &quot;Software will always be full of bugs. Desktop computers today are so powerful that even if technology stands still it will take the world's programmers at least 20 years to export their capability to the full. The trouble is that they have less than 20 months – because by then a new generation of machines will be around ... So brand new code was written for Pentium chips. The bugs were never sorted out in the old versions and bugs in the new ones will never be either, for the same reason&quot;. <ul> <li>Dixon's reasoning as to why software is buggy is completely wrong. It is not because Intel releases a new chip and programmers have to abandon their old code and write code for the new chip. This level of incorrectness of reasoning generally holds for Dixon's comments even when he really nails a prediction and doesn't include some kind of &quot;because&quot; that invalidates the prediction</li> </ul></li> </ul></li> <li>Computer disaster recovery will become more important, resulting in lawsuits against backup companies being a major feature of the next century <ul> <li>No; not that there aren't any lawsuits, but lawsuits over backup data loss aren't a major feature of this century</li> </ul></li> <li>Home workers will be vulnerable to data loss, will eventually &quot;back up data on-line to computers in other cities as the ultimate security&quot; <ul> <li>Yes, although the reasoning here was incorrect. Dixon concluded this due to the ratio of hard disk sizes (&gt;= 2GB) to floppy disk sizes (&lt;= 2 MB), which caused him to conclude that local backups are impossible (would take more than 1000 floppy disks), but even at the time Dixon was writing, cheap, large, portable disks were available (zip drives, etc.) and tape backups were possible</li> </ul></li> <li>Much greater expenditure on anti virus software, with &quot;monthly updates&quot; of antivirus software, and anti virus companies creating viruses to force people to buy anti virus software <ul> <li>No; MS basically obsoleted commercial anti virus software for what was, by far, the largest platform where users bought anti virus software by providing it for free with Windows; corp spend on anti virus software is still signifcant and increases as corps own more computers, but consumer spend dropping drastically seems opposed to what Dixon was predicting</li> </ul></li> <li>New free zones or semi-states will be created to bypass online sales tax and countries will retaliate against ISPs that provide content served from these tax havens <ul> <li>No</li> </ul></li> <li>Sex industry will be a major driver of internet technologies and technology in general &quot;for the next 30 years&quot; (up through 2028) <ul> <li>No; porn was a major driver of internet technology up to the mid 90s by virtue of being a huge fraction of internet commerce, but this was already changing when Dixon was writing the book (IIRC, mp3 surpassed sex as the top internet search term in 1999) and the non-sex internet economy dwarfs the sex internet economy, so sex sites are no longer major drivers of tech innovation, e.g., youtube's infra drives cutting edge work in a way that pornhub's infra has no need to</li> </ul></li> <li>The internet will end income tax as we know it by 2020 because transactions will be untraceable <ul> <li>No</li> </ul></li> <li>By 2020, sales and property taxes will have replaced income tax due to the above <ul> <li>No</li> </ul></li> <li>All new homes in western countries will be &quot;intelligent&quot; in 2010, which includes things like the washing machine automatically calling a repair person to get repaired when it has a problem, etc. <ul> <li>No; I've lived in multiple post-2010 builds and none of them have been &quot;intelligent&quot;</li> </ul></li> <li>Pervasive networking via power outlets by 2005, allowing you to plug into any power outlet &quot;in every building anywhere in the world&quot; to get networking <ul> <li>No</li> </ul></li> <li>PC or console as &quot;smart home&quot; brains by one of the above timelines <ul> <li>No</li> </ul></li> <li>Power line networking eliminates other network technologies in the home <ul> <li>No</li> </ul></li> <li>No more ordering of food by 2000; scanner in rubbish bin will detect when food is used up and automatically order food <ul> <li>No; nonsensical idea even if such scanners were reliable and ubiquitous since the system would only know what food was used, not what food the person wants in the future</li> </ul></li> <li>World will be dominated by the largest telecom companies <ul> <li>No; Dixon's idea was that the importance of the internet and networks would mean that telecom companies would dominate the world, an argument analogous to when people say software companies must grow in importance because software will grow in importance; instead, telecom became a commodity</li> </ul></li> <li>Power companies will compete with telecoms and high voltage lines will carry major long haul traffic by 2001 <ul> <li>No</li> </ul></li> <li>Internet will replace the telephone <ul> <li>Yes</li> </ul></li> <li>Mobile phone costs drop so rapidly that they're free by 2000 <ul> <li>No; arguably yes because some cell phone providers were providing phones free with contract at one point, but once total costs were added up, these weren't cheaper than non-contract phones where those were available</li> </ul></li> <li>Phones with direct retinal displays and voice recognition very soon (prototypes already exist) <ul> <li>No</li> </ul></li> <li>The end of books; replaced by digital books with &quot;more than a hundred paper-thin electronic pages. Just load the text you want, settle back and enjoy&quot; <ul> <li>No; display technology isn't there, and it's unclear why something like a Kindle should have Dixon's proposed design instead of just having a one-page display</li> </ul></li> <li>Cheap printing causes print on demand in the home to also be force in the end of books <ul> <li>No; a very trendy idea in the 90s (either in the home or at local), though</li> </ul></li> <li>Growth in internet radio; &quot;expect thousands of amateur disc jockeys, single-issue activists, eccentrics and misfits to be broadcasting to audiences of only a few tens ot a few hundred from garages or bedrooms with virtually no equipment other than a hi-fi, a PC, modem, and a microphone, possibly with TV camera&quot; <ul> <li>No; drastically underestimated how many people would broadcast and/or stream</li> </ul></li> <li>Mainstream TV companies will lose prime time viewership <ul> <li>Not scoring this prediction because it's an extremely boring prediction; as Dixon notes, in the book, this had already started happening years before he wrote the book</li> </ul></li> <li>By 2010, doctors will de facto be required to defer to computers for diagnoses because computer diagnoses will be so much better than human diagnoses that the legal liability for overruling the computer with human judgement will be prohibitive <ul> <li>No</li> </ul></li> <li>Surgeons will be judged on how many people die during operations, which will cause surgeons to avoid operating on patients with likely poor outcomes <ul> <li>No</li> </ul></li> <li>Increased education; &quot;several graduate or postgraduate courses in a lifetime&quot; <ul> <li>No</li> </ul></li> <li>Paper credentials devalued, replaced by emphasis on &quot;skills not created by studying books&quot; <ul> <li>No</li> </ul></li> <li>Governments set stricter targets for literacy, education, etc. <ul> <li>No, or at least not in general for serious targets that are intended to be met</li> </ul></li> <li>Many lawsuits from people who received poor education <ul> <li>No</li> </ul></li> <li>Return to single-sex schools, at least regionally in some areas <ul> <li>No</li> </ul></li> <li>&quot;complete rethink about punishment and education, with the recognition that a no-touch policy isn't working&quot;, by 2005 <ul> <li>No</li> </ul></li> <li>Collapse of black-white integration in schooling in U.S. cities <ul> <li>No</li> </ul></li> <li>College libraries become irrelevant <ul> <li>No, or no more so than when the book was written, anyway</li> </ul></li> <li>Ubiquitous video phones and video phone usage by 2005 <ul> <li>No</li> </ul></li> <li>Dense multimedia and VR experiences in grocery stores <ul> <li>No</li> </ul></li> <li>General consolidation of retails, except for &quot;corner shops&quot;, which will survive as car-use restrictions &quot;being to bite&quot;, circa 2010 or so <ul> <li>No</li> </ul></li> <li>Blanket loyalty programs are grocery stores replaced by customized per-person programs <ul> <li>No</li> </ul></li> <li>VR dominates arcades and theme parks by 2010 <ul> <li>No</li> </ul></li> <li>&quot;all complex prototyping [for manufacturing]&quot; done in VR by 2000 <ul> <li>No</li> </ul></li> <li>Rapid prototyping from VR images <ul> <li>No</li> </ul></li> <li>Pervasive use of voice recognition will cause open offices to get redesigned by 2002 <ul> <li>No</li> </ul></li> <li>Speech recognition to have replaced typing to the extent that typing is considered obsolete and inefficient by 2008, except in cases where silence is necessary <ul> <li>No</li> </ul></li> <li>Accurate handwriting recognition will exist but become irrelevant by 2008, obsoleted by speech recognition <ul> <li>No</li> </ul></li> <li>Traditional banking wiped out by the internet <ul> <li>No</li> </ul></li> <li>&quot;millions&quot; of people will buy and sell directly to and from each other via online marketplaces <ul> <li>Not counting this because ebay alone already had 2 million users when the book was published</li> </ul></li> <li>Traditional brokerage services will become less important over time; more trading will happen via cheap or discount brokerages, online <ul> <li>Yes, but an extremely boring prediction that was already coming to pass when the book was written</li> </ul></li> <li>Pervasive corporate espionage, an increase over prior eras, made possible by bugs becoming smaller and easier to palace, etc. <ul> <li>No? Hard to judge this one, though</li> </ul></li> <li>Pervasive internal corporate surveillance (microphones and hidden cameras everywhere, including the homes of employees), to fight corporate espionage <ul> <li>No</li> <li>Retina scans commonly used to verify identity</li> <li>No</li> </ul></li> <li>Full self-driving cars, networked with each other, etc. <ul> <li>No, or not yet anyway</li> </ul></li> <li>Cars physically linked together to form trains on the road <ul> <li>No</li> </ul></li> <li>Widespread tagging of humans with identity chips by 2010 <ul> <li>No</li> </ul></li> </ul> <p>This marks the end of the &quot;Fast&quot; chapter. From having skimmed the rest of the book, the hit rate isn't really higher later nor is the style of reasoning any different, so I'm going to avoid doing a prediction-by-prediction grading. Instead, I'll just mention a few highlights (some quite accurate, but mostly not; not included in the prediction accuracy rate since I didn't ensure consistent or random sampling):</p> <ul> <li>Extremely limited water supply by 2020, with widespread water metering, recycling of used bathwater, etc.; water so limited that major nations have conflicts over water and water is a major foreign policy instrument by 2010; waterless cleaning of fabrics, etc., by 2025</li> <li>Return to &quot;classic&quot; pop-Christian American family and cultural values, increased stigmatization of single parent households, etc., by 2020</li> <li>Major prohibition movement against smoking, drinking, psychedelic drugs, etc.</li> <li>Increased risk of major disease epidemics due to higher global population and increased mobility</li> <li>Due to increasing tribalism, most new wealth created by companies with &lt;= 20 employees, of which &gt;= 75% are family owned or controlled and started with family money</li> <li>Increased global free trade</li> <li>Death of &quot;old economics&quot; allow for (for example) low unemployment with no inflationary pressure due to combination of globalization pushing down wages and computerization causing productivity increases</li> <li>Travel will have virtually no friction by 2000 due to increased automation; you'll be able to buy a plane ticket online, go to the airport, where a scanner will scan you as you walk through security without delay; you'll even be able to skip the ticket buying process and just walk directly onto a plane, at which point a system will scan an embedded smart-card in your watch or skin will allow the system to seamlessly deduct the payment from your bank account</li> <li>End of left/right politics and rise of single-issue politics and parties [presumebly referring to U.S. politics here]</li> <li>Environmentalism the single biggest political issue</li> <li>Destruction of ozone layer causes people to avoid sun; vacations in sunny areas and beaches no longer popular</li> <li>Very accurate weather predictions by 2008, due to newly collected data allowing accurate forecasting</li> <li>Nuclear power dead, with zero or close to zero active reactors by 2030</li> <li>Increased concern over damage / cancer from &quot;electromagnetic fields&quot;</li> <li>Noise canceling technology wipes out unpleasant noise in cars and homes</li> <li>Widespread market for human cloning, with people often raising a genetic clone of themselves instead of conceiving traditionally</li> <li>Have the capability to design custom viruses / plagues that target particular organs or racial groups by 2010</li> <li>Comprehensive reform of U.S. legal system to reduce / eliminate spurious lawsuits by 2010</li> <li>Major growth of religions; particularly Islam and Christianity <ul> <li>Globally, as well as in the U.S., where the importance of Christianity will give rise to things like &quot;the Christian Democratic Party&quot; and an increasing number of Christian schools</li> </ul></li> <li>The internet helps guarantee freedom against authoritarian regimes, which can censor newspapers, radio, and TV, but not the internet</li> <li>Total globalization will cause a new world religion to be created which doesn't come from old ideas and will market itself as dogmatic, exclusive, and superior to old religions</li> <li>New world order with international laws and international courts; international trade impossible otherwise</li> <li>&quot;Cyberspace&quot; has its own governance, with a &quot;cyber-government&quot; and calls for democracy where each email address gets a vote; nation-level governance over &quot;cyberspace&quot; &quot;cannot and will not last, nor will any other benevolent dictatorship of non-elected, unrepresentative authority&quot;<br /></li> </ul> <p>Overall accuracy, 8/79 = 10%</p> <h4 id="toffler">Toffler</h4> <p>Intro to Future Shock:</p> <blockquote> <p>Another reservation has to do with the verb &quot;will.&quot; No serious futurist deals in &quot;predictions.&quot; These are left for television oracles and newspaper astrologers. ... Yet to enter every appropriate qualification in a book of this kind would be to bury the reader under an avalanche of maybes. Rather than do this, I have taken the liberty of speaking firmly, without hesitation, trusting that the intelligent reader will understand the stylistic problem. The word &quot;will&quot; should always be read as though it were preceded by &quot;probably&quot; or &quot;in my opinion.&quot; Similarly, all dates applied to future events need to be taken with a grain of judgment.</p> </blockquote> <p>[Chapter 1 is about how future shock is going to be a big deal in the future and how we're presently undergoing a revolution]</p> <p>Despite the disclaimer in the intro, there are very few concrete predictions. The first that I can see is in the middle of chapter two and isn't even really a prediction, but is a statement that very weakly implies world population growth will continue at the same pace or accelerate. Chapter 1 has a lot of vague statements about how severe future shock will be, and then Chapter 2 discusses how the world is changing at an unprecedented rate and cite a population doubling time eleven years to note how much this must change the world since it would require the equivalent of a new Tokyo, Hamburg, Rome, and Rangoon in eleven years, illustrating how shockingly rapid the world is changing. There's a nod to the creation of future subterranean cities, but stated weakly enough that it can't really be called a prediction.</p> <p>There's a similar implicit prediction that economic growth will continue with a doubling time of fifteen years, meaning that by the time someone is thirty, the amount of stuff (and it's phrased as amount of stuff and not wealth) will have quadrupled and then by the time someone is seventy it will have increased by a factor of thirty two. This is a stronger implicit prediction than the previous one since the phrasing implies this growth rate should continue for at least seventy years and is perhaps the first actual prediction in the book.</p> <p>Another such prediction appears later in the chapter, on the speed of travel, which took millions of years to reach 100 mph in the 1880s, only fifty-eight years to reach 400 mph in 1938, and then twenty to double again, and then not much more time before rockets could propel people at 4000 mph and people circled the earth at 18000 mph. Strictly speaking, no prediction is made as to the speed of travel in the future, but since the two chapters are about how this increased rate of change will, in the future, cause future shock, citing examples where exponential growth is expected to level off as reasons the future is going to cause future shock would be silly and implicit in the citation is that the speed of travel will continue to grow.</p> <p>Toffler then goes on to cite a series of examples where, at previous times in history, the time between having an idea and applying the idea was large, shrinking as we get closer to the present, where it's very low because &quot;we have, with the passage of time, invented all sorts of social devices to hasten the process&quot;.</p> <p>Through Chapter 4, Toffler continued to avoid making concrete, specific predictions, but also implied that buildings would be more temporary and, in the United States specifically, there would be an increase in tearing down old buildings (e.g., ten year old apartment buildings) to build new ones because new buildings would be so much better than old ones that it wouldn't make sense to live in old buildings, and that schools will move to using temporary buildings that are quickly dismantled after they're no longer necessary, perhaps often using geodesic domes.</p> <p>Also, a general increase in modularity, which parts of buildings being swapped out to allow more rapid changes during the short, 25-year life, of a modern building.</p> <p>Another implied prediction is that everything will be rented instead of owned, with specific examples cited of cars and homes, with an extremely rapid growth in the rate of car rentership over ownership continuing through the 70s in the then-near future.</p> <p>Through Chapter 5, Toffler continued to avoid making specific predictions, but very strongly implies that the amount of travel people will do for mundane tasks such as committing will hugely increase, making location essentially irrelevant. As with previous implied predictions, this is based on a very rapid increase in what Toffler views as a trend and is implicitly a prediction of the then very near future, citing people who commute 50k miles in a year and 120 miles in a day and citing stats showing that miles traveled have been increasing. When it comes to an actual prediction, Toffler makes the vague comment</p> <blockquote> <p>among those I have characterized as &quot;the people of the future,&quot; commuting, traveling, and regularly relocating one's family have become second nature.</p> </blockquote> <p>Which, if read very strictly, not technically not a prediction about the future, although it can be implied that people in the future will commute and travel much more.</p> <p>In a similar implicit prediction, Toffler implies that, in the future, corporations will order highly skilled workers to move to whatever location most benefits the corporation and they'll have no choice but to obey if they want to have a career.</p> <p>In Chapter 6, in a rare concrete prediction, Toffler writes</p> <blockquote> <p>When asked &quot;What do you do?&quot; the super-industrial man will label himself not in terms of his present (transient) job, but in terms of his trajectory type, the overall pattern of his work life.</p> </blockquote> <p>Some obsolete example job types that Toffler presents are &quot;machine operator&quot;, &quot;sales clerk&quot;, and &quot;computer programmer&quot;. Implicit in this section is that career changes will be so rapid and so frequent that the concept of being &quot;a computer programmer&quot; will be meaningless in the future. It's also implied that the half-life of knowledge will be so short in the future that people will no longer accumulate useful knowledge over the course of their career in the future and people, especially in management, shouldn't expect to move up with age and may be expected to move down with age as their knowledge becomes obsolete and they end up in &quot;simpler&quot; jobs.</p> <p>It's also implied that more people will work for temp agencies, replacing what would previously have been full-time roles. The book is highly U.S. centric and, in the book, this is considered positive for workers (it will give people more flexibility) without mentioning any of the downsides (lack of benefits, etc.). The chapter has some actual explicit predictions about how people will connect to family and friends, but the predictions are vague enough that it's difficult to say if the prediction has been satisfied or not.</p> <p>In chapter 7, Toffler says that bureaucracies will be replaced by &quot;adhocracies&quot;. Where bureaucracies had top down power and put people into well-defined roles, in adhocracies, roles will change so frequently that people won't get stuck into defined roles.. Toffler notes that a concern some people have about the future is that, since organizations will get larger and more powerful, people will feel like cogs, but this concern is unwarranted because adhocracy will replace bureaucracy. This will also mean an end to top-down direction because the rapid pace of innovation in the future won't leave time for any top down decision making, giving workers power. Furthermore, computers will automate all mundane and routine work, leaving no more need for bureaucracy because bureaucracy will only be needed to control large groups of people doing routine work and has no place in non-routine work. It's implied that &quot;in the next twenty-five to fifty years [we will] participate in the end of bureaucracy&quot;. As Toffler was writing in 1970, his timeframe for that prediction is 1995 to 2020.</p> <p>Chapter 8 takes the theme of everything being quicker and turns it to culture. Toffler predicts that celebrities, politicians, sports stars, famous fictional characters, best selling books, pieces of art, knowledge, etc., will all have much shorter careers and/or durations of relevance in the future. Also, new, widely used, words will be coined more rapidly than in the past.</p> <p>Chapter 9 takes the theme of everything accelerating and notes that social structures and governments are poised to break down under the pressure of rapid change, as evidenced by unrest in Berlin, New York, Turin, Tokyo, Washington, and Chicago. It's possible this is what Toffler is using to take credit for predicting the fall of the Soviet Union?</p> <p>Under the subheading &quot;The New Atlantis&quot;, Toffler predicts an intense race to own the bottom of the ocean and the associated marine life there, with entire new industries springing up to process the ocean's output. &quot;Aquaculture&quot; will be as important as &quot;agriculture&quot;, new textiles, drugs, etc., will come from the ocean. This will be a new frontier, akin to the American frontier, people will colonize the ocean. Toffler says &quot;If all this sounds too far off it is sobering to note that Dr. Walter L. Robb, a scientist at General Electric has already kept a hamster alive under water by enclosing it in a box that is, in effect, an artificial gill--a synthetic membrane that extracts air from the surrounding water while keeping the water out.&quot; Toffler gives the timeline for ocean colonization as &quot;long before the arrival of A.D. 2000&quot;.</p> <p>Toffler also predicts control over the weather starting in the 70s, that &quot;It is clearly only a matter of years&quot; before women are able to birth children &quot;without the discomfort of pregnancy&quot;.</p> <p>I stopped reading at this point because the chapters all seem very similar to each other, applying the same reasoning to different areas and the rate of accuracy of predictions didn't seem likely to increase in later chapters.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:W">I used web.archive.org to pull an older list because the current list of futurists is far too long for people to evaluate. I clicked on an arbitrary time in the past on archive.org and that list seemed to be short enough to evaluate (though, given the length of this post, perhaps that's not really true) and then looked at those futurists. <a class="footnote-return" href="#fnref:W"><sup>[return]</sup></a></li> <li id="fn:I">While there are cases where people can make great predictions or otherwise show off expertise while making &quot;cocktail party idea&quot; level statements because it's possible to have a finely honed intuition without being able to verbalize the intuition, developing that kind of intuition requires taking negative feedback seriously in order to train your intuition, which is the opposite of what we observed with the futurists discussed in this post. <a class="footnote-return" href="#fnref:I"><sup>[return]</sup></a></li> <li id="fn:B"><p>Ballmer is laughing with incredulity when he says this; $500 is too expensive for phone and will be the most expensive phone by far; a phone without a keyboard won't appeal to business users and won't be useful for writing emails; you can get &quot;great&quot; Windows Phone devices like the Motorola QPhone for $100, which will do everything (messaging, email, etc.), etc.</p> <p>You can see these kinds of futurist-caliber predictions all over the place in big companies. For example, on internal G+ at Google, Steve Yegge made a number of quite accurate predictions about what would happen with various major components of Google, such as Google cloud. If you read comments from people who are fairly senior, many disagreed with Yegge for reasons that I would say were fairly transparently bad at the time and were later proven to be incorrect by events. There's a sense in which you can say this means that what's going to happen isn't so obvious even with the right information, but <a href="https://www.patreon.com/posts/how-do-their-71735437">this really depends on what you mean by obvious</a>.</p> <p>A kind of anti-easter egg in Tetlock's Superforecasting is that Tetlock makes the &quot;smart contrarian&quot; case that the Ballmer quote is unjustly attacked since worldwide iPhone marketshare isn't all that high and he also claims that Ballmer is making a fairly measured statement that's been taken out of context, which seems plausible if you read the book and look at the out of context quote Tetlock uses but is obviously untrue if you watch the interview the quote comes from. Tetlock has mentioned that he's not a superforecaster and has basically said that he doesn't have the patience necessary to be one, so I don't hold this against him, but I do find it a bit funny that this bogus Freakonomics-style contrarian &quot;refutation&quot; is in this book that discusses, at great length, how important it is to understand the topic you're discussing.</p> <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:F"><p>Although this is really a topic for another post, I'll note that longtermists not only often operate with the same level of certainty, but also on the exact same topics, e.g., in 2001, noted longetermist Eliezer Yudkowsky said the following in a document describing Flare, his new programming language:</p> <blockquote> <p>A new programming language has to be <i>really good</i> to survive. A new language needs to represent a quantum leap just to be in the game. Well, we're going to be up-front about this: Flare is <i>really good</i>. There are concepts in Flare that have never been seen before. We expect to be able to solve problems in Flare that cannot realistically be solved in any other language. ... Back in the good old days, it may have made sense to write &quot;efficient&quot; programming languages. This, however, is a new age. The age of microwave ovens and instant coffee. The age of six-month-old companies, twenty-two-year-old CEOs and Moore's Law. The age of fiber optics. The age of <i>speed</i>. ... &quot;Efficiency&quot; is the property that determines how much hardware you need, and &quot;scalability&quot; is the property that determines whether you can throw more hardware resources at the problem. In extreme cases, lack of scalability may defeat some problems entirely; for example, any program built around 32-bit pointers may not be able to scale at all past 4GB of memory space. Such a lack of scalability forces programmer efforts to be spent on efficiency - on doing more and more with the mere 4GB of memory available. Had the hardware and software been scalable, however, more RAM could have been bought; this is not necessarily cheap but it is usually cheaper than buying another programmer. ... Scalability also determines how well a program or a language ages with time. Imposing a hard limit of 640K on memory or 4GB on disk drives may not seem absurd when the decision is made, but the inexorable progress of Moore's Law and its corollaries inevitably bumps up against such limits. ... Flare is a language built around the philosophy that it is acceptable to sacrifice efficiency in favor of scalability. What is important is not squeezing every last scrap of performance out of current hardware, but rather preserving the ability to throw hardware at the problem. As long as scalability is preserved, it is also acceptable for Flare to do complex, MIPsucking things in order to make things easier for the programmer. In the dawn days of computing, most computing tasks ran up against the limit of available hardware, and so it was necessary to spend a lot of time on optimizing efficiency just to make computing a bearable experience. Today, most simple programs will run pretty quickly (instantly, from the user's perspective), whether written in a fast language or a slow language. If a program is slow, the limiting factor is likely to be memory bandwidth, disk access, or Internet operations, rather than RAM usage or CPU load. ... Scalability often comes at a cost in efficiency. Writing a program that can be parallelized traditionally comes at a cost in memory barrier instructions and acquisition of synchronization locks. For small N, O(N) or O(N**2) solutions are sometimes faster than the scalable O(C) or O(N) solutions. A two-way linked list allows for constant-time insertion or deletion, but at a cost in RAM, and at the cost of making the list more awkward (O(N) instead of O(C) or O(log N)) for other operations such as indexed lookup. Tracking Flare's two-way references through a two-way linked list maintained on the target burns RAM to maintain the scalability of adding or deleting a reference. Where only ten references exist, an ordinary vector type would be less complicated and just as fast, or faster. Using a two-way linked list adds complication and takes some additional computing power in the smallest case, and buys back the theoretical capability to scale to thousands or millions of references pointing at a single target... though perhaps for such an extreme case, further complication might be necessary.</p> </blockquote> <p>As with the other Moore's law predictions of the era, this is not only wrong in retrospect, it was so obviously wrong that undergraduates were taught why this was wrong.</p> <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:C"><p>My personal experience is that, as large corporations have gotten more powerful, <a href="nothing-works/">the customer experience has often gotten significantly worse</a> as I'm further removed from a human who feels empowered to do anything to help me when I run into a real issue. And the only reason my experience can be described as merely significantly worse and not much worse is that I have enough Twitter followers that when I run into a bug that makes a major corporation's product stop working for me entirely (which happened twice in the past year), I can post about it on Twitter and it's likely someone will escalate the issue enough that it will get fixed.</p> <p>In 2005, when I interacted with corporations, it was likely that I was either directly interacting with someone who could handle whatever issue I had or that I only needed a single level of escalation to get there. And, in the event that the issue wasn't solvable (which never happened to me, but could happen), the market was fragmented enough that I could just go use another company's product or service. More recently, in the two cases where I had to go resort to getting support via Twitter, one of the products essentially has no peers, so my ability to use any product or service of that kind would have ended if I wasn't able to find a friend of a friend to help me or if I couldn't craft some kind of viral video / blog post / tweet / etc. In the other case, there are two companies in the space, but one is much larger and offers effective service over a wider area, so I would've lost the ability to use an entire class of product or service in many areas with no recourse other than &quot;going viral&quot;. There isn't a simple way to quantify whether or not this effect is &quot;larger than&quot; the improvements which have occurred and if, on balance, consumer experiences have improved or regressed, but there are enough complaints about how widespread this kind of thing is that degraded experiences should at least have some weight in the discussion, and Kurzweil assigns them zero weight.</p> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:P"><p>If it turns out that longtermists and other current predictors of the future very heavily rely on the same techniques as futurists past, I may not write up the analysis since it will be quite long and I don't think it's very interesting to write up a very long list of obvious blunders. Per the comment above about how this post would've been more interesting if it focused on business leaders, it's a lot more interesting to write up an analysis if there are some people using reasonable methodologies that can be compared and contrasted.</p> <p>Conversely, if people predicting the future don't rely on the techniques discussed here at all, then an analysis informed by futurist methods would be a fairly straightforward negative result that could be a short Twitter thread or a very short post. As Catherine Olsson points out, longtermists draw from a variety of intellectual traditions (and I'm not close enough to longtermist culture to personally have an opinion of the relative weights of these traditions):</p> <blockquote> <p>Modern 'longtermism' draws on a handful of intellectual traditions, including historical 'futurist' thinking, as well as other influences ranging from academic philosophy of population ethics to Berkeley rationalist culture.</p> <p>To the extent that 'longtermists' today are using similar prediction methods to historical 'futurists' in particular, [this post] bodes poorly for longtermists' ability to anticipate technological developments in the coming decades</p> </blockquote> <p>If there's a serious &quot;part 2&quot; to this post, we'll look at this idea and others but, for the reasons mentioned above, there may not be much of a &quot;part 2&quot; to this post.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:N"><a href="https://nostalgebraist.tumblr.com/post/692246981744214016/more-on-metaculus-badness">This post by nostalgebraist gives another example of this</a>, where metaculus uses Brier scores for scoring, just like Tetlock did for his Superforecasting work. This gives it an air of credibility until you look at what's actually being computed, which is not something that's meaningful to take a Brier score over, meaning the result of using this rigorous, Superforecasting-approved, technique is nonsense; exactly the kind of thing McElreath warns about. <a class="footnote-return" href="#fnref:N"><sup>[return]</sup></a></li> </ol> </div> In defense of simple architectures simple-architectures/ Wed, 06 Apr 2022 00:00:00 +0000 simple-architectures/ <p>Wave is a $1.7B company with 70 engineers<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup> whose product is a CRUD app that adds and subtracts numbers. In keeping with this, our architecture is a standard CRUD app architecture, a Python monolith on top of Postgres. <a href="https://twitter.com/danluu/status/1462607028585525249">Starting with a simple architecture and solving problems in simple ways</a> where possible has allowed us to scale to this size while engineers mostly focus on work that delivers value to users.</p> <p>Stackoverflow scaled up a monolith to good effect (<a href="https://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow/">2013 architecture</a> / <a href="https://nickcraver.com/blog/2016/02/17/stack-overflow-the-architecture-2016-edition/">2016 architecture</a>), eventually getting acquired for $1.8B. If we look at traffic instead of market cap, Stackoverflow is among the top 100 highest traffic sites on the internet (for many other examples of valuable companies that were built on top of monoliths, <a href="https://twitter.com/danluu/status/1498678300163588096">see the replies to this Twitter thread</a>. We don’t have a lot of web traffic because we’re a mobile app, but Alexa still puts our website in the top 75k even though our website is basically just a way for people to find the app and most people don’t even find the app through our website).</p> <p>There are some kinds of applications that have demands that would make a simple monolith on top of a boring database a non-starter but, for most kinds of applications, even at top-100 site levels of traffic, computers are fast enough that high-traffic apps can be served with simple architectures, which can generally be created more cheaply and easily than complex architectures.</p> <p>Despite the unreasonable effectiveness of simple architectures, most press goes to complex architectures. For example, at a recent generalist tech conference, there were six talks on how to build or deal with side effects of complex, microservice-based, architectures and zero on how one might build out a simple monolith. There were more talks on quantum computing (one) than talks on monoliths (zero). Larger conferences are similar; a recent enterprise-oriented conference in SF had a double-digit number of talks on dealing with the complexity of a sophisticated architecture and zero on how to build a simple monolith. Something that was striking to me the last time I attended that conference is how many attendees who worked at enterprises with low-scale applications that could’ve been built with simple architectures had copied the latest and greatest sophisticated techniques that are popular on the conference circuit and HN.</p> <p>Our architecture is so simple I’m not even going to bother with an architectural diagram. Instead, I’ll discuss a few boring things we do that help us keep things boring.</p> <p>We’re currently using boring, synchronous, Python, which means that our server processes block while waiting for I/O, like network requests. We previously tried Eventlet, an async framework that would, in theory, let us get more efficiency out of Python, but ran into so many bugs that we decided the CPU and latency cost of waiting for events wasn’t worth the operational pain we had to take on to deal with Eventlet issues. The are other <a href="https://twitter.com/mcfunley/status/1194713713330122752">well-known async frameworks for Python</a>, but users of those at scale often also report <a href="https://twitter.com/mcfunley/status/1194715290841432064">significant fallout from using those frameworks at scale</a>. Using synchronous Python is expensive, in the sense that we pay for CPU that does nothing but wait during network requests, but since we’re only handling billions of requests a month (for now), the cost of this is low even when using a slow language, like Python, and paying retail public cloud prices. The cost of our engineering team completely dominates the cost of the systems we operate<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">2</a></sup>.</p> <p>Rather than take on the complexity of making our monolith async we farm out long-running tasks (that we don’t want responses to block on) to a queue.</p> <p>A place where we can’t be as boring as we’d like is with our on-prem datacenters. When we were operating solely in Senegal and Côte d'Ivoire, we operated fully in the cloud, but as we expand into Uganda (and more countries in the future), we’re having to split our backend and deploy on-prem to comply with local data residency laws and regulations. That's not exactly a simple operation, but as anyone who's done the same thing with a complex service-oriented architecture knows, this operation is much simpler than it would've been if we had a complex service-oriented architecture.</p> <p>Another area is with software we’ve had to build (instead of buy). When we started out, we strongly preferred buying software over building it because a team of only a few engineers can’t afford the time cost of building everything. That was the right choice at the time even though <a href="nothing-works/">the “buy” option generally gives you tools that don’t work</a>. In cases where vendors can’t be convinced to fix showstopping bugs that are critical blockers for us, <a href="in-house/">it does make sense to build more of our own tools and maintain in-house expertise in more areas</a>, in contradiction to the standard advice that a company should only choose to “build” in its core competency. Much of that complexity is complexity that we don’t want to take on, but in some product categories, even after fairly extensive research we haven’t found any vendor that seems likely to provide a product that works for us. To be fair to our vendors, the problem they’d need to solve to deliver a working solution to us is much more complex than the problem we need to solve since our vendors are taking on the complexity of solving a problem for every customer, whereas we only need to solve the problem for one customer, ourselves.</p> <p>A mistake we made in the first few months of operation that has some cost today was not carefully delimiting the boundaries of database transactions. In Wave’s codebase, the SQLAlchemy database session is a request-global variable; it implicitly begins a new database transaction any time a DB object’s attribute is accessed, and any function in Wave’s codebase can call commit on the session, causing it to commit all pending updates. This makes it difficult to control the time at which database updates occur, which increases our rate of subtle data-integrity bugs, as well as making it harder to lean on the database to build things like <a href="https://brandur.org/idempotency-keys">idempotency keys</a> or a <a href="https://brandur.org/job-drain">transactionally-staged job drain</a>. It also increases our risk of accidentally holding open long-running database transactions, which can <a href="https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/">make schema migrations operationally difficult</a>.</p> <p>Some choices that we’re unsure about (in that these are things we’re either thinking about changing, or would recommend to other teams starting from scratch to consider a different approach) were using RabbitMQ (for our purposes, Redis would probably work equally well as a task queue and just using Redis would reduce operational burden), using Celery (which is overcomplicated for our use case and has been implicated in several outages e.g. due to backwards compatibility issues during version upgrades), using SQLAlchemy (which makes it hard for developers to understand what database queries their code is going to emit, leading to various situations that are hard to debug and involve unnecessary operational pain, especially related to the above point about database transaction boundaries), and using Python (which was the right initial choice because of our founding CTO’s technical background, but its concurrency support, performance, and extensive dynamism make us question whether it’s the right choice for a large-scale backend codebase). None of these was a major mistake, and for some (e.g. Python) the downsides are minimal enough that it’s cheaper for us to continue to pay the increased maintenance burden than to invest in migrating to something theoretically better, but if we were starting a similar codebase from scratch today we’d think hard about whether they were the right choice.</p> <p>Some areas where we’re happy with our choices even though they may not sound like the simplest feasible solution is with our API, where we use GraphQL, with our transport protocols, where we had a custom protocol for a while, and our host management, where we use Kubernetes. For our transport protocols, we used to use a custom protocol that runs on top of UDP, with an SMS and USSD fallback, <a href="https://www.youtube.com/watch?v=EAxnA9L5rS8">for the performance reasons described in this talk</a>. With the rollout of HTTP/3, we’ve been able to replace our custom protocol with HTTP/3 and we generally only need USSD for events like the recent internet shutdowns in Mali.</p> <p>As for using GraphQL, we believe the pros outweigh the cons for us:</p> <p>Pros:</p> <ul> <li>Self-documentation of exact return type</li> <li>Code generation of exact return type leads to safer clients</li> <li>GraphiQL interactive explorer is a productivity win</li> <li>Our various apps (user app, support app, Wave agent app, etc.) can mostly share one API, reducing complexity</li> <li>Composable query language allows clients to fetch exactly the data they need in a single packet roundtrip without needing to build a large number of special-purpose endpoints</li> <li>Eliminates bikeshedding over what counts as a RESTful API</li> </ul> <p>Cons:</p> <ul> <li>GraphQL libraries weren’t great when we adopted GraphQL (the base Python library was a port of the Javascript one so not Pythonic, Graphene required a lot of boilerplate, and Apollo-Android produced very poorly optimized code)</li> <li>Default GQL encoding is redundant and we care a lot about limiting size because many of our customers have low bandwidth</li> </ul> <p>As for Kubernetes, we use Kubernetes because knew that, if the business was successful (which it has been) and we kept expanding, we’d eventually expand to countries that require us to operate our services in country. The exact regulations vary by country, but we’re already expanding into one major African market that requires we operate our “primary datacenter” in the country and there are others with regulations that, e.g., require us to be able to fail over to a datacenter in the country.</p> <p>An area where there’s unavoidable complexity for us is with telecom integrations. In theory, we would use a SaaS SMS provider for everything, but <a href="https://youtu.be/6tb8ALAvodM?t=196">the major SaaS SMS provider doesn’t operate everywhere in Africa</a> and the cost of using them everywhere would be prohibitive<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">3</a></sup>. The earlier comment on how the compensation cost of engineers dominates the cost of our systems wouldn’t be true if we used a SaaS SMS provider for all of our SMS needs; the team that provides telecom integrations pays for itself many times over.</p> <p>By keeping our application architecture as simple as possible, we can spend our complexity (and headcount) budget in places where there’s complexity that it benefits our business to take on. Taking the idea of doing things as simply as possible unless there’s a strong reason to add complexity has allowed us to build a fairly large business with not all that many engineers despite running an African finance business, which is generally believed to be a tough business to get into, which we’ll discuss in a future post (one of our earliest and most helpful advisers, who gave us advice that was critical in Wave’s success, initially suggested that Wave was a bad business idea and the founders should pick another one because he foresaw so many potential difficulties).</p> <p><em>Thanks to Ben Kuhn, Sierra Rotimi-Williams, June Seif, Kamal Marhubi, Ruthie Byers, Lincoln Quirk, Calum Ball, John Hergenroeder, Bill Mill, Sophia Wisdom, and Finbarr Timbers for comments/corrections/discussion.</em></p> <p><link rel="canonical" href="simple-architecture"/></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R">If you want to compute a ratio, we had closer to 40 engineers when we last fundraised and were valued at $1.7B. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:B">There are business models for which this wouldn't be true, e.g., if we were an ad-supported social media company, the level of traffic we'd need to support our company as it grows would be large enough that we'd incur a significant financial cost if we didn't <a href="sounds-easy/">spend a significant fraction of our engineering time on optimization and cost reduction work</a>. But, as a company that charges real money for a significant fraction of interactions with an app, our computational load per unit of revenue is very low compared to a social media company and it's likely that this will be a minor concern for us until we're well over an order of magnitude larger than we are now; it's not even clear that this would be a major concern if we were two orders of magnitude larger, although it would definitely be a concern at three orders of magnitude growth. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:C">Despite the classic advice about how one shouldn’t compete on price, we (among many other things) do compete on price and therefore must care about costs. We’ve driven down the cost of mobile money in Africa and <a href="https://www.wave.com/en/blog/world/">our competitors have had to slash their prices to match our prices, which we view as a positive value for the world</a> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> </ol> </div> Why is it so hard to buy things that work well? nothing-works/ Mon, 14 Mar 2022 00:00:00 +0000 nothing-works/ <p>There's a <a href="cocktail-ideas/">cocktail party version</a> of the efficient markets hypothesis I frequently hear that's basically, &quot;markets enforce efficiency, so it's not possible that a company can have some major inefficiency and survive&quot;. <a href="tech-discrimination/">We've previously discussed Marc Andreessen's quote that tech hiring can't be inefficient here</a> and <a href="talent/">here</a>:</p> <blockquote> <p>Let's launch right into it. I think the critique that Silicon Valley companies are deliberately, systematically discriminatory is incorrect, and there are two reasons to believe that that's the case. ... No. 2, our companies are desperate for talent. Desperate. Our companies are dying for talent. They're like lying on the beach gasping because they can't get enough talented people in for these jobs. The motivation to go find talent wherever it is unbelievably high.</p> </blockquote> <p>Variants of this idea that I frequently hear engineers and VCs repeat involve companies being efficient and/or products being basically as good as possible because, if it were possible for them to be better, someone would've outcompeted them and done it already<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <p>There's a vague plausibility to that kind of statement, which is why it's <a href="cocktail-ideas/">a debate I've often heard come up in casual conversation</a>, where one person will point out some obvious company inefficiency or product error and someone else will respond that, if it's so obvious, someone at the company would have fixed the issue or another company would've come along and won based on being more efficient or better. Talking purely abstractly, it's hard to settle the debate, but things are clearer if we look at some specifics, as in the two examples above about hiring, where we can observe that, whatever abstract arguments people make, inefficiencies persisted for decades.</p> <p>When it comes to buying products and services, at a personal level, most people I know who've checked the work of people they've hired for things like home renovation or <a href="https://twitter.com/benskuhn/status/1477072484092375040">accounting</a> have found grievous errors in the work. Although it's possible to find people who don't do shoddy work, <a href="https://twitter.com/pushcx/status/1499381895653699585">it's generally difficult for someone who isn't an expert in the field to determine if someone is going to do shoddy work in the field</a>. You can try to get better quality by paying more, but once you get out of the very bottom end of the market, it's frequently unclear how to trade money for quality, e.g., my friends and colleagues who've gone with large, brand name, accounting firms have paid much more than people who go with small, local, accountants and gotten a higher error rate; as a strategy, trying expensive local accountants hasn't really fared much better. The good accountants are typically somewhat expensive, but they're generally not charging the highest rates and only a small percentage of somewhat expensive accountants are good.</p> <p>More generally, in many markets, consumers are uninformed and <a href="why-benchmark/#appendix-capitalism">it's fairly difficult to figure out which products are even half decent, let alone good</a>. When people happen to choose a product or service that's right for them, it's often for the wrong reasons. For example, in my social circles, there have been two waves of people migrating from iPhones to Android phones over the past few years. Both waves happened due to Apple PR snafus which caused a lot of people to think that iPhones were terrible at something when, in fact, they were better at that thing than Android phones. Luckily, iPhones aren't strictly superior to Android phones and many people who switched got a device that was better for them because they were previously using an iPhone due to good Apple PR, causing their errors to cancel out. But, when people are mostly making decisions off of marketing and PR and don't have access to good information, there's no particular reason to think that a product being generally better or even strictly superior will result in that winning and the worse product losing. In capital markets, we don't need all that many informed participants to think that some form of the efficient market hypothesis holds ensuring &quot;prices reflect all available information&quot;. It's a truism that published results about market inefficiencies stop being true the moment they're published because people exploit the inefficiency until it disappears. But with the job market examples, even though firms can take advantage of mispriced labor, as Greenspan famously did before becoming Chairman of the fed, inefficiencies can persist:</p> <blockquote> <p>Townsend-Greenspan was unusual for an economics firm in that the men worked for the women (we had about twenty-five employees in all). My hiring of women economists was not motivated by women's liberation. It just made great business sense. I valued men and women equally, and found that because other employers did not, good women economists were less expensive than men. Hiring women . . . gave Townsend-Greenspan higher-quality work for the same money . . .</p> </blockquote> <p>But as we also saw, individual firms exploiting mispriced labor have a limited demand for labor and inefficiencies can persist for decades because the firms that are acting on &quot;all available information&quot; don't buy enough labor to move the price of mispriced people to where it would be if most or all firms were acting rationally.</p> <p>In the abstract, it seems that, with products and services, inefficiencies should also be able to persist for a long time since, similarly, there also isn't a mechanism that allows actors in the system to exploit the inefficiency in a way that directly converts money into more money, and sometimes there isn't really even a mechanism to make almost any money at all. For example, if you observe that it's silly for people to move from iPhones to Android phones because they think that Apple is engaging in nefarious planned obsolescence when Android devices generally become obsolete more quickly, due to a combination of <a href="android-updates/">iPhones getting updates for longer</a> and iPhones being faster at every price point they compete at, allowing the phone to be used on <a href="web-bloat/">bloated sites</a> <a href="https://twitter.com/danluu/status/919423480776613888">for longer</a>, you can't really make money off of this observation. This is unlike a mispriced asset that you can buy derivatives of to make money (in expectation).</p> <p>A common suggestion to the problem of not knowing what product or service is good is to ask an expert in the field or a credentialed person, but <a href="https://twitter.com/danluu/status/1324495890548056065">this often fails as well</a>. For example, a friend of mine had trouble sleeping because his window air conditioner was loud and would wake him up when it turned on. He asked a trusted friend of his who works on air conditioners if this could be improved by getting a newer air conditioner and his friend said &quot;no; air conditioners are basically all the same&quot;. But any consumer who's compared items with motors in them would immediately know that this is false. Engineers have gotten much better at producing quieter devices when holding power and cost constant. My friend eventually bought a newer, quieter, air conditioner, which solved his sleep problem, but he had the problem for longer than he needed to because he assumed that someone whose job it is to work on air conditioners would give him non-terrible advice about air conditioners. If my friend were an expert on air conditioners or had compared the noise levels of otherwise comparable consumer products over time, he could've figured out that he shouldn't trust his friend, but if he had that level of expertise, he wouldn't have needed advice in the first place.</p> <p>So far, we've looked at the difficulty of getting the right product or service at a personal level, but this problem also exists at the firm level and is often worse because the markets tend to be thinner, with fewer products available as well as opaque, &quot;call us&quot; pricing. Some commonly repeated advice is that firms should focus on their &quot;core competencies&quot; and outsource everything else (e.g., Joel Spolsky, Gene Kim, Will Larson, Camille Fournier, etc., all say this), but if we look mid-sized tech companies, we can see that they often need to have in-house expertise that's far outside what anyone would consider their core competency unless, e.g., <a href="in-house/">every social media company has kernel expertise as a core competency</a>. In principle, firms can outsource this kind of work, but people I know who've relied on outsourcing, e.g., kernel expertise to consultants or application engineers on a support contract, have been very unhappy with the results compared to what they can get by hiring dedicated engineers, both in absolute terms (support frequently doesn't come up with a satisfactory resolution in weeks or months, even when it's one a good engineer could solve in days) and for the money (despite engineers being expensive, large support contracts can often cost more than an engineer while delivering worse service than an engineer).</p> <p>This problem exists not only for support but also for products a company could buy instead of build. For example, Ben Kuhn, the CTO of Wave, has <a href="https://twitter.com/benskuhn/status/1324493738304036864">a Twitter thread about some of the issues we've run into at Wave</a>, with <a href="https://twitter.com/benskuhn/status/1382325921311563779">a couple</a> of <a href="https://twitter.com/benskuhn/status/1496167942744346624">followups</a>. Ben now believes that one of the big mistakes he made as CTO was not putting much more effort into vendor selection, even when the decision appeared to be a slam dunk, and more strongly considering moving many systems to custom in-house versions sooner. Even after selecting the consensus best product in the space from the leading (as in largest and most respected) firm, and using the main offering the company has, the product often not only doesn't work but, by design, can't work.</p> <p>For example, we tried &quot;buy&quot; instead of &quot;build&quot; for a product that syncs data from Postgres to Snowflake. Syncing from Postrgres is the main offering (as in the offering with the most customers) from a leading data sync company, and we found that it would lose data, duplicate data, and corrupt data. After digging into it, it turns out that the product has a design that, among other issues, relies on the data source being able to seek backwards on its changelog. But Postgres throws changelogs away once they're consumed, so the Postgres data source can't support this operation. When their product attempts to do this and the operation fails, we end up with the sync getting &quot;stuck&quot;, needing manual intervention from the vendor's operator and/or data loss. Since our data is still on Postgres, it's possible to recover from this by doing a full resync, but the data sync product tops out at 5MB/s for reasons that appear to be unknown to them, so a full resync can take days even on databases that aren't all that large. Resyncs will also silently drop and corrupt data, so multiple cycles of full resyncs followed by data integrity checks are sometimes necessary to recover from data corruption, which can take weeks. Despite being widely recommended and the leading product in the space, the product has a number of major design flaws that mean that it literally cannot work.</p> <p>This isn't so different from Mongo or other products that had fundamental design flaws that caused severe data loss, with <a href="why-benchmark/">the main difference being that, in most areas, there isn't a Kyle Kingsbury who spends years publishing tests on various products in the field, patiently responding to bogus claims about correctness until the PR backlash caused companies in the field to start taking correctness seriously</a>. Without that pressure, most software products basically don't work, hence the Twitter threads from Ben, above, where he notes that the &quot;buy&quot; solutions you might want to choose mostly don't work<sup class="footnote-ref" id="fnref:V"><a rel="footnote" href="#fn:V">2</a></sup>. Of course, at our scale, there are many things we're not going to build any time soon, like CPUs, but, for many things where the received wisdom is to &quot;buy&quot;, &quot;build&quot; seems like a reasonable option. This is even true for larger companies and building CPUs. Fifteen years ago, high-performance (as in, non-embedded level of performance) CPUs were a canonical example of something it would be considered bonkers to build in-house, absurd for even the largest software companies, but Apple and Amazon have been able to produce best-in-class CPUs on the dimensions they're optimizing for, for predictable reasons<sup class="footnote-ref" id="fnref:Q"><a rel="footnote" href="#fn:Q">3</a></sup>.</p> <p>This isn't just an issue that impacts tech companies; we see this across many different industries. For example, any company that wants to mail items to customers has to either implement shipping themselves or deal with the fallout of having unreliable shipping. As a user, whether or not packages get shipped to you depends a lot on where you live and what kind of building you live in.</p> <p>When I've lived in a house, packages have usually arrived regardless of the shipper (although they've often arrived late). But, since moving into apartment buildings, some buildings just don't get deliveries from certain delivery services. Once, I lived in a building where the postal service didn't deliver mail properly and I didn't get a lot of mail (although I frequently got mail addressed to other people in the building as well as people elsewhere). More commonly, UPS and Fedex usually won't attempt to deliver and will just put a bunch of notices up on the building door for all the packages they didn't deliver, where the notice falsely indicates that the person wasn't home and correctly indicates that, to get the package, the person has to go to some pick-up location to get the package.</p> <p>For a while, I lived in a city where Amazon used 3rd-party commercial courier services to do last-mile shipping for same-day delivery. The services they used were famous for marking things as delivered without delivering the item for days, making &quot;same day&quot; shipping slower than next day or even two day shipping. Once, I naively contacted Amazon support because my package had been marked as delivered but wasn't delivered. Support, using a standard script supplied to them by Amazon, told me that I should contact them again three days after the package was marked as delivered because couriers often mark packages as delivered without delivering them, but they often deliver the package within a few days. Amazon knew that the courier service they were using didn't really even try to deliver packages<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">4</a></sup> promptly and the only short-term mitigation available to them was to tell support to tell people that they shouldn't expect that packages have arrived when they've been marked as delivered.</p> <p>Amazon eventually solved this problem by having their own delivery people or using, by commercial shipping standards, an extremely expensive service (Apple has done for same-day delivery)<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">5</a></sup>. At scale, there's no commercial service you can pay for that will reliably attempt to deliver packages. If you want a service that actually works, you're generally on the hook for building it yourself, just like in the software world. My local grocery store tried to outsource this to DoorDash. I've tried delivery 3 times from my grocery store and my groceries have showed up 2 out of 3 times, which is well below what most people would consider an acceptable hit rate for grocery delivery. Having to build instead of buy to get reliability is a huge drag on productivity, especially for smaller companies (e.g., it's not possible for small shops that want to compete with Amazon and mail products to customers to have reliable delivery since they can't build out their own delivery service).</p> <p>The amount of waste generated by the inability to farm out services is staggering and I've seen it everywhere I've worked. An example from another industry: when I worked at a small chip startup, we had in-house capability to do end-to-end chip processing (with the exception of having its own fabs), which is unusual for a small chip startup. When the first wafer of a new design came off of a fab, we'd have the wafer flown to us on a flight, at which point someone would use a wafer saw to cut the wafer into individual chips so we could start testing ASAP. This was often considered absurd in the same way that it would be considered absurd for a small software startup to manage its own on-prem hardware. After all, the wafer saw and the expertise necessary to go from a wafer to a working chip will be idle over 99% of the time. Having full-time equipment and expertise that you use less than 1% of the time is a classic example of the kind of thing you should outsource, but if you price out having people competent to do this plus having the equipment available to do it, even at fairly low volumes, it's cheaper to do it in-house even if the equipment and expertise for it are idle 99% of the time. More importantly, you'll get much better service (faster turnaround) in house, letting you ship at a higher cadence. I've both worked at companies that have tried to contract this kind of thing out as well as talked with many people who've done that and you get slower, less reliable, service at a higher cost.</p> <p>Likewise with chip software tooling; despite it being standard to outsource tooling to <a href="https://en.wikipedia.org/wiki/Electronic_design_automation">large EDA vendors</a>, we got a lot of mileage out using our own custom tools, generally created or maintained by one person, e.g., while I was there, most simulator cycles were run on a custom simulator that was maintained by one person, which saved millions a year in simulator costs (standard pricing for a simulator at the time was a few thousand dollars per license per year and we had a farm of about a thousand simulation machines). You might think that, if a single person can create or maintain a tool that's worth millions of dollars a year to the company, our competitors would do the same thing, just like you might think that if you can ship faster and at a lower cost by hiring a person who knows how to crack a wafer open, our competitors would do that, but they mostly didn't.</p> <p><a href="https://twitter.com/danluu/status/823296808919019520">Joel Spolsky has an old post where he says</a>:</p> <blockquote> <p>“Find the dependencies — and eliminate them.” When you're working on a really, really good team with great programmers, everybody else's code, frankly, is bug-infested garbage, and nobody else knows how to ship on time.</p> </blockquote> <p>We had a similar attitude, although I'd say that we were a bit more humble. We didn't think that everyone else was producing garbage but, we also didn't assume that we couldn't produce something comparable to what we could buy for a tenth of the cost. From talking to folks at some competitors, there was a <a href="culture/">pretty big cultural difference</a> between how we operated and how they operated. It simply didn't occur to them that they didn't have to buy into the standard American business logic that you should focus on your core competencies, that you can think through whether or not it makes sense to do something in-house on the merits of the particular thing instead of outsourcing your thinking to a pithy saying.</p> <p>I once watched, from the inside, a company undergo this cultural shift. A few people in leadership decided that the company should focus on its core competencies, which meant abandoning custom software for infrastructure. This resulted in quite a few large migrations from custom internal software to SaaS solutions and open source software. If you watched the discussions on &quot;why&quot; various projects should or shouldn't migrate, there were a few unusually unreasonable people who tried to reason through particular cases on the merits of each case (in a post on pushing back against orders from the top, <a href="https://yosefk.com/blog/people-can-read-their-managers-mind.html">Yossi Kreinin calls these people insane employees</a>; I'm going to refer to the same concept in this post, but instead call people who do this unusually unreasonable). But, for the most part, people bought the party line and pushed for a migration regardless of the specifics.</p> <p>The thing that I thought was interesting was that leadership didn't tell particular teams they had to migrate and there weren't really negative consequences for teams where an &quot;unusually unreasonable person&quot; pushed back in order to keep running an existing system for reasonable reasons. Instead, people mostly bought into the idea and tried to justify migrations for vaguely plausible sounding reasons that weren't connected to reality, resulting in funny outcomes like moving to an open source system &quot;to save money&quot; when the new system was quite obviously less efficient<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">6</a></sup> and, predictably, required much higher capex and opex. The cost savings was supposed to come from shrinking the team, but the increase in operational cost dominated the change in the cost of the team and the complexity of operating the system meant that the team size increased instead of decreasing. There were a number of cases where it really did make sense to migrate, but the stated reasons for migration tended to be unrelated or weakly related to the reasons it actually made sense to migrate. Once people absorbed the idea that the company should focus on core competencies, the migrations were driven by the cultural idea and not any technical reasons.</p> <p>The pervasiveness of <a href="bad-decisions/">decisions like the above, technical decisions made without serious technical consideration, is a major reason that the selection pressure on companies to make good products is so weak</a>. There is some pressure, but it's noisy enough that successful companies often route around making a product that works, like in the Mongo example from above, where Mongo's decision to loudly repeat <a href="https://aphyr.com/tags/mongodb">demonstrably bogus performance claims and making demonstrably false correctness claims</a> was, from a business standpoint, superior to focusing on actual correctness and performance; by focusing their resources where it mattered for the business, they managed to outcompete companies that made the mistake of devoting serious resources to performance and correctness.</p> <p>Yossi's post about how an unusually unreasonable person can have outsized impact in a dimension they value at their firm also applies to impact outside of a firm. Kyle Kingsbury, mentioned above, is an example of this. At the rates that I've heard <a href="https://jepsen.io/">Jepsen</a> is charging now, Kyle can bring in what a senior developer at BigCo does (actually senior, not someone with the title &quot;senior&quot;), but that was after years of working long hours at below market rates on an uncertain endeavour, refuting <a href="https://en.wikipedia.org/wiki/Fear,_uncertainty,_and_doubt">FUD</a> from his critics (if you read the replies to the linked posts or, worse yet, the actual tickets where he's involved in discussions with developers, the replies to Kyle were a constant stream of nonsense for many years, including people working for vendors <a href="https://lobste.rs/s/jgfyep/addio_redis_i_m_leaving_redis_labs#c_i6kcci">feeling like he has it out for them in particular, casting aspersions on his character</a><sup class="footnote-ref" id="fnref:J"><a rel="footnote" href="#fn:J">7</a></sup>, and <a href="https://lobste.rs/s/jgfyep/addio_redis_i_m_leaving_redis_labs#c_sae3wo">generally trashing him</a>). I have a deep respect for people who are willing to push on issues like this despite the system being aligned against them but, my respect notwithstanding, basically no one is going to do that. A system that requires someone like Kyle to take a stand before successful firms will put effort into correctness instead of correctness marketing is going to produce a lot of products that are good at marketing correctness without really having decent correctness properties (such as the data sync product mentioned in this post, whose website repeatedly mentions how reliable and safe the syncing product is despite having a design that is fundamentally broken).</p> <p>It's also true at the firm level that it often takes an unusually unreasonable firm to produce a really great product instead of just one that's marketed as great, e.g., <a href="car-safety/">Volvo, the one car manufacturer that seemed to try to produce a level of structural safety beyond what could be demonstrated by IIHS tests</a> fared so poorly as a business that it's been forced to move upmarket and became a niche, luxury, automaker since safety isn't something consumers are really interested in despite car accidents being a leading cause of death and a significant source of life expectancy loss. And it's not clear that Volvo will be able to persist in being an unreasonable firm since they weren't able to survive as an independent automaker. When Ford acquired Volvo, <a href="car-safety/#quality-may-change-over-time">Ford started moving Volvos to the shared Ford C1 platform, which didn't fare particularly well in crash tests</a>. Since Geely has acquired Volvo, <a href="https://twitter.com/danluu/status/1458814317948661768">it's too early to tell for sure if they'll maintain Volvo's commitment to designing for real-world crash data and not just crash data that gets reported in benchmarks</a>. If Geely declines to continue Volvo's commitment to structural safety, it may not be possible to buy a modern car that's designed to be safe.</p> <p>Most markets are like this, except that there was never an unreasonable firm like Volvo in the first place. On unreasonable employees, Yossi says</p> <blockquote> <p>Who can, and sometimes does, un-rot the fish from the bottom? An insane employee. Someone who finds the forks, crashes, etc. a personal offence, and will repeatedly risk annoying management by fighting to stop these things. Especially someone who spends their own political capital, hard earned doing things management truly values, on doing work they don't truly value – such a person can keep fighting for a long time. Some people manage to make a career out of it by persisting until management truly changes their mind and rewards them. Whatever the odds of that, the average person cannot comprehend the motivation of someone attempting such a feat.</p> </blockquote> <p>It's rare that people are willing to expend a significant amount of personal capital to do the right thing, whatever that means to someone, but it's even rarer that the leadership of a firm will make that choice and spend down the firm's capital to do the right thing.</p> <p>Economists have a term for cases where information asymmetry means that buyers can't tell the difference between good products and &quot;lemons&quot;, &quot;a market for lemons&quot;, like the car market (where the term lemons comes from), or <a href="hiring-lemons/">both sides of the hiring market</a>. In economic discourse, there's a debate over whether cars are a market for lemons at all for a variety of reasons (lemon laws, which allow people to return bad cars, don't appear to have changed how the market operates, very few modern cars are lemons when that's defined as a vehicle with serious reliability problems, etc.). But looking at whether or not people occasionally buy a defective car is missing the forest for the trees. There's maybe one car manufacturer that really seriously tries to make a structurally safe car beyond what standards bodies test (and word on the street is that they skimp on the increasingly important software testing side of things) because consumers can't tell the difference between a more or less safe car beyond the level a few standards bodies test to. That's a market for lemons, as is nearly every other consumer and B2B market.</p> <h3 id="appendix-culture">Appendix: culture</h3> <p>Something I find interesting about American society is how many people think that someone who gets the raw end of a deal because they failed to protect themselves against every contingency &quot;deserves&quot; what happened (orgs that want to be highly effective often avoid this by having a &quot;blameless&quot; culture, but very few people have exposure to such a culture).</p> <p>Some places I've seen this recently:</p> <ul> <li>Person had a laptop stolen in a cafe; blamed for not keeping their eye on the laptop the entire time since no reasonable person would ever let their eyes off of any belongings for 10 seconds as they turned their head to briefly chat with someone</li> <li>Person posted a PSA that they were caught out by a change in the terms of service of a company and other people should be aware of the same thing, people said that the person caught out was dumb for not reading every word of every terms of service update they're sent</li> <li>(many times, on r/idiotsincars): person gets in an accident that would've been difficult or impossible to reasonably avoid and people tell the person they're a terrible driver for not having avoided the accident <ul> <li>At least once, the person did a frame-by-frame analysis that showed that they reacted to, within one frame of latency, as fast as humanly possible, and was still told they should've avoided the accident</li> <li>Often, people will say things like &quot;I would never get into that situation in the first place&quot;, which, in the circumstance where someone is driving past a parked car, results in absurd statements like &quot;I would never pass a vehicle at more than 10mph&quot;, as if the person making the comment slows down to 10mph on every street that has parked or stopped cars on it.</li> </ul></li> <li>Person griped on flyertalk forum that Google maps instructions are unclear if you're not a robot (e.g., &quot;turn right in 500 meters&quot;, which could be one of multiple intersections) and people responded with things like &quot;I never go anywhere without being completely familiar with the route&quot; and that you should map out all of your driving beforehand, just like you would for a road trip with a paper map in 1992 (this was used as a justification for the reasonableness of mapping out all travel beforehand – I did it back then and anyone who isn't dumb would do it now) <ul> <li>People with those kinds of negative responses were highly upvoted; <a href="https://twitter.com/danluu/status/1520012665954865152">no one suggested switching to Apple Maps, which gives clear, landmark based directions like &quot;go through the light and then take the next right&quot;</a></li> </ul></li> </ul> <p>If you read these kinds of discussions, you'll often see people claiming &quot;that's just how the world is&quot; and going further and saying that there is no other way the world could be, so anyone who isn't prepared for that is an idiot.</p> <p>Going back to the laptop theft example, anyone who's traveled, or even read about other cultures, can observe that the things that North Americans think are basically immutable consequences of a large-scale society are arbitrary. For example, if you leave your bag and laptop on a table at a cafe in Korea and come back hours later, the bag and laptop are overwhelmingly likely to be there <a href="https://news.ycombinator.com/item?id=30625890">I've heard this is true in Japan as well</a>. While it's rude to take up a table like that, you're not likely to have your bag and laptop stolen.</p> <p>And, in fact, if you tweak the context slightly, this is basically true in America. It's not much harder to walk into an empty house and steal things out of the house (it's fairly <a href="https://www.youtube.com/c/lockpickinglawyer">easy to learn how to pick locks</a> and even easier to just break a window) than it is to steal things out of a cafe. And yet, in most neighbourhoods in America, people are rarely burglarized and when someone posts about being burglarized, they're not excoriated for being a moron for not having kept an eye on their house. Instead, people are mostly sympathetic. It's considered normal to have unattended property stolen in public spaces and not in private spaces, but that's more of a cultural distinction than a technical distinction.</p> <p>There's a related set of stories Avery Pennarun tells about the culture shock of being an American in Korea. One of them is about some online ordering service you can use that's sort of like Amazon. With Amazon, when you order something, you get a box with multiple bar/QR/other codes on it and, when you open it up, there's another box inside that has at least one other code on it. Of course the other box needs the barcode because it's being shipped through some facility at-scale where no one knows what the box is or where it needs to go and the inner box also had to go through some other kind of process and it also needs to be able to be scanned by a checkout machine if the item is sold at a retailer. Inside the inner box is the item. If you need to return the item, you put the item back into its barcoded box and then put that box into the shipping box and then slap another barcode onto the shipping box and then mail it out.</p> <p>So, in Korea, there's some service like Amazon where you can order an item and, an hour or two later, you'll hear a knock at your door. When you get to the door, you'll see an unlabeled box or bag and the item is in the unlabeled container. If you want to return the item, you &quot;tell&quot; the app that you want to return the item, put it back into its container, put it in front of your door, and they'll take it back. After seeing this shipping setup, which is wildly different from what you see in the U.S., he asked someone &quot;how is it possible that they don't lose track of which box is which?&quot;. The answer he got was, &quot;why would they lose track of which box is which?&quot;. His other stories have a similar feel, where he describes something quite alien, asks a local how things can work in this alien way, who can't imagine things working any other way and response with &quot;why would X not work?&quot;</p> <p>As with the laptop in cafe example, a lot of Avery's stories come down to how there are completely different shared cultural expectations around how people and organizations can work.</p> <p>Another example of this is with covid. Many of my friends have spent most of the last couple of years in Asian countries like Vietnam or Taiwan, which have had much lower covid rates, so much so that they were barely locked down at all. My friends in those countries were basically able to live normal lives, as if covid didn't exist at all (at least until the latest variants, at which point they were vaccinated and at relatively low risk for the most serious outcomes), while taking basically zero risk of getting covid.</p> <p>In most western countries, initial public opinion among many people was that locking down was pointless and there was nothing we could do to prevent an explosion of covid. Multiple engineers I know, who understand exponential growth and knew what the implications were, continued normal activities before lockdown and got and (probably) spread covid. When lockdowns were implemented, there was tremendous pressure to lift them as early as possible, resulting <a href="http://vihart.com/how-we-reopen/">in something resembling the &quot;adaptive response&quot; diagram from this post</a>. Since then, many people (I have a project tallying up public opinion on this that I'm not sure I'll ever prioritize enough to complete) have changed their opinion to &quot;having ever locked down was stupid, we were always going to end up with endemic covid, all of this economic damage was pointless&quot;. If we look at in-person retail sales data or restaurant data, we can easily see that many people were voluntarily limiting their activities before and after lockdowns in the first year or so of the pandemic when the virus was in broad circulation.</p> <p>Meanwhile, in some Asian countries, like Taiwan and Vietnam, people mostly complied with lockdowns when they were instituted, which means that they were able to squash covid in the country when outbreaks happened until relatively recently, when covid mutated into forms that spread much more easily and people's tolerance for covid risk went way up due to vaccinations. Of course, covid kept getting reintroduced into countries that were able to squash it because other countries were not, in large part due to the self-fulfilling belief that it would be impossible to squash covid.</p> <p>Coming back to when it makes sense to bring something in-house, even in cases where it superficially sounds like it shouldn't, because the expertise is 99% idle or a single person would have to be able to build software that a single firm would pay millions of dollars a year for, much of this comes down to whether or not you're in a culture where you can trust another firm's promise. If you operate in a society where it's expected that other firms will push you to the letter of the law with respect to whatever contract you've negotiated, it's frequently not worth the effort to negotiate a contract that would give you service even one half as good as you'd get from someone in house. If you look at how these contracts end up being worded, companies <a href="https://www.patreon.com/posts/21164582">often try to sneak in terms that make the contract meaningless</a>, and even when you managed to stamp out all of that, legally enforcing the contract is expensive and, in the cases I know of where companies regularly violated their agreement for their support SLA (just for example), the resolution was to terminate the contract rather than pursue legal action because the cost of legal action wouldn't be worth anything that could be gained.</p> <p>If you can't trust other firms, you frequently don't have a choice with respect to bringing things in house if you want them to work.</p> <p>Although this is really a topic for another post, I'll note that lack of trust that exists across companies can also hamstring companies when it exists internally. <a href="culture/">As we discussed previously, a lot of larger scale brokenness also comes out of the cultural expectations within organizations</a>. A specific example of this that leads to pervasive organizational problems is lack of trust within the organization. For example, a while back, I was griping to a director that a VP broke a promise and that we were losing a lot of people for similar reasons. The director's response was &quot;there's no way the VP made a promise&quot;. When I asked for clarification, the clarification was &quot;unless you get it in a contract, it wasn't a promise&quot;, i.e., the rate at which VPs at the company lie is high enough that a verbal commitment from a VP is worthless; only a legally binding commitment that allows you to take them to court has any meaning.</p> <p>Of course, that's absurd, in that no one could operate at a BigCo while going around and asking for contracts for all their promises since they'd immediately be considered some kind of hyperbureaucratic weirdo. But, let's take the spirit of the comment seriously, that only trust people close to you. That's good advice in the company I worked for but, unfortunately for the company, the implications are similar to the inter-firm example, where we noted that a norm where you need to litigate the letter of the law is expensive enough that firms often bring expertise in house to avoid having to deal with the details. In the intra-firm case and you'll often see teams and orgs &quot;empire build&quot; because they know they, <a href="https://amzn.to/3pDnYUr">at least the management level, they can't trust anyone outside their fiefdom</a>.</p> <p>While this intra-firm lack of trust tends to be less costly than the inter-firm lack of trust since there are better levers to get action on an organization that's the cause of a major blocker, <a href="https://amzn.to/3pDnYUr">it's still fairly costly</a>. Virtually all of the VPs and BigCo tech execs I've talked to are so steeped in the culture they're embedded in that they can't conceive of an alternative, but there isn't an inherent reason that organizations have to work like that. I've worked at two companies where people actually trust leadership and leadership does generally follow through on commitments even when you can't take them to court, including my current employer, Wave. But, at the other companies, the shared expectation that leadership cannot and should not be trusted &quot;causes&quot; the people who end up in leadership roles to be untrustworthy, which results in the inefficiencies we've just discussed.</p> <p>People often think that having a high degree of internal distrust is inevitable as a company scales, but people I've talked to who were in upper management or fairly close to the top of Intel and Google said that the companies had an extended time period where leadership enforced trustworthiness and that stamping out dishonesty and &quot;bad politics&quot; was a major reason the company was so successful, under Andy Grove and Eric Schmidt, respectively. When the person at the top changed and a new person who didn't enforce honesty came in, the standard cultural norms that you see at the upper levels of most big companies seeped in, but that wasn't inevitable.</p> <p>When I talk to people who haven't been exposed to BigCo leadership culture and haven't seen how decisions are actually made, they often find the decision making processes to be unbelievable in much the same way that people who are steeped in BigCo leadership culture find the idea that a large company could operate any other way to be unbelievable.</p> <p>It's often difficult to see how absurd a system is from the inside. Another perspective on this is that Americans often find Japanese universities and <a href="https://twitter.com/danluu/status/1428848459071819779">the work practices of Japanese engineering firms</a> absurd, though often not as absurd as the <a href="https://english.hani.co.kr/arti/english_edition/e_business/823025.html">promotion policies in Korean chaebols, which are famously nepotistic</a>, e.g., Chung Mong-yong is the CEO of Hyundai Sungwoo because he's the son of Chung Soon-yung, who was the head of Hyundai Sungwoo because he was the younger brother of Chung Ju-yung, the founder of Hyundai Group (essentially the top-level Hyundai corporation), etc. But Japanese and Korean engineering firms are not, in general, less efficient than American engineering firms outside of the software industry despite practices that seem absurdly inefficient to American eyes. <a href="https://twitter.com/altluu/status/1439511588529065987">American firms didn't lose their dominance in multiple industries</a> while being more efficient; if anything, market inefficiencies allowed them to hang on to marketshare much longer than you would naively expect if you just looked at the technical merit of their products.</p> <p>There are offsetting inefficiencies in American firms that are just as absurd as effectively having familiar succession of company leadership in Korean chaebols. It's just that the inefficiencies that come out of American cultural practices seem to be immutable facts about the world to people inside the system. But when you look at firms that have completely different cultures, it becomes clear that cultural norms aren't a law of nature.</p> <h3 id="appendix-downsides-of-build">Appendix: downsides of build</h3> <p>Of course, building instead of buying isn't a panacea. I've frequently seen internal designs that are just as broken as the data sync product described in this post. In general, when you see a design like that, <a href="https://twitter.com/danluu/status/1470890510998851588">a decent number of people explained why the design can never work during the design phase</a> and were ignored. Although &quot;build&quot; gives you a lot more control than &quot;buy&quot; and gives you better odds of a product that works because you can influence the design, a dysfunctional team in a dysfunctional org can quite easily make products that don't work.</p> <p>There's a Steve Jobs quote that's about companies that also applies to teams:</p> <blockquote> <p>It turns out the same thing can happen in technology companies that get monopolies, like IBM or Xerox. If you were a product person at IBM or Xerox, so you make a better copier or computer. So what? When you have monopoly market share, the company's not any more successful.</p> <p>So the people that can make the company more successful are sales and marketing people, and they end up running the companies. And the product people get driven out of the decision making forums, and the companies forget what it means to make great products. The product sensibility and the product genius that brought them to that monopolistic position gets rotted out by people running these companies that have no conception of a good product versus a bad product.</p> <p>They have no conception of the craftsmanship that's required to take a good idea and turn it into a good product. And they really have no feeling in their hearts, usually, about wanting to really help the customers.&quot;</p> </blockquote> <p>For &quot;efficiency&quot; reasons, some large companies try to avoid duplicate effort and kill projects if they seem too similar to another project, giving the team that owns the canonical verison of a product a monopoly. If the company doesn't have a culture of trying to do the right thing, this has the same problems that Steve Jobs discusses, but at the team and org level instead of the company level.</p> <p>The workaround a team I was on used was to basically re-implement a parallel stack of things we relied on that didn't work. But this was only possible beacuse leadership didn't enforce basically anything. Ironically, this was despite their best efforts — <a href="https://twitter.com/altluu/status/1494696655928508418">leadership made a number of major attempts to impose top-down control</a>, but they didn't understand how to influence an organization, so the attempts failed. Had leadership been successful, the company would've been significantly worse off. <a href="https://twitter.com/danluu/status/1487239027002527744">There are upsides to effective top-down direction when leadership has good plans</a>, but that wasn't really on the table, so it's actually better that leadership didn't know how to execute.</p> <p><i>Thanks to Fabian Giesen, Yossi Kreinen, Peter Bhat Harkins, Ben Kuhn, Laurie Tratt, John Hergenroeder, Tao L., @softminus, Justin Blank, @deadalnix, Dan Lew, @ollyrobot, Sophia Wisdom, Elizabeth Van Nostrand, Kevin Downey, and @PapuaHardyNet for comments/corrections/discussion.</i></p> <p><link rel="prefetch" href="tech-discrimination/"> <link rel="prefetch" href="talent/"> <link rel="prefetch" href="cocktail-ideas/"> <link rel="prefetch" href="why-benchmark/"> <link rel="prefetch" href="android-updates/"> <link rel="prefetch" href="web-bloat/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">To some, that position is so absurd that it's not believable that anyone would hold that position (in response to my first post that featured the Andreessen quote, above, a number of people told me that it was an exaggerated straw man, which is impossible for a quote, let alone one that sums up a position I've heard quite a few times), but to others, it's an immutable fact about the world. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:V"><p>On the flip side, if we think about things from the vendor side of things, there's little incentive to produce working products since the combination of the fog of war plus making false claims about a product working seems to be roughly as good as making a working product (at least until someone like Kyle Kingsbury comes along, which never happens in most industries), and it's much cheaper.</p> <p>And, as Fabian Giesen points out, when vendors actually want to produce good or working products, the fog of war also makes that difficult:</p> <blockquote> <p>But producers have a dual problem, which is that all the signal you get from consumers is sporadic, infrequent and highly selected direct communication, as well as a continuous signal of how sales look over time, which is in general very hard to map back to <em>why</em> sales went up or down.</p> <p>You hear directly from people who are either very unhappy or very happy, and you might hear second-hand info from your salespeople, but often that's pure noise. E.g. with RAD products over the years a few times we had a prospective customer say, &quot;well we would license it but we really need X&quot; and we didn't have X. And if we heard that 2 or 3 times from different customers, we'd implement X and get back to them a few months later. More often than not, they'd then ask for Y next, and it would become clear over time that they just didn't want to license for some other reason and saying &quot;we need X, it's a deal-breaker for us&quot; for a couple choices of X was just how they chose to get out of the eval without sounding rude or whatever.</p> <p>In my experience that's a pretty thorny problem in general, once you spin something out or buy something you're crossing org boundaries and lose most of the ways you otherwise have to cut through the BS and figure out what's actually going on. And whatever communication does happen is often forced to go through a very noisy, low-bandwidth, low-fidelity, high-latency channel.</p> </blockquote> <a class="footnote-return" href="#fnref:V"><sup>[return]</sup></a></li> <li id="fn:Q"><p>Note that even though it was somewhat predictable that a CPU design team at Apple or Amazon that was well funded had a good chance of being able to produce a best-in-class CPU (e.g., see <a href="hardware-unforgiving/">this 2013 comment about the effectiveness of Apple's team</a> and <a href="intel-cat/#spec">this 2015 comment about other mobile vendors</a>) that would be a major advantage for their firm, this doesn't mean that the same team should've been expected to succeed if they tried to make a standalone business. In fact, Apple was able to buy their core team cheaply because the team, <a href="hardware-unforgiving/">after many years at DEC and then successfully founding SiByte, founded PA Semi</a>, which basically failed as a business. Similarly, Amazon's big silicon initial hires were from Annapurna (also a failed business that was up for sale because it couldn't survive independently) and Smooth Stone (a startup that failed so badly that it didn't even need to be acquired and people could be picked up individually). Even when there's an obvious market opportunity, factors like network effects, high fixed costs, up front capital expenditures, the ability of incumbent players to use market power to suppress new competitors, etc., can and often does prevent anyone from taking the opportunity. Even though we can now clearly see that there were large opportunities available for the taking, there's every reason to believe that, based on the fates of many other CPU startups to date, an independent startup that attempt to implement the same ideas wouldn't have been nearly a successful and most likely have gone bankrupt or taken a low offer relative to the company's value due to the company's poor business prospects.</p> <p>Also, before Amazon started shipping ARM server chips, <a href="https://www.patreon.com/posts/20571244">the most promising ARM server chip, which had pre-orders from at least one major tech company, was killed because it was on the wrong side of an internal political battle</a>.</p> <p><a href="talent/">The chip situation isn't so different from the motivating example we looked at in our last post, baseball scouting</a>, where many people observed that baseball teams were ignoring simple statistics they could use to their advantage. But, none of the people observing that were in a position to run a baseball team for decades, allowing the market opportunity to persist for decades.</p> <a class="footnote-return" href="#fnref:Q"><sup>[return]</sup></a></li> <li id="fn:P"><p>Something that amuses me is how some package delivery services appear to apply relatively little effort to make sure that someone even made an attempt to delivery the package. When packages are marked delivered, there's generally a note about how it was delivered, which is frequently quite obviously wrong for the building, e.g., &quot;left with receptionist&quot; for a building with no receptionist or &quot;left on porch&quot; for an office building with no porch and a receptionist who was there during the alleged delivery time. You could imagine services would, like Amazon, request a photo along with &quot;proof of delivery&quot; or perhaps use GPS to check that the driver was plausibly at least in the same neighborhood as the building at the time of delivery, but they generally don't seem to do that?</p> <p>I'd guess that a lot of the fake deliveries come from having some kind of quota, one that's difficult or impossible to achieve, combined with weak attempts at verifying that a delivery was done or even attempted.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:A">When I say they solved it, I mean that Amazon delivery drivers actually try to deliver the package maybe 95% of the time to the apartment buildings I've lived in, vs. about 25% for UPS and Fedex and much lower for USPS and Canada Post, if we're talking about big packages and not letters. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:M"><p>Very fittingly for this post, I saw an external discussion on this exact thing where someone commented that it must've been quite expensive for the company to switch to the new system due to its known inefficiencies.</p> <p>In true cocktail party efficient markets hypothesis form, an internet commenter replied that the company wouldn't have done it if it was inefficient and therefore it must not have been as inefficient as the first commenter thought.</p> <p>I suspect I spent more time looking at software <a href="https://en.wikipedia.org/wiki/Total_cost_of_ownership">TCO</a> than anyone else at the company and the system under discussion was notable for having one of the largest increases in cost of any system at the company without a concomitant increase in load. Unfortunately, the assumption that competition results in good <a href="https://twitter.com/danluu/status/1122617702080581632">internal decisions</a> is just as false as the assumption that competition results in good external decisions.</p> <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:J">Note that if you click the link but don't click through to the main article, the person defending Kyle made the original quote seem more benign than it really is out of politeness because he elided the bit where the former Redis developer advocate (now <a href="https://twitter.com/sunshowers6/status/1568660532999372801">&quot;VP of community&quot;</a> <a href="https://news.ycombinator.com/item?id=31479020">for Zig</a>) said that Jespen is &quot;ultimately not that different from other tech companies, and thus well deserving of boogers and cum&quot;. <a class="footnote-return" href="#fnref:J"><sup>[return]</sup></a></li> </ol> </div> Misidentifying talent talent/ Mon, 21 Feb 2022 00:00:00 +0000 talent/ <p><details open> <summary>[Click to collapse / expand section on sports]</summary> Here are some notes from talent scouts:</p> <ul> <li>Recruit A: <ul> <li>... will be a real specimen with chance to have a Dave Parker body. Facially looks like Leon Wagner. Good body flexibility. Very large hands.</li> </ul></li> <li>Recruit B: <ul> <li>Outstanding physical specimen – big athletic frame with broad shoulders and long, solid arms and leg. Good bounce in his step and above avg body control. Good strong face.</li> </ul></li> <li>Recruit C: <ul> <li>Hi butt, longish arms &amp; legs, leanish torso, young colt</li> <li>[different scout]: Wiry loose good agility with good face</li> <li>[another scout]: Athletic looking body, loose, rangy, slightly bow legged.</li> </ul></li> </ul> <p>Out of context, you might think they were scouting actors or models, but these are baseball players (&quot;A&quot; is Lloyd Moseby, &quot;B&quot; is Jim Abbott, and &quot;C&quot; is Derek Jeter), ones that were quite good (Lloyd Moseby was arguably only a very good player for perhaps four years, but that makes him extraordinary compared to most players who are scouted). If you read other baseball scouting reports, you'll see a lot of comments about how someone has a &quot;good face&quot;, who they look like, what their butt looks like, etc.</p> <p>Basically everyone wants to hire talented folks. But even in baseball, where returns to hiring talent are obvious and high and which is the most easily quantified major U.S. sport, people made fairly obvious blunders for a century due to relying on incorrectly honed gut feelings that relied heavily on unconscious as well as conscious biases. Later, we'll look at what baseball hiring means for other fields, but first, let's look at how players who didn't really pan out ended up with similar scouting reports (programmers who don't care about sports can think of this as equivalent to interview feedback) as future superstars, such as the following comments on <a href="https://www.baseball-reference.com/players/e/eatonad01.shtml">Adam Eaton</a>, who was a poor player by pro standards despite being considered one of the hottest prospects (potential hires) of his generation:</p> <ul> <li>Scout 1: Medium frame/compact/firm. A very good athlete / shows quick &quot;cat-like&quot; reactions. Excellent overall body strength. Medium hands / medium length arms / w strong forearms ... Player is a tough competitor. This guy has some old fashioned bull-dog in his make-up.</li> <li>Scout 2: Good body with frame to develop. Long arms and big hands. Narrow face. Has sideburns and wears hat military style. Slope shoulders. Strong inlegs ... Also played basketball. Good athlete .... Attitude is excellent. Can't see him breaking down. One of the top HS pitchers in the country</li> <li>Scout 3: 6'1&quot;-6'2&quot; 180 solid upper and lower half. Room to pack another 15 without hurting</li> </ul> <p>On the flip side, scouts would also pan players who would later turn out to be great based on their physical appearance, such as these scouts who were concerned about Albert Pujols's weight:</p> <ul> <li>Scout 1: Heavy, bulky body. Extra (weight) on lower half. Future (weight) problem. Aggressive hitter with mistake HR power. Tends to be a hacker.</li> <li>Scout 2: Good bat (speed) with very strong hands. Competes well and battles at the plate. Contact seems fair. Swing gets a little long at times. Will over pull. He did not hit the ball hard to CF or RF. Weight will become an issue in time.</li> </ul> <p>Pujols ended up becoming one of the best baseball players of all time (<a href="https://web.archive.org/web/20220209005630/https://www.baseball-reference.com/leaders/WAR_career.shtml">currently ranked 32nd by WAR</a>). His weight wasn't a problem, but if you read scouting reports on other great players who were heavy or short, they were frequently underrated. Of course, baseball scouting reports didn't only look at people's appearances, but scouts were generally highly biased by what they thought an athlete should look like.</p> <p>Because using stats in baseball has &quot;won&quot; (top teams all employ stables of statisticians nowadays) and &quot;old school&quot; folks don't want to admit this, we often see people saying that using stats doesn't really result in different outcomes than we used to get. But this is so untrue that the examples people give are generally <a href="https://www.patreon.com/posts/62223485">self-refuting</a>. For example, here's what Sports Illustrated had to say on the matter:</p> <blockquote> <p>Media and Internet draft prognosticators love to play up the “scrappy little battler” aspect with Madrigal, claiming that modern sabermetrics helps scouts include smaller players that were earlier overlooked. Of course, that is hogwash. A players [sic] abilities dictate his appeal to scouts—not height or bulk—and smaller, shorter players have always been a staple of baseball-from Mel Ott to Joe Morgan to Kirby Puckett to Jose Altuve.</p> </blockquote> <p>These are curious examples to use in support of scouting since Kirby Puckett was famously overlooked by scouts despite putting up statistically dominant performances and was only able to become a baseball player through random happenstance, when the assistant director of the Twins <a href="https://en.wikipedia.org/wiki/Farm_team">farm system</a> went to watch his own son play in a baseball game his and saw Kirby Puckett in the same game, which led to the Twins drafting Kirby Puckett, who carried the franchise for a decade.</p> <p>Joe Morgan was also famously overlooked and only managed to become a professional baseball player through random happenstance. Morgan put up statistically dominant numbers in high school, but was ignored due to his height. Because he wasn't drafted by a pro team, he went to Oakland City College, where he once again put up great numbers that were ignored. The reason a team noticed him was a combination of two coincidences. First, a new baseball team was created and that new team needed to fill a team and the associated farm system, which meant that they needed a lot of players. Second, that new baseball team needed to hire scouts and hired Bill Wight (who wasn't previously working as a scout) as a scout. Wight became known for not having the same appearance bias as nearly every other scout and was made fun of for signing &quot;funny looking&quot; baseball players. Bill convinced the new baseball team to &quot;hire&quot; quite a few overlooked players, including Joe Morgan.</p> <p>Mel Ott was also famously overlooked and only managed to become a professional baseball player through happenstance. He was so dominant in high school that he played for adult semi-pro teams in his spare time. However, when he graduated, pro baseball teams didn't want him because he was too small, so he took a job at a lumber company and played for the company team. The owner of the lumber company was impressed by his baseball skills and, luckily for Ott, the owner of the lumber company was business partners and friends with the owner of a baseball team and effectively got Ott a position on a pro baseball team, resulting in the <a href="https://web.archive.org/web/20220208221047/https://www.baseball-reference.com/players/o/ottme01.shtml">20th best baseball career of all time as ranked by WAR</a><sup class="footnote-ref" id="fnref:J"><a rel="footnote" href="#fn:J">1</a></sup>. Most short baseball players probably didn't get a random lucky break; for every one who did, there are likely many who didn't. If we look at how many nearly-ignored-but-lucky players put up numbers that made them all-time greats, it seems likely that the vast majority of the potentially greatest players of all time who played amateur or semi-pro baseball were ignored and did not play professional baseball (if this seems implausible, when reading the upcoming sections on chess, go, and shogi, consider what would happen if you removed all of the players who don't look like they should be great based on what people think makes someone cognitively skilled at major tech companies, and then look at what fraction of all-time-greats remain).</details></p> <p>Deciding who to &quot;hire&quot; for a baseball team was a high stakes decision with many millions of dollars (in 2022 dollars) on the line, but rather than attempt to seriously quantify productivity, teams decided who to draft (hire) based on all sorts of irrelevant factors. Like any major sport, baseball productivity is much easier to quantify than in most real-world endeavors since the game is much simpler than &quot;real&quot; problems are. And, among major U.S. sports, baseball is the easiest sport to quantify, but <a href="tech-discrimination/">this didn't stop baseball teams from spending a century overindexing on visually obvious criteria such as height and race</a>.</p> <p>I was reminded of this the other day when, the other day, I saw a thread on Twitter where a very successful person talks about how they got started, saying that they were able to talk their way into an elite institution despite being unqualified and use this story to conclude that elite gatekeepers are basically just scouting for talent and that you just need to show people that you have talent:</p> <blockquote> <p>One college related example from my life is that I managed to get into CMU with awful grades and awful SAT scores (I had the flu when I took the test :/)</p> <p>I spent a month learning everything about CMU's CS department, then drove there and talked to professors directly when I first showed up at the campus the entrance office asked my GPA and SAT, then asked me to leave. But I managed to talk to one professor, who sent me to their boss, recursively till I was talking to the vice president of the school he asked me why I'm good enough to go to CMU and I said &quot;I'm not sure I am. All these other kids are really smart. I can leave now&quot; and he interrupted me and reminded me how much agency it took to get into that room.</p> <p>He gave me a handwritten acceptance letter on the spot ... I think one secret, at least when it comes to gatekeepers, is that they're usually just looking for high agency and talent.</p> </blockquote> <p>I've heard this kind of story from other successful people, who tend to come to bimodal conclusions on what it all means. Some conclude that the world correctly recognized their talent and that this is how the world works; talent gets recognized and rewarded. Others conclude that the world is fairly random with respect to talent being rewarded and that they got lucky to get rewarded for their talent when many other people with similar talents who used similar strategies were passed over<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">2</a></sup>.</p> <p>Another time I was reminded of old baseball scouting reports was when I heard about how a friend of mine who's now an engineering professor at a top Canadian university got there. Let's call her Jane. When Jane was an undergrad at the university she's now a professor at, she was sometimes helpfully asked &quot;are you lost?&quot; when she was on campus. Sometimes this was because, as a woman, she didn't look like was in the right place when she was in an engineering building. Other times, it was because she looked like and talked like someone from rural Canada. Once, a security guard thought she was a homeless person who had wandered onto campus. After a few years, she picked up the right clothes and mannerisms to pass as &quot;the right kind of person&quot;, with help from her college friends, who explained to her how one is supposed to talk and dress, but when she was younger, people's first impression was that she was an admin assistant, and now their first impression is that she's a professor's wife because they don't expect a woman to be a professor in her department. She's been fairly successful, but it's taken a lot more work than it would've for someone who looked the part.</p> <p>On whether or not, in her case, her gate keepers were just looking for agency and talent, she once failed a civil engineering exam because she'd never heard of a &quot;corn dog&quot; and also barely passed an intro programming class she took where the professor announced that anyone who didn't already know how to program was going to fail.</p> <p>The corn dog exam failure was because there was a question on a civil engineering exam where students were supposed to design a corn dog dispenser. My friend had never heard of a corn dog and asked the professor what a corn dog was. The professor didn't believe that she didn't know what a corn dog was and berated her in front of the entire class to for asking a question that clearly couldn't be serious. Not knowing what a corn dog was, she designed something that put corn inside a hot dog and dispensed a hot dog with corn inside, which failed because that's not what a corn dog is.</p> <p>It turns out the gatekeepers for civil engineering and programming were not, in fact, just looking for agency and were instead looking for someone who came from the right background. I suspect this is not so different from the CMU professor who admitted a promising student on the spot, it just happens that a lot of people pattern match &quot;smart teenage boy with a story about why their grades and SAT scores are bad&quot; to &quot;promising potential prodigy&quot; and &quot;girl from rural Canada with the top grade in her high school class who hasn't really used a computer before and dresses like a poor person from rural Canada because she's paying for college while raising her younger brother because their parents basically abandoned both of them&quot; to &quot;homeless person who doesn't belong in engineering&quot;.</p> <p>Another thing that reminded me of how funny baseball scouting reports are is a conversation I had with Ben Kuhn a while back.</p> <p><strong>Me</strong>: it's weird how tall so many of the men at my level (senior staff engineer) are at big tech companies. In recent memory, I think I've only been in a meeting with one man who's shorter than me at that level or above. I'm only 1&quot; shorter than U.S. average! And the guy who's shorter than me has worked remotely for at least a decade, so I don't know if people really register his height. And people seem to be even taller on the management track. If I look at the VPs I've been in meetings with, they must all be at least 6' tall.<br> <strong>Ben</strong>: Maybe I could be a VP at a big tech company. I'm 6' tall!<br> <strong>Me</strong>: Oh, I guess I didn't know how tall 6' tall is. The VPs I'm in meetings with are noticeably taller than you. They're probably at least 6'2&quot;?<br> <strong>Ben</strong>: Wow, that's really tall for a minimum. 6'2&quot; is 96%-ile for U.S. adult male<br></p> <p>When I've discussed this with successful people who work in big companies of various sorts (tech companies, consulting companies, etc.), men who would be considered tall by normal standards, 6' or 6'1&quot;, tell me that they're frequently the shortest man in the room during important meetings. 6'1&quot; is just below the median height of a baseball player. There's something a bit odd about height seeming more correlated to success as a consultant or a programmer than in baseball, where height directly conveys an advantage. One possible explanation would be due to a <a href="https://en.wikipedia.org/wiki/Halo_effect">halo effect</a>, where positive associations about tall or authoritative seeming people contribute to their success.</p> <p>When I've seen this discussed online, someone will point out that this is because height and cognitive performance are correlated. But if we look at the literature on IQ, the correlation isn't strong enough to explain something like this. We can also observe this if we look at fields where people's mental acuity is directly tested by something other than an IQ test, such as in chess, where most top players are around average height, with some outliers in both directions. Even without looking at the data in detail, this should be expected because correlation between height and IQ is weak, with much the correlation due to the relationship at the low end<sup class="footnote-ref" id="fnref:Q"><a rel="footnote" href="#fn:Q">3</a></sup>, and the correlation between IQ and performance in various mental tasks is also weak (some people will say that it's strong by social science standards, <a href="percentile-latency/">but that's very weak in terms of actual explanatory power even when looking at the population level</a> and it's even weaker at the individual level). <a href="https://artscimedia.case.edu/wp-content/uploads/sites/141/2016/12/22143817/Burgoyne-Sala-Gobet-Macnamara-Campitelli-Hambrick-2016.pdf">And then if we look at chess in particular, we can see that the correlation is weak, as expected</a>.</p> <p>Since the correlation is weak, and there are many more people around average height than not, we should expect that most top chess players are around average height. <a href="https://www.chess.com/forum/view/chess-players/do-chess-players-tend-to-be-short">If we look at the most dominant chess players in recent history, Carlsen, Anand, and Kasparov</a>, they're 5'8&quot;, 5'8&quot;, and 5'9&quot;, respectively (if you look at different sources, they'll claim heights of plus or minus a couple inches, but still with a pretty normal range; people often exaggerate heights; if you look at people who try to do real comparisons either via photos or in person, measured heights are often lower than what people claim their own height is<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">4</a></sup>).</p> <p>It's a bit more difficult to find heights of go and shogi players, but it seems like the absolute <a href="https://gameofgo.app/learn/top-ten-go-players">top modern players from this list</a> I could find heights for (Lee Sedol, Yoshiharu Habu) are roughly in the normal range, with there being some outliers in both directions among elite players who aren't among the best of all time, as with chess.</p> <p>If it were the case that height or other factors in appearance were very strongly correlated with mental performance, we would expect to see a much stronger correlation between height and performance in activities that relatively directly measure mental performance, like chess, than we do between height and career success, but it's the other way around, which seems to indicate that the halo effect from height is stronger than any underlying benefits that are correlated with height.</p> <p>If we look at activities where there's a fair amount of gatekeeping before people are allowed to really show their skills but where performance can be measured fairly accurately and where hiring better employees has an immediate, measurable, direct impact on company performance, such as <a href="tech-discrimination/">baseball and hockey, we can see that people went with their gut instinct over data for decades after there were public discussions about how data-driven approaches found large holes in people's intuition</a>.</p> <p>If we then look at programming, where it's somewhere between extremely difficult and impossible to accurately measure individual performance and the impact of individual performance on company success is much less direct than in sports, what should our estimate of how accurate talent assessment be?</p> <p>The pessimistic view is that it seems implausible that we should expect that talent assessment is better than in sports, where it took decades of there being fairly accurate and rigorous public write-ups of performance assessments for companies to take talent assessment seriously. With programming, talent assessment isn't even far enough along that anyone can write up accurate evaluations of people across the industry, so we haven't even started the decades long process of companies fighting to keep evaluating people based on personal opinions instead of accurate measurements.</p> <p>Jobs have something equivalent to old school baseball scouting reports at multiple levels. At the hiring stage, there are multiple levels of filters that encode people's biases. A classic study on this is <a href="https://www.nber.org/system/files/working_papers/w9873/w9873.pdf">Bertrand and Sendhil Mullainathan</a>'s paper, which found that &quot;white sounding&quot; names on resumes got more callbacks for interviews than &quot;black sounding&quot; names and that having a &quot;white sounding&quot; name on the resume increased the returns to having better credentials on the resume. Since then, many variants of this study have been done, e.g., <a href="https://www.utpjournals.press/doi/pdf/10.3138/cpp.2017-033">resumes with white sounding names do better than resumes with Asian sounding names</a>, <a href="https://www.researchgate.net/profile/Juan-Madera-2/publication/266589756_The_Problem_for_Black_Professors_Judged_Before_Met/links/54381fea0cf24a6ddb922d57/The-Problem-for-Black-Professors-Judged-Before-Met.pdf">professors with white sounding names are evaluated on CVs are evaluated as having better interpersonal skills than professors with black and Asian sounding names on CVs</a>, etc.</p> <p>The literature on promotions and leveling is much weaker, but I and other folks who are in highly selected environments that effectively require multiple rounds of screening, each against more and more highly selected folks, such as VPs, senior (as in &quot;senior staff&quot;+) ICs, professors at elite universities, etc., have observed that filtering on height is as severe or more severe than in baseball but less severe than in basketball.</p> <p>That's curious when, in mental endeavors where the &quot;promotion&quot; criteria are directly selected by performance, such as in chess, height appears to only be very weakly correlated to success. A major issue in the literature on this is that, in general, social scientists look at averages. In a lot of the studies, they simply produce a correlation coefficient. If you're lucky, they may produce a graph where, for each height, they produce an average of something or other. That's the simplest thing to do but this only provides a very coarse understanding of what's going on.</p> <p>Because I like knowing how things tick, including organizations and people's opinions, I've (informally, verbally) polled a lot of engineers about what they thought about other engineers. What I found was that there was a lot of clustering of opinions, resulting in clusters of folks that had rough agreement about who did excellent work. Within each cluster, people would often disagree about the ranking of engineers, but they would generally agree on who was &quot;good to excellent&quot;.</p> <p>One cluster was (in my opinion; this could, of course, also just be my own biases) people who were looking at the output people produced and were judging people based on that. Another cluster was of people who were looking at some combination of height and confidence and were judging people based on that. This one was a mystery to me for a long time (I've been asking people questions like this and collating the data out of habit, long before I had the idea to write this post and, until I recognized the pattern, I found it odd that so many people who have good technical judgment, as evidenced by their ability to do good work and make comments showing good technical judgment, highly evaluated so many people who so frequently said blatantly incorrect things and produced poorly working or even non-working systems). Another cluster was around credentials, such as what school someone went to or what the person was leveled at or what prestigious companies they'd worked for. People could have judgment from multiple clusters, e.g., some folks would praise both people who did excellent technical work as well as people who are tall and confident. At higher levels, where it becomes more difficult to judge people's work, relatively fewer people based their judgment on people's output.</p> <p>When I did this evaluation collating exercise at the startup I worked at, there was basically only one cluster and it was based on people's output, with fairly broad consensus about who the top engineers were, but I haven't seen that at any of the large companies I've worked for. I'm not going to say that means evaluation at that startup was fair (perhaps all of us were falling prey to the same biases), but at least we weren't falling prey to the most obvious biases.</p> <p>Back to big companies, if we look at what it would take to reform the promotion system, it seems difficult to do when biased because many individual engineers are biased. Some companies have committees handle promotions in order to reduce bias, but the major inputs to the system still have strong biases. The committee uses, as input, recommendations from people, many of whom let those biases have more weight than their technical judgment. Even if we, hypothetically, introduced a system that identified whose judgments were highly correlated with factors that aren't directly relevant to performance and gave those recommendations no weight, people's opinions often limit the work that someone can do. A complaint I've heard from some folks who are junior is that they can't get promoted because their work doesn't fulfill promo criteria. When they ask to be allowed to do work that could get them promoted, they're told they're too junior to do that kind of work. They're generally stuck at their level until they find a manager who believes in their potential enough to give them work that could possibly result in a promo if they did a good job. Another factor that interacts with this is that it's easier to transfer to a team where high-impact work is available if you're doing well and/or having high &quot;promo velocity&quot;, i.e., are getting promoted frequently and harder if you're doing poorly or even just have low promo velocity and aren't doing particularly poorly. At higher levels, it's uncommon to not be able to do high-impact work, but it's also very difficult to separate out the impact of individual performance and biases because a lot of performance is about who you can influence, which is going to involve trying to influence people who are biased if you need to do it at scale, which you generally do to get promoted at higher levels. The nested, multi-level, impact of bias makes it difficult to change the system in a way that would remove the impact of bias.</p> <p>Although it's easy to be pessimistic when looking at the system as a whole, it's also easy to be optimistic when looking at what one can do as an individual. It's pretty easy to do what Bill Wight (the scout known for recommending &quot;funny looking&quot; baseball players) did and ignore what other people incorrectly think is important<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">5</a></sup>. <a href="culture/#appendix-centaur-s-hiring-process">I worked for a company that did this which had, by far, the best engineering team of any company I've ever worked for</a>. They did this by ignoring the criteria other companies cared about, e.g., <a href="https://twitter.com/danluu/status/1425514112642080773">hiring people from non-elite schools instead of focusing on pedigree</a>, <a href="algorithms-interviews/">not ruling people out for not having practiced solving abstract problems on a whiteboard that people don't solve in practice at work</a>, <a href="https://twitter.com/danluu/status/1447322487583244289">not having cultural fit criteria that weren't related to job performance</a> (they did care that people were self-directed and would function effectively when given a high degree of independence), etc.<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">6</a></sup></p> <p>Thanks to <font size=+1><b><a rel="sponsored" href ="https://www.reforge.com/all-programs?utm_source=danluu&utm_medium=referral&utm_campaign=spring22_newsletter_test&utm_term=&utm_content=engineering">Reforge - Engineering Programs</a></b></font> and <font size=+1><b><a rel="sponsored" href ="https://flatironsdevelopment.com/">Flatirons Development</a></b></font> for helping to make this post possible by <a href="https://patreon.com/danluu">sponsoring me at the Major Sponsor tier</a>.</p> <p>Also, thanks to Peter Bhat Harkins, Yossi Kreinin, Pam Wolf, Laurie Tratt, Leah Hanson, Kate Meyer, Heath Borders, Leo T M, Valentin Hartmann, Sam El-Borai, Vaibhav Sagar, Nat Welch, Michael Malis, Ori Berstein, Sophia Wisdom, and Malte Skarupke for comments/corrections/discussion.</i></p> <h3 id="appendix-other-factors">Appendix: other factors</h3> <p>This post used height as a running example because it's both something that's easy to observe is correlated to success in men which has been studied across a number of fields. I would guess that social class markers / <a href="https://twitter.com/danluu/status/1480616193761288194">mannerisms</a>, as in the Jane example from this post, have at least as much impact. For example, a number of people have pointed out to me that the tall, successful, people they're surrounded by say things with very high confidence (often incorrect things, but said confidently) and also have mannerisms that convey confidence and authority.</p> <p>Other physical factors also seem to have a large impact. There's a fairly large literature on how much the halo effect causes people who are generally attractive to be rated more highly on a variety of dimensions, e.g., <a href="https://www.gwern.net/docs/sociology/2021-klebl.pdf">morality</a>. There's a famous ask metafilter (reddit before there was reddit) answer to a quesiton that's something like &quot;how can you tell someone is bad?&quot; and the most favorited answer (I hope for ironic reasons, although the answerer seemed genuine) is that they have bad teeth. Of course, <a href="https://twitter.com/danluu/status/1477001184904814593">in the U.S., having bad teeth is a marker of childhood financial poverty</a>, not impoverished moral character. And, of course, gender is another dimension that people appear to filter on for reasons unrelated to talent or competence.</p> <p>Another is just random luck. To go back to the baseball example, one of the few negative scouting reports on Chipper Jones came from a scout who said</p> <blockquote> <p>Was not aggressive w/bat. Did not drive ball from either side. Displayed non-chalant attitude at all times. He was a disappointment to me. In the 8 games he managed to collect only 1 hit and hit very few balls well. Showed slap-type swing from L.side . . . 2 av. tools</p> </blockquote> <p>Another scout, who saw him on more typical days, correctly noted</p> <blockquote> <p>Definite ML prospect . . . ML tools or better in all areas . . . due to outstanding instincts, ability, and knowledge of game. Superstar potential.</p> </blockquote> <p>Another similarly noted:</p> <blockquote> <p>This boy has all the tools. Has good power and good basic approach at the plate with bat speed. Excellent make up and work-habits. Best prospect in Florida in the past 7 years I have been scouting . . . This boy must be considered for our [1st round draft] pick. Does everything well and with ease.</p> </blockquote> <p>There's a lot of variance in performance. If you judge performance by watching someone for a short period of time, you're going to get wildly different judgements depending on when you watch them.</p> <h3 id="appendix-related-discussions">Appendix: related discussions</h3> <p>If you read the blind orchestra audition study that everybody cites, the study itself seems poor quality and unconvincing, but it also seems true that blind auditions were concomitant with an increase in orchestras hiring people who didn't look like what people expected musicians to look like. Blind auditions, where possible, seem like something good to try.</p> <p>As noted previously, a professor remarked that doing hiring over zoom accidentally made height much less noticeable than normal and resulted in at least one university department hiring a number of professors who are markedly less tall than professors who were previously hired.</p> <p>Me on <a href="algorithms-interviews/">how tech interviews don't even act as an effective filter for the main thing they nominally filter for</a>.</p> <p>Me on <a href="programmer-moneyball/">how prestige-focused tech hiring is</a>.</p> <p>@ArtiKel on <a href="https://nintil.com/talent-review">Cowen and Gross's book on talent</a> and on <a href="https://nintil.com/categories/fund-people-not-projects/">funding people over projects</a>. A question I've had for a long time is whether the less-mainstream programs that convey prestige via some kind of talent selection process (Thiel Fellowship, grants from folks like Tyler Cowen, Patrick Collison, Scott Alexander, etc.) are less biased than traditional selection processes or just differently biased. The book doesn't appear to really answer this question, but it's food for thought. And BTW, I view these alternative processes as highly value even if they're not better and, actually, even if they're somewhat worse, because their existence gives the world a wider portfolio of options for talent spotting. But, even so, I would like to know if the alternative processes are better than traditional processes.</p> <p>Alexy Guezy on <a href="https://guzey.com/where-does-talent-come-from/">where talent comes from</a>.</p> <p><a href="https://marginalrevolution.com/marginalrevolution/2020/02/an-anonymous-reader-on-talent-misallocation-and-bureaucratization.html">An anonymous person on talent misallocation</a>.</p> <p><a href="https://sockpuppet.org/blog/2015/03/06/the-hiring-post/">Thomas Ptacek on actually attempting to look at relevant signals when hiring in tech</a>.</p> <p><a href="https://www.patreon.com/posts/62933244">Me on the use of sleight of hand in an analogy meant to explain the importance of IQ and talent, where the sleight of hand is designed to make it seem like IQ is more important than it actually is</a>.</p> <p><a href="https://web.archive.org/web/20140831182249/https://newrepublic.com/article/119239/transgender-people-can-explain-why-women-dont-advance-work">Jessica Nordell on trans experiences demonstrating differences between how men and women are treated</a>.</p> <p><a href="https://amzn.to/3I9Sqg1">The Moneyball book</a>, of course. Although, for the real nerdy details, I'd recommend reading the old <a href="https://baseballthinkfactory.org/">baseballthinkfactory archives</a> from back when the site was called &quot;baseball primer&quot;. Fans were, in real time, calling out who would be successful and generally better greater success than baseball teams of the era. The site died off as baseball teams started taking stats seriously, leaving fan analysis in the dust since teams have access to both much better fine-grained data as well as time to spend on serious analysis than hobbyists, but it was interesting to watch hobbyists completely dominate the profession using basic data anlaysis techniques.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:J">Jose Altuve comes from the modern era of statistics-driven decision making and therefore cannot be a counterexample. <a class="footnote-return" href="#fnref:J"><sup>[return]</sup></a></li> <li id="fn:B">There's a similar bimodal split when I see discussions among people who are on the other side of the table and choose who gets to join an elite institution vs. not. Some people are utterly convinced that their judgment is basically perfect (&quot;I just know&quot;, etc.), and some people think that making judgment calls on people is a noisy process and you, at best, get weak signal. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:Q"><p>Estimates range from 0 to 0.3, with Teasdale et al. finding that the correlation decreased over time (speculated to be due to better nutrition) and Teasdale et al. finding that, the correlation was significantly stronger than on average in the bottom tail (bottom 2% of height) and significantly weaker than on average at the top tail (top 2% of height), indicating that much of the overall correlation comes from factors that cause both reduced height and IQ.</p> <p>In general, for a correlation coefficient of <code>x</code>, it will explain <code>x^2</code> of the variance. So even if the correlation were not weaker at the high end and we had a correlation coefficient of <code>0.3</code>, that would only explain <code>0.3 = 0.09</code> of the variance, i.e., <code>1 - 0.09 = 0.91</code> would be explained by other factors.</p> <a class="footnote-return" href="#fnref:Q"><sup>[return]</sup></a></li> <li id="fn:O">When I did online dating, I frequently had people tell me that I must be taller than I am because they're so used to other people lying about their heights on dating profiles that they associated my height with a larger number than the real number. <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:F"><p>On the other side of the table, what one can do when being assessed, I've noticed that, at work, unless people are familiar with my work, they generally ignore me in group interactions, like meetings. Historically, things that have worked for me and gotten people to stop ignoring me were doing doing an unreasonably large amount of high-impact work in a small period of time (<a href="productivity-velocity/">while not working long hours</a>), often solving a problem that people thought was impossible to solve in the timeframe, which made it very difficult for people to not notice my work; another was having a person who appears more authoritative than me get the attention of the room and ask people to listen to me; and also finding groups (teams or orgs) that care more about the idea than the source of the idea. More recently, some things that have worked are writing this blog and using mediums where a lot of the cues that people use as proxies for competence aren't there (slack, and to a lesser extent, video calls).</p> <p>In some cases, the pandemic has accidentally caused this to happen in some dimensions. For example, a friend of mine mentioned to me that their university department did video interviews during the pandemic and, for the first time, hired a number of professors who weren't strikingly tall.</p> <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:C"><p>When at a company that has biases in hiring and promo, it's still possible to go scouting for talent in a way that's independent of the company's normal criteria. One method that's worked well for me is to hire interns, since the hiring criteria for interns tends to be less strict. Once someone is hired as an intern, if their work is great and you know how to sell it, it's easy to get them hired full-time.</p> <p>For example, at Twitter, I hired two interns to my team. One, as an intern, <a href="cgroup-throttling/">wrote the kernel patch that solved the container throttling problem</a> (at the margin, worth hundreds of millions of dollars a year) and has gone on to do great, high-impact, work as a full-time employee. The other, as an intern, built out across-the-fleet profiling, a problem many full-time staff+ engineers had wanted to solve but that no one had solved and is joining Twitter as a full-time employee this fall. In both cases, the person was overlooked by other companies for silly reasons. In the former case, there was a funny combination of reasons other companies weren't interested in hiring them for a job that utilized their skillset, including location / time zone (Australia). From talking to them, they clearly had deep knowledge about computer performance that would be very rare even in an engineer with a decade of &quot;systems&quot; experience. There were jobs available to them in Australia, but teams doing performance work at the other big tech companies weren't really interested in taking on an intern in Australia. For the kind of expertise this person had, I was happy to shift my schedule to a bit late for a while until they ramped up, and it turned out that they were highly independent and didn't really need guidance to ramp up (we talked a bit about problems they could work on, including the aforementioned container throttling problem, and then they came back with some proposed approaches to solve the problem and then solved the problem). In the latter case, they were a student who was very early in their university studies. The most desirable employers often want students who have more classwork under their belt, so we were able to hire them without much competition. Waiting until a student has a lot of classes under their belt might be a good strategy on average, but this particular intern candidate had written some code that was good for someone with that level of experience and they'd shown a lot of initiative (they reverse engineered the server protocol for a dying game in order to reimplement a server so that they could fix issues that were killing the game), which is a much stronger positive signal than you'll get out of interviewing almost any 3rd year student who's looking for an internship.</p> <p>Of course, you can't always get signal on a valuable skill, but if you're actively scouting for people, you don't need to always get signal. If you occasionally get a reliable signal and can hire people who you have good signal on who are underrated, that's still valuable! For Twitter, in three intern seasons, I hired two interns, the first of whom already made &quot;staff&quot; and the second of whom should get there very quickly based on their skills as well as the impact of their work. In terms of ROI, spending maybe 30 hours a year on the lookout for folks who had very obvious signals indicating they were likely to be highly effective was one of the most valuable things I did for the company. The ROI would go way down if the industry as a whole ever started using effective signals when hiring but, for the reasons discussed in the body of this post, I expect progress to be slow enough that we don't really see the amount of change that would make this kind of work low ROI in my lifetime.</p> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> </ol> </div> A decade of major cache incidents at Twitter cache-incidents/ Wed, 02 Feb 2022 00:00:00 +0000 cache-incidents/ <p><i>This was co-authored with Yao Yue</i></p> <p>This is a collection of information on severe (<code>SEV-0</code> or <code>SEV-1</code>, the most severe incident classifications) incidents at Twitter that were at least partially attributed to cache from the time Twitter started using its current incident tracking JIRA (2012) to date (2022), with one bonus incident from before 2012. Not including the bonus incident, there were 6 <code>SEV-0</code>s and 6 <code>SEV-1</code>s that were at least partially attributed to cache in the incident tracker, along with 38 less severe incidents that aren't discussed in this post.</p> <p>There are a couple reasons we want to write this down. First, historical knowledge about what happens at tech companies is lost at a fairly high rate and we think it's nice to preserve some of it. Second, we think it can be useful to look at incidents and reliability from a specific angle, putting all of the information into one place, because that can sometimes make some patterns very obvious.</p> <p>On knowledge loss, when we've seen viral Twitter threads or other viral stories about what happened at some tech company, when we look into what happened, the most widely spread stories are usually quite wrong, generally for banal reasons. One reason is that outrageously exaggerated stories are more likely to go viral, so those are the ones that tend to be remembered. Another is that <a href="https://twitter.com/copyconstruct/status/1353487786163097601">there's a cottage industry of former directors / VPs who tell self-aggrandizing stories about all the great things they did that, to put it mildly, frequently distort the truth</a> (although there's nothing stopping ICs from doing this, the most spread false stories we see tend to come from people on the management track). In both cases, there's a kind of <a href="https://en.wikipedia.org/wiki/Gresham%27s_law">Gresham's law of stories in play</a>, where incorrect stories tend to win out over correct stories.</p> <p>And even when making a genuine attempt to try to understand what happened, it turns out that knowledge is lost fairly quickly. For this and other incident analysis projects we've done, links to documents and tickets from the past few years tend to work (90%+ chance), but older links are less likely to work, with the rate getting pretty close to 0% by the time we're looking at things from 2012. Sometimes, people have things squirreled away in locked down documents, emails, etc. but those will often link to things that are now completely dead, and figuring out what happened requires talking to a bunch of people who will, <a href="https://www.pnas.org/content/114/30/7758">due to the nature of human memory, give you inconsistent stories that you need to piece together</a><sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">1</a></sup>.</p> <p>On looking at things from a specific angle, while <a href="postmortem-lessons/">looking at failures broadly and classifying and collating all failures is useful</a>, it's also useful to drill down into certain classes of failures. For example, when Rebecca Isaacs and Dan Luu did an (internal, non-public) analysis of Twitter failover tests (from 2018 to 2020), which found a number of things that led to operational changes. In some sense, there was no new information in the analysis since the information we got all came from various documents that already existed, but putting into one place made a number of patterns obvious that weren't obvious when looking at incidents one at a time across multiple years.</p> <p>This document shouldn't cause any changes at Twitter since looking at what patterns exist in cache incidents over time and what should be done about that has already been done, but collecting these into one place may still be useful to people outside of Twitter.</p> <p>As for why we might want to look at cache failures (as opposed to failures in other systems), cache is relatively commonly implicated in major failures, as illustrated by this comment Yao made during an internal Twitter War Stories session (referring to the dark ages of Twitter, in operational terms):</p> <blockquote> <p>Every single incident so far has at least mentioned cache. In fact, for a long time, cache was probably the #1 source of bringing the site down for a while.</p> <p>In my first six months, every time I restarted a cache server, it was a <code>SEV-0</code> by today's standards. On a good day, you might have 95% Success Rate (SR) [for external requests to the site] if I restarted one cache ...</p> </blockquote> <p>Also, the vast majority of Twitter cache is (a fork of) memcached<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">2</a></sup>, which is widely used elsewhere, making the knowledge more generally applicable than if we discussed a fully custom Twitter system.</p> <p>More generally, caches are nice source of relatively clean real-world examples of common distributed systems failure modes because of how simple caches are. Conceptually, a cache server is a high-throughput, low-latency RPC server plus a library that manages data, such as memory and/or disk and key value indices. For in memory caches, the data management side should be able to easily outpace the RPC side (a naive in-memory key-value library should be able to hit millions of QPS per core, whereas a naive RPC server that doesn't use userspace networking, batching and/or pipelining, etc. will have problems getting to 1/10th that level of performance). Because of the simplicity of everything outside of the RPC stack, cache can be thought of as an approximation of nearly pure RPC workloads, which are frequently important in heavily service-oriented architectures.</p> <p>When scale and performance are concerns, cache will frequently use sharded clusters, which then subject cache to the constraints and pitfalls of distributed systems (but with less emphasis on synchronization issues than with some other workloads, such as strongly consistent distributed databases, due to the emphasis on performance). Also, by the nature of distributed systems, users of cache will be exposed to these failure modes and be vulnerable to or possibly implicated in failures caused by the cascading impact of some kinds of distributed systems failures.</p> <p>Cache failure modes are also interesting because, when cache is used to serve a significant fraction of requests or fraction of data, cache outages or even degradation can easily cause a total outage because an architecture designed with cache performance in mind will not (and should not) have backing DB store performance that's sufficient to keep the site up.</p> <p>Compared to most workloads, cache is more sensitive to performance anomalies below it in the stack (e.g., kernel, firmware, hardware, etc.) because it tends to have relatively high-volume and low-latency SLOs (because the point of cache is that it's fast) and it spends (barring things like userspace networking) a lot of time in kernel (~80% as a ballpark for Twitter memcached running normal kernel networking). Also, because cache servers often run a small number of threads, cache is relatively sensitive to being starved by other workloads sharing the same underlying resources (CPU, memory, disk, etc.). The high volume and low latency SLOs worsen positive feedback loops that lead to a &quot;death spiral&quot;, a classic distributed systems failure mode.</p> <p>When we look at the incidents below, we'll see that most aren't really due to errors in the logic of cache, but rather, some kind of anomaly that causes an insufficiently mitigated positive feedback loop that becomes a runaway feedback loop.</p> <p>So, when reading the incidents below, it may be helpful to read them with an eye towards how cache interacts with things above cache in the stack that call caches and things below cache in the stack that cache interacts with. Something else to look for is how frequently a major incident occured due to an incompletely applied fix for an earlier incident or because something that was considered a serious operational issue by an engineer wasn't prioritized. These were both common themes in the analysis Rebecca Isaacs and Dan Luu did on causes of failover test failures as well.</p> <h3 id="2011-08-sev-0">2011-08 (SEV-0)</h3> <p>For a few months, a significant fraction of user-initiated changes (such as username, screen name, and password) would get reverted. There was continued risk of this for a couple more years.</p> <h4 id="background">Background</h4> <p>At the time, the Rails app had single threaded workers, managed by a single master that did health checks, redeploys, etc. If a worker got stuck for 30 seconds, the master would kill the worker and restart it.</p> <p>Teams were running on bare metal, without the benefit of a cluster manager like mesos or kubernetes. Teams had full ownership of the hardware and were responsible for kernel upgrades, etc.</p> <p><a href="https://www.metabrew.com/article/libketama-consistent-hashing-algo-memcached-clients">The algorithm for deciding which shard a key would land involved a hash. If a node went away, the keys that previously hashed to that node would end up getting hashed to other nodes</a>. Each worker had a client that made its own independent routing decisions to figure out which cache shard to talk to, which means that each worker made independent decisions as to which cache nodes were live and where keys should live. If a client thinks that a host isn't &quot;good&quot; anymore, that host is said to be ejected.</p> <h4 id="incident">Incident</h4> <p>On Nov 8, a user changed their name from [old name] to [new name]. One week later, their username reverted to [old name].</p> <p>Between Nov 8th and early December, tens of these tickets were filed by support agents. Twitter didn't have the instrumentation to tell where things were going wrong, so the first two weeks of investigation was mostly getting metrics into the rails app to understand where the issue was coming from. Each change needed to be coordinated with the deploy team, which would take at least two hours. After the rails app was sufficiently instrumented, all signs pointed to cache as the source of the problem. The full set of changes needed to really determine if cache was at fault took another week or two, which included adding metrics to track cache inconsistency, cache exception paths, and host ejection.</p> <p>After adding instrumentation, an engineer made the following comment on a JIRA ticket in early December:</p> <blockquote> <p>I turned on code today to allow us to see the extent to which users in cache are out of sync with users in the database, at the point where we write the user in cache back to the database, at the point where we write the user in cache back to the database. The number is roughly 0.2% ... Checked 150 popular users on Twitter to see how many caches they were in (should be at most one). Most of them were on at least two, with some on as many as six.</p> </blockquote> <p>The first fix was to avoid writing stale data back to the DB. However, that didn't address the issue of having multiple copies of the same data in different cache shards. The second fix, intended to reduce the number of times keys appeared in multiple locations, was to retry multiple times before ejecting a host. The idea is that, if a host is really permanently down, that will trigger an alert, but alerts for dead hosts weren't firing, so the errors that were causing host ejections should be transient and therefore, if a client keeps retrying, it should be able to find a key &quot;where it's supposed to be&quot;. And then, to prevent flapping keys from hosts having many transient errors, the time that ejected hosts were kept ejected was increased.</p> <p>This change was tested on one cache and the rolled out to other caches. Rolling out the change to all caches immediately caused the site to go down because ejections still occurred and the longer ejection time caused the backend to get stressed. At the time, the backend was MySQL, which, as configured, could take an arbitrarily long amount of time to return a request under high load. This caused workers to take an arbitrarily long time to return results, which caused the master to kill workers, which took down the site when this happened at scale since not enough workers were available to serve requests.</p> <p>After rolling back the second fix, users could still see stale data since, even though stale data wasn't being written back to the DB, cache updates could happen to a key in one location and then a client could read a stale, cached, copy of that key in another location. Another mitigation that was deployed was to move the user data cache from a high utilization cluster to a low utilization cluster.</p> <p>After debugging further, it was determined that retrying could address ejections occurring due to &quot;random&quot; causes of tail latency, but there was still a high rate of ejections coming from some kind of non-random cause. From looking at metrics, it was observed that there was sometimes a high rate of packet loss and that this was correlated with incoming packet rate but not bandwidth usage. Looking at the host during times of high packet rate and packet loss showed that CPU0 was spending 65% to 70% of time handling soft IRQs, indicating that the packet loss was likely coming from CPU0 not being able to keep with the packet arrival rate.</p> <p>The fix for this was to set <a href="https://www.kernel.org/doc/Documentation/IRQ-affinity.txt">IRQ affinity</a> to spread incoming packet processing across all of the physical cores on the box. After deploying the fix, packet loss and cache inconsistency was observed on the new cluster that user data was moved to but not the old cluster.</p> <p>At this point, it's late December. Looking at other clusters, it was observed that some other clusters also had packet loss. Looking more closely, the packet loss was happening every 20 hours and 40 minutes on some specific machines. All machines that had this issue were a particular <a href="https://en.wikipedia.org/wiki/Stock_keeping_unit">hardware SKU</a> with a particular BIOS version (the latest version; machines from that SKU with earlier BIOS versions were fine). It turned out that hosts with this BIOS version were triggering the BMC to run a very expensive health check every 20 hours and 40 minutes which interrupted the kernel for the duration, preventing any packets from being processed, causing packet drops.</p> <p>It turned out that someone from the kernel team had noticed this exact issue about six months earlier and had tried to push a kernel config change that would fix the issue (increasing the packet ring buffer size so that transient issues wouldn't cause the packet drops when the buffer overflowed). Although that ticket was marked resolved, the fix was never widely rolled out for reasons that are unclear.</p> <p>A quick mitigation that was deployed was to stagger host reboot times so that clusters didn't have coordinated packet drops across the entire cluster at the same time.</p> <p>Because the BMC version needs to match the BIOS version and the BMC couldn't be rolled back, it wasn't possible to fix the issue by rolling back the BIOS. In order to roll the BMC and BIOS forward, the <code>HWENG</code> team had to do emergency testing/qualification of those, which was done as quickly as possible, at which point the BIOS fix was rolled out and the packet loss went away.</p> <p>The total time for everything combined was about two months.</p> <p>However, this wasn't a complete fix since the host ejection behavior was still unchanged and any random issue that caused one or more clients but not all clients to eject a cache shard would still result in inconsistency. Fixing that required changing cache architectures, which couldn't be quickly done (that took about two years).</p> <p><b>Mitigations / fixes</b>:</p> <ul> <li>Add visibility</li> <li>Set IRQ affinity to avoid overloading CPU0</li> <li>Fix firmware issue causing hosts to drop packets periodically</li> <li>Fix cache architecture to one that can tolerate partitions without becoming inconsistent</li> </ul> <p><b>Lessons learned</b>:</p> <ul> <li>Need visibility</li> <li>Need low-level systems understanding to operate cache</li> <li>Make isolated changes (one thing that confused the issue was migrating to new cluster at the same time as pushing IRQ affinity fix, which confusingly fixed one packet loss problem and introduced another one at the same time).</li> </ul> <h3 id="2012-07-sev-1">2012-07 (SEV-1)</h3> <p>Non-personalized trends didn't show up for ~10% of users for about 10 hours, who got an empty trends box.</p> <p>An update to the rails app was deployed, after which the trends cache stopped returning results. This only impacted non-personalized trends because those were served directly from rails (personalized trends were served from a separate service).</p> <p>Two hours in, it was determined that this was due to segfaults in the daemon that refreshes the trends cache, which was due to running out of memory. The reason this happened was that the deployed change added a Thrift field to the Trend object, which increased the trends cache refresh daemon memory usage beyond the limit.</p> <p>There was an alert on the trends cache daemon failing, but it only checked for the daemon starting a run successfully, not for it finishing a run successfully.</p> <p>Mitigations / fixes:</p> <ul> <li>Increase ulimit</li> <li>Alert changed to use job success as a criteria, not job startup</li> <li>Add global 404 rate to global dashboard</li> </ul> <p>Lessons learned</p> <ul> <li>Alerts should use job success as a criteria, not job startup</li> </ul> <h3 id="2012-07-sev-0">2012-07 (SEV-0)</h3> <p>This was one of the more externally well-known Twitter incidents because this one resulted in the public error page showing, with no images or CSS:</p> <blockquote> <p>Twitter is currently down for &lt;% = reason %&gt;</p> <p>We expect to be back in &lt;% = deadline %&gt;</p> </blockquote> <p>The site was significantly impacted for about four hours.</p> <p>The information on this one is a bit sketchy since records from this time are highly incomplete (the JIRA ticket for this notes, &quot;This incident was heavily Post-Mortemed and reviewed. Closing incident ticket.&quot;, but written documentation on the incident has mostly been lost).</p> <p>The trigger for this incident was power loss in two rows of racks. In terms of the impact on cache, 48 hosts lost power and were restarted when power came back up, one hour later. 37 of those hosts had their caches fail to come back up because a directory that a script expected to exist wasn't mounted on those hosts. &quot;Manually&quot; fixing the layouts on those hosts took 30 minutes and caches came back up shortly afterwards.</p> <p>The directory wasn't actually necessary for running a cache server, at least as they were run at Twitter at the time. However, there was a script that checked for the existence of the directory on startup that was not concurrently updated when the directory was removed from the layout setup script a month earlier.</p> <p>Something else that increased debugging time was that <code>/proc</code> wasn't mounted properly on hosts when they came back up. Although that wasn't the issue, it was unusual and it took some time to determine that it wasn't part of the incident and was an independent non-urgent issue to be fixed.</p> <p>If the rest of the site were operating perfectly, the cache issue above wouldn't have caused such a severe incident, but a number of other issues in combination caused a total site outage that lasted for an extended period of time.</p> <p>Some other issues were:</p> <ul> <li>Slow requests that should've timed out at 5 seconds didn't. Instead, they would continue for 30 seconds until the entire worker process that was working on the slow request was killed and restarted <ul> <li>The code that was supposed to cause the 5 second timeout was being run, but it wasn't using the right timestamp to determine duration and therefore didn't trigger a timeout</li> </ul></li> <li>User data service took a long time to recover <ul> <li>Logging during failures used a large amount of resources and very high GC pressure</li> </ul></li> <li>A number of non-cache hosts failed to come back up when rebooted, with issue including hanging at <code>fsck</code> or in a <code>PXE</code> boot loop</li> <li>Although the site and error message were static, the outage page used Ruby wildcards, resulting in template messages being displayed to users <ul> <li>This came from Twitter having recently migrated from having the rails app act as a front end to having a C++ front end; assets for errors were directly copied over and still had <a href="https://docs.ruby-lang.org/en/2.3.0/ERB.html">ERB templates</a></li> </ul></li> <li>CSS didn't load because the part of the site CSS would've loaded from was down</li> <li>Front end got overloaded and failed to restart properly when health checks found that shards were unhealthy</li> </ul> <p>Cache mitigations / fixes:</p> <ul> <li>Fix software that configures layouts to avoid issue in future</li> <li>Audit existing hosts to fix issue on any hosts that were then-currently impacted</li> <li>Make sure <code>/proc</code> is mounted on kernel upgrade</li> <li>Create process for updating/upgrading software that configures layouts to reduce probability of introducing future bug</li> <li>Make sure cache hosts (as well as other hosts) are spread more evenly across failure domains</li> </ul> <p>Other mitigations / fixes (highly incomplete):</p> <ul> <li>Set up disk / RAID health &amp; maintenance on observability boxes</li> <li>Send broken / unhealthy hosts to SiteOps for repair</li> <li>Remove Ruby wildcards from outage page</li> <li>Bundle CSS into outage page so that site CSS still works when other things are down but the outage page is up</li> <li>Add load shedding to front end to drop traffic when overloaded</li> <li>Change logging library for user data service to much cheaper logging library to prevent GC pressure from killing the service when error rate is high</li> <li>Fix 5 second timeout to look at correct header</li> <li>Add an independent timeout at a different level of the stack that should also fire if requests are completely failing to make progress</li> <li>Change front-end health check and restart to forcibly kill nodes instead of trying to gracefully shut them down</li> <li>Ensure only one version of the health check script is running on one node at any given time</li> </ul> <p>Lessons learned:</p> <ul> <li>Failure modes need to be actively tested, including failure modes that would cause a host reboot or a timeout</li> <li>Need to have rack diversity requirements, so losing a couple racks won't disproportionately impact a small number of services</li> </ul> <h3 id="2013-01-sev-0">2013-01 (SEV-0)</h3> <p>Site outage for 3h30m</p> <p>An increase in load (AFAIK, normal for the day, not an outlier load spike) caused a tail latency increase on cache. The tail latency increased on cache was caused by IRQ affinities not being set on new cache hosts, which caused elevated queue lengths and therefore elevated latency.</p> <p>Increased cache latency along with the design of tweet service using cache caused shards of the service using cache to enter a GC death spiral (more latency -&gt; more outstanding requests -&gt; more GC pressure -&gt; more load on the shard -&gt; more latency), which then caused increased load on remaining shards.</p> <p>At the time, the tweet service cache and user data cache were colocated onto the same boxes, with 1 shard of tweet service cache and 2 shards of user data cache per box. Tweet service cache added the new hosts without incident. User data cache then gradually added the new hosts over the course of an evening, also initially without incident. But when morning peak traffic arrived (peak traffic is in the morning because that's close to both Asian and U.S. peak usage times, with Asian countries generally seeing peak usage outside of &quot;9-5&quot; work hours and U.S. peak usage during work hours), that triggered the IRQ affinity issue. Tweet service was much more impacted by the IRQ affinity issue than the user data service.</p> <p><b>Mitigations / fixes</b>:</p> <ul> <li>IRQ affinity needs to be set for cache hosts, per the 2011-08 incident <ul> <li>Make this the default for boxes instead of having cache hosts do this as one-off changes</li> </ul></li> <li>Change tweet service settings <ul> <li>Reduce max number of connections</li> <li>Increase timeout</li> <li>No GC config changes made because, at the time, GC stats weren't exported as metrics and the GC logs weren't logging sufficient information to understand if bad GC settings were a contributing factor</li> </ul></li> <li>Change settings for all services that use cache <ul> <li>Adjust connection limits to ~2x steady state</li> </ul></li> </ul> <h3 id="2013-09-sev-1">2013-09 (SEV-1)</h3> <p>Overall site success rate dropped to 92% in one datacenter. Users were impacted for about 15 minutes.</p> <p>The timeline service lost access to about 75% of one of the caches it uses. The cache team made a serverset change for that cache and the timeline service wasn't using the recommended mechanism to consume the cache serverset path and didn't &quot;know&quot; which servers were cache servers.</p> <p><b>Mitigations / fixes</b>:</p> <ul> <li>Have timeline service use recommended mechanism for finding serverset path</li> <li>Audit all code that consumes serverset paths to ensure no service is using a non-recommended mechanism for serverset paths</li> </ul> <h3 id="2014-01-sev-0">2014-01 (SEV-0)</h3> <p>The site went down in one datacenter, impacting users whose requests went to that datacenter for 20 minutes.</p> <p>The tweet service started sending elevated load to caches. A then-recent change removed the cap on the number of connections that could be made to caches. At the time, when caches hit around ~160k connections, they would fail to accept new connections. This caused the monitoring service to be unable to connect to cache shards, which caused the monitoring service to restart cache shards, causing an outage.</p> <p>In the months before the outage, there were five tickets describing various ingredients for the outage.</p> <p>In one ticket, a follow-up to a less serious incident caused by a combination of bad C-state configs and SMIs, it was noted that caches stopped accepting connections at ~160k connections. An engineer debugged the issue in detail, figured out what was going on, and suggested a number of possible paths to mitigating the issue.</p> <p>One ingredient is that, especially when cache is highly loaded, cache can not have <code>accept</code>ed the connection even though the kernel will have established the TCP connection.</p> <p>The client doesn't &quot;know&quot; that the connection isn't really open to the cache and will send a request and wait for a response. Finagle may open multiple connections if it &quot;thinks&quot; that more concurrency is needed. After 150ms, the request will time out. If the queue is long on the cache side, this is likely to be before the cache has even attempted to do anything about the request.</p> <p>After the timeout, Finagle will try again and open another connection, causing the cache shard to become more overloaded each time this happens.</p> <p>On the client side, each of these requests causes a lot of allocations, causing a lot of GC pressure.</p> <p>At the time, settings allowed for 5 requests before marking a node as unavailable for 30 seconds, with 16 connection parallelism and each client attempting to connect to 3 servers. When all those numbers were multiplied out by the number of shards, that allowed the tweet service to hit the limits of what cache can handle before connections stop being accepted.</p> <p>On the cache side, there was one dispatcher thread and N worker threads. The dispatcher thread would call <code>listen</code> and <code>accept</code> and then put work onto queues for worker threads. By default, the backlog length was 1024. When <code>accept</code> failed due to an fd limit, the dispatcher thread set backlog to 0 in <code>listen</code> and ignored all events coming to listening fds. Backlog got reset to normal and connections were accepted again when a connection was closed, freeing up an fd.</p> <p>Before the major incident, it was observed that after the number of connections gets &quot;too high&quot;, connections start getting rejected. After a period of time, the backpressure caused by rejected connections would allow caches to recover.</p> <p>Another ingredient to the issue was that, on one hardware SKU, there were OOMs when the system ran out of <code>32kB</code> pages under high cache load, which would increase load to caches that didn't OOM. This was fixed by a Twitter kernel engineer in</p> <pre><code>commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 Author: Eric W. Biederman &lt;ebiederm@xmission.com&gt; Date: Mon Feb 10 14:25:41 2014 -0800 fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order &gt; 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Signed-off-by: &quot;Eric W. Biederman&quot; &lt;ebiederm@xmission.com&gt; Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt; Acked-by: David Rientjes &lt;rientjes@google.com&gt; Cc: Cong Wang &lt;cwang@twopensource.com&gt; Cc: &lt;stable@vger.kernel.org&gt; Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt; Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt; </code></pre> <p>and is another example of why <a href="in-house/">companies the size of Twitter get value out of having a kernel team</a>.</p> <p>Another ticket noted the importance of having standardized settings for cache hosts for things like IRQ affinity, C-states, turbo boost, NIC bonding, and firmware version, which was a follow up to another ticket noting that the tweet service sometimes saw elevated latency on some hosts, which was ultimately determined to be due to increased SMIs after a kernel upgrade impacting one hardware SKU type due to some interactions between the kernel and the firmware version.</p> <p><b>Cache Mitigations / fixes</b>:</p> <ul> <li>Reduce backlog from 1024 to 128 to apply back pressure more quickly when dispatcher is overloaded</li> <li>Lower fd limit to avoid some shards running out of memory</li> <li>Use a fixed hash table size in cache to avoid large load of allocations and memory/CPU load during hash table migration</li> <li>Use CPU affinity on low latency memcache hosts</li> </ul> <p>Tests with these mitigations indicated that, even without fixes to clients to prevent clients from &quot;trying to&quot; overwhelm caches, these prevented cache from falling over under conditions similar to the incident.</p> <p><b>Tweet service Mitigations / fixes</b>:</p> <ul> <li>Change timeout, retry, and concurrent client connection settings to avoid overloading caches</li> </ul> <p><b>Lessons learned</b>:</p> <ul> <li>Consistent hardware settings are important</li> <li>Allowing high queue depth before applying backpressure can be dangerous</li> <li>Clients should &quot;do the math&quot; when setting retry policies to avoid using retry policies that can completely overwhelm cache servers when 100% of responses fail and maximal backpressure is being applied</li> </ul> <h3 id="2014-03-sev-0">2014-03 (SEV-0)</h3> <p><a href="https://twitter.com/TheEllenShow/status/440322224407314432">A tweet from Ellen</a> was retweeted very frequently during the Oscars, which resulted in search going down for about 25 minutes as well as a site outage that prevented many users from being able to use the site.</p> <p>This incident had a lot of moving parts. From a cache standpoint, this was another example of caches becoming overloaded due to badly behaved clients.</p> <p>It's similar to the 2014-01 incident we looked at, except that the cache-side mitigations put in place for that incident weren't sufficient because the &quot;attacking&quot; clients picked more aggressive values than were used by the tweet service during 2014-01 incident and, by this time, some caches were running in containerized environments on shared mesos, which made them vulnerable to <a href="cgroup-throttling/">throttling death spirals</a>.</p> <p>The major fix to this direct problem was to add pipelining to the Finagle memcached client, allowing most clients to get adequate throughput with only 1 or 2 connections, reducing the probability of clients hammering caches until they fall over.</p> <p>For other services, there were close to 50 fixes put into place across many services. Some major themes were for the fixes were:</p> <ul> <li>Add backpressure where appropriate <ul> <li>Avoid retrying when backpressure is being applied</li> </ul></li> <li>Make sure data (mostly) flows to the same DC to avoid expensive and slow cross-DC traffic</li> <li>Create appropriate thread pools to prevent critical work from being starved</li> <li>Add in-process caching for hot items</li> <li>Return incomplete results for queries when under high load / don't fail requests if results are incomplete</li> <li>Create guide for how to configure cache clients to avoid DDoSing cache</li> </ul> <h3 id="2016-01-sev-0">2016-01 (SEV-0)</h3> <p>SMAP, a former Japanese boy band that became a popular adult J-pop group as well the hosts of a variety show that was frequently the #1 watched show in Japan, held a conference to falsely deny rumors they were going to break up. This resulted in an outage in one datacenter that impacted users routed to that datacenter for ~20 minutes, until that DC was failed away from. It took about six hours for services in the impacted DC to recover.</p> <p>The tweet service in one DC had a load spike, which caused 39 cache shard hosts to OOM kill processes on those hosts. The cluster manager didn't automatically remove the dead nodes from the server set because there were too many dead nodes (it will automatically remove nodes if a few fail, but if too many fail, this change is not automated due to the possibility of exacerbating some kind of catastrophic failure with an automated action since removing nodes from a cache server set can cause traffic spikes to persistent storage). When cache oncalls manually cleaned up the dead nodes, the service that should have restarted them failed to do so because a puppet change had accidentally removed cache related configs for the service would normally restart the nodes. Once the bad puppet commit was reverted, the cache shards came back up, but these initially came back too slowly and then later came back too quickly, causing recovery of tweet service success rate take an extended period of time.</p> <p>The cache shard hosts were OOM killed because too much kernel socket buffer memory was allocated.</p> <p>The initial fix for this was to limit TCP buffer size on hosts to 4 GB, but this failed a stress test and it was determined that memory fragmentation on hosts with high uptime (2 years) was the reason for the failure and the mitigation was to reboot hosts more frequently to clean up fragmentation.</p> <p><b>Mitigations / fixes</b>:</p> <ul> <li>Reboot hosts more than once every two years</li> <li>Add puppet alerts to cache boxes to detect breaking puppet changes</li> <li>Change cluster manager to handle large changes better (change already in progress due to a previous, smaller, incident)</li> </ul> <h3 id="2016-02-sev-1">2016-02 (SEV-1)</h3> <p>This was the failed stress test from the 2016-01 <code>SEV-0</code> mentioned above. This mildly degraded success rate to the site for a few minutes until the stress test was terminated.</p> <h3 id="2016-07-sev-1">2016-07 (SEV-1)</h3> <p>A planned migration of user data cache from dedicated hosts to Mesos led to significant service degradation in one datacenter and then minor degradation in another datacenter. Some existing users were impacted and all basically new user signups failed for about half an hour.</p> <p>115 new cache instances were added to a serverset as quickly as the cluster manager could add them, reducing cache hit rates. The cache cluster manager was expected to add 1 shard every 20 minutes, but the configuration change accidentally changed the minimum cache cluster size, which &quot;forced&quot; the cluster manager to add the nodes as quickly as it could.</p> <p>Adding so many nodes at once reduced user data cache hit rate from the normal 99.8% to 84%. In order to stop this from getting worse, operators killed the cluster manager to prevent it from adding more nodes to the serverset and then redeployed the cluster manager in its previous state to restore the old configuration, which immediately improved user data cache hit rate.</p> <p>During the time period cache hit rate was degraded, the backing DB saw a traffic spike that caused long GC pauses. This caused user data service requests that missed cache to have a 0% success rate when querying the backing DB.</p> <p>Although there was rate limiting in place to prevent overloading the backing DB, the thresholds were too high to trigger. In order to recover the backing DB, operators did a rolling restart and deployed strict rate limits. Since one datacenter was failed away from due to the above, the strict rate limit was hit in another datacenter because the failing away from one datacenter caused elevated traffic in another datacenter. This caused mildly reduced success rate in the user data service because requests were getting rejected by the strict rate limit, which is why this incident also impacted a datacenter that wasn't impacted by the original cache outage.</p> <p><b>Mitigations / fixes</b>:</p> <ul> <li>Add a deploy hook that warns operators who are adding or removing a large number of nodes from a cache cluster</li> <li>Add detailed information in runbooks about how to do deploys, cluster creation, expansion, shrinkage, etc.</li> <li>Add a checklist for all &quot;tier 0&quot; (critical) cache deploys</li> </ul> <h3 id="2018-04-sev-0">2018-04 (SEV-0)</h3> <p>A planned test datacenter failover caused a partial site outage for about 1 hour. Degraded success rate was noticed 1 minute into the failover. The failover test was immediately reverted, but it took most of an hour for the site to fully recover.</p> <p>The initial site degradation came from increased error rates in the user data service, which was caused by cache hot keys. There was a mechanism intended to cache hot keys, which sampled 1% of events (with sampling being used in order to reduce overhead, the idea being that if a key is hot, it should be noticed even with sampling) and put sampled keys into a FIFO queue with a hash map to count how often each key appears in the queue.</p> <p>Although this worked for previous high load events, there were some instances where this didn't work as well as intended (but weren't a root cause in an incident) when the values are large because the 1% sampling rate wouldn't allow the cache to &quot;notice&quot; a hot key quickly enough in the case where there were large (and therefore expensive) values. The original hot key detection logic was designed for tweet service cache, where the largest keys were about 5KB. This same logic was then used for other caches, where keys can be much larger. User data cache wasn't a design consideration for hot keys because, at the time hot key promotion was designed, the user data cache wasn't having hot key issues because, at the time, the items that would've been the hottest keys were served from an in-process cache.</p> <p>The large key issue was exacerbated by the use of <code>FNV1-32</code> for key hashing, which ignores the least significant byte. The data set that was causing a problem had a lot of its variance inside the last byte, so the use of <code>FNV1-32</code> caused all of the keys with large values to be stored on small number of cache shards. There were suggestions to move to migrate off of <code>FNV1-32</code> at least as far back as 2014 for this exact reason and a more modern hash function was added to a utility library, but some cache owners chose not to migrate.</p> <p>Because the hot key promotion logic didn't trigger, traffic to the hot cache shards saturated NIC bandwidth to the shards that had hot keys that were using 1Gb NICs (Twitter hardware is generally heterogenous unless someone ensures that clusters only have specific characteristics; although many cache hosts had 10Gb NICs, many also had 1Gb NICs).</p> <p>Fixes / mitigations:</p> <ul> <li>Tune user data cache hot key detection</li> <li>Upgrade all hardware in the relevant cache clusters to hosts with 10Gb NICs</li> <li>Switch some caches from <code>FNV</code> to <code>murmur3</code></li> </ul> <h3 id="2018-06-sev-1">2018-06 (SEV-1)</h3> <p>During a test data center failover, success rate for some kinds of actions dropped to ~50% until the test failover was aborted, about four minutes later.</p> <p>From a cache standpoint, the issue was that tweet service cache shards were able to handle much less traffic than expected (about 50% as much traffic) based on load tests that weren't representative of real traffic, resulting in the tweet service cache being under provisioned. Among the things that made the load test setup unrealistic were:</p> <ul> <li><a href="https://twitter.com/danluu/status/1360029773011984385">The arrival distribution was highly non-independent, with large spikes due to correlated arrivals when under load</a>. It's common to assume either a constant or Poisson arrival distribution, but as <a href="latency-pitfalls/#minutely-resolution">we saw when looking at metrics data, the commonly used load generation assumption that arrivals are either constant or Poisson is false</a> in a way that can result in unbounded difference between achievable throughput under actual load vs. a load generator that makes naive assumptions</li> <li>The number of connections used in the load test was significantly smaller than the number of connections when under high load in practice</li> </ul> <p>Also, a reason for degraded cache performance was that, once a minute, container-based performance counter collection was run for ten seconds, which was fairly expensive because many more counters were being collected than there are hardware counters, requiring the kernel to do expensive operations to switch out which counters are being collected.</p> <p>The degraded performance both increased latency enough during the window when performance counters were collected that cache shards were unable to complete their work before hitting <a href="cgroup-throttling/">container throttling limits</a>, degrading latency to the point that tweet service requests would time out. As configured, after 12 consecutive failures to a single cache node, tweet service clients would mark the node as dead for 30 seconds and stop issuing requests to it, causing the node to get no traffic for 30 seconds as clients independently made the decision to mark the node as dead. This caused increased request rates to increase past the request rate quota to the backing DB, causing requests to get rejected at the DB, increasing the failure rate of the tweet service.</p> <p><b>Mitigations / fixes</b>:</p> <ul> <li>Reduced number of connections from tweet service client to cache from 4 to 2, which reduced latency <ul> <li>As noted in a previous incident, adding pipelining allowed caches to operate efficiently with only 1 client connection, but some engineers were worried that 1 might not be enough because the number of connections was previously much higher, so 4 was chosen &quot;just in case&quot;, but, with standard Linux kernel networking, <a href="https://twitter.github.io/pelikan/2020/benchmark-adq.html">having more connections increases tail latency</a>, so this degraded performance</li> </ul></li> <li>Add more cache nodes to reduce load on individual cache shards</li> <li>Improve cache hot key promotion algorithm <ul> <li>This wasn't specific to this incident, but an engineer did an analysis and found that the hot key promotion algorithm introduced a year ago had a cache hit rate of approximately 0.3% due to a combination of issues for one cache cluster. Switching to a better algorithm improved cache hit rate and performance significantly</li> </ul></li> <li>Change cache qualification process so that the cache performance used to determine capacity (number of nodes) more accurately reflects real-world cache performance</li> <li>Do a detailed analysis of the cost of multiplexed performance counter collection</li> </ul> <p><i>Thanks to <font size=+1><b><a rel="sponsored" href ="https://www.reforge.com/all-programs?utm_source=danluu&utm_medium=referral&utm_campaign=spring22_newsletter_test&utm_term=&utm_content=engineering">Reforge - Engineering Programs</a></b></font> and <font size=+1><b><a rel="sponsored" href ="https://flatironsdevelopment.com/">Flatirons Development</a></b></font> for helping to make this post possible by <a href="https://patreon.com/danluu">sponsoring me at the Major Sponsor tier</a>.</p> <p>Also, thanks to Michael Leinartas, Tao L., Michael Motherwell, Jonathan Riechhold, Stephan Zuercher, Justin Blank, Jamie Brandon, John Hergenroeder, and Ben Kuhn for comments/corrections/discussion.</i></p> <h3 id="appendix-pelikan-cache">Appendix: Pelikan cache</h3> <p><a href="https://twitter.github.io/pelikan/">Pelikan</a> was created to address issues we saw when operating memcached and Redis at scale. <a href="https://twitter.github.io/pelikan/2019/why-pelikan.html">This document</a> explains some of the motivations for Pelikan. The moduarlity / ease of modification has allowed us to discover novel cache innovations, such as <a href="https://twitter.com/danluu/status/1381687511362138113">a new eviction algorithm that addresses the problems we ran into with existing eviction algorithms</a>.</p> <p>With respect to the kinds of things discussed in this post, Pelikan has had more predictable performance, better median performance, and better performance in the tail than our existing caches when we've tested it in production, which means we get better reliaiblity and more capacity at a lower cost.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:L"><p>That knowledge decays at a high rate isn't unique to Twitter. In fact, of all the companies I've worked at as a full-time employee, I think Twitter is the best at preserving knowledge. The chip company I worked at, Centaur, basically didn't believe in written documentation other than having comprehensive bug reports, so many kinds of knowledge became lost very quickly. Microsoft was almost as bad since, by default, documents were locked down and fairly need-to-know, so basically nobody other than perhaps a few folks with extremely broad permissions would even be able to dig through old docs to understand how things had come about.</p> <p>Google was a lot like Twitter is now in the early days, but as the company grew and fears about legal actions grew, especially after multiple embarrassing incidents when <a href="https://twitter.com/danluu/status/1172676081628766208">execs stated their intention to take unethical and illegal actions</a>, things became more locked down, like Microsoft.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:R">There's also some use of a Redis fork, but the average case performance is significantly worse and the performance in the tail is relatively worse than the average case performance. Also, it has higher operational burden at scale directly due to its design, which limits its use for us. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> </ol> </div> Cocktail party ideas cocktail-ideas/ Wed, 02 Feb 2022 00:00:00 +0000 cocktail-ideas/ <p>You don't have to be at a party to see this phenomenon in action, but there's a curious thing I regularly see at parties in social circles where people value intelligence and cleverness without similarly valuing on-the-ground knowledge or intellectual rigor. People often discuss the standard trendy topics (some recent ones I've observed at multiple parties are how to build a competitor to Google search and how to solve the problem of high transit construction costs) and explain why people working in the field today are doing it wrong and then explain how they would do it instead. I occasionally have good conversations that fit that pattern (with people with very deep expertise in the field who've been working on changing the field for years), but the more common pattern is that someone with cocktail-party level knowledge of a field will give their ideas on how the field can be fixed.</p> <p>Asking people why they think their solutions would solve valuable problems in the field has become a hobby of mine when I'm at parties where this kind of superficial pseudo-technical discussion dominates the party. What I've found when I've asked for details is that, in areas where I have some knowledge, people <a href="sounds-easy/">generally don't know what sub-problems need to be solved to solve the problem they're trying to address, making their solution hopeless</a>. After having done this many times, my opinion is that the root cause of this is generally that many people who have a superficial understanding of topic assume that the topic is as complex as their understanding of the topic instead of realizing that only knowing a bit about a topic means that they're missing an understanding of the full complexity of a topic.</p> <p>Since I often attend parties with programmers, this means I often hear <a href="https://twitter.com/nick_r_cameron/status/1346174149044043776">programmers retelling</a> their <a href="https://twitter.com/danluu/status/1347269793578053632">cocktail-party level understanding of another field</a> (the search engine example above notwithstanding). If you want a sample of similar comments online, you can often see these when programmers discuss &quot;trad&quot; engineering fields. An example I enjoyed was <a href="https://twitter.com/danluu/status/1162469763374673920">this Twitter thread where Hillel Wayne discussed how programmers without knowledge of trad engineering often have incorrect ideas about what trad engineering is like</a>, where many of the responses are from programmers with little to no knowledge of trad engineering who then reply to Hillel with their misconceptions. When Hillel completed his <a href="https://www.hillelwayne.com/tags/crossover-project/">crossover project</a>, where he interviewed people who've worked in a trad engineering field as well as in software, <a href="https://twitter.com/danluu/status/1484268111687663620">he got even more such comments</a>. Even when people are warned that naive conceptions of a field are likely to be incorrect, many can't help themselves and they'll immediately reply with their opinions about a field they know basically nothing about.</p> <p>Anyway, in the crossover project, Hillel compared the perceptions of people who'd actually worked in multiple fields to pop-programmer perceptions of trad engineering. One of the many examples of this that Hillel gives is when people talk about bridge building, where he notes that programmers say things like</p> <blockquote> <p>The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly. If we did not quickly adapt to the unforeseen, the only foreseeable event would be our own destruction.</p> </blockquote> <p>and</p> <blockquote> <p>No one thinks about moving the starting or ending point of the bridge midway through construction.</p> </blockquote> <p>But Hillel interviewed a civil engineer who said that they had to move a bridge! Of course, civil engineers don't move bridges as frequently as programmers deal with changes in software but, if you talk to actual, working, civil engineers, many civil engineers frequently deal with changing requirements after a job has started that's not fundamentally different from what programmers have to deal with at their jobs. People who've worked in both fields or at least talk to people in the other field tend to think the concerns faced by engineers in both fields are complex, but people with a cocktail-party level of understanding of the field often claim that the field they're not in is simple, unlike their field.</p> <p>A line I often hear from programmers is that programming is like &quot;having to build a plane while it's flying&quot;, implicitly making the case that programming is harder than designing and building a plane since people who design and build planes can do so before the plane is flying<sup class="footnote-ref" id="fnref:N"><a rel="footnote" href="#fn:N">1</a></sup>. But, of course, someone who designs airplanes could just as easily say &quot;gosh, my job would be very easy if I could build planes with 4 9s of uptime and my plane were allowed to crash and kill all of the passengers for 1 minute every week&quot;. Of course, the constraints on different types of projects and different fields make different things hard, but people often seem to have a hard time seeing constraints other fields have that their field doesn't. One might think that understanding that their own field is more complex than an outsider might naively think would help people understand that other fields may also have hidden complexity, but that doesn't generally seem to be the case.</p> <p>If we look at the rest of the statement Hillel was quoting (which is from the top &amp; accepted answer to a stack exchange question), the author goes on to say:</p> <blockquote> <p>It's much easier to make accurate projections when you know in advance exactly what you're being asked to project rather than making guesses and dealing with constant changes.</p> <p>The vast majority of bridges are using extremely tried and true materials, architectures, and techniques. A Roman engineer could be transported two thousand years into the future and generally recognize what was going on at a modern construction site. There would be differences, of course, but you're still building arches for load balancing, you're still using many of the same materials, etc. Most software that is being built, on the other hand . . .</p> </blockquote> <p>This is typical of the kind of error people make when they're discussing cocktail-party ideas. Programmers legitimately gripe when clueless execs who haven't been programmers for a decade request unreasonable changes to a project that's in progress, but this is not so different and actually more likely to be reasonable than when politicians who've never been civil engineers require project changes on large scale civil engineering projects. It's plausible that, on average, programming projects have more frequent or larger changes to the project than civil engineering projects, I'd guess that the intra-field variance is at least as large as the inter-field variance.</p> <p>And, of course, only someone who hasn't done serious engineering work in the physical world could say something like &quot;The predictability of a true engineer’s world is an enviable thing. But ours is a world always in flux, where the laws of physics change weekly&quot;, thinking that the (relative) fixity of physical laws means that physical work is predictable. When I worked as a hardware engineer, a large fraction of the effort and complexity of my projects went into dealing with physical uncertainty and civil engineering is no different (if anything, the tools civil engineers have to deal with physical uncertainty on large scale projects are much worse, resulting in a larger degree of uncertainty and a reduced ability to prevent delays due to uncertainty).</p> <p>If we look at how Roman engineering or even engineering from 300 years ago differs from modern engineering, a major source of differences is our much better understanding of uncertainty that comes from the physical world. It didn't used to be shocking when a structure failed not too long after being built without any kind of unusual conditions or stimulus (e.g., building collapse, or train accident due to incorrectly constructed rail). This is now rare enough that it's major news if it happens in the U.S. or Canada and this understanding also lets us build gigantic structures in areas where it would have been previously considered difficult or impossible to build moderate-sized structures.</p> <p>For example, if you look at a large-scale construction project in the Vancouver area that's sitting on the delta (Delta, Richmond, much of the land going out towards Hope), it's only relatively recently that we discovered the knowledge necessary to build some large scale structures (e.g., tall-ish buildings) reliably on that kind of ground, which is one of the many parts of modern civil engineering a Roman engineer wouldn't understand. A lot of this comes from a field called geotechnical engineering, a sub-field of civil engineering (alternately, arguably its own field and also arguably a subfield of geological engineering) that involves the ground, i.e., soil mechanics, rock mechanics, geology, hydrology, and so on and so forth. One fundamental piece of geotechnical engineering is the idea that you can apply <a href="https://en.wikipedia.org/wiki/Mechanics">mechanics</a> to reason about soil. The first known application of mechanics to soils, a fundamental part of geotechnical engineering, was in 1773 and geotechnical engineering as it's thought of today is generally said to have started in 1925. While Roman engineers did a lot of impressive work, <a href="https://www.patreon.com/posts/61946482">the mental models they were operating with precluded understanding much of modern civil engineering</a>.</p> <p>Naturally, for this knowledge to have been able to change what we can build, it must change how we build. If we look at what a construction site on compressible Vancouver delta soils that uses this modern knowledge looks like, by wall clock time, it mostly looks like someone put a pile of sand on the construction site (preload). While a Roman engineer would know what a pile of sand is, they wouldn't know how someone figured out how much sand was needed and how long it needed to be there (in some cases, Romans would use piles or rafts where we would use preload today, but in many cases, they had no answer to the problems preload solves today).</p> <p>Geotechnical engineering and the resultant pile of sand (preload) is one of tens of sub-fields where you'd need expertise when doing a modern, large scale, civil engineering project that a Roman engineer would need a fair amount of education to really understand.</p> <p>Coming back to cocktail party solutions I hear, one common set of solutions is how to fix high construction costs and slow construction. There's a set of trendy ideas that people throw around about why things are so expensive, why projects took longer than projected, etc. Sometimes, these comments are similar to what I hear from practicing engineers that are involved in the projects but, more often than not, the reasons are pretty different. When the reasons are the same, it seems that <a href="https://twitter.com/danluu/status/1420866014493822980">they must be correct by coincidence since they don't seem to understand the body of knowledge necessary to reason through the engineering tradeoffs</a><sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">2</a></sup>.</p> <p>Of course, like cocktail party theorists, <a href="https://twitter.com/danluu/status/1483162978224463872">civil engineers with expertise in the field also think that modern construction is wasteful</a>, but the reasons they come up with are often quite different from what I hear at parties<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">3</a></sup>. It's easy to come up with cocktail party solutions to problems by not understanding the problem, assuming the problem is artificially simple, and then coming up with a solution to the imagined problem. It's harder to understand the tradeoffs in play among the tens of interacting engineering sub-fields required to do large scale construction projects and have an actually relevant discussion of what the tradeoffs should be and how one might motivate engineers and policy makers to shift where the tradeoffs land.</p> <p>A widely cited study on the general phenomena of people having <a href="https://twitter.com/danluu/status/1356056202203947008">wildly oversimplified and incorrect models of how things work</a> is <a href="https://link.springer.com/content/pdf/10.3758/BF03195929.pdf">this study by Rebecca Lawson on people's understanding of how bicycles work</a>, which notes:</p> <blockquote> <p>Recent research has suggested that people often overestimate their ability to explain how things function. Rozenblit and Keil (2002) found that people overrated their understanding of complicated phenomena. This illusion of explanatory depth was not merely due to general overconfidence; it was specific to the understanding of causally complex systems, such as artifacts (crossbows, sewing machines, microchips) and natural phenomena (tides, rainbows), relative to other knowledge domains, such as facts (names of capital cities), procedures (baking cakes), or narratives (movie plots).</p> </blockquote> <p>And</p> <blockquote> <p>It would be unsurprising if nonexperts had failed to explain the intricacies of how gears work or why the angle of the front forks of a bicycle is critical. Indeed, even physicists disagree about seemingly simple issues, such as why bicycles are stable (Jones, 1970; Kirshner, 1980) and how they steer (Fajans, 2000). What is striking about the present results is that so many people have virtually no knowledge of how bicycles function.​​</p> </blockquote> <p>In &quot;experiment 2&quot; in the study, people were asked to draw a working bicycle and focus on the mechanisms that make the bicycle work (as opposed to making the drawing look nice) and 60 of the 94 participants had at least one gross error that caused the drawing to not even resemble a working bicycle. If we look at a large-scale real-world civil engineering project, a single relevant subfield, like geotechnical engineering, contains many orders of magnitude more complexity than a bicycle and it's pretty safe to guess that, to the nearest percent, zero percent of lay people (or Roman engineers) could roughly sketch out what the relevant moving parts are.</p> <p>For a non-civil engineering example, Jamie Brandon quotes this excerpt from <a href="https://amzn.to/3HrzkSc">Jim Manzi's Uncontrolled</a>, which is a refutation of a &quot;clever&quot; nugget that I've frequently heard trotted out at parties:</p> <blockquote> <p>The paradox of choice is a widely told folktale about a single experiment in which putting more kinds of jam on a supermarket display resulted in less purchases. The given explanation is that choice is stressful and so some people, facing too many possible jams, will just bounce out entirely and go home without jam. This experiment is constantly cited in news and media, usually with descriptions like &quot;scientists have discovered that choice is bad for you&quot;. But if you go to a large supermarket you will see approximately 12 million varieties of jam. Have they not heard of the jam experiment? Jim Manzi relates in <a href="https://amzn.to/3HrzkSc">Uncontrolled</a>:</p> <blockquote> <p>First, note that all of the inference is built on the purchase of a grand total of thirty-five jars of jam. Second, note that if the results of the jam experiment were valid and applicable with the kind of generality required to be relevant as the basis for economic or social policy, it would imply that many stores could eliminate 75 percent of their products and cause sales to increase by 900 percent. That would be a fairly astounding result and indicates that there may be a problem with the measurement.</p> <p>... the researchers in the original experiment themselves were careful about their explicit claims of generalizability, and significant effort has been devoted to the exact question of finding conditions under which choice overload occurs consistently, but popularizers telescoped the conclusions derived from one coupon-plus-display promotion in one store on two Saturdays, up through assertions about the impact of product selection for jam for this store, to the impact of product selection for jam for all grocery stores in America, to claims about the impact of product selection for all retail products of any kind in every store, ultimately to fairly grandiose claims about the benefits of choice to society. But as we saw, testing this kind of claim in fifty experiments in different situations throws a lot of cold water on the assertion.</p> <p>As a practical business example, even a simplification of the causal mechanism that comprises a useful forward prediction rule is unlikely to be much like 'Renaming QwikMart stores to FastMart will cause sales to rise,' but will instead tend to be more like 'Renaming QwikMart stores to FastMart in high-income neighborhoods on high-traffic roads will cause sales to rise, as long as the store is closed for painting for no more than two days.' It is extremely unlikely that we would know all of the possible hidden conditionals before beginning testing, and be able to design and execute one test that discovers such a condition-laden rule.</p> <p>Further, these causal relationships themselves can frequently change. For example, we discover that a specific sales promotion drives a net gain in profit versus no promotion in a test, but next year when a huge number of changes occurs - our competitors have innovated with new promotions, the overall economy has deteriorated, consumer traffic has shifted somewhat from malls to strip centers, and so on - this rule no longer holds true. To extend the prior metaphor, we are finding our way through our dark room by bumping our shins into furniture, while unobserved gremlins keep moving the furniture around on us. For these reasons, it is not enough to run an experiment, find a causal relationship, and assume that it is widely applicable. We must run tests and then measure the actual predictiveness of the rules developed from these tests in actual implementation.</p> </blockquote> </blockquote> <p>So far, we've discussed examples of people with no background in a field explaining how a field works or should work, but the error of taking a high-level view and incorrectly assuming that things are simple also happens when people step back and have a high-level view of their own field that's disconnected from the details. For example, back when I worked at Centaur and we'd not yet shipped a dual core chip, a nearly graduated PhD student in computer architecture from a top school asked me, &quot;why don't you just staple two cores together to make a dual core chip like Intel and AMD? That's an easy win&quot;.</p> <p>At that time, we'd already been working on going from single core to multi core for more than one year. Making a single core chip multi-core or even multi-processor capable with decent performance requires significant additional complexity to the cache and memory hierarchy, the most logically complex part of the chip. As a rough estimate, I would guess that taking a chip designed for single-core use and making it multi-processor capable at least doubles the amount of testing/verification effort required to produce a working chip (and the majority of the design effort that goes into a chip is on testing/verification). More generally, a computer architect is only as good as their understanding of the tradeoffs their decisions impact. Great ones have a strong understanding of the underlying fields they must interact with. A common reason that a computer architect will make a bad decision is that they have a cocktail party level understanding of the fields that are one or two levels below computer architecture. An example of a bad decision that's occurred multiple times in industry is when a working computer architect decides to add <a href="https://en.wikipedia.org/wiki/Simultaneous_multithreading">SMT</a> to a chip because it's basically a free win. You pay a few percent extra area and get perhaps 20% better performance. I know of multple attempts to do this that completely failed for predictable reasons because the architect failed to account for the complexity and verification cost of adding SMT. Adding SMT adds much more complexity than adding a second core because the logic has to be plumbed through everything and it causes an explosion in the complexity of verifying the chip for the same reason. Intel famously added SMT to the P4 and did not enable in the first generation it was shipped in because it was too complex to verify in a single generation and had critical, showstopping, bugs. With the years of time they had to shake the bugs out on one generation of architecture, they fixed their SMT implementation and shipped it in the next generation of chips. This happened again when they migrated to the Core architecture and added SMT to that. A working computer architect should know that this happened twice to Intel, implying that verifying an SMT implementation is hard, and yet there have been multiple instances where someone had a cocktail party level of understanding of the complexity of SMT and suggested adding it to a design that did not have the verification budget to ever ship a working chip with SMT.</p> <p><a href="https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/">And, of course, this isn't really unique to computer architecture</a>. I used the dual core example because it's one that happens to currently be top-of-mind for me, but I can think of tens of similar <a href="https://twitter.com/danluu/status/814167684954738688">examples</a> off the top of my head and I'm pretty sure I could write up a few hundred examples if I spent a few days thinking about similar examples. <a href="https://twitter.com/danluu/status/1594227349662609408">People working in a field still have to be very careful to avoid having an incorrect, too abstract, view of the world that elides details</a> and <a href="https://twitter.com/altluu/status/1484589911873261568">draws comically wrong inferences or conclusions as a result</a>. When people outside a field explain how things should work, their explanations are generally even worse than someone in the field who missed a critical consideration and <a href="https://www.patreon.com/posts/54329188">they generally present</a> <a href="https://yosefk.com/blog/the-high-level-cpu-challenge.html">crank ideas</a>.</p> <p>Bringing together the Roman engineering example and the CPU example, going from 1 core to 2 (and, in general, going from 1 to 2, as in 1 datacenter to 2 datacenters or a monolith to a distributed system) is something every practitioner should understand is hard, even if some don't. Somewhat relatedly, if someone showed off a 4 THz processor that had 1000x the performance of a 4 GHz processor, that's something any practitioner should recognize as alien technology that they definitely do not understand. Only a lay person with no knowledge of the field could reasonably think to themselves, &quot;it's just a processor running at 1000x the clock speed; an engineer who can make a 4 GHz processor would basically understand how a 4 THz processor with 1000x the performance works&quot;. We are so far from being able to scale up performance by 1000x by running chips 1000x faster that doing so would require many fundamental breakthroughs in technology and, most likely, the creation of entirely new fields that contain more engineering knowledge than exists in the world today. Similarly, only a lay person could look at Roman engineering and modern civil engineering and think &quot;Romans built things and we build things that are just bigger and more varied; a Roman engineer should be able to understand how we build things today because the things are just bigger&quot;. Geotechnical engineering alone contains more engineering knowledge than existed in all engineering fields combined in the Roman era and it's only one of the <a href="https://www.patreon.com/posts/61946482">new fields that had to be invented to allow building structures like we can build today</a>.</p> <p>Of course, I don't expect random programmers to understand geotechnical engineering, but I would hope that someone who's making a comparison between programming and civil engineering would at least have some knowledge of civil engineering and not just assume that the amount of knowledge that exists in the field is roughly equal to their knowledge of the field when they know basically nothing about the field.</p> <p>Although <a href="https://www.patreon.com/posts/60185075">I seem to try a lot harder than most folks to avoid falling into the trap of thinking something is simple because I don't understand it</a>, I still fall prey to this all the time and the best things I've come up with to prevent this, while better than nothing, are not reliable.</p> <p>One part of this is that I've tried to cultivate noticing &quot;the feeling of glossing over something without really understanding it&quot;. I think of this is analogous to (and perhaps it's actually the same thing as) something that's become trendy over the past twenty years, paying attention to how emotions feel in your body and understanding your emotional state by noticing feelings in your body, e.g., a certain flavor of tight feeling in a specific muscle is a sure sign that I'm angry.</p> <p>There's a specific feeling I get in my body when I have a fuzzy, high-level, view of something and am mentally glossing over it. I can easily miss it if I'm not paying attention and I suspect I can also miss it when I gloss over something in a way where the non-conscious part of the brain that generates the feeling doesn't even know that I'm glossing over something. Although noticing this feeling is inherently unreliable, I think that everything else I might do that's self contained to check my own reasoning fundamentally relies on the same mechanism (e.g., if I have a checklist to try to determine if I haven't glossed over something when I'm reasoning about a topic, some part of that process will still rely on feeling or intuition). I do try to postmortem cases where I missed the feeling to figure out happened, and that's basically how I figured out that I have a feeling associated with this error in the first place (I thought about what led up to this class of mistake in the past and noticed that I have a feeling that's generally associated with it), but that's never going to perfect or even <a href="p95-skill/">very good</a>.</p> <p>Another component is doing what I think of as &quot;checking inputs into my head&quot;. When I was in high school, I noticed that a pretty large fraction of the &quot;obviously wrong&quot; things I said came from letting incorrect information into my head. I didn't and still don't have a good, cheap, way to tag a piece of information with how reliable it is, so I find it much easier to either fact-check or discard information on consumption.</p> <p>Another thing I try to do is <a href="writing-non-advice/#appendix-getting-feedback">get feedback</a>, which is unreliable and also intractable in the general case since the speed of getting feedback is so much slower than the speed of thought that slowing down general thought to the speed of feedback would result in having relatively few thoughts<sup class="footnote-ref" id="fnref:W"><a rel="footnote" href="#fn:W">4</a></sup>.</p> <p>Although, <a href="teach-debugging/">unlike in some areas, there's no mechanical, systematic, set of steps</a> that can be taught that will solve the problem, I do think this is something that can be practiced and improved and there are some fields where similar skills are taught (often implicitly). For example, when discussing the prerequisites for an advanced or graduate level textbook, it's not uncommon to see a book say something like &quot;Self contained. No prerequisites other than mathematical maturity&quot;. This is a shorthand way of saying &quot;This book doesn't require you to know any particular mathematical knowledge that a high school student wouldn't have picked up, but you do need to have ironed out a kind of fuzzy thinking that almost every untrained person has when it comes to interpreting and understanding mathematical statements&quot;. Someone with a math degree will have a bunch of explicit knowledge in their head about things like <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz inequality</a> and the <a href="https://en.wikipedia.org/wiki/Bolzano%E2%80%93Weierstrass_theorem">Bolzano-Weierstrass theorem</a>, but the important stuff for being able to understand the book isn't the explicit knowledge, but the general way one thinks about math.</p> <p>Although there isn't really a term for the equivalent of mathematical maturity in other fields, e.g., people don't generally refer to &quot;systems designs maturity&quot; as something people look for in <a href="https://twitter.com/danluu/status/1470890504833228801">systems design interviews</a>, the analogous skill exists even though it doesn't have a name. And likewise for just thinking about topics where one isn't a trained expert, like a non-civil engineer thinking about why a construction project cost what it did and took as long as it did, a sort of general maturity of thought<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">5</a></sup>.</p> <p>Thanks to <font size=+1><b><a rel="sponsored" href ="https://www.reforge.com/all-programs?utm_source=danluu&utm_medium=referral&utm_campaign=spring22_newsletter_test&utm_term=&utm_content=engineering">Reforge - Engineering Programs</a></b></font> and <font size=+1><b><a rel="sponsored" href ="https://flatironsdevelopment.com/">Flatirons Development</a></b></font> for helping to make this post possible by <a href="https://patreon.com/danluu">sponsoring me at the Major Sponsor tier</a>.</p> <p>Also, thanks to Pam Wolf, Ben Kuhn, Yossi Kreinin, Fabian Giesen, Laurence Tratt, Danny Lynch, Justin Blank, A. Cody Schuffelen, Michael Camilleri, and Anonymous for comments/corrections discussion.</p> <h4 id="appendix-related-discussions">Appendix: related discussions</h4> <p>An anonymous blog reader gave this example of their own battle with cocktail party ideas:</p> <blockquote> <p>Your most recent post struck a chord with me (again!), as I have recently learned that I know basically nothing about making things cold, even though I've been a low-temperature physicist for nigh on 10 years, now. Although I knew the broad strokes of cooling, and roughly how a dilution refrigerator works, I didn't appreciate the sheer challenge of keeping things at milliKelvin (mK) temperatures. I am the sole physicist on my team, which otherwise consists of mechanical engineers. We have found that basically every nanowatt of dissipation at the mK level matters, as does every surface-surface contact, every material choice, and so on.</p> <p>Indeed, we can say that the physics of thermal transport at mK temperatures is well understood, and we can write laws governing the heat transfer as a function of temperature in such systems. They are usually written as P = aT^n. We know that different classes of transport have different exponents, n, and those exponents are well known. Of course, as you might expect, the difference between having 'hot' qubits vs qubits at the base temperature of the dilution refrigerator (30 mK) is entirely wrapped up in the details of exactly what value of the pre-factor a happens to be in our specific systems. This parameter can be guessed, usually to within a factor of 10, sometimes to within a factor of 2. But really, to ensure that we're able to keep our qubits cold, we need to measure those pre-factors. Things like type of fastener (4-40 screw vs M4 bolt), number of fasteners, material choice (gold? copper?), and geometry all play a huge role in the actual performance of the system. Oh also, it turns out n changes wildly as you take a metal from its normal state to its superconducting state. Fun!</p> <p>We have spent over a year carefully modeling our cryogenic systems, and in the process have discovered massive misconceptions held by people with 15-20 years of experience doing low-temperature measurements. We've discovered material choices and design decisions that would've been deemed insane had any actual thermal modeling been done to verify these designs.</p> <p>The funny thing is, this was mostly fine if we wanted to reproduce the results of academic labs, which mostly favored simpler experiment design, but just doesn't work as we leave the academic world behind and design towards our own purposes.</p> <p>P.S. Quantum computing also seems to suffer from the idea that controlling 100 qubits (IBM is at 127) is not that different from 1,000 or 1,000,000. I used to think that it was just PR bullshit and the people at these companies responsible for scaling were fully aware of how insanely difficult this would be, but after my own experience and reading you post, I'm a little worried that most of them don't truly appreciate the titanic struggle ahead for us.</p> <p>This is just a long-winded way of saying that I have held cocktail party ideas about a field in which I have a PhD and am ostensibly an expert, so your post was very timely for me. I like to use your writing as a springboard to think about how to be better, which has been very difficult. It's hard to define what a good physicist is or does, but I'm sure that trying harder to identify and grapple with the limits of my own knowledge seems like a good thing to do.</p> </blockquote> <p>For a broader and higher-level discussion of clear thinking, see Julia Galef's Scout Mindset:</p> <blockquote> <p>WHEN YOU THINK of someone with excellent judgment, what traits come to mind? Maybe you think of things like intelligence, cleverness, courage, or patience. Those are all admirable virtues, but there’s one trait that belongs at the top of the list that is so overlooked, it doesn’t even have an official name.</p> <p>So I’ve given it one. I call it scout mindset: the motivation to see things as they are, not as you wish they were.</p> <p>Scout mindset is what allows you to recognize when you are wrong, to seek out your blind spots, to test your assumptions and change course. It’s what prompts you to honestly ask yourself questions like “Was I at fault in that argument?” or “Is this risk worth it?” or “How would I react if someone from the other political party did the same thing?” As the late physicist Richard Feynman once said, “The first principle is that you must not fool yourself—and you are the easiest person to fool.”</p> </blockquote> <p>As a tool to improve thought, the book has <a href="https://twitter.com/danluu/status/1477789638387322880">a number of chapters that give concrete checks that one can try</a>, which makes it more (or at least more easily) actionable than this post, which merely suggests that you figure out what it feels like when you're glossing over something. But I don't think that the ideas in the book are a substitute for this post, in that the self-checks the book suggests don't directly attack the problem discussed in this post.</p> <p>In one chapter, Galef suggests leaning into confusion (e.g., if some seemingly contradictory information gives rise to a feeling of confusion), which I agree with. I would add that there are a lot of other feelings that are useful to observe that don't really have a good name. When it comes to evaluating ideas, some that I try to note, beside the already mentioned &quot;the feeling that I'm glossing over important details&quot;, are &quot;the feeling that a certain approach is likely to pay off if pursued&quot;, &quot;the feeling that an approach is really fraught/dangerous&quot;, &quot;the feeling that there's critical missing information&quot;, &quot;the feeling that something is really wrong&quot;, along with similar feelings that don't have great names.</p> <p>For a discussion of how the movie Don't Look Up promotes the idea that the world is simple and we can easily find cocktail party solutions to problems, see <a href="https://astralcodexten.substack.com/p/movie-review-dont-look-up">this post by Scott Alexander</a>.</p> <p>Also, John Salvatier notes that <a href="http://johnsalvatier.org/blog/2017/reality-has-a-surprising-amount-of-detail">reality has a surprising amount of detail</a>.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:N">Another one I commonly hear is that, unlike trad engineers, <a href="https://twitter.com/danluu/status/1162469760900091904">programmers do things that have never been done before</a> <a class="footnote-return" href="#fnref:N"><sup>[return]</sup></a></li> <li id="fn:D"><p>Discussions about construction delays similarly ignore geotechnical reasons for delays. As with the above, I'm using geotechnical as an example of a sub-field that explains many delays because it's something I happen to be familiar with, not because it's the most important thing, but it is a major cause of delays and, on many kinds of projects, the largest cause of delays.</p> <p>Going back to our example that a Roman engineer might, at best, superficially understand, the reason that we pile dirt onto the ground before building is that much of Vancouver has poor geotechnical conditions for building large structures. The ground is soft and will get unevenly squished down over time if something heavy is built on top of it. The sand is there as a weight, to pre-squish the ground.</p> <p>As described in the paragraph above, this sounds straightforward. Unfortunately, it's anything but. As it happens, I've been spending a lot of time driving around with a geophysics engineer (a field that's related to but quite distinct from geotechnical engineering). When we drive over a funny bump or dip in the road, she can generally point out the geotechnical issue or politically motivated decision to ignore the geotechnical engineer's guidance that caused the bump to come into existence. The thing I find interesting about this is that, even though the level of de-risking done for civil engineering projects is generally much higher than is done for the electrical engineering projects I've worked on, where in turn it's much higher than on any software project I've worked on, enough &quot;bugs&quot; still make it into &quot;production&quot; that you can see tens or hundreds of mistakes in a day if you drive around, are knowledgeable, and pay attention.</p> <p>Fundamentally, the issue is that humanity does not have the technology to understand the ground at anything resembling a reasonable cost for physically large projects, like major highways. One tool that we have is to image the ground with ground penetrating radar, but this results in highly <a href="https://en.wikipedia.org/wiki/Underdetermined_system">underdetermined</a> output. Another tool we have is to use something like a core drill or soil augur, which is basically digging down into the ground to see what's there. This also has inherently underdetermined output because we only get to see what's going on exactly where we drilled and the ground sometimes has large spatial variation in its composition that's not obvious from looking at it from the surface. A common example is when there's an unmapped remnant creek bed, which can easily &quot;dodge&quot; the locations where soil is sampled. Other tools also exist, but they, similarly, leave the engineer with an incomplete and uncertain view of the world when used under practical financial constraints.</p> <p>When I listen to cocktail party discussions of why a construction project took so long and compare it to what civil engineers tell me caused the delay, the cocktail party discussion almost always exclusively discusses reasons that civil engineers tell me are incorrect. There are many reasons for delays and &quot;unexpected geotechnical conditions&quot; are a common one. Civil engineers are in a bind here since drilling cores is time consuming and expensive and people get mad when they see that the ground is dug up and no &quot;real work&quot; is happening (and likewise when preload is applied — &quot;why aren't they working on the highway?&quot;), which creates pressure on politicians which indirectly results in timelines that don't allow sufficient time to understand geotechnical conditions. This sometimes results in a geotechnical surprise during a project (typically phrased as &quot;unforseen geotechnical conditions&quot; in technical reports), which can result in major parts of a project having to switch to slower and more expensive techniques or, even worse, can necessitate a part of a project being redone, resulting in cost and schedule overruns.</p> <p>I've never heard a cocktail party discussion that discusses geotechnical reasons for project delays. Instead, people talk about high-level reasons that are plausible sounding to a lay person, but completely fabricated, reasons that are disconnected from reality. But if you want to discuss how things can be built more quickly and cheaply, &quot;<a href="https://www.theatlantic.com/science/archive/2019/07/we-need-new-science-progress/594946/">progress studies</a>&quot;, etc., this cannot be reasonably done without having some understanding of the geotechnical tradeoffs that are in play (as well as the tradeoffs from other civil engineering fields we haven't discussed).</p> <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:C"><p>One thing we could do to keep costs under control is to do less geotechnical work and ignore geotechnical surprises up to some risk bound. Today, some of the &quot;amount of work&quot; done is determined by regulations and much of it is determined by case law, which gives a rough idea of what work needs to be done to avoid legal liability in case of various bad outcomes, such as a building collapse.</p> <p>If, instead of using case law and risk of liability to determine how much geotechnical derisking should be done, we compute this based on <a href="https://en.wikipedia.org/wiki/Quality-adjusted_life_year">QALYs</a> per dollar, at the margin, we seem to spend a very large amount of money geotechnical derisking compared to many other interventions.</p> <p>This is not just true of geotechnical work and is also true of other fields in civil engineering, e.g., builders in places like the U.S. and Canada do much more slump testing than is done in some countries that have a much faster pace of construction, which reduces the risk of a building's untimely demise. It would be both scandalous and a serious liability problem if a building collapsed because the builders of the building didn't do slump testing when they would've in the U.S. or Canada,, but buildings usually don't collapse even when builders don't do as much slump testing as tends to be done in the U.S. and Canada.</p> <p>Countries that don't build to standards roughly as rigorous as U.S. or Canadian standards sometimes have fairly recently built structures collapse in ways that would be considered shocking in the U.S. and Canada, but the number of lives saved per dollar is very small compared to other places the money could be spent. Whether or not we should change this with a policy decision is a more relevant discussion to building costs and timelines than the fabricated reasons I hear cocktail party discussions of construction costs, but I've never heard this or other concrete reasons for project cost brought up outside of civil engineering circles.</p> <p>Even if we just confine ourselves to work that's related to civil engineering as opposed to taking a broader, more EA-minded approach, and looking QALYs for all possible interventions, the tradeoff between resources spent on derisking during construction vs. resources spent derisking on an ongoing basis (inspections, maintenance, etc.), the relative resource levels weren't determined by a process that should be expected to produce anywhere near an optimal outcome.</p> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:W">Some people suggest that writing is a good intermediate step that's quicker than getting external feedback while being more reliable than just thinking about something, but <a href="https://mobile.twitter.com/danluu/status/1453454530444619778/photo/1">I find writing too slow to be usable as a way to clarify ideas</a> and, after working on identifying when I'm having fuzzy thoughts, I find that trying to think through an idea to be more reliable as well as faster. <a class="footnote-return" href="#fnref:W"><sup>[return]</sup></a></li> <li id="fn:R"><p>One part of this that I think is underrated by people who have a self-image of &quot;being smart&quot; is where book learning and thinking about something is sufficient vs. where on-the-ground knowledge of the topic is necessary.</p> <p>A fast reader can read the texts one reads for most technical degrees in maybe 40-100 hours. For a slow reader, that could be much slower, but it's still not really that much time. There are some aspects of problems where this is sufficient to understand the problem and come up with good, reasonable, solutions. <a href="hardware-unforgiving/">And there are some aspects of problems where this is woefully inefficient and thousands of hours of applied effort are required to really be able to properly understand what's going on</a>.</p> <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> </ol> </div> The container throttling problem cgroup-throttling/ Sat, 18 Dec 2021 00:00:00 +0000 cgroup-throttling/ <p><i>This is an excerpt from an internal document <b>David Mackey</b> and I co-authored in April 2019. The document is excerpted since much of the original doc was about comparing possible approaches to increasing efficency at Twitter, which is mostly information that's meaningless outside of Twitter without a large amount of additional explanation/context.</i></p> <p>At Twitter, most CPU bound services start falling over at around 50% reserved container CPU utilization and almost all services start falling over at not much more CPU utilization even though CPU bound services should, theoretically, be able to get higher CPU utilizations. Because load isn't, in general, evenly balanced across shards and the shard-level degradation in performance is so severe when we exceed 50% CPU utilization, this makes the practical limit much lower than 50% even during peak load events.</p> <p>This document will describe potential solutions to this problem. We'll start with describing why we should expect this problem given how services are configured and how the Linux scheduler we're using works. We'll then look into case studies on how we can fix this with config tuning for specific services, which can result in a 1.5x to 2x increase in capacity, which can translate into $[redacted]M/yr to $[redacted]M/yr in savings for large services. While this is worth doing and we might get back $[redacted]M/yr to $[redacted]M/yr in <a href="https://en.wikipedia.org/wiki/Total_cost_of_ownership">TCO</a> by doing this for large services, manually fixing services one at a time isn't really scalable, so we'll also look at how we can make changes that can recapture some of the value for most services.</p> <h3 id="the-problem-in-theory">The problem, in theory</h3> <p>Almost all services at Twitter run on Linux with <a href="https://en.wikipedia.org/wiki/Completely_Fair_Scheduler">the CFS scheduler</a>, using <a href="https://www.kernel.org/doc/Documentation/cgroup-v2.txt">CFS bandwidth control quota</a> for isolation, with default parameters. The intention is to allow different services to be colocated on the same boxes without having one service's runaway CPU usage impact other services and to prevent services on empty boxes from taking all of the CPU on the box, resulting in unpredictable performance, which service owners found difficult to reason about before we enabled quotas. The quota mechanism limits the amortized CPU usage of each container, but it doesn't limit how many cores the job can use at any given moment. Instead, if a job &quot;wants to&quot; use more than that many cores over a quota timeslice, it will use more cores than its quota for a short period of time and then get throttled, i.e., basically get put to sleep, in order to keep its amortized core usage below the quota, which is disastrous for <a href="latency-pitfalls/">tail latency</a><sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <p>Since the vast majority of services at Twitter use thread pools that are much larger than their mesos core reservation, when jobs have heavy load, they end up requesting and then using more cores than their reservation and then throttling. This causes services that are provisioned based on load test numbers or observed latency under load to over provision CPU to avoid violating their <a href="https://en.wikipedia.org/wiki/Service-level_objective">SLO</a>s. They either have to ask for more CPUs per shard than they actually need or they have to increase the number of shards they use.</p> <p>An old example of this problem was the JVM Garbage Collector. Prior to work on the JVM to make the JVM container aware, each JVM would default the GC parallel thread pool size to the number of cores on the machine. During a GC, all these GC threads would run simultaneously, exhausting the cpu quota rapidly causing throttling. The resulting effect would be that a subsecond stop-the-world GC pause could take many seconds of wallclock time to complete. While the GC issue has been fixed, the issue still exists at the application level for virtually all services that run on mesos.</p> <h3 id="the-problem-in-practice-case-study">The problem, in practice [case study]</h3> <p>As a case study, let's look at <code>service-1</code>, the largest and most expensive service at Twitter.</p> <p>Below is the CPU utilization histogram for this service just as it starts failing its load test, i.e., when it's just above the peak load the service can handle before it violates its SLO. The x-axis is the number of CPUs used at a given point in time and the y-axis is (relative) time spent at that utilization. The service is provisioned for 20 cores and we can see that the utilization is mostly significantly under that, even when running at nearly peak possible load:</p> <p><img src="images/cgroup-throttling/cpu-histogram-1.png" alt="Histogram for service with 20 CPU quota showing that average utilization is much lower but peak utilization is significantly higher when the service is overloaded and violates its SLO" width="1266" height="658"></p> <p>The problem is the little bars above 20. These spikes caused the job to use up its CPU quota and then get throttled, which caused latency to drastically increase, which is why the SLO was violated even though average utilization is about 8 cores, or 40% of quota. One thing to note is that the sampling period for this graph was 10ms and the quota period is 100ms, so it's technically possible to see an excursion above 20 in this graph without throttling, but on average, if we see a lot of excursions, especially way above 20, we'll likely get throttling.</p> <p>After reducing the thread pool sizes to avoid using too many cores and then throttling, we got the following CPU utilization histogram under a load test:</p> <p><img src="images/cgroup-throttling/cpu-histogram-2.png" alt="Histogram for service with 20 CPU quota showing that average utilization is much lower but peak utilization is significantly higher when the service is overloaded and violates its SLO" width="1216" height="548"></p> <p>This is at 1.6x the load (request rate) of the previous histogram. In that case, the load test harness was unable to increase load enough to determine peak load for <code>service-1</code> because the service was able to handle so much load before failure that the service that's feeding it during the load test couldn't keep it and send more load (although that's fixable, I didn't have the proper permissions to quickly fix it). [later testing showed that the service was able to handle about 2x the capacity after tweaking the thread pool sizes]</p> <p>This case study isn't an isolated example — Andy Wilcox has looked at the same thing for <code>service-2</code> and found similar gains in performance under load for similar reasons.</p> <p>For services that are concerned about latency, we can get significant latency gains if we prefer to get latency gains instead of cost reduction. For <code>service-1</code>, if we leave the provisioned capacity the same instead of cutting by 2x, we see a 20% reduction in latency.</p> <p>The gains for doing this for individual large services are significant (in the case of <code>service-1</code>, it's [mid 7 figures per year] for the service and [low 8 figures per year] including services that are clones of it, but tuning every service by hand isn't scalable. That raises the question: how many services are impacted?</p> <h3 id="thread-usage-across-the-fleet">Thread usage across the fleet</h3> <p>If we look at the number of active threads vs. number of reserved cores for moderate sized services (&gt;= 100 shards), we see that almost all services have many more threads that want to execute than reserved cores. It's not uncommon to see tens of <a href="https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf">runnable threads</a> per reserved core. This makes the <code>service-1</code> example, above, look relatively tame, at 1.5 to 2 runnable threads per reserved core under load.</p> <p>If we look at where these threads are coming from, it's common to see that a program has multiple thread pools where each thread pool is sized to either twice the number of reserved cores or twice the number of logical cores on the host machine. Both inside and outside of Twitter, It's common to see advice that thread pool size should be 2x the number of logical cores on the machine. This advice probably comes from a workload like picking how many threads to use for something like a gcc compile, where we don't want to have idle resources when we could have something to do. Since threads will sometimes get blocked and have nothing to do, going to 2x can increase throughput over 1x by decreasing the odds that any core is every idle, and 2x is a nice, round, number.</p> <p>However, there are a few problems with applying this to Twitter applications:</p> <ol> <li>Most applications have multiple, competing, thread pools</li> <li>Exceeding the reserved core limit is extremely bad</li> <li>Having extra threads working on computations can increase latency</li> </ol> <p>The &quot;we should provision 2x the number of logical cores&quot; model assumes that we have only one main thread pool doing all of the work and that there's little to no downside to having threads that could do work sit and do nothing and that we have a throughput oriented workload where we don't care about the deadline of any particular unit of work.</p> <p>With the CFS scheduler, threads that have active work that are above the core reservation won't do nothing, they'll get scheduled and run, but this will cause throttling, which negatively impacts tail latency.</p> <h3 id="potential-solutions">Potential Solutions</h3> <p>Given that we see something similar looking to our case study on many services and that it's difficult to push performance fixes to a lot of services (because service owners aren't really incentivized to take performance improvements), what can we do to address this problem across the fleet and just on a few handpicked large services? We're going to look at a list of potential solutions and then discuss each one in more detail, below.</p> <ul> <li>Better defaults for cross-fleet threadpools (eventbus, netty, etc.)</li> <li>Negotiating ThreadPool sizes via a shared library</li> <li>CFS period tuning</li> <li>CFS bandwidth slice tuning</li> <li>Other scheduler tunings</li> <li>CPU pinning and isolation</li> <li>Overprovision at the mesos scheduler level</li> </ul> <h4 id="better-defaults-for-cross-fleet-threadpools">Better defaults for cross-fleet threadpools</h4> <p><b>Potential impact</b>: some small gains in efficiency<br> <b>Advantages</b>: much less work than any comprehensive solution, can be done in parallel with more comprehensive solutions and will still yield some benefit (due to reduced lock contention and context switches) if other solutions are in place.<br> <b>Downsides</b>: doesn't solve most of the problem.</p> <p>Many defaults are too large. Netty default threadpool size is 2x the reserved cores. In some parts of [an org], they use a library that spins up <a href="https://news.ycombinator.com/item?id=26643392">eventbus</a> and allocates a threadpool that's 2x the number of logical cores on the host (resulting in [over 100] eventbus threads) when 1-2 threads is sufficient for most of their eventbus use cases.</p> <p>Adjusting these default sizes won't fix the problem, but it will reduce the impact of the problem and this should be much less work than the solutions below, so this can be done while we work on a more comprehensive solution.</p> <h4 id="negotiating-threadpool-sizes-via-a-shared-library-api">Negotiating ThreadPool sizes via a shared library (API)</h4> <p>[this section was written by <i>Vladimir Kostyukov</i>]</p> <p><b>Potential impact</b>: can mostly mitigate the problem for most services.<br> <b>Advantages</b>: quite straightforward to design and implement; possible to make it first-class in <a href="https://kostyukov.net/posts/finagle-101/">Finagle</a>/Finatra.<br> <b>Downsides</b>: Requires service-owners to opt-in explicitly (adopt a new API for constructing thread-pools).</p> <p>CSL’s util library has a package that bridges in some integration points between an application and a JVM (util-jvm), which could be a good place to host a new API for negotiating the sizes of the thread pools required by the application.</p> <p>The look and feel of such API is effectively dictated by how granular the negotiation is needed to be. Simply contending on a total number of allowed threads allocated per process, while being easy to implement, doesn’t allow distinguishing between application and IO threads. Introducing a notion of QoS for threads in the thread pool (i.e., “IO thread; can not block”, “App thread; can block”), on the other hand, could make the negotiation fine grained.</p> <h4 id="cfs-period-tuning">CFS Period Tuning</h4> <p><b>Potential impact</b>: small reduction tail latencies by shrinking the length of the time period before the process group’s CFS runtime quota is refreshed.<br> <b>Advantages</b>: relatively straightforward change requiring few minimal changes.<br> <b>Downsides</b>: comes at increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota. May result in more total throttling.</p> <p>To limit CPU usage, CFS operates over a time window known as the CFS period. Processes in a scheduling group take time from the CFS quota assigned to the cgroup and this quota is consumed over the cfs_period_us in CFS bandwidth slices. By shrinking the CFS period, the worst case time between quota exhaustion causing throttling and the process group being able to run again is reduced proportionately. Taking the default values of a CFS bandwidth slice of 5ms and CFS period of 100ms, in the worst case, a highly parallel application could exhaust all of its quota in the first bandwidth slice leaving 95ms of throttled time before any thread could be scheduled again.</p> <p>It's possible that total throttling would increase because the scheduled time over 100ms might not exceed the threshold even though there are (for example) 5ms bursts that exceed the threshold.</p> <h4 id="cfs-bandwidth-slice-tuning">CFS Bandwidth Slice Tuning</h4> <p><b>Potential impact</b>: small reduction in tail latencies by allowing applications to make better use of the allocated quota.<br> <b>Advantages</b>: relatively straightforward change requiring minimal code changes.<br> <b>Downsides</b>: comes at increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota.</p> <p>When CFS goes to schedule a process it will transfer run-time between a global pool and CPU local pool to reduce global accounting pressure on large systems.The amount transferred each time is called the &quot;slice&quot;. A larger bandwidth slice is more efficient from the scheduler’s perspective but a smaller bandwidth slice allows for more fine grained execution. In debugging issues in [link to internal JIRA ticket] it was determined that if a scheduled process fails to consume its entire bandwidth slice, the default slice size being 5ms, because it has completed execution or blocked on another process, this time is lost to the process group reducing its ability to consume all available resources it has requested.</p> <p>The overhead of tuning this value is expected to be minimal, but should be measured. Additionally, it is likely not a one size fits all tunable, but exposing this to the user as a tunable has been rejected in the past in Mesos. Determining a heuristic for tuning this value and providing a per application way to set it may prove infeasible.</p> <h4 id="other-scheduler-tunings">Other Scheduler Tunings</h4> <p><b>Potential Impact</b>: small reduction in tail latencies and reduced throttling.<br> <b>Advantages</b>: relatively straightforward change requiring minimal code changes.<br> <b>Downsides</b>: comes at potentially increased scheduler overhead costs that may offset the benefits and does not address the core issue of parallelism exhausting quota.</p> <p>The kernel has numerous auto-scaling and auto-grouping features whose impact to scheduling performance and throttling is currently unknown. <code>kernel.sched_tunable_scaling</code> can adjust <code>kernel.sched_latency_ns</code> underneath our understanding of its value. <code>kernel.sched_min_granularity_ns</code> and <code>kernel.sched_wakeup_granularity_ns</code> can be tuned to allow for preempting sooner, allowing better resource sharing and minimizing delays. <code>kernel.sched_autogroup_enabled</code> may currently not respect <code>kernel.sched_latency_ns</code>leading to more throttling challenges and scheduling inefficiencies. These tunables have not been investigated significantly and the impact of tuning them is unknown.</p> <h4 id="cfs-scheduler-improvements">CFS Scheduler Improvements</h4> <p><b>Potential impact</b>: better overall cpu resource utilization and minimized throttling due to CFS inefficiencies.<br> <b>Advantages</b>: improvements are transparent to userspace.<br> <b>Downsides</b>: the CFS scheduler is complex so there is a large risk to the success of the changes and upstream reception to certain types of modifications may be challenging.</p> <p>How the CFS scheduler deals with unused slack time from the CFS bandwidth slice has shown to be ineffective. The kernel team has a patch to ensure that this unused time is returned back to the global pool for other processes to use, <a href="https://lore.kernel.org/patchwork/patch/907450/">https://lore.kernel.org/patchwork/patch/907450/</a> to ensure better overall system resource utilization. There are some additional avenues to explore that could provide further enhancements. Another of many recent discussions in this area that fell out of a k8s throttling issue(<a href="https://github.com/kubernetes/kubernetes/issues/67577">https://github.com/kubernetes/kubernetes/issues/67577</a>) is <a href="https://lkml.org/lkml/2019/3/18/706">https://lkml.org/lkml/2019/3/18/706</a>.</p> <p>Additionally, CFS may lose efficiency due to bugs such as [link to internal JIRA ticket] and <a href="http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf">http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf</a>. However, we haven't spent much time looking at the CFS performance for Twitter’s particular use cases. A closer look at CFS may find ways to improve efficiency.</p> <p>Another change which has more upside and downside potential would be to use a scheduler other than CFS.</p> <h4 id="cpu-pinning-and-isolation">CPU Pinning and Isolation</h4> <p><b>Potential impact</b>: removes the concept of throttling from the system by making the application developer’s mental model of a CPU map to a physical one. <br> <b>Advantages</b>: simplified understanding from application developer’s perspective, scheduler imposed throttling is no longer a concept an application contends with, improved cache efficiency, much less resource interference resulting in more deterministic performance.<br> <b>Disadvantages</b>: greater operational complexity, oversubscription is much more complicated, significant changes to current operating environment</p> <p>The fundamental issue that allows throttling to occur is that a heavily threaded application can have more threads executing in parallel than the “number of CPUs” it requested resulting in an early exhaustion of available runtime. By restricting the number of threads executing simultaneously to the number of CPUs an application requested there is now a 1:1 mapping and an application’s process group is free to consume the logical CPU thread unimpeded by the scheduler. Additionally, by dedicating a CPU thread rather than a bandwidth slice to the application, the application is now able to take full advantage of CPU caching benefits without having to contend with other applications being scheduled on the same CPU thread while it is throttled or context switched away.</p> <p>In Mesos, implementing CPU pinning has proven to be quite difficult. However, in k8s there is existing hope in the form of a project from Intel known as the k8s CPU Manager. The CPU Manager was added as an alpha feature to k8s in 1.8 and has been enabled as a beta feature since 1.10. It has somewhat stalled in beta as few people seem to be using it but the core functionality is present. The performance improvements promoted by the CPU Manager project are significant as shown in examples such as <a href="https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/">https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/</a> and <a href="https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf">https://builders.intel.com/docs/networkbuilders/cpu-pin-and-isolation-in-kubernetes-app-note.pdf</a> While these benchmarks should be looked at with some skepticism, it does provide promising hope for exploring this avenue. A cursory inspection of the project highlights a few <a href="https://github.com/kubernetes/kubernetes/issues/70585">areas</a> where work may still be needed but it is already in a usable state for validating the approach. Underneath, the k8s CPU Manager leverages the cpuset cgroup functionality that is present in the kernel.</p> <p>Potentially, this approach does reduce the ability to oversubscribe the machines. However, the efficiency gains from minimized cross-pod interference, CPU throttling, a more deterministic execution profile and more may offset the need to oversubscribe. Currently, the k8s CPU Manager does allow for minor oversubscription in the form of allowing system level containers and the daemonset to be oversubscribed, but on a pod scheduling basis the cpus are reserved for that pod’s use.</p> <p>Experiments by Brian Martin and others have shown significant performance benefits from CPU pinning that are almost as large as our oversubscription factor.</p> <p>Longer term, oversubscription could be possible through a multitiered approach of wherein a primary class of pods is scheduled using CPU pinning but a secondary class of pods that is not as latency sensitive is allowed to float across all cores consuming slack resources from the primary pods. The work on the CPU Manager side would be extensive. However, recently <a href="https://lwn.net/ml/linux-kernel/20190408214539.2705660-1-songliubraving@fb.com/">Facebook has been doing some work</a> on the kernel scheduler side to further enable this concept in a way that minimally impacts the primary pod class that we can expand upon or evolve.</p> <h4 id="oversubscription-at-the-cluster-scheduler-level">Oversubscription at the cluster scheduler level</h4> <p><b>Potential impact</b>: can bring machine utilization up to an arbitrarily high level and overprovisioning &quot;enough&quot;.<br> <b>Advantages</b>: oversubscription at the cluster scheduler level is independent of the problem described in this doc; doing it in a data-driven way can drive machine utilization up without having to try to fix the specific problems described here. This could simultaneously fix the problem in this doc (low CPU utilization due to overprovisioning to avoid throttling) while also fixing [reference to document describing another problem].<br> <b>Disadvantages</b>: we saw in [link to internal doc] that shards of services running on hosts with high load have degraded performance. Unless we change the mesos scheduler to schedule based on actual utilization (as opposed to reservation), some hosts would end up too highly loaded and services with shards that land on those hosts would have poor performance.</p> <h4 id="disable-cfs-quotas">Disable CFS quotas</h4> <p><b>Potential impact</b>: prevents throttling and allows services to use all available cores on a box by relying on the &quot;shares&quot; mechanism instead of quota.<br> <b>Advantages</b>: in some sense, can gives us the highest possible utilization.<br> <b>Disadvantages</b>: badly behaved services could severely interfere with other services running on the same box. Also, service owners would have a much more difficult time predicting the performance of their own service since performance variability between the unloaded and loaded state would be much larger.</p> <p>This solution is what was used before we enabled quotas. From a naive hardware utilization standpoint, relying on the shares mechanism seems optimal since this means that, if the box is underutilized, services can take unused cores, but if the box becomes highly utilized, services will fall back to taking their share of cores, proportional to their core reseration. However, when we used this system, most service owners found it too difficult to estimate performance under load for this to be practical. At least one company has tried this solution to fix their throttling problem and has had severe incidents under load because of it. If we switched back to this today, we'd be no better off than we were before we were before we enabled quotes.</p> <p>Given how we allocate capacity, two ingredients that would make this work better than it did before include having a more carefully controlled request rate to individual shards and a load testing setup that allowed service owners to understand what things would really look like during a load spike, as opposed to our system, which only allows injection of unrealistic load to individual shards, which both has the problem that the request mix isn't the same as it is under a real load spike and that the shard with injected load isn't seeing elevated load from other services running on the same box. Per [another internal document], we know that one of the largest factors impacting shard-level performance is overall load on the box and that the impact on latency is non-linear and difficult to predict, so there's not really a good way to predict performance under actual load from performance under load tests with the load testing framework we have today.</p> <p>Although these missing ingredients are important, high impact, issues, addressing either of these issues is beyond the scope of this doc; [Team X] owns load testing and is working on load testing and it might be worth revisiting this when the problem is solved.</p> <p>An intermediate solution would be to set the scheduler quota to a larger value than the number of reserved cores in mesos, which would bound the impact of having &quot;too much&quot; CPU available causing unpredictable performance while potentially reducing throttling when under high load because the scheduler will effective fall back to the shares mechanism if the box is highly loaded. For example, if the cgroup quota was twice the the mesos quota, services that fall over at 50% of reserved mesos CPU usage would then instead fall over at 100% of reserved mesos CPU usage. For boxes at high load, the higher overall utilization would reduce throttling because the increased load from other cores would mean that a service that has too many runnable threads wouldn't be able to have as many of those threads execute. This has a weaker version of the downside of disabling in quota, in that, from [internal doc], we know that load on a box from other services is one of the largest factors in shard-level performance variance and this would, if we don't change how many mesos cores are reserved on a box, increase load on boxes. And if we do proportionately decrease the number of mesos reserved cores on a box, that makes the change pointless in that it's equivalent to just doubling every service's CPU reservation, except that having it &quot;secretly&quot; doubled would probably reduce the number of people who ask the question, &quot;Why can't I exceed X% CPU in load testing without the service falling over?&quot;</p> <h3 id="results">Results</h3> <p><i>This section was not in the original document from April 2019; it was written in December 2021 and describes work that happened as a result of the original document.</i></p> <p>The suggestion of changing default thread pool sizes was taken and resulted in minor improvements. More importantly, two major efforts came out of the document. Vladimir Kostyukov (from the <a href="https://finagle.github.io/blog/2021/03/31/quarterly/">CSL team</a>) and Flavio Brasil (from the JVM team) created <a href="https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/filter/OffloadFilter.scala">Finagle Offload Filter</a> and Xi Yang (my intern<sup class="footnote-ref" id="fnref:I"><a rel="footnote" href="#fn:I">2</a></sup> at the time and now a full-time employee for my team) created a kernel patch which eliminates container throttling (the patch is still internal, but will hopefully eventually upstreamed).</p> <p>Almost all applications that run on mesos at Twitter run on top of <a href="https://kostyukov.net/posts/finagle-101/">Finagle</a>. The Finagle Offload Filter makes it trivial for service owners to put application work onto a different thread pool than IO (which was often not previously happening). In combination with sizing thread pools properly, this resulted in, ceteris paribus, applications having <a href="https://mobile.twitter.com/fbrasisil/status/1163974576511995904">drastically reduced latency</a>, enabling them to reduce their provisioned capacity and therefore their cost while meeting their SLO. Depending on the service, this resulted in a 15% to 60% cost reduction for the service.</p> <p>The kernel patch implements the obvious idea of preventing containers from using more cores than a container's quota at every moment instead of allowing a container to use as many cores as are available on the machine and then putting the container to sleep if it uses too many cores to bring its amortized core usage down.</p> <p>In experiments on hosts running major services at Twitter, this has the expected impact of eliminating issues related to throttling, giving a roughly 50% cost reduction for a typical service with untuned thread pool sizes. And it turns out the net impact is larger than we realized when we wrote this document due to the reduction in interference caused by preventing services from using &quot;too many&quot; cores and then throttling<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">3</a></sup>. Also, although this was realized at the time, we didn't note in the document that the throttling issue causes shards to go from &quot;basically totally fine&quot; to a &quot;throttling death spiral&quot; that's analogous to a &quot;GC death spiral&quot; with only a small amount of additional load, which increases the difficulty of operating systems reliably. What happens is that, when a service is under high load, it will throttle. Throttling doesn't prevent requests from coming into the shard that's throttled, so when the shard wakes up from being throttled, it has even more work to do than it had before it throttled, causing it to use even more CPU and throttle more quickly, which causes even more work to pile up. Finagle has a mechanism that can shed load for shards that are in very bad shape (clients that talk to the dead server will mark the server as dead and stop sending request for a while) but, shards tend to get into this bad state when overall load to the service is high, so marking a node as dead just means that more load goes to other shards, which will then &quot;want to&quot; enter a throttling death spiral. Operating in a regime where throttling can cause a death spiral is <a href="https://twitter.com/copyconstruct/status/1399766443596472320">an inherently metastable state</a>. Removing both of these issues is arguably as large an impact as the cost reduction we see from eliminating throttling.</p> <p>Xi Yang has experimented with variations on the naive kernel scheduler change mentioned above, but even the naive change seems to be quite effective compared to no change, even though the naive change does mean that services will often not be able to hit their full CPU allocation when they ask for it, e.g., if a service requests no CPU for the first half a period and then requests infinite CPU for the second half of the period, under the old system, it would get its allocated amount of CPU for the period, but under the new system, it would only get half. Some of Xi's variant patches address this issue in one way or another, but that has a relatively small impact compared to preventing throttling in the first place.</p> <p>An independent change Pratik Tandel drove that reduced the impact of throttling on services by reducing the impact of variance between shards was to move to fewer larger shards. The main goal for that change was to reduce overhead due to duplicate work/memory that happens across all shards, but it also happens to have an impact due to larger per-shard quotas reducing the impact of random noise. Overall, this resulted in 0% to 20% reduced CPU usage and 10% to 40% reduced memory usage of large services at Twitter, depending on the service.</p> <h3 id="appendix-other-container-throttling-related-work">Appendix: other container throttling related work</h3> <ul> <li><a href="https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/">https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/</a></li> <li>Adding burstiness <ul> <li><a href="https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/">https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/</a></li> <li><a href="https://lkml.org/lkml/2019/11/26/196">https://lkml.org/lkml/2019/11/26/196</a></li> <li><a href="https://lwn.net/Articles/840595/">https://lwn.net/Articles/840595/</a></li> <li>A container that exceeds its allocation will still throttle, but the idea of &quot;burst capacity&quot; is added, allowing more margin before throttling while keeping basically the same average core utilization <ul> <li>Allowing burstiness is independent of our fix, which prevents throttling and, in principle, both ideas could be applied at the same time, which would be somewhat like how network isolation works if you enable htb qdisc</li> <li>Given the workloads and configurations that Twitter has, this does not fix the throttling problem for us with respect to either achieving very high per-container CPU utilization or preventing a the metastability caused by threat of throttling death spiral, although it does allow us to use slightly more average CPU than without enabling burstiness</li> </ul></li> </ul></li> <li>Runtime level parallelism limiting <ul> <li>Since Go typically uses a single thread pool, Uber was able to work around this issue by limiting the maximum number of running goroutines via <a href="https://github.com/uber-go/automaxprocs">https://github.com/uber-go/automaxprocs</a> <ul> <li>Unfortunately for Twitter, a number of Twitter's largest and most expensive services, including <code>service-1</code>, use multiple language runtimes, so there isn't a simple way to bound the parallelism at the runtime level</li> </ul></li> <li>The .NET runtime has had adaptive thread pool sizes for a decade, <a href="https://twitter.com/danluu/status/1340059907026898944">one of the many ways the .NET stack is more advanced than what we commonly see at trendy tech companies</a></li> </ul></li> </ul> <p><i>Thanks to Xi Yang, Ilya Pronin, Ian Downes, Rebecca Isaacs, Brian Martin, Vladimir Kotsyukov, Moses Nakamura, Flavio Brasil, Laurence Tratt, Akshay Shah, Julian Squires, Michael Greenberg @synrotek, and Miguel Angel Corral for comments/corrections/discussion</i></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">if this box is highly loaded, because there aren't enough cores to go around, then a container may not get all of the cores it requests, but this doesn't change the fundamental problem. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:I">I often joke that <a href="https://twitter.com/danluu/status/1324416895013986305">interns get all of the most interesting work</a>, while us full-time employees are stuck with the stuff interns don't want to do. <a class="footnote-return" href="#fnref:I"><sup>[return]</sup></a></li> <li id="fn:M">In an independent effort, Matt Tejo found that, for a fixed average core utilization, services that throttle cause a much larger negative impact on other services on the same host than services that use a constant number of cores. That's because a service that's highly loaded and throttling toggles between attempting to use all of the cores on the box and then using none of the cores on the box, causing an extremely large amount of interference during the periods where it's attempting to use all of the cores on the box. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> </ol> </div> Some thoughts on writing writing-non-advice/ Mon, 13 Dec 2021 00:00:00 +0000 writing-non-advice/ <p>I see a lot of essays framed as writing advice which are actually thinly veiled descriptions of how someone writes that basically say &quot;you should write how I write&quot;, e.g., <a href="https://twitter.com/danluu/status/1437539076324790274">people who write short posts say that you should write short posts</a>. As with technical topics, <a href="learn-what/">I think a lot of different things can work and what's really important is that you find a style that's suitable to you and the context you operate in</a>. <a href="https://twitter.com/danluu/status/1467235582199812097">Copying what's worked for someone else is unlikely to work for you</a>, making &quot;write how I write&quot; bad advice.</p> <p>We'll start by looking at how much variety there's been in what's worked<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">1</a></sup> for people, come back to what makes it so hard to copy someone else's style, and then discuss what I try to do in my writing.</p> <p>If I look at the most read programming blogs in my extended social circles<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">2</a></sup> from 2000 to 2017<sup class="footnote-ref" id="fnref:7"><a rel="footnote" href="#fn:7">3</a></sup>, it's been Joel Spolsky, Paul Graham, Steve Yegge, and Julia Evans (if you're not familiar with these writers, <a href="#appendix-some-snippets-of-writing">see the appendix for excerpts that I think are representative of their styles</a>). Everyone on this list has a different style in the following dimensions (as well as others):</p> <ul> <li>Topic selection</li> <li>Prose style</li> <li>Length</li> <li>Type of humor (if any)</li> <li>Level of technical detail</li> <li>Amount of supporting evidence</li> <li>Nuance</li> </ul> <p>To pick a simple one to quantify, length, Julia Evans and I both started blogging in 2013 (she has one post from 2012, but she's told me that she considers her blog to have started in earnest when she was at <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>, in September 2013, the same month I started blogging). Over the years, we've compared notes a number of times and, until I paused blogging at the end of 2017, we had a similar word count on our blogs even though she was writing roughly one order of magnitude more posts than I do.</p> <p>To look at a few aspects that are difficult to quantify, consider this passage from Paul Graham, which is typical of his style:</p> <blockquote> <p>What nerds like is the kind of town where people walk around smiling. This excludes LA, where no one walks at all, and also New York, where people walk, but not smiling. When I was in grad school in Boston, a friend came to visit from New York. On the subway back from the airport she asked &quot;Why is everyone smiling?&quot; I looked and they weren't smiling. They just looked like they were compared to the facial expressions she was used to.</p> <p>If you've lived in New York, you know where these facial expressions come from. It's the kind of place where your mind may be excited, but your body knows it's having a bad time. People don't so much enjoy living there as endure it for the sake of the excitement. And if you like certain kinds of excitement, New York is incomparable. It's a hub of glamour, a magnet for all the shorter half-life isotopes of style and fame.</p> <p>Nerds don't care about glamour, so to them the appeal of New York is a mystery.</p> </blockquote> <p>It uses multiple aspects of what's sometimes called <a href="https://amzn.to/3dRLMgR">classic style</a>. In this post, when I say &quot;classical style&quot;, I mean as the term is used by <a href="https://amzn.to/3dRLMgR">Thomas &amp; Turner</a>, not a colloquial meaning. What that means is really too long to reasonably describe in this post, but I'll say that one part of it is that the prose is clean, straightforward, and simple; an editor whose slogan is &quot;omit needless words&quot; wouldn't have many comments. Another part is that the clean-ness of the style goes past the prose to what information is presented, so much so that supporting evidence isn't really presented. Thomas &amp; Turner say &quot;truth needs no argument but only accurate presentation&quot;. An example that exemplifies both of these is this passage from Rochefoucauld:</p> <blockquote> <p>Madame de Chevreuse had sparkling intelligence, ambition, and beauty in plenty; she was flirtatious, lively, bold, enterprising; she used all her charms to push her projects to success, and she almost always brought disaster to those she encountered on her way.</p> </blockquote> <p>Thomas &amp; Turner said this about Rochefoucauld's passage:</p> <blockquote> <p>This passage displays truth according to an order that has nothing to do with the process by which the writer came to know it. The writer takes the pose of full knowledge. This pose implies that the writer has wide and textured experience; otherwise he would not be able to make such an observation. But none of that personal history, personal experience, or personal psychology enters into the expression. Instead the sentence crystallizes the writer’s experience into a timeless and absolute sequence, as if it were a geometric proof.</p> </blockquote> <p>Much of this applies to the passage by Paul Graham (though not all, since he tells us an anecdote about a time a friend visited Boston from New York and he explicitly says that you would know such and such &quot;if you've lived in New York&quot; instead just stating what you would know).</p> <p>My style is opposite in many ways. I often have long, meandering, sentences, not for any particular literary purpose, but just because it reflects how I think. <a href="https://amzn.to/3s0Adwd">Strunk &amp; White</a> would have a field day with my writing. To the extent feasible, I try to have a structured argument and, when possible, evidence, with caveats for cases where the evidence isn't applicable. Although not presenting evidence makes something read cleanly, that's not my choice because I don't like that the reader basically has to take or leave it with respect to bare assertions, such as &quot;what nerds like is the kind of town where people walk around smiling&quot; and would prefer if readers know why I think something so they can agree or disagree based on the underlying reasons.</p> <p>With length, style, and the other dimensions mentioned, there isn't a right way and a wrong way. A wide variety of things can work decently well. Though, if popularity is the goal, then I've probably made a sub-optimal choice on length compared to Julia and on prose style when compared to Paul. If I look at what causes other people to gain a following, and what causes my RSS to get more traffic, for me to get more Twitter followers, etc., publishing short posts frequently looks more effective than publishing long posts less frequently.</p> <p>I'm less certain about the impact of style on popularity, but my feeling is that, for the same reason that making a lot of confident statements at a job works (gets people promoted), writing confident, unqualified, statements, works (gets people readers). People like confidence.</p> <p>But, in both of these cases, one can still be plenty popular while making a sub-optimal choice and, for me, I view optimizing for other goals to be more important than optimizing for popularity. On length, I frequently cover topics that can't be covered in brief easily, or perhaps at all. One example of this is <a href="branch-prediction/">my post on branch prediction</a>, which has two goals: give a programmer with no background in branch prediction or even computer architecture a historical survey and teach them enough to be able to read and understand a modern, state-of-the-art paper on branch prediction. That post comes in at 5.8k words. I don't see how to achieve the same goals with a post that comes in at the lengths that people recommend for blog posts, 500 words, 1000 words, 1500 words, etc. The post could probably be cut down a bit, but every predictor discussed is either a necessary building block used to explain later predictors except the <code>agree</code> predictor or of historical importance. But if the <code>agree</code> predictor wasn't discussed, it would still be important to discuss at least one interference-reducing scheme since why interference occurs and what can be done to reduce it is a fundamental concept in branch prediction.</p> <p>There are other versions of the post that could work. One that explains that branch prediction exists at all could probably be written in 1000 words. That post, written well, would have a wider audience, be more popular, but that's not what I want to write.</p> <p>I have an analogous opinion on style because I frequently want to discuss things in a level of detail and with a level of precision that precludes writing cleanly in the classic style. A specific, small, example is that, on a recent post, a draft reader asked me to remove a double negative and I declined because, in that case, the double negative had different connotations from the positive statement that might've replaced it and I had something precise I wanted to convey that isn't what would've been conveyed if I simplified the sentence.</p> <p>A more general thing is that Paul writes about a lot of &quot;big ideas&quot; at a high level. That's something that's amenable to writing in a clean, simple style; what Paul calls an elegant style. But I'm not interested in writing about <a href="https://scattered-thoughts.net/writing/on-bad-advice/#context-matters">big ideas that are disconnected from low-level details</a> and it's difficult to effectively discuss low-level details without writing in a style Paul would call inelegant.</p> <p>A concrete example of this is <a href="cli-complexity/">my discussion of command line tools and the UNIX philosophy</a>. Should we have tools that &quot;do one thing and do it well&quot; and &quot;write programs to handle text streams, because that is a universal interface&quot; or use commands that have many options and can handle structured data? People have been trading the same high-level rebuttals back and forth for decades. But the moment we look at the details, look at what happens when these ideas get exposed to the real world, we can immediately see that one of these sets of ideas couldn't possibly work as espoused.</p> <p>Coming back to writing style, if you're trying to figure out what stylistic choices are right for you, you should start from your goals and what you're good at and go from there, not listen to somebody who's going to tell you to write like them. Besides being unlikely to work for you even if someone is able to describe what makes their writing tick, most advice is written by people who don't understand how their writing works. This may be difficult to see for writing if you haven't spent a lot of time analyzing writing, but it's easy to see this is true if you've taken a bunch of dance classes or had sports instruction that isn't from a very good coach. If you watch, for example, the median dance instructor and listen to their instructions, you'll see that their instructions are quite different from what they actually do. <a href="https://news.ycombinator.com/item?id=29524840">People who listen and follow instructions instead of attempting to copy what the instructor is doing will end up doing the thing completely wrong</a>. Most writing advice similarly fails to capture what's important.</p> <p>Unfortunately, <a href="https://www.youtube.com/watch?v=2THVvshvq0Q">copying someone else's style isn't easy either; most people copy entirely the wrong thing</a>. For example, Natalie Wynn noted that people who copy her style often copy the superficial bits without understanding what's driving the superficial bits to be the way they are:</p> <blockquote> <p>One thing I notice is when people aren’t saying anything. Like when someone’s trying to do a “left tube video essay” and they shove all this opulent shit onscreen because contrapoints, but it has nothing to do with the topic. What’s the reference? What are you saying??</p> <p>I made a video about shame, and the look is Eve in Eden because Eve was the first person to experience shame. So the visual is connected to the concept and hopefully it resonates more because of that. So I guess that’s my advice, try to say something</p> </blockquote> <p>If you look into what people who excel in their field have to say, you'll often see analogous remarks about other fields. For example, in Practical Shooting, Rob Leatham says:</p> <blockquote> <p>What keeps me busy in my classes is trying to help my students learn how to think. They say, &quot;Rob holds his hands like this...,&quot; and they don't know that the reason I hold my hands like this is not to make myself look that way. The end result is not to hold the gun that way; holding the gun that way is the end result of doing something else.</p> </blockquote> <p>And Brian Enos says:</p> <blockquote> <p>When I began ... shooting I had only basic ideas about technique. So I did what I felt was the logical thing. I found the best local shooter (who was also competitive nationally) and asked him how I should shoot. He told me without hesitation: left index finger on the trigger guard, left elbow bent and pulling back, classix boxer stance, etcetera, etcetera. I adopted the system blindly for a year or two before wondering whether there might be a system that better suited my structure and attitude, and one that better suited the shooting. This first style that I adopted didn't seem to fit me because it felt as though I was having to struggle to control the gun; I was never actually flowing with the gun as I feel I do now. My experimentation led me to pull ideas from all types of shooting styles: Isosceles, Modified Weaver, Bullseye, and from people such as Bill Blankenship, shotgunner John Satterwhite, and martial artist Bruce Lee.</p> <p>But ideas coming from your environment only steer you in the right direction. These ideas can limit your thinking by their very nature ... great ideas will arise from a feeling within yourself. This intuitive awareness will allow you to accept anything that works for you and discard anything that doesn't</p> </blockquote> <p>I'm citing those examples because they're written up in a book, but I've heard basically the same comment from instructors in a wide variety of activities, e.g., dance instructors I've talked to complain that people will ask about whether, during a certain motion, the left foot should cross in front or behind the right foot, which is missing the point since what matters is the foot placement is reasonable given how the person's center of gravity is moving, which may mean that the foot should cross in front or behind, depending on the precise circumstance.</p> <p>The more general issue is that a person who doesn't understand the thing they're trying to copy will end up copying unimportant superficial aspects of what somebody else is doing and miss the fundamentals that drive the superficial aspects. <a href="hardware-unforgiving/">This even happens when there are very detailed instructions</a>. Although watching what other people do can accelerate learning, especially for beginners who have no idea what to do, there isn't a shortcut to understanding something deeply enough to facilitate doing it well that can be summed up in simple rules, like &quot;omit needless words&quot;<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">4</a></sup>.</p> <p>As a result, I view style as something that should fall out of your goals, and goals are ultimately a personal preference. Personally, some goals that I sometimes have are:</p> <ul> <li>Explain a technical topic that a lot of people don't seem to understand at a level that's accessible to almost any professional programmer <ul> <li>Examples: <a href="branch-prediction/">branch prediction</a>, <a href="malloc-tutorial/">malloc</a>, <a href="intel-cat/">cache partitioning</a></li> </ul></li> <li>Make a case for a minority opinion (or one that was a minority opinion at the time, anyway): <ul> <li>Examples: <a href="deconstruct-files/">files are difficult to use</a>, <a href="startup-tradeoffs/">public tech companies can pay very well</a>, <a href="monorepo/">monorepos aren't stupid</a></li> </ul></li> <li><a href="why-benchmark/">Measure something</a></li> <li>Discuss phenomena I think are interesting: <ul> <li>Examples: <a href="discontinuities/">funny discontinuities</a>, <a href="hardware-unforgiving/">the difficulty of knowledge transfer</a>, <a href="wat/">normalization of deviance</a></li> </ul></li> </ul> <p>When you combine one of those goals with the preference of discussing things <a href="http://johnsalvatier.org/blog/2017/reality-has-a-surprising-amount-of-detail">in detail</a>, you get a style that's different from any of the writers mentioned above, even if you want to use humor as effectively as Steve Yegge, write for as broad an audience as Julia Evans, or write as authoritatively as Paul Graham.</p> <p>When I think about major components of my writing, the major thing that I view as driving how I write besides style &amp; goals is process. As with style, I view this as something where a wide variety of things can work, where it's up to you to figure out what works for you.</p> <p><a id="friction"></a>For myself, I had the following process goals when I started my blog:</p> <ul> <li>Low up-front investment, with as little friction as possible, maybe increasing investment over time if I continue blogging</li> <li>Improve writing technique/ability with each post without worrying too much about writing quality any specific post</li> <li>Only publish when I have something I feel is worth publishing</li> <li>Write a blog that I would want to subscribe to</li> <li>Write on my own platform</li> </ul> <p>The low-up front investment goal is because, when I surveyed blogs I'd seen, one of the most common blog formats was a blog that contained a single post explaining that person was starting a blog, perhaps with another post explaining how their blog was set up, with no further posts. Another common blog format were blogs that had regular posts for a while, followed by a long dormant period with a post at the end explaining that they were going to start posting again, followed by no more posts (in some cases, there are a few such posts, with more time between each). Given the low rate of people continuing to blog after starting a blog, I figured I shouldn't bother investing in blog infra until I knew I was going to write for a while so, even though I already owned this domain name, I didn't bother figuring out how to point this domain at github pages and just set up a default install of some popular blogging software and I didn't even bother doing that until I had already written a post. In retrospect, it was a big mistake to use Octopress (Jekyll); I picked it because I was hanging out with a bunch of folks who were doing trendy stuff at the time, but the fact that it was so annoying to set up that people organized little &quot;Octopress setup days&quot; was a bad sign. And it turns out that, not only was it annoying to set up, it had a fair amount of breakage, used a development model that made it impossible to take upstream updates, and it was extremely slow (it didn't take long before it took a whole minute to build my blog, a ridiculous amount of time to &quot;compile&quot; a handful of blog posts). I should've either just written pure HTML until I had a few posts and then turned that into <a href="https://twitter.com/danluu/status/1244023627613274115">a custom static site generator</a>, or used <a href="https://wordpress.com/refer-a-friend/34nAGAYt06ZlZRYBjx1/">WordPress</a>, which can be spun up in minutes and trivially moved or migrated from. But, part of the the low up-front investment involved not doing research into this and trusting that people around me were making reasonable decisions<sup class="footnote-ref" id="fnref:G"><a rel="footnote" href="#fn:G">5</a></sup>. Overall, I stand behind the idea of keeping startup costs low, but had I just ignored all of the standard advice and either done something minimal or used the out-of-fashion but straightforward option, I would've saved myself a lot of work.</p> <p>The &quot;improve writing&quot; goal is because I found my writing annoyingly awkward and wanted to fix that. I frequently wrote sentences or paragraphs that seemed clunky to me, like when you misspell a word and it looks wrong no matter how you try re-spelling it. Spellcheckers are now ubiquitous enough that you don't really run into the spelling problem anymore, but we don't yet have automated tools that will improve your writing (some attempts exist, but they tend to create bad writing). I didn't worry about any specific post since I figured I could easily spend years working on my writing and I didn't think that spending years re-editing a single post would be very satisfying.</p> <p><a href="p95-skill/">As we've discussed before, getting feedback can greatly speed up skill acquisition</a>, so I hired a professional editor whose writing I respect with the instruction &quot;My writing is clunky and awkward and I'd like to fix it. I don't really care about spelling and grammar issues. Can you edit my writing with that in mind?&quot;. I got detailed feedback on a lot of my posts. I tried to fix the issues brought up in the feedback but, more importantly, tried to write my next post without it having the same or other previously mentioned issues. I can be a bit of a slow learner, so it sometimes took a few posts to iron out an issue but, over time, my writing improved a lot.</p> <p>The only publishing when I felt like publishing is because I generally prefer process goals to outcome goals, at least with respect to personal goals. I originally had a goal of spending a certain amount of time per month blogging, but I got rid of that when I realized that I'd tend to spend enough time writing regardless of whether or not I made it an obligation. I think that outcome goals with respect to blogging do work for some people (e.g., &quot;publish one post per week&quot;), but if your goal is to improve writing quality, having outcome goals can be counterproductive (e.g., to hit a &quot;publish one post per week goal&quot; on limited time, someone might focus on getting something out the door and then not think about how to improve quality since, from the standpoint of the outcome goal, improving quality is a waste of time).</p> <p>Having a goal of writing something I'd want to subscribe to is, of course, highly arbitrary. There are a bunch of things I don't like in other blogs, so I try to avoid them. Some examples:</p> <ul> <li>Breaking up what could be a single post into a bunch of smaller posts</li> <li>Clickbait titles</li> <li>Repeatedly blogging about the same topic with nothing new to say <ul> <li>A sub-category of this is having some kind of belief and then blogging about it every time a piece of evidence shows up that confirms the belief while not mentioning evidence that shows up that disconfirms the belief</li> </ul></li> <li>Not having an RSS or atom feed</li> </ul> <p>Writing on my own platform is the most minor of these. A major reason for that comes out of what's happened to platforms. At the time I started my blog, a number of platforms had already come and gone. Most recently, Twitter had acquired Posterous and shut it down. For a while, Posterous was the trendiest platform around and Twitter's decision to kill it entirely broke links to many of the all-time top voted HN posts, among others. Blogspot, a previously trendy place to write, had also been acquired by Google and severely degraded the reader experience on many sites afterwards. Avoiding trendy platforms has worked out well. <a id="svbtle">The two trendy platforms</a> people were hopping on when I started blogging were <a href="(https://news.ycombinator.com/item?id=4268832)">Svbtle</a> and Medium. <a href="https://www.designernews.co/stories/44300-is-svbtle-dead">Svbtle was basically abandoned shortly afterward I started my blog</a> when it became clear that Medium was going to dominate Svbtle on audience size. And Medium never managed to find a good monetization strategy and severely degraded the user experience for readers in an attempt to generate enough revenue to justify its valuation after raising $160M. You can't trust someone else's platform to not disappear underneath you or radically change in the name of profit.</p> <p>A related thing I wanted to do was write in something that's my own space (as opposed to in internet comments). I used to write a lot of HN comments<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">6</a></sup>, but the half-life of an HN comment is short. With very few exceptions, basically all of the views a comment is going to get will be in the first few days. With a blog, it's the other way around. A post might get burst of traffic initially but, as long as you keep writing, most traffic will come later (e.g., for my blog, I tend to get roughly twice as many hits as the baseline level when a post is on HN, and of course I don't have a post on HN most days). It isn't really much more work to write a &quot;real blog post&quot; instead of writing an HN comment, so I've tended to favor writing blog posts instead of HN comments. Also, when I write here, most of the value created is split between myself and readers. If I were to write on someone else's platform, most of the value would be split between the platform and readers. If I were doing video, I might not really have a choice outside of YouTube or Twitch but, for text, I have a real choice. Looking at <a href="http://exquora.thoughtstorms.info/">how things worked out for people who made the other choice and decided to write comments for a platform</a>, I think I made the right choice for the right seasons. I do see <a href="https://twitter.com/foone/status/1066547904532242437">the appeal of the reduced friction commenting on an existing platform offers</a> but, even so, I'd rather pay the cost of the extra friction and write something that's in my space instead of elsewhere.</p> <p>All of that together is basically it. That's how I write.</p> <p>Unlike other bloggers, I'm not going to try to tell you &quot;how to write usefully&quot; or &quot;how to write well&quot; or anything like that. I agree with Steve Yegge when he says that <a href="https://sites.google.com/site/steveyegge2/you-should-write-blogs">you should consider writing because it's potentially high value and the value may show up in ways you don't expect</a>, but how you write should really come from your goals and aptitudes.</p> <h4 id="appendix-changes-in-approach-over-time">Appendix: changes in approach over time</h4> <p>When I started the blog, I used to worry that a post wouldn't be interesting enough because it only contained a simple idea, so I'd often wait until I could combine two or more ideas into a single post. In retrospect, I think many of my early posts would've been better off as separate posts. For example, <a href="bimodal-compensation/">this post on compensation</a> from 2016 contains the idea that compensation might be turning bimodal and that programmers are unbelievably well paid given the barriers to entry compared to other fields that are similarly remunerative, such has finance, law, and medicine. I don't think there was much value-add to combining the two ideas into a single post and I think a lot more people would've read the bit about how unusually highly paid programmers are if it wasn't bundled into a post about compensation becoming bimodal.</p> <p>Another thing I used to do is avoid writing things that seem too obvious. <a href="https://www.patreon.com/posts/58713950">But, I've come around to the idea that there's a lot of value in writing down obvious things</a> and a number of my most influential posts have been on things I would've previously considered too obvious to write down:</p> <ul> <li><a href="look-stupid/">look-stupid/</a></li> <li><a href="people-matter/">people-matter/</a></li> <li><a href="culture/">culture/</a></li> <li><a href="learn-what/">learn-what/</a></li> <li><a href="productivity-velocity/">productivity-velocity/</a></li> <li><a href="in-house/">in-house/</a></li> </ul> <p>Excluding these recent posts, more people have told me that <a href="look-stupid/">look-stupid/</a> has changed how they operate than all other posts combined (and the only reason it's even close is that a lot of people have told me that my discussions of compensation caused them to realize that they can find a job they enjoy more that also pays hundreds of thousands a year more than they were previously making, which is the set of posts that's drawn the most comments from people telling me that the post was pointless because everybody knows how much you can make in tech).</p> <p>A major, and relatively recent, style change I'm trying out is using more examples. This was prompted by comments from Ben Kuhn, and I like it so far. Compared to most bloggers, <a href="https://twitter.com/benskuhn/status/1431671165320376327">I wasn't exactly light on examples in my early days</a>, but one thing I've noticed is that adding more examples than I would naturally tend to can really clarify things for readers; having &quot;a lot&quot; of examples reduces the rate at which people take away wildly different ideas than the ones I meant. A specific example of this would be, in a post discussing <a href="p95-skill/">what it takes to get to 95%-ile performance</a>, I only provided a couple examples and <a href="corrections/#p95">many people filled in the blanks and thought that performance that's well above 99.9%-ile is 95%-ile, e.g., that being a chess GM</a> is 95%-ile.</p> <p>Another example of someone who's made this change is Jamie Brandon. If you read his early posts, <a href="https://www.scattered-thoughts.net/writing/imperative-thinking-and-the-making-of-sandwiches/">such as this one</a>, he often has a compelling idea with a nice turn of phrase, e.g., this bit about when he was working on Eve with Chris Granger:</p> <blockquote> <p>People regularly tell me that imperative programming is the natural form of programming because 'people think imperatively'. I can see where they are coming from. Why, just the other day I found myself saying, &quot;Hey Chris, I'm hungry. I need you to walk into the kitchen, open the cupboard, take out a bag of bread, open the bag, remove a slice of bread, place it on a plate...&quot; Unfortunately, I hadn't specified where to find the plate so at this point Chris threw a null pointer exception and died.</p> </blockquote> <p>But, despite having parts that are really compelling, his earlier writing was often somewhat disconnected from the real world in a way that Jamie doesn't love when looking back on his old posts. On adding more details, Jamie says</p> <blockquote> <p>The point of focusing down on specific examples and keeping things as concrete as possible is a) makes me less likely to be wrong, because non-concrete ideas are very hard to falsify and I can trick myself easily b) makes it more likely that the reader absorbs the idea I'm trying to convey rather than some superficially similar idea that also fits the vague text.</p> <p>Examples kind of pin ideas down so they can be examined properly.</p> </blockquote> <p>Another big change, the only one I'm going to discuss here that really qualifies as prose style, is that I try much harder to write things where there's continuity of something that's sometimes called &quot;narrative grammar&quot;. <a href="https://web.archive.org/web/20180611151516/http://www.sterlingediting.com/narrative-grammar-an-exercise/">This post by Nicola Griffith has some examples of this at the sentence level</a>, but I also try to think about this in the larger structure of my writing. I don't think I'm particularly good at this, but thinking about this more has made my writing easier to follow. This change, especially on larger scales, was really driven by working with a professional editor who's good at spotting structural issues that make writing more difficult to understand. But, at the same time, I don't worry too much if there's a reason that something is difficult to follow. A specific example of this is, if you read answers to questions on <a href="https://ask.metafilter.com">ask metafilter</a> or reddit, any question that isn't structurally trivial will have a large fraction of answers that from people who failed to read the question and answer the wrong question, e.g., if someone asks for something that has two parts connected with an <code>and</code>, many people will only read one half of the <code>and</code> and give an answer that's clearly disqualified by the <code>and</code> condition. If many people aren't going to read a short question closely enough to write up an answer that satisfies both halves of an <code>and</code>, many people aren't going to follow the simplest things anyone might want to write. I don't think it's a good use of a writer's time to try to walk someone who can't be bothered with reading both sides of an <code>and</code> through a structured post, but I do think there's value in trying to avoid &quot;narrative grammar&quot; issues that might make it harder for someone who does actually want to read.</p> <h4 id="appendix-getting-feedback">Appendix: getting feedback</h4> <p><a href="p95-skill/">As we've previously discussed</a>, feedback can greatly facilitate improvement. Unfortunately, the idea from that post, that 95%-ile performance is generally poor, also applies to feedback, making most feedback counterproductive.</p> <p>I've spent a lot of time watching people get feedback in private channels and seeing how they change their writing in response to it and, at least in the channels that I've looked at (programmers and not professional writers or editors commenting), most feedback is ignored. And when feedback is taken, because almost all feedback is bad and people generally aren't perfect or even very good at picking out good feedback, the feedback that's taken is usually bad.</p> <p>Fundamentally, most feedback has the issue mentioned in this post and is a form of &quot;you should write it like I would've written it&quot;, which generally doesn't work unless the author of the feedback is very careful in how they give the feedback, which few people are. The feedback tends to be superficial advice that misses serious structural issues in writing. Furthermore, the feedback also tends to be &quot;lowest common denominator&quot; feedback that turns nice prose into Strunk-and-White-ified mediocre prose. I don't think that I have a particularly nice prose style, but I've seen a number of people who have a naturally beautiful style ask for feedback from programmers, which has turned their writing into boring prose that anyone could've written.</p> <p>The other side of this is that when people get what I think is good, substantive, feedback, the most common response is &quot;nah, it's fine&quot;. I think of this as the flip side of most feedback being &quot;you should write it how I'd write it&quot;. Most people's response to feedback is &quot;I want to write it how I want to write it&quot;.</p> <p>Although this post has focused on how a wide variety of styles can work, it's also true that, given a style and a set of goals, writing can be better or worse. But, most people who are getting feedback <a href="https://twitter.com/danluu/status/1428445465662603272">don't know enough about writing to know what's better and what's worse</a>, so they can't tell the difference between good feedback and bad feedback.</p> <p>One way around this is to get feedback from someone whose judgement you trust. As mentioned in the post, the way I did this was by hiring a professional editor whose writing (and editing) I respected.</p> <p>Another thing I do, one that's a core aspect of my personality and not really about writing, is that I take feedback relatively seriously and try to avoid having a &quot;nah, it's fine&quot; response to feedback. I wouldn't say that this is optimal since I've sometimes spent far too much time on bad feedback, but a core part of how I think is that I'm aware that most people are overconfident and frequently wrong because of their overconfidence, so I don't trust my own reasoning and spend a relatively large amount of time and effort thinking about feedback in an attempt to reduce my rate of overconfidence.</p> <p>At times, I've spent a comically long amount of time mulling over what is, in retrospect, very bad and &quot;obviously&quot; incorrect feedback that I've been wary of dismissing as incorrect. One thing I've noticed is that, as people gain an audience, some people become more and more confident in themselves and eventually end up becoming highly overconfident. It's easy to see how this happens — as you gain prominence, you'll get more exposure and more &quot;fans&quot; who think you're always right and, on the flip side, you'll also get more &quot;obviously&quot; bad comments.</p> <p>Back when basically no one read my blog, most of the comments I got were quite good. As I've gotten more and more readers, the percentage of good comments has dropped. From looking at how other people handle this, one common failure mode is that they'll see the massive number of obviously wrong comments that their posts draw and then incorrectly conclude that all of their critics are bozos and that they're basically never wrong. I don't really have an antidote to that other than &quot;take criticism very seriously&quot;. Since the failure mode here involves blind spots in judgement, I don't see a simple way to take a particular piece of criticism seriously that doesn't have the potential to result in incorrectly dismissing the criticism due to a blind spot.</p> <p>Fundamentally, my solution to this has been to avoid looking at most feedback while trying to take feedback from people I trust.</p> <p>When it comes to issues with the prose, one thing that we discussed above, hiring a professional editor whose writing and editing I respect and deferring to them on issues with my prose worked well.</p> <p>When it comes to logical soundness or just general interestingness, those are a more difficult to outsource to a single person and I have <a href="https://twitter.com/Aella_Girl/status/1453936895088488449">a set of people whose judgement I trust</a> who look at most posts. If anyone whose judgement I trust thinks a post is interesting, I view that as a strong confirmation and I basically ignore comments that something is boring or uninteresting. For almost all of my posts that are among my top posts in terms of the number of people who told me the post was life changing for them, I got a number of comments from people whose judgement I otherwise think isn't terrible saying that the post seemed boring, pointless, too obvious to write, or just plain uninteresting. I used to take comments that something was uninteresting seriously but, in retrospect, that was a mistake that cost me a lot of time and didn't improve my writing. I think this isn't so different from people who say &quot;write how I write&quot;; instead, it's people who have a similar mental model, but with respect to interesting-ness instead, who can't imagine that other people would find something interesting that they don't. Of course, not everyone's mind works like that, but people who are good at modeling what other people find interesting generally don't leave feedback like &quot;this is boring/pointless&quot;, so feedback of that form is almost guaranteed to be worthless.</p> <p>When it comes to the soundness of an argument, I take the opposite approach that I do for interestingness, in that I take negative comments very seriously and I don't do much about positive comments. I have, sometimes, wasted a lot of time on particular posts because of that. My solution to that has been to try to ignore feedback from people who regularly give bad feedback. That's something I think of as dangerous to do since selectively choosing to ignore feedback is a good way to create an echo chamber, but really seriously taking the time to think through feedback when I don't see a logical flaw is time consuming enough that I don't think there's really another alternative given how I re-evaluate my own work when I get feedback.</p> <p>One thing I've started doing recently that's made me feel a lot better about this is to look at what feedback people give to others. People who give me bad feedback generally also give other people feedback that's bad in pretty much exactly the same ways. Since I'm not really concerned that I have some cognitive bias that might mislead me into thinking I'm right and their feedback is wrong when it comes to their feedback on other people's writing, instead of spending hours trying to figure out if there's some hole in how I'm explaining something that I'm missing, I can spend minutes seeing that their feedback on someone else's writing is bogus feedback and then see that their feedback on my writing is bogus in exactly the same way.</p> <h4 id="appendix-where-i-get-ideas">Appendix: where I get ideas</h4> <p>I often get asked how I get ideas. I originally wasn't going to say anything about this because I don't have much to say, but Ben Kuhn strongly urged me to add this section &quot;so that other people realize what an alien you are&quot;.</p> <p>My feeling is that the world is so full of interesting stuff that ideas are everywhere. I have on the order of a hundred drafts lying around that I think are basically publishable that I haven't prioritized finishing up for one reason or another. If I think of ideas where I've sketched out a post in my head but haven't written it down, the number must well into the thousands. If I were to quit my job and then sit down to write full-time until I died, I think I wouldn't run out of ideas even if I stuck to ones I've already had. The world is big and wondrous and fractally interesting.</p> <p>For example, I recently took up surf skiing (a kind of kayaking) and I'd say that, after a few weeks, I had maybe twenty or so blog post ideas that I think could be written up for a general audience in the sense that <a href="branch-prediction/">this post on branch prediction</a> is written for a general audience, in that it doesn't assume any hardware background. I could write two posts on different technical aspects of canoe paddle evolution and design as well as two posts on cultural factors and how they impacted the update of different canoe paddle designs. Kayak paddle design has been, in recent history, a lot richer, and that could easily be another five or six posts. The technical aspects of hull design are richer still and could be an endless source of posts, although I only have four particular posts in mind at the moment, but the cultural and historical aspects also seem interesting to me and that's what rounds out the twenty things in my head with respect to that.</p> <p>I don't have twenty posts on kayaking and canoeing in my head because I'm particularly interested in kayaking and canoeing. Everything seems interesting enough to write twenty posts about. A lot of my posts that exist are part of what might become a much longer series of posts if I ever get around to spending the time to write them up. For example, <a href="bad-decisions/">this post on decision making in baseball</a> was, in my head, the first of a long-ish (10+) post series on decision making that I never got around to writing that I suspect I'll never write because there's too much other interesting stuff to write about and not enough time.</p> <h4 id="appendix-other-writing-about-writing">Appendix: other writing about writing</h4> <ul> <li>Richard Lanham: <a href="https://amzn.to/3DTiVU2">Analyzing Prose</a> <ul> <li>I think it's not easy to take anything directly actionable away from this book, but I found the way that it dissects the rhythm of prose to be really interesting</li> </ul></li> <li>Robert Alter: <a href="https://amzn.to/3oOs3VT">The Five Books of Moses</a> <ul> <li>For the footnotes on why Robert Alter made certain subtle choices in his translation</li> </ul></li> <li>Francis-Noel Thomas &amp; Mark Turner: <a href="https://amzn.to/3ymHFDh">Clear and Simple as the Truth</a> <ul> <li>If you want to write in a clean, authoritative, style <ul> <li>People who use this as a manual typically write in an unnuanced fashion with a lot of incorrect statements, but I don't think that's necessary. Also, the writing is often compelling, which many people prefer over nuance anyway; many popular writers in tech use an analogous style</li> </ul></li> </ul></li> <li>Gary Hoffman &amp; Glynis Hoffman: <a href="https://amzn.to/33qX6if">Adios, Strunk &amp; White: A Handbook for the new Academic Essay</a></li> <li>Tracy Kidder &amp; Richard Todd: <a href="https://amzn.to/3DRTCSp">Good Prose: The Art of Nonfiction</a> <ul> <li>This book was recommended to me by Kelly Eskridge for its in-depth look at how an editor and a writer interact and I found it useful to keep in mind when working with an editor; reading this book is probably an inefficient way to get a better understanding of what working with a good editor looks like, but it's probably worth reading if you're curious how the author of <a href="https://amzn.to/3ETX7Jw">The Soul of a New Machine</a> writes; if you're not sure what an editor could do for you, this is a nice read</li> </ul></li> <li>Steve Yegge: <a href="https://sites.google.com/site/steveyegge2/you-should-write-blogs">You Should Write Blogs</a> <ul> <li>In particular, for &quot;Reason #3&quot;, the Jacob Gabrielson / Zero Config story, although the whole thing is worth reading</li> </ul></li> <li>Lawrence Tratt: <a href="https://tratt.net/laurie/blog/entries/what_ive_learnt_so_far_about_writing_research_papers.html">What I’ve Learnt So Far About Writing Research Papers</a> <ul> <li>Well written, just like everyone else from Lawrence. Also, I think it's interesting in that Lawrence has a completely different process than mine in most major dimensions, but the resultant style is relatively similar if you compare across all programming bloggers (certainly more similar than any of the authors mentioned in the body of this post)</li> </ul></li> <li>Julia Evans: <a href="https://jvns.ca/blog/2020/12/05/how-i-write-useful-programming-comics/">How I write useful programming comics</a> <ul> <li>Nice explanation of what makes Julia's zines tick; also completely different from my approach, but this time with a completely different result</li> </ul></li> <li>Yossi Kreinen: <a href="https://yosefk.com/blog/blogging-is-hard.html">Blogging is hard</a> (the title is a contrast to his next post, &quot;low level is easy&quot;) <ul> <li>A rare example of a first post that's basically &quot;I'm going to write a blog&quot; that's both interesting and has interesting future posts that follow; also Yossi's writing philosophy</li> </ul></li> <li>Phil Eaton: <a href="https://notes.eatonphil.com/2024-04-10-what-makes-a-great-tech-blog.html">What makes a great technical blog</a> <ul> <li>A brief summary of properties that Phil likes in technical blogs. It's sort of the opposite of what people usually take away from Thomas and Turner's Clear and Simple as the Truth</li> </ul></li> </ul> <h4 id="appendix-things-that-increase-popularity-that-i-generally-don-t-do">Appendix: things that increase popularity that I generally don't do</h4> <p>Here are some things that I think work based on observing what works for other people that I don't do, but if you want a broad audience, perhaps you can try some of them out:</p> <ul> <li>Use clickbait titles <ul> <li>Swearing or saying that something &quot;is cancer&quot; or &quot;is the Vietnam of X&quot; or some other highly emotionally loaded phrase seems to be particularly effective</li> </ul></li> <li>Talk-up prestige/accomplishments/titles</li> <li>Use an authoritative tone and/or style</li> <li>Write things with an angry tone or that are designed to induce anger</li> <li>Write frequently</li> <li>Get endorsements from people</li> <li>Write about hot, current, topics <ul> <li>Provide takes on recent events</li> </ul></li> <li>Use deliberately outrageous / controversial framings on topics</li> </ul> <h4 id="appendix-some-snippets-of-writing">Appendix: some snippets of writing</h4> <p>In case you're not familiar with the writers mentioned, here are some snippets that I think are representative of their writing styles:</p> <p>Joel Spolsky:</p> <blockquote> <p>Why I really care is that Microsoft is vacuuming up way too many programmers. Between Microsoft, with their shady recruiters making unethical exploding offers to unsuspecting college students, and Google (you're on my radar) paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex and walk around trying to get someone...anyone...to come see the demo code they've just written with their &quot;20% time,&quot; doing some kind of, let me guess, cloud-based synchronization... between Microsoft and Google the starting salary for a smart CS grad is inching dangerously close to six figures and these smart kids, the cream of our universities, are working on hopeless and useless architecture astronomy because these companies are like cancers, driven to grow at all cost, even though they can't think of a single useful thing to build for us, but they need another 3000-4000 comp sci grads next week. And dammit foosball doesn't play <i>itself</i>.</p> </blockquote> <p>and</p> <blockquote> <p>When I started interviewing programmers in 1991, I would generally let them use any language they wanted to solve the coding problems I gave them. 99% of the time, they chose C. Nowadays, they tend to choose Java ... Java is not, generally, a hard enough programming language that it can be used to discriminate between great programmers and mediocre programmers ... Nothing about an all-Java CS degree really weeds out the students who lack the mental agility to deal with these concepts. As an employer, I’ve seen that the 100% Java schools have started churning out quite a few CS graduates who are simply not smart enough to work as programmers on anything more sophisticated than Yet Another Java Accounting Application, although they did manage to squeak through the newly-dumbed-down coursework. These students would never survive 6.001 at MIT, or CS 323 at Yale, and frankly, that is one reason why, as an employer, a CS degree from MIT or Yale carries more weight than a CS degree from Duke, which recently went All-Java, or U. Penn, which replaced Scheme and ML with Java</p> </blockquote> <p>Paul Graham:</p> <blockquote> <p>A couple years ago a venture capitalist friend told me about a new startup he was involved with. It sounded promising. But the next time I talked to him, he said they'd decided to build their software on Windows NT, and had just hired a very experienced NT developer to be their chief technical officer. When I heard this, I thought, these guys are doomed. One, the CTO couldn't be a first rate hacker, because to become an eminent NT developer he would have had to use NT voluntarily, multiple times, and I couldn't imagine a great hacker doing that; and two, even if he was good, he'd have a hard time hiring anyone good to work for him if the project had to be built on NT.</p> </blockquote> <p>and</p> <blockquote> <p>What sort of people become haters? Can anyone become one? I'm not sure about this, but I've noticed some patterns. Haters are generally losers in a very specific sense: although they are occasionally talented, they have never achieved much. And indeed, anyone successful enough to have achieved significant fame would be unlikely to regard another famous person as a fraud on that account, because anyone famous knows how random fame is.</p> </blockquote> <p>Steve Yegge:</p> <blockquote> <p>When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you've been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: &quot;Yeah, uh, you've read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course.&quot; Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.</p> <p>This is a wonderful book about how to write good code, and there aren't many books like it. None, maybe. They don't typically teach you how to write good code in school, and you may never learn on the job. It may take years, but you may still be missing some key ideas. I certainly was. ... If you're a relatively experienced engineer, you'll recognize 80% or more of the techniques in the book as things you've already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don't comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!</p> </blockquote> <p>and</p> <blockquote> <p>Jeff Bezos is an infamous micro-manager. He micro-manages every single pixel of Amazon's retail site. He hired Larry Tesler, Apple's Chief Scientist and probably the very most famous and respected human-computer interaction expert in the entire world, and then ignored every goddamn thing Larry said for three years until Larry finally -- wisely -- left the company. Larry would do these big usability studies and demonstrate beyond any shred of doubt that nobody can understand that frigging website, but Bezos just couldn't let go of those pixels, all those millions of semantics-packed pixels on the landing page. They were like millions of his own precious children. So they're all still there, and Larry is not.</p> <p>Micro-managing isn't that third thing that Amazon does better than us, by the way. I mean, yeah, they micro-manage really well, but I wouldn't list it as a strength or anything. I'm just trying to set the context here, to help you understand what happened. We're talking about a guy who in all seriousness has said on many public occasions that people should be paying him to work at Amazon. He hands out little yellow stickies with his name on them, reminding people &quot;who runs the company&quot; when they disagree with him. The guy is a regular... well, Steve Jobs, I guess. Except without the fashion or design sense. Bezos is super smart; don't get me wrong. He just makes ordinary control freaks look like stoned hippies.</p> </blockquote> <p>Julia Evans:</p> <blockquote> <p>Right now I’m on a million-hour train ride from New York to Montreal. So I’m looking at the output of strace because, uh, <code>strace</code> is cool, and it is teaching me some things about how the command line tools I use all the time work.</p> <p>What <code>strace</code> does is capture every single system call that gets called when executing a program. System calls are the interface between userspace programs and the kernel, so looking at the output from strace is a fun way to understand how Linux works, and what’s really involved in running a program.</p> <p>For example! <code>killall</code>! I ran</p> <p><code>strace killall ruby1.9.1 2&gt; killall-log.</code></p> </blockquote> <h4 id="appendix-how-gpt-4o-thinks-people-write">Appendix: how GPT-4o thinks people write</h4> <p>[2024 update]</p> <p>I've seen quite a few people fawn over GPT-3, GPT-3.5's, and GPT-4's ability to &quot;imitiate&quot; somebody and create work that's in their style, allowing you to read hypotheetical essays or have hypothetical conversations with someone. I haven't found this to work very well in general. Below are examples of what you get with the prompt &quot;Generate a 200 word quote about Microsoft that's something X might write.&quot; (I tried other prompts and to equally bad results, in that the content, tone, and style are not much like what any of these people actually produce):</p> <p>Joel Spolsky:</p> <blockquote> <p>Microsoft is a study in the complex interplay between engineering excellence and business acumen. They're not just a company that builds software; they're a sprawling, multi-tentacled beast that has, over decades, managed to insert itself into almost every corner of the tech industry. They’ve had their missteps—anyone remember Clippy or the Zune?—but they’ve also had spectacular successes. The magic of Microsoft lies in their ability to pivot and adapt, often turning what seems like a colossal blunder into an integral part of their strategy.</p> <p>Their initial dominance was built on Windows and Office, a formidable combo that defined personal and enterprise computing for decades. But what’s really fascinating is how they’ve evolved. Under Nadella, Microsoft transformed from a lumbering giant into a nimble innovator, diving headfirst into the cloud with Azure and embracing the open-source community they once shunned. They’ve created a developer ecosystem that’s robust and, dare I say, fun to work in. Microsoft is not just about software anymore; it's about platforms, services, and an ever-growing universe of connected tools. They’re proof that even the biggest companies can stay relevant if they’re willing to rethink and reinvent themselves continually</p> </blockquote> <p>Paul Graham:</p> <blockquote> <p>Microsoft's journey is a quintessential example of how adaptability trumps inertia in the tech world. In its early days, the company epitomized the startup ethos of seizing opportunities, transforming from a scrappy two-man team into a juggernaut by outmaneuvering larger, more established competitors. They mastered the art of commoditizing software and built an ecosystem that turned personal computing into a universal reality. Yet, success bred complacency. For years, Microsoft clung to its Windows and Office cash cows, missing the mobile and internet revolutions as younger, nimbler rivals reshaped the industry. But Microsoft’s story didn't end there. Under Satya Nadella’s visionary leadership, the company rediscovered its startup roots, pivoting towards the cloud and embracing open source with a zeal reminiscent of its early days. This ability to transform, to reimagine itself in the face of profound industry shifts, is what keeps Microsoft relevant. It's a lesson to every startup and tech giant alike: the key to longevity is not just innovation but the relentless pursuit of reinvention. In Microsoft's world, the only constant is change, and the company's true genius lies in its unwavering ability to adapt and thrive.</p> <p>Microsoft is a fascinating entity to explore, especially if you love to dive into the details of technology and see how they scale complexity. For a long time, they were the behemoth that you couldn’t avoid if you were using a computer. Windows and Office were like the air you breathed—ubiquitous and essential. But what's really interesting is how Microsoft has evolved. They’ve gone from being the gatekeepers of the desktop to pioneers in the cloud and champions of open-source.</p> <p>Think about Azure. It’s this massive infrastructure that powers everything from small startups to large enterprises, enabling developers to deploy code globally with just a few clicks. And then there’s their embrace of open source. Remember when they acquired GitHub? That was a signal that Microsoft was serious about supporting the developer community. It wasn’t just a PR move; they’re contributing to and even leading projects that millions of us rely on.</p> <p>Their tools have become indispensable for so many of us. Visual Studio Code, for instance, is a joy to use. It’s light, fast, and extensible—a perfect example of how Microsoft has learned to build products that developers love. It’s a company that’s not afraid to reinvent itself and, in doing so, makes our tech lives better and more interesting.</p> </blockquote> <p><i>Thanks to Yossi Kreinin, Ben Kuhn, Laurence Tratt, Heath Borders, Jamie Brandon, Julia Evans, Vegard Nossum, Julien Kirch, Bram Delver, and Pam Wolf for comments/corrections/discussion.</i></p> <div class="footnotes"> <hr /> <ol> <li id="fn:P"><p>What's worked can mean very different things for different people, but for this section we're going to look at popular blogs because, when people I know have frustratedly stopped writing after writing a blog for a while, the most common reason has been that their blog had basically no readers.</p> <p>Of course, many people write without a goal of having readers and some people even try to avoid having more than a few readers (by &quot;locking&quot; posts in some way so that only &quot;friends&quot; have access) but, I don't think the idea that &quot;what works&quot; is very broad and that many different styles can work changes if the goal is to have just a few friends read a blog.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:O">This is pretty arbitrary. In other social circles, Jeff Atwood, Raymond Chen, Scott Hanselman, etc., might be on the list, but this wouldn't change the point since all of these folks also have different styles from each other as well as the people on my list. <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:7">2017 is the endpoint since I reduced how much I pay attention to programming internet culture around then and don't have a good idea on what people I know were reading after 2017. <a class="footnote-return" href="#fnref:7"><sup>[return]</sup></a></li> <li id="fn:C">In sports, elite coaches that have really figured out how to cue people to do the right thing can greatly accelerate learning but, outside of sports, although there's no shortage of people who are willing to supply coaching, it's rare to find one who's really figured out what cues students can be given that will help them get to the right thing much more quickly than they would've if they just naively measured what they were doing and applied a bit of introspection. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:G">It turns out that blogging has been pretty great for me (e.g., my blog got me my current job, facilitated meeting a decent fraction of my friends, results in people sending me all sorts of interesting stories about goings-on in the industry, etc.), but I don't think that was a predictable outcome before starting the blog. My guess, based on base rates, was that the most likely outcome was failure. <a class="footnote-return" href="#fnref:G"><sup>[return]</sup></a></li> <li id="fn:H">Such as <a href="https://news.ycombinator.com/item?id=6315483">this comment on how cushy programming jobs are compared to other lucrative jobs</a> (which turned into the back half of <a href="bimodal-compensation/">this post on programmer compensation</a>, <a href="https://news.ycombinator.com/item?id=4652367">this comment on writing pay</a>, and <a href="https://news.ycombinator.com/item?id=4920429">this comment on the evolution of board game design</a>. <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> </ol> </div> Some latency measurement pitfalls latency-pitfalls/ Mon, 06 Dec 2021 00:00:00 +0000 latency-pitfalls/ <p><small><i>This is a pseudo-transcript (actual words modified to be more readable than a 100% faithful transcription) of a short lightning talk I did at Twitter a year or two ago, on pitfalls of how we use latency metrics (with the actual service names anonymized per a comms request). Since this presentation, significant progress has been made on this on the infra side, so the situation is much improved over what was presented, but I think this is still relevant since, from talking to folks at peer companies, many folks are facing similar issues.</i></small></p> <p>We frequently use <a href="https://research.google/pubs/pub40801/">tail latency</a> metrics here at Twitter. Most frequently, service owners want to get cluster-wide or Twitter-wide latency numbers for their services. Unfortunately, the numbers that service owners tend to use differ from what we'd like to measure due some historical quirks in our latency measurement setup:</p> <ul> <li>Opaque, uninstrumented, latency</li> <li>Lack of, cluster-wide, aggregation capability</li> <li>Minutely resolution</li> </ul> <h4 id="opaque-uninstrumented-latency">Opaque, uninstrumented, latency</h4> <p>When we look at the dashboards for most services, the latency metrics that are displayed and are used for alerting are usually from the server the service itself is running on. Some services that have dashboards set up by senior SREs who've been burned by invisible latency before will also have the service's client-observed latency from callers of the service. I'd like to discuss three issues with this setup.</p> <p>For the purposes of this talk, we can view a client request as passing through the following pipeline after client &quot;user&quot; code passes the request to our RPC layer, Finagle(<a href="https://twitter.github.io/finagle/">https://twitter.github.io/finagle/</a>), and before client user code receive the response (the way Finagle currently handles requests, we can't get timestamps for a particular request once the request is handled over to the network library we use, <a href="(https://netty.io/)">netty</a></p> <p><code>client netty -&gt; client Linux -&gt; network -&gt; server Linux -&gt; server netty -&gt; server &quot;user code&quot; -&gt; server netty -&gt; server Linux -&gt; network -&gt; client Linux -&gt; client netty</code></p> <p>As we previously saw in [an internal document quantifying the impact of <a href="https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html">CFS bandwidth control throttling</a> and how our use of excessively large thread pools causes throttling]<sup class="footnote-ref" id="fnref:W"><a rel="footnote" href="#fn:W">1</a></sup>, we frequently get a lot of queuing in and below netty, which has the knock-off effect of causing services to get throttled by the kernel, which often results in a lot of opaque latency, especially when under high load, when we most want dashboards to show correct latency numbers..</p> <p>When we sample latency at the server, we basically get latency from</p> <ul> <li>Server service &quot;user&quot; code</li> </ul> <p>When we sample latency at the client, we basically get</p> <ul> <li>Server service &quot;user&quot; code</li> <li>Server-side netty</li> <li>Server-side Linux latency</li> <li>Client-side Linux latency</li> <li>Client-side netty latency</li> </ul> <p>Two issues with this are that we don't, with metrics data, have a nice way to tell if latency is in the opaque parts of the stack are coming from the client or the server. As a service owner, if you set alerts based on client latency, you'll get alerted when client latency rises because there's too much queuing in netty or Linux on the client even when your service is running smoothly.</p> <p>Also, the client latency metrics that are reasonable to look at given what we expose give you latency for all servers a client talks to, which is a really different view from what we see on server metrics, which gives us per-server latency numbers and there isn't a good way to aggregate per-server client numbers across all clients, so it's difficult to tell, for example, if a particular instance of a server has high latency in netty.</p> <p>Below are a handful examples of cluster-wide measurements of latency measured at the client vs. the server. These were deliberately selected to show a cross-section of deltas between the client and the server.</p> <p><img src="images/latency-pitfalls/service-1.png" alt="Graph showing large difference between latency measured at the client vs. at the server" width="1259" height="778"></p> <p>This is a <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function">CDF</a>, presented with the standard orientation for a CDF, with the percentile is on the y-axis and the value on the x-axis, which makes down and to the right higher latency and up and to the left lower latency, and a flatter line meaning latency is increasing quickly and a steeper line meaning that latency is increasing more slowly.</p> <p>Because the chart is log scale on both axes, the difference between client and server latency is large even though the lines don't look all that far apart. For example, if we look at 99%-ile latency, we can see that it's ~16ms when measured at the server and ~240ms when measured at the client, a factor of 15 difference. Alternately, if we look at a fixed latency, like 240ms, and look up the percentile, we see that's 99%-ile latency on the client, but well above 99.9%-ile latency on the server.</p> <p>The graphs below have similar properties, although the delta between client and server will vary.</p> <p><img src="images/latency-pitfalls/service-2.png" alt="Graph showing moderate at the client vs. at the server until p99.5, with large difference above p99.5" width="1259" height="778"> <img src="images/latency-pitfalls/service-3.png" alt="Graph showing small difference between latency measured at the client vs. at the server until p74, with increasing divergence after that" width="1259" height="778"> <img src="images/latency-pitfalls/service-4.png" alt="Graph showing moderate difference between latency measured at the client vs. at the server until close to client timeout value, with large divergence near timeout value" width="1259" height="778"> <img src="images/latency-pitfalls/service-5.png" alt="Graph showing small difference between latency measured at the client vs. at the server unil p999, with rapid increasing after that" width="1259" height="778"></p> <p>We can see that latencies often differ significantly when measured at the client vs. when measured at the server and that, even in cases where the delta is small for lower percentiles, it sometimes gets large at higher percentiles, where more load can result in more queueing and therefore more latency in netty and the kernel.</p> <p>One thing to note is that, for any particular measured server latency value, we see a very wide range of client latency values. For example, here's a zoomed in scatterplot of client vs. server latency for <code>service-5</code>. If we were to zoom out, we'd see that for a request with a server-measured latency of 10ms, we can see client-measured latencies as high as 500ms. More generally, we see many requests where the server-measured latency is very similar to the client-measured latency, with a smattering of requests where the server-measured latency is a very inaccurate representation of the client-measured latency. In almost all of those cases, the client-measured latency is higher due to queuing in a part of the stack that's opaque to us and, in a (very) few cases, the client-measured latency is lower due to some issues in our instrumentation. In the plot below, due to how we track latencies, we only have 1ms granularity on latencies. The points on the plots below have been randomly jittered by +/- 0.4ms to give a better idea of the distribution at points on the plot that are very dense.</p> <p><img src="images/latency-pitfalls/service-scatter.png" alt="Per-request scatterplot of client vs. server latency, showing that any particular server latency value can be associated with a very wide range of client latency values" width="1259" height="778"></p> <p>While it's possible to plumb instrumentation through netty and the kernel to track request latencies after Finagle has handed them off (the kernel even has hooks that would make this somewhat straightforward), that's probably more work than is worth it in the near future. If you want to get an idea for how your service is impacted by opaque latency, it's fairly easy to get a rough idea with <a href="https://zipkin.io/">Zipkin</a> if you leverage <a href="tracing-analytics/">the work Rebecca Isaacs, Jonathan Simms, and Rahul Iyer have done</a>, which is how I generated the plots above. The code for these is checked into [a path in our monorepo] and you can plug in your own service names if you just want to check out a different service.</p> <h4 id="lack-of-cluster-wide-aggregation-capability">Lack of cluster-wide aggregation capability</h4> <p>In the examples above, we were able to get cluster-wide latency percentiles because we used data from Zipkin, which attempts to sample requests uniformly at random. For a variety of reasons, service owners mostly rely on metrics data which, while more complete because it's unsampled, doesn't let us compute cluster-wide aggregates because we pre-compute fixed aggregations on a per-shard basis and there's no way to reconstruct the cluster-wide aggregate from the per-shard aggregates.</p> <p>From looking at dashboards of our services, the most common latency target is a per-shard average of shard-level 99%-ile latency (with some services that are deep in the request tree, like cache, using numbers further in the tail). Unfortunately, taking the average of per-shard tail latency defeats the purpose of monitoring tail latency. If <a href="https://brooker.co.za/blog/2021/04/19/latency.html">we think about why we want to use tail latency</a> because, when we have high fanout and high depth request trees, a very small fraction of server responses slowing down can slow down many or most top-level requests, taking the average of tail latency fails to capture the value of using tail latency since the average of shard-level tail latencies fails to capture the property that a small fraction of server responses being slow can slow down many or most requests while <a href="https://brooker.co.za/blog/2017/12/28/mean.html">also missing out on the advantages of looking at cluster-wide averages</a>, which can be reconstructed from per-shard averages.</p> <p>For example, when we have a few bad nodes returning , that has a small impact on the average per-shard tail latency even though cluster-wide tail latency will be highly elevated. As we saw in [a document quantifying the extent of machine-level issues across the fleet as well as the impact on data integrity and performance]<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">2</a></sup>, we frequently have host-level issues that can drive tail latency on a node up by one or more orders of magnitude, which can sometimes drive median latency on the node up past the tail latency on other nodes. Since a few or even one such node can determine the tail latency for a cluster, taking the average across all nodes can be misleading, e.g., if we have a 100 node cluster where tail latency is up by 10x on one node, this might cause our average of cluster-wide latencies to increase by a factor of 0.99 + 0.01 * 10 = 1.09 when the actual increase in tail latency is much larger.</p> <p>Some service owners try to get a better approximation of cluster-wide tail latency by taking a percentile of the 99%-ile, often the 90%-ile or the 99%-ile, but this doesn't work either and there is, in general, no per-shard percentile or other aggregation of per-shard tail latencies that can reconstruct the cluster-level tail latency.</p> <p>Below are plots of the various attempts that people have on dashboards to get cluster-wide latency with instance-level metrics data vs. actual (sampled) cluster-wide latency on a service which makes the percentile of percentile attempts more accurate than for smaller services. We can see the correlation is very weak and has the problem we expect, where the average of the tail isn't influenced by outlier shards as much as it &quot;should be&quot; and the various commonly used percentiles either aren't influenced enough or are influenced too much, on average and are also weakly correlated with the actual latencies. Because we track metrics with minutely granularity, each point in the graphs below represents one minute, with the sampled cluster-wide p999 latency on the x-axis and the dashboard aggregated metric value on the y-axis. Because we have 1ms granularity on individual latency measurements from our tracing pipeline, points are jittered horizontally +/- 0.3ms to give a better idea of the distribution (no such jitter is applied vertically since we don't have this limitation in our metrics pipeline, so that data is higher precision).</p> <p><img src="images/latency-pitfalls/cluster-1.png" alt="Per-minute scatterplot of average of per-shard p999 vs. actual p999, showing that average of per-shard p999 is a very poor approximation" width="1259" height="778"> <img src="images/latency-pitfalls/cluster-2.png" alt="Per-minute scatterplot of p99 of per-shard p999 vs. actual p999, showing that p99 of per-shard p999 is a poor approximation" width="1259" height="778"> <img src="images/latency-pitfalls/cluster-3.png" alt="Per-minute scatterplot of p999 of per-shard p999 vs. actual p999, showing that p999 of per-shard p999 is a very poor approximation" width="1259" height="778"></p> <p>The correlation between cluster-wide latency and aggregations of per-shard latency is weak enough that even if you pick the aggregation that results in the correct average behavior, the value will still be quite wrong for almost all samples (minutes). Given our infra, the only solutions that can really work here are extending our tracing pipeline for use on dashboards and with alerts or adding metric histograms to Finagle and plumbing that data up through everything and the into [dashboard software] so that we can get proper cluster-level aggregations<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">3</a></sup>.</p> <p>While it's popular to take the average of tail latencies because it's easy and people are familiar with it (e.g., the TL of observability at [redacted peer company name] has said that they shouldn't bother with anything other than averages because everyone just wants averages), taking the average or another aggregation of shard-level tail latencies has neither the properties people want nor the properties people expect.</p> <h4 id="minutely-resolution">Minutely resolution</h4> <p>Another, independent, issue that's a gap in our ability to observe what's going on with our infrastructure is that we only collect metrics at a minutely granularity. Rezolus does metrics collection on a secondly (and in some cases, even sub-secondly) granularity, but for reasons that are beyond the scope of this talk, it's generally only used for system-level metrics (with a few exceptions).</p> <p>We've all seen incidents where some bursty, sub-minutely event, is the cause of a problem. Let's look at an example of one such incident. In this incident, a service had elevated latency and error rate. Looking at the standard metrics we export wasn't informative, but looking at sub-minutely metrics immediately reveals a clue:</p> <p><img src="images/latency-pitfalls/minutely-1.png" alt="Plot of per-request latency for sampled requests, showing large spike followed by severely reduced request rate" width="1259" height="778"></p> <p>For this particular shard of a cache (and many others, not shown), there's a very large increase in latency at <code>time 0</code>, followed by 30 seconds of very low request rate. The 30 seconds is because shards of <code>service-6</code> were configured to mark servers they talk to as dead for 30 seconds if <code>service-6</code> clients encounter too many failed requests. This decision is distributed, which is why the request rate to the impacted shard of <code>cache-1</code> isn't zero; some shards of <code>service-6</code> didn't send requests to that particular shard of <code>cache-1</code> during during the period of elevated latency, so they didn't mark that shard of <code>cache-1</code> as dead and continued to issue requests.</p> <p>A sub-minutely view of request latency made it very obvious what mechanism caused elevated error rates and latency in <code>service-6</code>.</p> <p>One thing to note is that the lack of sub-minutely visibility wasn't the only issue here. Much of the elevated latency was in places that are invisible to the latency metric, resulting in monitoring <code>cache-1</code> latencies insufficient to detect the issue. Below, the reported latency metrics for a single instance of <code>cache-1</code> are the blue points and the measured (sampled) latency the client observed is the black line<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">4</a></sup>. Reported p99 latency is 0.37ms, but actual p99 latency is ~580ms, an over three order of magnitude difference.</p> <p><img src="images/latency-pitfalls/minutely-2.png" alt="Plot of reported metric latency vs. latency from trace data, showing extremely large difference between metric latency and trace latency" width="1259" height="778"></p> <h4 id="summary">Summary</h4> <p>Although our existing setup for reporting and alerting on latency works pretty decently, in that the site generally works and our reliability is actually quite good compared to peer companies in our size class, we do pay some significant costs as a result of our setup.</p> <p>One is that we often have incidents where it's difficult to see what's going on without using tools that are considered specialized that most people don't use, adding to the toil of being on call. Another is that, due to large margins of error in our estimates of cluster-wide latencies, we have to have to provision a very large amount of slack and keep latency SLOs that are much stricter than the actual latencies we want to achieve to avoid user-visible incidents. This increases operating costs as we've seen in [a document comparing per-user operating costs to companies that serve similar kinds of and levels of traffic].</p> <p><i>If you enjoyed this post you might like to read about <b><a href="perf-tracing/">tracing on a single host vs. sampling profilers</a></b></i>.</p> <h4 id="appendix-open-vs-closed-loop-latency-measurements">Appendix: open vs. closed loop latency measurements</h4> <p>Some of our synthetic benchmarking setups, such as setup-1, use &quot;closed-loop&quot; measurement, where they effectively send a single request, wait for it to come back, and then send another request. Some of these allow for a degree of parallelism, where N request can be in flight at once but that still has similar problems in terms of realism.</p> <p>For a toy example of the problem, let's say that we have a service that, in production, receives exactly 1 request every second and that the service has a normal response time of 1/2 second. Under normal behavior, if we issue requests at 1 per second, we'll observe that the mean, median, and all percentile request times are 1/2 second. As an exercise for the reader, compute the mean and 90%-ile latency if the service has no parallelism and one request takes 10 seconds in the middle of a 1 minute benchmark run for a closed vs. open loop benchmark setup where the benchmarking setup issues requests at 1 per second for the open loop case, and 1 per second but waits for the previous request to finish in the closed loop case.</p> <p>For more info on this, see <a href="https://psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html">Nitsan Wakart's write-up on fixing this issue in the YCSB benchmark</a> or <a href="https://www.youtube.com/watch?v=9MKY4KypBzg">Gil Tene's presentation on this issue</a>.</p> <h4 id="appendix-use-of-unweighted-averages">Appendix: use of unweighted averages</h4> <p>An common issue with averages on dashboards that I've looked at that's independent of the issues that come up when we take the average of tail latencies is that an unweighted average frequently underestimates the actual latency.</p> <p>Two places I commonly see an unweighted average are when someone gets an overall latency by taking an unweighted average across datacenters and when someone gets a cluster-wide latency by taking an average across shards. Both of these have the same issue, that shards that have lower load tend to have lower latency. This is especially pronounced when we fail away from a datacenter. Services that incorrectly use an unweighted average across datacenters will often show decreased latency even though actually served requests have increased latency.</p> <p><i>Thanks to Ben Kuhn for comments/corrections/discussion.</i></p> <p><link rel="prefetch" href=""> <link rel="prefetch" href="perf-tracing/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:W">This is another item that's somewhat out of date, since this document motivated work from Flavio Brasil and Vladimir Kostyukov to do work on Finagle that reduces the impact of this problem and then, later, work from my then-intern, Xi Yang, on a patch to the kernel scheduler that basically eliminates the problem by preventing cgroups from exceeding their CPU allocation (as opposed to the standard mechanism, which allows cgroups to exceed their allocation and then effectively puts the cgroup to sleep until its amortized cpu allocation is no longer excessive, which is very bad for tail latency). <a class="footnote-return" href="#fnref:W"><sup>[return]</sup></a></li> <li id="fn:H">This is yet another item that's out of date since the kernel, HWENG, and the newly created fleet health team have expended significant effort to drive down the fraction of unhealthy machines. <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> <li id="fn:M">This is also significantly out of date today. Finagle does now support exporting shard-level histogram data and this can be queried via one-off queries by hitting the exported metrics endpoint. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:O">As we previously noted, opaque latency could come from either the server or the client, but in this case, we have strong evidence that the latency is coming from the <code>cache-1</code> server and not the <code>service-6</code> client because opaque latency from the <code>service-6</code> client should be visible on all requests from <code>service-6</code> but we only observe elevated opaque latency on requests from <code>service-6</code> to <code>cache-1</code> and not to the other servers it &quot;talks to&quot;. <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> </ol> </div> Major errors on this blog (and their corrections) corrections/ Mon, 22 Nov 2021 00:00:00 +0000 corrections/ <p>Here's a list of errors on this blog that I think were fairly serious. While what I think of as serious is, of course, subjective, I don't think there's any reasonable way to avoid that because, e.g., I make a huge number of typos, so many that the majority of acknowledgements on many posts are for people who e-mailed or DM'ed me typo fixes.</p> <p>A list that included everything, including typos would both be uninteresting for other people to read as well as high overhead for me, which is why I've drawn the line somewhere. An example of an error I don't think of as serious is, <a href="learning-to-program/">in this post on how I learned to program</a>, I originally had the dates wrong on when the competition programmers from my high school made money (it was a couple years after I thought it was). In that case, and many others, I don't think that the date being wrong changes anything significant about the post.</p> <p>Although I'm publishing the original version of this in 2021, I expect this list to grow over time. I hope that I've become more careful and that the list will grow more slowly in the future than it has in the past, but that remains to be seen. I view it as a good sign that a large fraction of the list is from my first three months of blogging, in 2013, but that's no reason to get complacent!</p> <p>I've added a classification below that's how I think of the errors, but that classification is also arbitrary and the categories aren't even mutually exclusive. If I ever collect enough of these that it's difficult to hold them all in my head at once, I might create a tag system and use that to classify them instead, but I hope to not accumulate so many major errors that I feel like I need a tag system for readers to easily peruse them.</p> <ul> <li>Insufficient thought <ul> <li><i>2013</i>: <a href="randomize-hn/">Using random algorithms to decrease the probability that good stories get &quot;unlucky&quot; on HN</a>: this idea was tried and didn't work well as well as putting humans in the loop who decide which stories should be rescued from oblivion. <ul> <li>Since this was a proposal and not a claim, this technically wasn't an error since I didn't claim that this would definitely work, but my feeling is that I should've also considered solutions that put humans in the loop. I didn't because Digg famously got a lot of backlash for having humans influence their front page but, in retrospect, we can see that it's possible to do so in a way that doesn't generate backlash that effective kills the site and I think this could've been predicted with enough thought</li> </ul></li> </ul></li> <li>Naivete <ul> <li><i>2013</i>: <a href="hardware-unforgiving/">The institution knowledge and culture that create excellence can take a long time to build up</a>: At this time, I hadn't worked in software and that thought that this wasn't as difficult for software because so many software companies are successful with new/young teams. But, in retrospect, the difference isn't that those companies don't produce bad (unreliable, buggy, slow, etc.) software, it's that product/market fit and network effects are important enough that it frequently doesn't matter that software is bad</li> <li><i>2015</i>: <a href="dunning-kruger/">In this post on how people don't read citations</a>, I found it mysterious that type system advocates would cite non-existent strong evidence, which seems unlike the other examples, where people pass on a clever, contrarian, result without ever having read it. The thing I thought was mysterious was that, unlike the other examples, there isn't an incorrect piece of evidence being passed around; the assertion that there is evidence is disconnected from any evidence, even misinterpreted evidence. In retrospect, I was being naive in thinking that there was a link to evidence that people wouldn't just fabricate the idea that there is evidence supporting their belief and then pass that around.</li> </ul></li> <li>Insufficient verification of information <ul> <li><i>2016</i>: <a href="sounds-easy/">Building a search engine isn't trivial</a>: although I think the overall point is true, one of the pieces of evidence I relied on came out of using numbers that someone who worked on a search engine told me about. But when I measured actual numbers, I found that the numbers I was told were off by multiple orders of magnitude</li> <li><i>2022</i>: <a href="futurist-predictions/">Futurist predictions</a>, pointed out to me by @ESRogs: I misread nostalgebraist's summary of a report and didn't understand what he was saying with respect to a sensitivity analysis he was referring to. I distinctly remember not being sure what nostaglebraist was saying and originally agreed with the correct interpretation. After re-reading it, I came away with my mistaken reading, which I then wrote into my post. That I had uncertainty about the reading should've caused me to just reproduce his analysis, which would have immediately clarified what he meant, but I didn't do that. This error didn't fundamentally change my own analysis since the broader point I was making didn't hinge on the exact numbers, but I think it's a very bad habit to allow yourself to publish something with the level of uncertainty I had without noting the uncertainty (quite an ironic mistake considering the contents of the post itself). A factor that both led to the mistake in the first place as well as to not checking the math in a way that would've spotted the mistake is that the edits that introduced this mistake were a last-minute change introduced when I had a short window of time to make the changes if I wanted to publish immediately and not some time significantly later. Of course, that should have led to me delaying publication, so this was one bad decision that led to another</li> </ul></li> <li>Blunder <ul> <li><i>2015</i>: <a href="butler-lampson-1999/">Checking out Butler Lampson's review of what worked in CS, 16 years later</a>: it was wrong to say that capabilities were a &quot;no&quot; in 2015 given their effectiveness on mobile and that seems so obviously wrong at the time that I would call this a blunder rather than something where I gave it a decent amount of thought but should've thought through it more deeply</li> <li><i>2024</i>: <a href="diseconomies-scale/">Diseconomies of scale</a>: I mixed up which number I was dividing by which when doing arithmetic, causing a multiple order of magnitude error in a percentage. Sophia Wisdom noticed this a few hours after the post was published and I fixed it immediately , but this quite a silly error.</li> </ul></li> <li>Pointlessly difficult to understand explanation <ul> <li><i>2013</i>: <a href="3c-conflict/">How data alignment impacts memory latency</a>: the main plots in this post use a ratio of latencies, which adds a level of indirection that many people found confusing</li> <li><a id="p95"><i>2017</i></a>: <a href="p95-skill/">It is easy to achieve 95%-ile performance</a>: the most common objection people had to this post was something like &quot;False. You need to be very talented and/or it is hard to [play in the NBA / become a chess GM / achieve a 2200 chess rating]&quot;. <a href="https://twitter.com/JamesClear/status/1292574538912456707">James Clear made an even weaker claim on Twitter</a> and also got similar responses. There isn't really space to do this on Twitter, but in my blog post, I should've included more concrete examples of what various levels of performance look like for people who have a difficult time estimating what performance looks like at various percentiles. To pick one of the less outlandish claims, <a href="https://lobste.rs/s/mwykjj/95_ile_isn_t_good#c_pudige">here's a claim that a 2200 rating is 95%-ile for someone who's ever played chess online</a>, which <a href="https://lobste.rs/s/mwykjj/95_ile_isn_t_good#c_iyoegc">appears to be off by perhaps four orders of magnitude, plus or minus one</a>.</li> </ul></li> <li>Errors in retrospect <ul> <li><i>2015</i>: <a href="blog-ads/">Blog monetization</a>: I grossly underestimated how much <a href="https://www.patreon.com/danluu">I could make on Patreon</a> by looking at how much Casey Muratori, Eric Raymond, and eevee were making on Patreon at the time. I thought that all three of them would out-earn me based for a variety of reasons and that was incorrect. A major reason that was incorrect was that boring, long-form, writing monetizes much better than I exepected, which means that I monetarily undervalued that compared to what other tech folks are doing. <ul> <li>A couple weeks ago, I added a link to Patreon at the top of posts (instead of just having one hidding at the bottom) and mentioned having a Patreon on Twitter. Since then, my earnings have increased by about as much as Eric Raymond makes in total and the amount seems to be increasing at a decent rate, which is a result I wouldn't have expected before the rise of substack. But anyone who realized how well individual writers can monetize their writing could've created substack and no one did until Chris Best, Hamish McKenzie, and Jairaj Sethi created substack, so I'd say that this one was somewhat non-obvious. Also, <a href="https://www.patreon.com/posts/60185075">it's unclear if the monetization is going to scale up or will plateau</a>; if it plateaus, then my guess would only be off by a small constant factor.</li> </ul></li> </ul></li> </ul> <p>Thanks to Anja Boskovic and Ville Sundberg for comments/corrections/discussion.</p> Individuals matter people-matter/ Mon, 15 Nov 2021 00:00:00 +0000 people-matter/ <p>One of the most common mistakes I see people make when looking at data is incorrectly using an overly simplified model. A specific variant of this that has derailed the majority of work roadmaps I've looked at is treating people as interchangeable, as if it doesn't matter who is doing what, as if individuals don't matter.</p> <p>Individuals matter.</p> <p>A pattern I've repeatedly seen during the roadmap creation and review process is that people will plan out the next few quarters of work and then assign some number of people to it, one person for one quarter to a project, two people for three quarters to another, etc. Nominally, this process enables teams to understand what other teams are doing and plan appropriately. I've never worked in an organization where this actually worked, where this actually enabled teams to effectively execute with dependencies on other teams.</p> <p>What I've seen happen instead is, when work starts on the projects, people will ask who's working the project and then will make a guess at whether or not the project will be completed on time or in an effective way or even be completed at all based on who ends up working on the project. &quot;Oh, Joe is taking feature X? He never ships anything reasonable. Looks like we can't depend on it because that's never going to work. Let's do Y instead of Z since that won't require X to actually work&quot;. The roadmap creation and review process maintains the polite fiction that people are interchangeable, but everyone knows this isn't true and teams that are effective and want to ship on time can't play along when the rubber hits the road even if they play along with the managers, directors, and VPs, who create roadmaps as if people can be generically abstracted over.</p> <p>Another place the non-fungibility of people causes predictable problems is with how managers operate teams. Managers who want to create effective teams<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">1</a></sup> end up fighting the system in order to do so. Non-engineering orgs mostly treat people as fungible, and the finance org at a number of companies I've worked for forces the engineering org to treat people as fungible by requiring the org to budget in terms of headcount. The company, of course, spends money and not &quot;heads&quot;, but internal bookkeeping is done in terms of &quot;heads&quot;, so $X of budget will be, for some team, translated into something like &quot;three staff-level heads&quot;. There's no way to convert that into &quot;two more effective and better-paid staff level heads&quot;<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">2</a></sup>. If you hire two staff engineers and not a third, the &quot;head&quot; and the associated budget will eventually get moved somewhere else.</p> <p>One thing I've repeatedly seen is that a hiring manager will <a href="https://twitter.com/danluu/status/1452701799417143296">want to hire someone who they think will be highly effective or even just someone who has specialized skills and then not be able to hire because the company has translated budget into &quot;heads&quot; at a rate that doesn't allow for hiring some kind of heads</a>. There will be a &quot;comp team&quot; or other group in HR that will object because the comp team has no concept of &quot;an effective engineer&quot; or &quot;a specialty that's hard to hire for&quot;; for a person, role, level, and location defines them and someone who's paid too much for their role and level is therefore a bad hire. If anyone reasonable had power over the process that they were willing to use, this wouldn't happen but, by design, the bureaucracy is set up so that few people have power<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">3</a></sup>.</p> <p>A similar thing happens with retention. A great engineer I know who was regularly creating $x0M/yr<sup class="footnote-ref" id="fnref:Q"><a rel="footnote" href="#fn:Q">4</a></sup> of additional profit for the company per year wanted to move home to Portugal, so the company cut the person's cash comp by a factor of four. The company also offered to only cut his cash comp by a factor of two if he moved to Spain instead of Portugal. He left for a company that doesn't have location-based pay. This was escalated up to the director level, but that wasn't sufficient to override HR, so they left. HR didn't care that the person made the company more money than HR saves by doing location adjustments for all international employees combined because HR at the company had no notion of the value of an employee, only the cost, title, level, and location<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">5</a></sup>.</p> <p>Relatedly, a &quot;move&quot; I've seen twice, once from a distance and once from up close, is when HR decides <a href="https://twitter.com/danluu/status/1277591470162104321">attrition is too low</a>. In one case, the head of HR thought that the company's ~5% attrition was &quot;unhealthy&quot; because it was too low and in another, HR thought that the company's attrition sitting at a bit under 10% was too low. In both cases, the company made some moves that resulted in attrition moving up to what HR thought was a &quot;healthy&quot; level. In the case I saw from a distance, folks I know at the company agree that the majority of the company's best engineers left over the next year, many after only a few months. In the case I saw up close, I made a list of the most effective engineers I was aware of (like the person mentioned above who increased the company's revenue by 0.7% on his paternity leave) and, when the company successfully pushed attrition to over 10% overall, the most effective engineers left at over double that rate (which understates the impact of this because they tended to be long-tenured and senior engineers, where the normal expected attrition would be less than half the average company attrition).</p> <p>Some people seem to view companies like a game of SimCity, where if you want more money, you can turn a knob, increase taxes, and get more money, uniformly impacting the city. But companies are not a game of SimCity. If you want more attrition and turn a knob that cranks that up, you don't get additional attrition that's sampled uniformly at random. People, as a whole, cannot be treated as an abstraction where the actions company leadership takes impacts everyone in the same way. The people who are most effective will be disproportionately likely to leave if you turn a knob that leads to increased attrition.</p> <p>So far, we've talked about how treating individual people as fungible doesn't work for corporations but, of course, it also doesn't work in general. For example, a complaint from a friend of mine who's done a fair amount of &quot;on the ground&quot; development work in Africa is that a lot of people who are looking to donate want, clear, simple criteria to guide their donations (e.g., an RCT showed that the intervention was highly effective). But many effective interventions cannot have their impact demonstrated ex ante in any simple way because, among other reasons, the composition of the team implementing the intervention is important, resulting in a randomized trial or other experiment not being applicable to team implementing the intervention other than the teams from the trial in the context they were operating in during the trial.</p> <p>An example of this would be an intervention they worked on that, among other things, helped wipe out guinea worm in a country. Ex post, we can say that was a highly effective intervention since it was a team of three people operating on a budget of $12/(person-day)<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">6</a></sup> for a relatively short time period, making it a high ROI intervention, but there was no way to make a quantitative case for the intervention ex ante, nor does it seem plausible that there could've been a set of randomized trials or experiments that would've justified the intervention.</p> <p>Their intervention wasn't wiping out guinea worm, that was just a side effect. The intervention was, basically, travelling around the country and embedding in regional government offices in order to understand their problems and then advise/facilitate better decision making. In the course of talking to people and suggesting improvements/changes, they realized that guinea worm could with better distribution of clean water (guinea worm can come from drinking unfiltered water; giving people clean water can solve that problem) and that aid money flowing into the country specifically for water-related projects, like building wells, was already sufficient if the it was distributed to places in the country that had high rates of guinea worm due to contaminated water instead of to the places aid money was flowing to (which were locations that had a lot of aid money flowing to them for a variety of reasons, such as being near a local &quot;office&quot; that was doing a lot of charity work). The specific thing this team did to help wipe out guinea worm was to give powerpoint presentations to government officials on how the government could advise organizations receiving aid money on how those organizations could more efficiently place wells. At the margin, wiping out guinea worm in a country would probably be sufficient for the intervention to be high ROI, but that's a very small fraction of the &quot;return&quot; from this three person team. I only mention it because it's a self-contained easily-quantifiable change. Most of the value of &quot;leveling up&quot; decision making in regional government offices is very difficult to quantify (and, to the extent that it can be quantified, will still have very large error bars).</p> <p>Many interventions that seem the same ex ante, probably even most, produce little to no impact. My friend has a lot of comments on organizations that send a lot of people around to do similar sounding work but that produce little value, such as the Peace Corps.</p> <p>A major difference between my friend's team and most teams is that my friend's team was composed of people who had a track record of being highly effective across a variety of contexts. In an earlier job, my friend started a job at a large-ish ($5B/yr revenue) government-run utility company and was immediately assigned a problem that, unbeknownst to her, had been an open problem for years that was considered to be unsolvable. No one was willing to touch the problem, so they hired her because they wanted a scapegoat to blame and fire when the problem blew up. Instead, she solved the problem she was assigned to as well as a number of other problems that were considered unsolvable. A team of three such people will be able to get a lot of mileage out of potentially high ROI interventions that most teams would not succeed at, such as going to a foreign country and improving governmental decision making in regional offices across the country enough that the government is able to solve serious open problems that had been plaguing the country for decades.</p> <p>Many of the highest ROI interventions are similarly skill intensive and not amenable to simple back-of-the-envelope calculations, but most discussions I see on the topic, both in person and online, rely heavily on simplistic but irrelevant back-of-the-envelope calculations. This is not just a problem limited to cocktail-party conversations. My friend's intervention was almost killed by the organization she worked for because the organization was infested with what she thinks of &quot;overly simplistic <a href="https://en.wikipedia.org/wiki/Effective_altruism">EA</a> thinking&quot;, which caused leadership in the organization to try to redirect resources to projects where the computation of expected return was simpler because those projects were thought to be higher impact even though they were, ex post, lower impact. Of course, we shouldn't judge interventions on how they performed ex post since that will overly favor high variance interventions, but I think that someone thinking it through, who was willing to exercise their judgement instead of outsourcing their judgement to a simple metric, could and should say that the intervention in question was a good choice ex ante.</p> <p>This issue of projects which are more <a href="https://amzn.to/3qGiucN">legible</a> getting more funding is an issue across organizations as well as within them. For example, my friend says that, back when GiveWell was mainly or only recommending charities that had simply quantifiable return, she basically couldn't get her friends who worked in other fields to put resources towards efforts that weren't endorsed by GiveWell. People who didn't know about her aid background would say things like &quot;haven't you heard of GiveWell?&quot; when she suggested putting resources towards any particular cause, project, or organization.</p> <p>I talked to a friend of mine who worked at GiveWell during that time period about this and, according to him, the reason GiveWell initially focused on charities that had easily quantifiable value wasn't that they thought those were the highest impact charities. Instead, it was because, as a young organization, they needed to be credible and it's easier to make a credible case for charities whose value is easily quantifiable. He would not, and he thinks GiveWell would not, endorse donors funnelling all resources into charities endorsed by GiveWell and neglecting other ways to improve the world. But many people want the world to be simple and apply the algorithm &quot;charity on GiveWell list = good; not on GiveWell list = bad&quot; because it makes the world simple for them.</p> <p>Unfortunately for those people, as well as for the world, the world is not simple.</p> <p>Coming back to the tech company examples, Laurence Tratt notes something that I've also observed:</p> <blockquote> <p>One thing I've found very interesting in large organisations is when they realise that they need to do something different (i.e. they're slowly failing and want to turn the ship around). The obvious thing is to let a small team take risks on the basis that they might win big. Instead they tend to form endless committees which just perpetuate the drift that caused the committees to be formed in the first place! I think this is because they really struggle to see people as anything other than fungible, even if they really want to: it's almost beyond their ability to break out of their organisational mould, even when it spells long-term doom.</p> </blockquote> <p>One lens we can use to look at what's going on is <a href="https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-called-legibility/">legibility</a>. When you have a complex system, whether that's a company with thousands of engineers or a world with many billions of dollars going to aid work, the system is too complex for any decision maker to really understand, whether that's an exec at a company or a potential donor trying to understand where their money should go. One way to address this problem is by reducing the perceived complexity of the problem via imagining that individuals are fungible, making the system more legible. That produces relatively inefficient outcomes but, unlike <a href="https://twitter.com/benskuhn/status/1458948982059593730">trying to understand the issues at hand</a>, it's highly scalable, and if there's one thing that tech companies like, it's doing things that scale, and treating a complex system like it's SimCity or Civilization is highly scalable. When returns are relatively evenly distributed, losing out on potential outlier returns in the name of legibility is a good trade-off. But when ROI is a heavy-tailed distribution, when the right person can, on their paternity leave, increase company revenue of a giant tech company by 0.7% and then much more when they work on that full-time, then severely tamping down on the right side of the curve to improve legibility is very costly and can cost you the majority of your potential returns.</p> <p>Thanks to Laurence Tratt, Pam Wolf, Ben Kuhn, Peter Bhat Harkins, John Hergenroeder, Andrey Mishchenko, Joseph Kaptur, and Sophia Wisdom for comments/corrections/discussion.</p> <h3 id="appendix-re-orgs">Appendix: re-orgs</h3> <p>A friend of mine recently told me a story about a trendy tech company where they tried to move six people to another project, one that the people didn't want to work on that they thought didn't really make sense. The result was that two senior devs quit, the EM retired, one PM was fired (long story), and three people left the team. The team for both the old project and the new project had to be re-created from scratch.</p> <p>It could be much worse. In that case, at least there were some people who didn't leave the company. I once asked someone why feature X, which had been publicly promised, hadn't been implemented yet and also the entire sub-product was broken. The answer was that, after about a year of work, when shipping the feature was thought to be weeks away, leadership decided that the feature, which was previously considered a top priority, was no longer a priority and should be abandoned. The team argued that the feature was very close to being done and they just wanted enough runway to finish the feature. When that was denied, the entire team quit and the sub-product has slowly decayed since then. After many years, there was one attempted reboot of the team but, for reasons beyond the scope of this story, it was done with a new manager managing new grads and didn't really re-create what the old team was capable of.</p> <p>As we've previously seen, <a href="hardware-unforgiving/">an effective team is difficult to create, due to the institutional knowledge that exists on a team</a>, as well as <a href="culture/">the team's culture</a>, but destroying a team is very easy.</p> <p>I find it interesting that so many people in senior management roles persist in thinking that they can re-direct people as easily as opening up the city view in Civilization and assigning workers to switch from one task to another when the senior ICs I talk to have high accuracy in predicting when these kinds of moves won't work out.</p> <h3 id="appendix-related-posts">Appendix: related posts</h3> <ul> <li><a href="https://yosefk.com/blog/compensation-rationality-and-the-projectperson-fit.html">Yossi Kreinin on compensation and project/person fit</a></li> <li><a href="hardware-unforgiving/">Me on the difficulty of obtaining institutional knowledge</a></li> <li><a href="https://amzn.to/3qGiucN">James C. Scott on legibility</a></li> </ul> <div class="footnotes"> <hr /> <ol> <li id="fn:P"><p>On the flip side, there are managers who want to maximize the return to their career. At every company I've worked at that wasn't a startup, doing that involves moving up the ladder, which is easiest to do by collecting as many people as possible. At one company I've worked for, the explicitly stated promo criteria are basically &quot;how many people report up to this person&quot;.</p> <p>Tying promotions and compensation to the number of people managed could make sense if you think of people as mostly fungible, but is otherwise an obviously silly idea.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:E">This isn't quite this simple when you take into account retention budgets (money set aside from a pool that doesn't come out of the org's normal budget, often used to match offers from people who are leaving), etc., but adding this nuance doesn't really change the fundamental point. <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> <li id="fn:B"><p>There are advantages to a system where people don't have power, such as mitigating abuses of power, various biases, nepotism, etc. One can argue that reducing variance in outcomes by making people powerless is the preferred result, but in winner-take-most markets, which many tech markets are, forcing everyone lowest-common-denominator effectiveness is a recipe for being an also ran.</p> <p>A specific, small-scale, example of this is the massive advantage <a href="corp-eng-blogs/">companies that don't have a bureaucratic comms/PR approval process for technical blog posts have</a>. The theory behind having the onerous process that most companies have is that the company is protected from downside risk of a bad blog post, but examples of bad engineering blog posts that would've been mitigated by having an onerous process are few and far between, whereas the companies that have good processes for writing publicly get a lot of value that's easy to see.</p> <p>A larger scale example of this is that the large, now &gt;= $500B companies, all made aggressive moves that wouldn't have been possible at their bureaucracy laden competitors, which allowed them to wipe the floor with their competitors. Of course, many other companies that made serious bets instead of playing it safe failed more quickly than companies trying to play it safe, but those companies at least had a chance, unlike the companies that played it safe.</p> <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:Q"><p>I'm generally skeptical of claims like this. At multiple companies that I've worked for, if you tally up the claimed revenue or user growth wins and compare them to actual revenue or user growth, you can see that there's some funny business going on since the total claimed wins are much larger than the observed total.</p> <p>Just because <a href="why-benchmark/">I'm generally curious about measurements</a>, I sometimes did my own analysis of people's claimed wins and I almost always came up with an estimate that was much lower than the original estimate. Of course, I generally didn't publish these results internally since that would, in general, be a good way to make a lot of enemies without causing any change. In one extreme case, I found that the experimental methodology one entire org used was broken, causing them to get spurious wins in their A/B tests. I quietly informed them and they did nothing about it, which was the only reasonable move for them since having experiments that systematically showed improvement when none existed was a cheap and effective way for the org to gain more power by having its people get promoted and having more headcount allocated to it. And if anyone with power over the bureaucracy cared about accuracy of results, such a large discrepancy between claimed wins and actual results couldn't exist in the first place.</p> <p>Anyway, despite my general skepticism of claimed wins in general, I found this person's claimed wins highly credible after checking them myself. A project of theirs, done on their paternity leave (done while on leave because their manager and, really, the organization as well as the company, didn't support the kind of work they were doing) increased the company's revenue by 0.7%, robust and actually increasing in value through a long-term holdback, and they were able to produce wins of that magnitude after leadership was embarrassed into allowing them to do valuable work.</p> <p>P.S. If you'd like to play along at home, another fun game you can play after figuring out which teams and orgs hit their roadmap goals. For bonus points, plot the percentage of roadmap goals a team hits vs. their headcount growth as well as how predictive hitting last quarter's goals are for hitting next quarter's goals across teams.</p> <a class="footnote-return" href="#fnref:Q"><sup>[return]</sup></a></li> <li id="fn:L"><p>I've seen quite a few people leave their employers due to location adjustments during the pandemic. In one case, HR insisted the person was actually very well compensated because, even though it might appear as if the person isn't highly paid because they were paid significantly less than many people who were one level below them, according to HR's formula, which included a location-based pay adjustment, the person was one of the highest paid people for their level at the entire company in terms of normalized pay. Putting aside abstract considerations about fairness, <a href="https://yosefk.com/blog/compensation-rationality-and-the-projectperson-fit.html">for an employee</a>, HR telling them that they're highly paid given their location is like HR having a formula that pays based on height telling an employee that they're well paid for their height. That may be true according to whatever formula HR has but, practically speaking, that means nothing to the employee, who can go work somewhere that has a smaller height-based pay adjustment.</p> <p>Companies were able to get away with severe location-based pay adjustments with no cost to themselves before the pandemic. But, since the pandemic, a lot of companies have ramped up remote hiring and some of those companies have relatively small location-based pay adjustments, which has allowed them to disproportionately hire away who they choose from companies that still maintain severe location-based pay adjustments.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:T">Technically, their budget ended up being higher than this because one team member contracted typhoid and paid for some medical expenses from their personal budget and not from the organization's budget, but $12/(person-day), the organizational funding, is a pretty good approximation. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> </ol> </div> Culture matters culture/ Mon, 08 Nov 2021 00:00:00 +0000 culture/ <p>Three major tools that companies have to influence behavior are incentives, process, and culture. People often mean different things when talking about these, so I'll provide an example of each so we're on the same page (if you think that I should be using a different word for the concept, feel free to mentally substitute that word).</p> <ul> <li><p>Getting people to show up to meetings on time</p> <ul> <li>Incentive: dock pay for people who are late</li> <li>Process: don't allow anyone who's late into the meeting</li> <li>Culture: people feel strongly about showing up on time</li> </ul></li> <li><p>Getting people to build complex systems</p> <ul> <li>Incentive: require complexity in promo criteria</li> <li>Process: make process for creating or executing on a work item so heavyweight that people stop doing simple work</li> <li>Culture: people enjoy building complex systems and/or building complex systems results in respect from peers and/or prestige</li> </ul></li> <li><p>Avoiding manufacturing defects</p> <ul> <li>Incentive: pay people per good item created and/or dock pay for bad items</li> <li>Process: have QA check items before shipment and discard bad items</li> <li>Culture: people value excellence and try very hard to avoid defects</li> </ul></li> </ul> <p>If you read &quot;old school&quot; thought leaders, many of them advocate for a culture-only approach, e.g., <a href="https://mobile.twitter.com/danluu/status/885214004649615360">Ken Thompson saying, to reduce bug rate, that tools (which, for the purposes of this post, we'll call process) aren't the answer, having people care to and therefore decide to avoid writing bugs is the answer</a> or Bob Martin saying &quot;<a href="https://www.hillelwayne.com/post/uncle-bob/">The solution to the software apocalypse is not more tools. The solution is better programming discipline</a>.&quot;</p> <p>The emotional reaction those kinds of over-the-top statements evoke combined with the ease of rebutting them has led to a backlash against cultural solutions, leading people to say things like &quot;you should never say that people need more discipline and you should instead look at the incentives of the underlying system&quot;, in the same way that the 10x programmer meme and the associated comments have caused a backlash that's led to people to say things like <a href="productivity-velocity/">velocity doesn't matter at all</a> or there's absolutely no difference in velocity between programmers (<a href="https://scattered-thoughts.net/writing/moving-faster/">as Jamie Brandon has noted, a lot of velocity comes down to caring about and working on velocity</a>, so this is also part of the backlash against culture).</p> <p>But if we look at quantifiable output, we can see that, even if processes and incentives are the first-line tools a company should reach for, culture also has a large impact. For example, if we look at manufacturing defect rate, some countries persistently have lower defect rates than others on a timescale of decades<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">1</a></sup>, generally robust across companies, even when companies are operating factories in multiple countries and importing the same process and incentives to each factory to the extent that's possible, due to cultural differences that impact how people work.</p> <p>Coming back to programming, Jamie's post on &quot;moving faster&quot; notes:</p> <blockquote> <p>The main thing that helped is actually wanting to be faster.</p> <p>Early on I definitely cared more about writing 'elegant' code or using fashionable tools than I did about actually solving problems. Maybe not as an explicit belief, but those priorities were clear from my actions.</p> <p>I probably also wasn't aware how much faster it was possible to be. I spent my early career working with people who were as slow and inexperienced as I was.</p> <p>Over time I started to notice that some people are producing projects that are far beyond what I could do in a single lifetime. I wanted to figure out how to do that, which meant giving up my existing beliefs and trying to discover what actually works.</p> </blockquote> <p>I was lucky to have the opposite experience starting out since my first full-time job was at Centaur, a company that, at the time, had very high velocity/productivity. I'd say that I've only ever worked on one team with a similar level of productivity, and that's my current team, but my current team is fairly unusual for a team at a tech company (e.g., the median level on my team is &quot;senior staff&quot;)<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">2</a></sup>. A side effect of having started my career at such a high velocity company is that I generally find the pace of development slow at big companies and I see no reason to move slowly just because that's considered normal. I often hear similar comments from people I talk to at big companies who've previously worked at non-dysfunctional but not even particularly fast startups. A regular survey at one of the trendiest companies around asks &quot;Do you feel like your dev speed is faster or slower than your previous job?&quot; and the responses are bimodal, depending on whether the respondent came from a small company or a big one (with dev speed at TrendCo being slower than at startups and faster than at larger companies).</p> <p>There's a story that, <a href="https://amzn.to/3EZXykS">IIRC, was told by Brian Enos</a>, where he was practicing timed drills with the goal of practicing until he could complete a specific task at or under his usual time. He was having a hard time hitting his normal time and was annoyed at himself because he was slower than usual and kept at it until he hit his target, at which point he realized he misremembered the target and was accidentally targeting a new personal best time that was better than he thought was possible. While it's too simple to say that we can achieve anything if we put our minds to it, <a href="https://twitter.com/nickcammarata/status/1362261305357393920">almost none of us are operating at anywhere near our capacity and what we think we can achieve is often a major limiting factor</a>. Of course, at the limit, there's a tradeoff between velocity and quality and you can't get velocity &quot;for free&quot;, but, when it comes to programming, <a href="p95-skill/">we're so far from the Pareto frontier that there are free wins</a> if you just <a href="https://twitter.com/danluu/status/1442945072144678914">realize that they're available</a>.</p> <p>One way in which culture influences this is that people often absorb their ideas of what's possible from the culture they're in. For a non-velocity example, one thing I noticed after attending <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> was that a lot of speakers at the well-respected non-academic non-enterprise tech conferences, like Deconstruct and Strange Loop, also attended RC. Most people hadn't given talks before attending RC and, when I asked people, a lot of people had wanted to give talks but didn't realize how straightforward the process for becoming a speaker at &quot;big&quot; conferences is (have an idea, write it down, and then submit what you wrote down as a proposal). It turns out that giving talks at conferences is easy to do and a major blocker for many folks is just knowing that it's possible. In an environment where lots of people give talks and, where people who hesitantly ask how they can get started are told that it's straightforward, a lot of people will end up giving talks. The same thing is true of blogging, which is why a disproportionately large fraction of widely read programming bloggers started blogging seriously after attending RC. For many people, the barrier to starting a blog is some combination of realizing it's feasible to start a blog and that, from a technical standpoint, it's very easy to start a blog if you just pick any semi-reasonable toolchain and go through the setup process. And then, because people give talks and write blog posts, they get better at giving talks and writing blog posts so, on average, RC alums are probably better speakers and writers than random programmers even though there's little to no skill transfer or instruction at RC.</p> <p>Another kind of thing where culture can really drive skills are skills that are highly attitude dependent. An example of this is debugging. As Julia Evans has noted, <a href="https://twitter.com/b0rk/status/1249715842708844544">having a good attitude is a major component of debugging effectiveness</a>. This is something Centaur was very good at instilling in people, to the point that nearly everyone in my org at Centaur would be considered a very strong debugger by tech company standards.</p> <p>At big tech companies, it's common to see people give up on bugs after trying a few random things that didn't work. In one extreme example, someone I know at a mid-10-figure tech company said that it never makes sense to debug a bug that takes more than a couple hours to debug because engineer time is too valuable to waste on bugs that take longer than that to debug, an attitude this person picked up from the first team they worked on. Someone who picks up that kind of attitude about debugging is unlikely to become a good debugger until they change their attitude, and many people, including this person, carry the attitudes and habits they pick up at their first job around for quite a long time<sup class="footnote-ref" id="fnref:N"><a rel="footnote" href="#fn:N">3</a></sup>.</p> <p>By tech standards, Centaur is an extreme example in the other direction. If you're designing a CPU, it's not considered ok to walk away from a bug that you don't understand. Even if the symptom of the bug isn't serious, it's possible that the underlying cause is actually serious and you won't observe the more serious symptom until you've shipped a chip, so you have to go after even seemingly trivial bugs. Also, it's pretty common for there to be no good or even deterministic reproduction of a bug. The repro is often something like &quot;run these programs with these settings on the system and then the system will hang and/or corrupt data after some number of hours or days&quot;. When debugging a bug like that, there will be numerous wrong turns and dead ends, some of which can eat up weeks or months. As a new employee watching people work on those kinds of bugs, what I observed was that people would come in day after day and track down bugs like that, not getting frustrated and not giving up. When that's the culture and everyone around you has that attitude, it's natural to pick up the same attitude. Also, a lot of practical debugging skill is applying tactical skills picked up from having debugged a lot of problems, which naturally falls out of spending a decent amount of time debugging problems with a positive attitude, especially with exposure to hard debugging problems.</p> <p>Of course, most bugs at tech companies don't warrant months of work, but there's a big difference between intentionally leaving some bugs undebugged because some bugs aren't worth fixing and having poor debugging skills from never having ever debugged a serious bug and then not being able to debug any bug that isn't completely trivial.</p> <p>Cultural attitudes can drive a lot more than individual skills like debugging. Centaur had, per capita, by far the lowest serious production bug rate of any company I've worked for, at well under one per year with ~100 engineers. By comparison, I've never worked on a team 1/10th that size that didn't have at least 10x the rate of serious production issues. Like most startups, Centaur was very light on process and it was also much lighter on incentives than the big tech companies I've worked for.</p> <p>One component of this was that there was a culture of owning problems, regardless of what team you were on. If you saw a problem, you'd fix it, or, if there was a very obvious owner, you'd tell them about the problem and they'd fix it. There weren't roadmaps, standups, kanban, or anything else to get people to work on important problems. People did it without needed to be reminded or prompted.</p> <p>That's the opposite of what I've seen at two of the three big tech companies I've worked for, where the median person avoids touching problems outside of their team's mandate like the plague, and someone who isn't politically savvy who brings up a problem to another team will get a default answer of &quot;sorry, this isn't on our roadmap for the quarter, perhaps we can put this on the roadmap in [two quarters from now]&quot;, with the same response repeated to anyone naive enough to bring up the same issue two quarters later. At every tech company I've worked for, huge, extremely costly, problems slip through the cracks all the time because no one wants to pick them up. I never observed that happening at Centaur.</p> <p>A side effect of big company tech culture is that someone who wants to actually do the right thing can easily do very high (positive) impact work by just going around and fixing problems that any intern could solve, if they're willing to ignore organizational processes and incentives. You can't shake a stick without <a href="https://twitter.com/danluu/status/802971209176477696">hitting a problem that's worth more to the company than my expected lifetime earnings</a> and it's easy to knock off multiple such problems per year. <a href="algorithms-interviews/#appendix-misaligned-incentive-hedgehog-defense-part-3">Of course, the same forces that cause so many trivial problems to not get solved mean that people who solve those problems don't get rewarded for their work</a><sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">4</a></sup>.</p> <p>Conversely, in eight years at Centaur, I only found one trivial problem whose fix was worth more than I'll earn in my life because, in general, problems would get solved before they got to that point. I've seen various big company attempts to fix this problem using incentives (e.g., monetary rewards for solving important problems) and process (e.g., <a href="https://twitter.com/altluu/status/1497980098107953152">making a giant list of all projects/problems, on the order of 1000 projects, and having a single person order them, along with a bureaucratic system where everyone has to constantly provide updates on their progress via JIRA so that PMs can keep sending progress updates to the person who's providing a total order over the work of thousands of engineers</a><sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">5</a></sup>), but none of those attempts have worked even half as well as having a culture of ownership (to be fair to incentives, I've heard that FB uses monetary rewards to good effect, but <a href="https://twitter.com/danluu/status/1447268693075841024">I've failed FB's interview three times</a>, so I haven't been able to observe how that works myself).</p> <p>Another component that resulted in a relatively low severe bug rate was that, across the company at Centaur, people cared about quality in a way that I've never seen at a team level let alone at an org level at a big tech company. When you have a collection of people who care about quality and feel that no issue is off limits, you'll get quality. And when you onboard people, as long as you don't do it so quickly that the culture is overwhelmed by the new hires, they'll also tend to pick up the same habits and values, especially when you hire new grads. While it's not exactly common, there are plenty of small firms out there with a culture of excellence that generally persists without heavyweight processes or big incentives, but this doesn't work at big tech companies since they've all gone through a hypergrowth period where it's impossible to maintain such extreme (by mainstream standards) cultural values.</p> <p>So far, we've mainly discussed companies transmitting culture to people, but something that I think is no less important is how people then carry that culture with them when they leave. I've been <a href="https://twitter.com/danluu/status/1444034823329177602">reasonably successful since changing careers from hardware to software</a> and I think that, among the factors that are under my control, one of the biggest ones is that I picked up effective cultural values from the first place I worked full-time and continue to operate as in the same way, which is highly effective. I've also seen this in other people who, career-wise, &quot;grew up&quot; in a culture of excellence and then changed to a different field where there's even less direct skill transfer, e.g., from skiing to civil engineering. Relatedly, if you read books from people who discuss the reasons why they were very effective in their field, e.g., <a href="https://amzn.to/3EZXykS">Practical Shooting by Brian Enos</a>, <a href="https://www.sirlin.net/ptw">Playing to Win by Dan Sirlin</a>, etc., the books tend to contain the same core ideas (serious observation and improvement of skills, the importance of avoiding emotional self-sabotage, the importance of intuition, etc.).</p> <p>Anyway, I think that cultural transmission of values and skills is an underrated part of choosing a job (some things I would consider overrated are <a href="https://www.patreon.com/posts/25835707">prestige</a> and <a href="https://twitter.com/danluu/status/1275191896097189888">general reputation</a> and that people should be thoughtful about what cultures they spend time in because not many people are able to avoid at least somewhat absorbing the cultural values around them<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">6</a></sup>.</p> <p>Although this post is oriented around tech, there's nothing specific to tech about this. A classic example is how idealistic students will go to law school with the intention of doing &quot;save the world&quot; type work and then absorb the <a href="https://www.patreon.com/posts/25835707">prestige-transmitted cultural values</a> of students around then go into the most prestigious job they can get which, when it's not a clerkship, will be a &quot;BIGLAW&quot; job that's the opposite of &quot;save the world&quot; work. To first approximation, everyone thinks &quot;that will never happen to me&quot;, but from having watched many people join organizations where they <a href="wat/">initially find the values and culture very wrong</a>, almost no one is able to stay without, to some extent, absorbing the values around them; <a href="look-stupid/">very few people are ok with everyone around them looking at them like they're an idiot for having the wrong values</a>.</p> <h3 id="appendix-bay-area-culture">Appendix: Bay area culture</h3> <p>One thing I admire about the bay area is how infectious people's attitudes are with respect to trying to change the world. Everywhere I've lived, people gripe about problems (the mortgage industry sucks, selling a house is high friction, etc.). Outside of the bay area, it's just griping, but in the bay, when I talk to someone who was griping about something a year ago, there's a decent chance they've started a startup to try to address one of the problems they're complaining about. I don't think that people in the bay area are fundamentally different from people elsewhere, it's more that when you're surrounded by people who are willing to walk away from their jobs to try to disrupt an entrenched industry, it seems pretty reasonable to do the same thing (which also leads to network effects that make it easier from a &quot;technical&quot; standpoint, e.g., easier fundraising). <a href="https://slatestarcodex.com/2017/05/11/silicon-valley-a-reality-check/">There's a kind of earnestness in these sorts of complaints and attempts to fix them that's easy to mock</a>, but <a href="look-stupid/">that earnestness is something I really admire</a>.</p> <p>Of course, <a href="https://twitter.com/pushcx/status/1442860058660913166">not all of bay area culture is positive</a>. The bay has, among other things, <a href="https://devonzuegel.com/post/why-is-flaking-so-widespread-in-san-francisco">a famously flaky culture</a> to an extent I found shocking when I moved there. Relatively early on in my time there, I met some old friends for dinner and texted them telling them I was going to be about 15 minutes late. They were shocked when I showed up because they thought that saying that I was going to be late actually meant that I wasn't going to show up (another norm that surprised me that's an even more extreme version was that, for many people, not confirming plans shortly before their commencement means that the person has cancelled, i.e., plans are cancelled by default).</p> <p>A related norm that I've heard people complain about is how management and leadership will say yes to everything in a &quot;people pleasing&quot; move to avoid conflict, which actually increases conflict as people who heard &quot;yes&quot; as a &quot;yes&quot; and not as &quot;I'm saying yes to avoid saying no but don't actually mean yes&quot; are later surprised that &quot;yes&quot; meant &quot;no&quot;.</p> <h3 id="appendix-centaur-s-hiring-process">Appendix: Centaur's hiring process</h3> <p>One comment people sometimes have when I talk about Centaur is that they must've had some kind of incredibly rigorous hiring process that resulted in hiring elite engineers, but the hiring process was much less selective than any &quot;brand name&quot; big tech company I've worked for (Google, MS, and Twitter) and not obviously more selective than boring, old school, companies I've worked for (IBM and Micron). The &quot;one weird trick&quot; was onboarding, not hiring.</p> <p>For new grad hiring (and, proportionally, we hired a lot of new grads), recruiting was more difficult than at any other company I'd worked for. Senior hiring wasn't difficult because Centaur had a good reputation locally, in Austin, but among new grads, no one had heard of us and no one wanted to work for us. When I recruited at career fairs, I had to stand out in front of our booth and flag down people who were walking by to get anyone to talk to us. This meant that we couldn't be picky about who we interviewed. We really ramped up hiring of new grads around the time that Jeff Atwood popularized the idea that there are a bunch of fake programmers out there applying for jobs and that you'd end up with programmers who can't program if you don't screen people out with basic coding questions in his very influential post, <a href="https://blog.codinghorror.com/why-cant-programmers-program/" rel="nofollow">Why Can't Programmers.. Program?</a> (the bolding below is his):</p> <blockquote> <p><strong>I am disturbed and appalled that any so-called programmer would apply for a job without being able to write the simplest of programs</strong>. That's a slap in the face to anyone who writes software for a living. ... It's a shame you have to do so much pre-screening to <strong>have the luxury of interviewing programmers who can actually program</strong>. It'd be funny if it wasn't so damn depressing</p> </blockquote> <p>Since we were a relatively coding oriented hardware shop (verification engineers primarily wrote software and design engineers wrote a lot of tooling), we tried asking a simple coding question where people were required to code up a function to output Fibonacci numbers given a description of how to compute them (the naive solution was fine; a linear time or faster solution wasn't necessary). We dropped that question because no one got it without being walked through the entire thing in detail, which meant that the question had zero discriminatory power for us.</p> <p>Despite not really asking a coding question, people did things like write hairy concurrent code (internal processor microcode, which often used barriers as the concurrency control mechanism) and create tools at a higher velocity and lower bug rate than I've seen anywhere else I've worked.</p> <p>We were much better off avoiding hiring the way everyone else was because that meant we tried to and did hire people that other companies weren't competing over. That wouldn't make sense if other companies were using techniques that were highly effective but other companies were doing things like asking people to code FizzBuzz and then whiteboard some algorithms. While <a href="algorithms-interviews/">one might expect that doing algorithms interviews would result in hiring people who can solve the exact problems people ask about in interviews, but this turns out not to be the case</a>. <a href="https://twitter.com/danluu/status/1425514112642080773">The other thing we did was have much less of a prestige filter than most companies</a>, which also let us hire great engineers that other companies wouldn't even consider.</p> <p>We did have some people who didn't work out, but it was never because they were &quot;so-called programmers&quot; who couldn't &quot;write the simplest of programs&quot;. I do know of two cases of &quot;fake programmers being hired who literally couldn't program, but both were at prestigious companies that have among the most rigorous coding interviews done at tech companies. In one case, it was discovered pretty quickly that the person couldn't code and people went back to review security footage from the interview and realized that the person who interviewed wasn't the person who showed up to do the job. In the other, the person was able to sneak under the radar at Google for multiple years before someone realized that the person never actually wrote any code and tasks only got completed when they got someone else to do the task. The person who realized eventually scheduled a pair programming session, where they discovered that the person wasn't able to write a loop, didn't know the difference between <code>=</code> and <code>==</code>, etc., despite being a &quot;senior SWE&quot; (L5/T5) at Google for years.</p> <p>I'm not going to say that having coding questions will never save you from hiring a fake programmer, but the rate of fake programmers appears to be very low enough that a small company can go a decade without hiring a fake programmer without asking a coding question and larger companies that are targeted by scammers still can't really avoid them even after asking coding questions.</p> <h3 id="appendix-importing-culture">Appendix: importing culture</h3> <p>Although this post is about how company culture impacts employees, of course employees impact company culture as well. Something that seems underrated in hiring, especially of senior leadership and senior ICs, is how they'll impact culture. Something I've repeatedly seen, both up close, and from a distance, is the hiring of a new senior person who manages to import their culture, which isn't compatible with the existing company's culture, causing serious problems and, frequently, high attrition, as things settle down.</p> <p>Now that I've been around for a while, I've been in the room for discussions on a number of very senior hires and I've never seen anyone else bring up whether or not someone will import incompatible cultural values other than really blatant issues, like the person being a jerk or making racist or sexist comments in the interview.</p> <p>Thanks to Peter Bhat Harkins, Laurence Tratt, Julian Squires, Anja Boskovic, Tao L., Justin Blank, Ben Kuhn, V. Buckenham, Mark Papadakis, and Jamie Brandon for comments/corrections/discussion.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:C">What countries actually have low defect rate manufacturing is often quite different from the general public reputation. To see this, you really need to look at the data, which is often NDA'd and generally only spread in &quot;bar room&quot; discussions. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:S">: Centaur had what I sometimes called &quot;the world's stupidest business model&quot;, competing with Intel on x86 chips starting in 1995, so it needed an extremely high level of productivity to survive. Through the bad years, AMD survived by selling off pieces of itself to fund continued x86 development and every other competitor (Rise, Cyrix, TI, IBM, UMC, NEC, and Transmeta) got wiped out. If you compare Centaur to the longest surviving competitor that went under, Transmeta, Centaur just plain shipped more quickly, which is a major reason that Centaur was able to survive until 2021 (when it was pseudo-acqui-hired by Intel) and Transmeta went in 2009 under after burning through ~$1B of funding (including payouts from lawsuits). Transmeta was founded in 1995 and shipped its first chip in 2000, which was considered a normal tempo for the creation of a new CPU/microarchitecture at the time; Centaur shipped its first chip in 1997 and continued shipping at a high cadence until 2010 or so (how things got slower and slower until the company stalled out and got acqui-hired is a topic for another post). <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:N">This person initially thought the processes and values on their first team were absurd before <a href="wat/">the cognitive dissonance got to them and they became a staunch advocate of the company's culture, which is typical for folks joining a company that has obviously terrible practices</a>. <a class="footnote-return" href="#fnref:N"><sup>[return]</sup></a></li> <li id="fn:B">This illustrates one way in which incentives and culture are non-independent. What I've seen in places where this kind of work isn't rewarded is that, due to the culture, making these sorts of high-impact changes frequently requires burnout inducing slogs, at the end of which there is no reward, which causes higher attrition among people who have a tendency to own problems and do high-impact work. What I've observed in environments like this is that the environment differentially retains people who don't want to own problems, which then makes make more difficult and more burnout inducing for new people who join who attempt to fix serious problems. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:R">I'm adding this note because, when I've described this to people, many people thought that this must be satire. It is not satire. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:L"><p>As with many other qualities, there can be high variance within a company as well as across companies. For example, there's a team I sometimes encountered at a company I've worked for that has a very different idea of customer service than most of the company and people who join that team and don't quickly bounce usually absorb their values.</p> <p>Much of the company has a pleasant attitude towards internal customers, but this team has a &quot;the customer is always wrong&quot; attitude. A funny side effect of this is that, when I dealt with the team, I got the best support when a junior engineer who hadn't absorbed the team's culture was on call, and sometimes a senior engineer would say something was impossible or infeasible only to have a junior engineer follow up and trivially solve the problem.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> </ol> </div> Willingness to look stupid look-stupid/ Thu, 21 Oct 2021 00:00:00 +0000 look-stupid/ <p>People frequently<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">1</a></sup> think that I'm very stupid. I don't find this surprising, since I don't mind if other people think I'm stupid, which means that I don't adjust my behavior to avoid seeming stupid, which results in people thinking that I'm stupid. Although there are some downsides to people thinking that I'm stupid, e.g., failing interviews where the interviewer very clearly thought I was stupid, I think that, overall, the upsides of being willing to look stupid have greatly outweighed the downsides.</p> <p>I don't know why this one example sticks in my head but, for me, the most memorable example of other people thinking that I'm stupid was from college. I've had numerous instances where more people thought I was stupid and also where people thought the depths of my stupidity was greater, but this one was really memorable for me.</p> <p>Back in college, there was one group of folks that, for whatever reason, stood out to me as people who really didn't understand the class material. When they talked, they said things that didn't make any sense, they were struggling in the classes and barely passing, etc. I don't remember any direct interactions but, one day, a friend of mine who also knew them remarked to me, &quot;did you know [that group] thinks you're really dumb?&quot;. I found that interesting and asked why. It turned out the reason was that I asked really stupid sounding questions.</p> <p>In particular, it's often the case that there's a seemingly obvious but actually incorrect reason something is true, a slightly less obvious reason the thing seems untrue, and then a subtle and complex reason that the thing is actually true<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">2</a></sup>. I would regularly figure out that the seemingly obvious reason was wrong and then ask a question to try to understand the subtler reason, which sounded stupid to someone who thought the seemingly obvious reason was correct or thought that the refutation to the obvious but incorrect reason meant that the thing was untrue.</p> <p>The benefit from asking a stupid sounding question is small in most particular instances, but the compounding benefit over time is quite large and I've observed that people who are willing to ask dumb questions and think &quot;stupid thoughts&quot; end up understanding things much more deeply over time. Conversely, when I look at people who have a very deep understanding of topics, many of them frequently ask naive sounding questions and continue to apply one of the techniques that got them a deep understanding in the first place.</p> <p>I think I first became sure of something that I think of as a symptom of the underlying phenomenon via playing competitive video games when I was in high school. There were few enough people playing video games online back then that you'd basically recognize everyone who played the same game and could see how much everyone improved over time. Just like <a href="p95-skill/">I saw when I tried out video games again a couple years ago</a>, most people would blame external factors (lag, luck, a glitch, teammates, unfairness, etc.) when they &quot;died&quot; in the game. The most striking thing about that was that people who did that almost never became good and never became great. I got pretty good at the game<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">3</a></sup> and my &quot;one weird trick&quot; was to think about what went wrong every time something went wrong and then try to improve. But most people seemed more interested in making an excuse to avoid looking stupid (or maybe feeling stupid) in the moment than actually improving, which, of course, resulted in them having many more moments where they looked stupid in the game.</p> <p>In general, I've found willingness to look stupid to be very effective. Here are some more examples:</p> <ul> <li>Going into an Apple store and asking for (and buying) the computer that comes in the smallest box, which I had a good reason to want at the time <ul> <li>The person who helped me, despite being very polite, also clearly thought I was a bozo and kept explaining things like &quot;the size of the box and the size of the computer aren't the same&quot;. Of course I knew that, but I didn't want to say something like &quot;I design CPUs. I understand the difference between the size of the box the computer comes and in the size of the computer and I know it's very unusual to care about the size of the box, but I really want the one that comes in the smallest box&quot;. Just saying the last bit without establishing any kind of authority didn't convince the person</li> <li>I eventually asked them to humor me and just bring out the boxes for the various laptop models so I could see the boxes, which they did, despite clearly thinking that my decision making process made no sense (<a href="https://twitter.com/altluu/status/1452704171447123969">I also tried explaining why I wanted the smallest box but that didn't work</a>)</li> </ul></li> <li>Covid: I took this seriously relatively early on and bought a half mask respirator on 2020-01-26 and was using N95s I'd already had on hand for the week before (IMO, the case that covid was airborne and that air filtration would help was very strong based on the existing literature on SARS contact tracing, filtration of viruses from air filters, and viral load) <ul> <li>It wasn't until many months later that people didn't generally look at me like I was an idiot, and even as late 2020-08, I would sometimes run into people who would verbally make fun me</li> <li>On the flip side, the person I was living with at the time didn't want to wear the mask I got her since she found it too embarrassing to wear a mask for the 1 hour round-trip BART ride to and from a maker space when no one else was on BART or at the maker space. She became one of the early bay area covid cases, which gave her a case of long covid that floored her for months <ul> <li>When she got covid, I tried to convince her that she should tell people at the maker space she'd been going to that she got covid so they would know that they were exposed and could take appropriate precautions in order to avoid accidentally spreading covid, but she also found admitting that she might've spread covid to people too embarrassing to do (in retrospect, I should've just called up the maker space and told them)</li> </ul></li> <li>A semi-related one is that, when Canada started doing vaccines, I wanted to get Moderna even though the general consensus online and in my social circles was that Pfizer was preferred <ul> <li>One reason for this was it wasn't clear if the government was going to allow mixing vaccines and the delivery schedule implied that there would be a very large shortage of Pfizer for 2nd doses as well as a large supply of Moderna</li> <li>Another thought that had crossed my mind was that Moderna is basically &quot;more stuff&quot; than Pfizer and might convey better immunity in some cases, in the same way that some populations get high-dose flu shots to get better immunity</li> </ul></li> </ul></li> <li>Work: I generally don't worry about proposals or actions looking stupid <ul> <li>I can still remember the first time I explicitly ran into this. This was very early on my career, when I was working on chip verification. Shortly before tape-out, the head of verification wanted to use our compute resources to re-run a set of tests that had virtually no chance of finding any bugs (they'd been run thousands of times before) instead of running the usual mix of tests, which would include a lot of new generated tests that had a much better chance of finding a bug (this was both logically and empirically true). I argued that we should run the tests that reduced the odds of shipping with a show stopping bug (which would cost us millions of dollars and delay shipping by three months), but the head of the group said that we would look stupid and incompetent if there was a bug that could've been caught by one of our old &quot;golden&quot; tests that snuck in since the last time we'd run those tests <ul> <li>At the time, I was shocked that somebody would deliberately do the wrong thing in order to reduce the odds of potentially looking stupid (and, really, only looking stupid to people who wouldn't understand the logic of running the best available mix of tests; since there weren't non-technical people anywhere in the management chain, anyone competent should understand the reasoning) but now that I've worked at various companies in multiple industries, I see that most people would choose to do the wrong thing to avoid potentially looking stupid to people who are incompetent. I see the logic, but I think that it's self-sabotaging to behave that way and that the gains to my career for standing up for what I believe are the right thing have been so large that, even if the next ten times I do so, I get unlucky and it doesn't work out, that still won't erase the gains I've made from having done the right thing many times in the past</li> </ul></li> </ul></li> <li>Air filtration: I did a bit of looking into the impact of air quality on health and bought air filters for my apartment in 2012 <ul> <li>Friends have been chiding about this for years and strangers, dates, and acquaintances, will sometimes tell me, with varying levels of bluntness, that I'm being paranoid and stupid</li> <li>I added more air filtration capacity when I moved to a wildfire risk area <a href="https://mobile.twitter.com/altluu/status/1409762306452459520">after looking into wildfire risk</a> which increased the rate and bluntness of people telling me that I'm weird for having air filters <ul> <li>I've been basically totally unimpacted by wildfire despite living through a fairly severe wildfire season twice</li> <li>Other folks I know experienced some degree of discomfort, with a couple people developing persistent issues after the smoke exposure (in one case, persistent asthma, which they didn't have before or at least hadn't noticed before)</li> </ul></li> </ul></li> <li>Learning things that are hard for me: this is a &quot;feeling stupid&quot; thing and not a &quot;looking stupid&quot; thing, but when I struggle with something, I feel really dumb, as in, I have a feeling/emotion that I would verbally describe as &quot;feeling dumb&quot; <ul> <li>When I was pretty young, I think before I was a teenager, I noticed that this happened when I learned things that were hard for me and tried to think of this feeling as &quot;the feeling of learning something&quot; instead of &quot;feeling dumb&quot;, which half worked (I now associate that feeling with the former as well as the latter)</li> </ul></li> <li>Asking questions: covered above, but I frequently ask questions when there's something I don't understand or know, from basic stuff, &quot;what does [some word] mean?&quot; to more subtle stuff. <ul> <li>On the flip side, one of the most common failure modes I see with junior engineers is when someone will be too afraid to look stupid to ask questions and then learn very slowly as a result; in some cases, this is so severe it results in them being put on a PIP and then getting fired <ul> <li>I'm sure there are other reasons this can happen, like not wanting to bother people, but in the cases where I've been close enough to the situation to ask, it was always embarrassment and fear of looking stupid</li> <li>I try to be careful to avoid this failure mode when onboarding interns and junior folks and have generally been sucessful, but it's taken me up to six weeks to convince people that it's ok for them to ask questions and, until that happens, I have to constantly ask them how things are going to make sure they're not stuck. That works fine if someone is my intern, but I can observe that many intern and new hire mentors do not do this and that often results in a bad outcome for all parties <ul> <li>In almost every case, the person had at least interned at other companies, but they hadn't learned that it was ok to ask questions. P.S. if you're a junior engineer at a place where it's not ok to ask questions, you should look for another job if circumstances permit</li> </ul></li> </ul></li> </ul></li> <li>Not making excuses for failures: covered above for video games, but applies a lot more generally</li> <li>When learning, deliberately playing around in the area between success and failure (this applies to things like video games and sports as well as abstract intellectual pursuits) <ul> <li>An example would be, when learning to climb, repeatedly trying the same easy move over and over again in various ways to understand what works better and what works worse. I've had strangers make fun of me and literally point at me and make snide comments to their friends while I'm doing things like this</li> <li>When learning to drive, I wanted to set up some cones and drive so that I barely hit them, to understand where the edge of the car is. My father thought this idea was very stupid and I should just not hit things like curbs or cones</li> </ul></li> <li>Car insurance: the last time I bought car insurance, I had to confirm three times that I only wanted coverage for damage I do to others with no coverage for damage to my own vehicle if I'm at fault. The insurance agent was unable to refrain from looking at me like I'm an idiot and was more incredulous each time they asked if I was really sure</li> <li>The styling and content on this website: I regularly get design folks and typographers telling me how stupid the design is, frequently in ways that become condescending very quickly if I engage with them <ul> <li>But, when I tested out switching to the current design from the generally highly lauded Octopress design, this one got much better engagement when a user landed on the site and also appeared to get passed around a lot more as well</li> <li>When I've compared my traffic numbers to major corporate blogs, my blog completely dominates most &lt; $100B companies (e.g., it gets an order of magnitude more traffic than my employer's blog and my employer is a $50B company)</li> <li>When I started my blog (and this is still true today), writing advice for programming blogs <a href="https://twitter.com/danluu/status/1437539076324790274">was to keep it short, maybe 500 to 1000 words</a>. Most of my blog posts are 5000 to 10000 words</li> </ul></li> <li>Taking my current job, which almost everyone thought was a stupid idea <ul> <li>Closely related: quitting my job at Centaur to attend <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> and then eventually changing fields into software (I don't think this would be considered as stupid now, but it was thought to be a very stupid thing to do in 2013)</li> </ul></li> <li>Learning a sport or video game: I try things out to understand what happens when you do them, which often results in other people thinking that I'm a complete idiot when the thing looks stupid, but being willing to look stupid helps me improve relatively quickly</li> <li>Medical care: I've found that a lot of doctors are very confident in their opinion and get condescending pretty fast if you disagree <ul> <li>And yet, in the most extreme case, I would have died if I listened to my doctor; in the next most extreme case, I would have gone blind</li> <li>When getting blood draws, I explain to people that I'm deceptively difficult to draw from and tell them what's worked in the past <ul> <li>About half the time, the nurse or phlebotomist takes my comments seriously, generally resulting in a straightforward and painless or nearly painless blood draw</li> <li>About half the time, the nurse or phlebotomist looks at me like I'm an idiot and makes angry and/or condescending comments towards me; so far, everyone who's done this has failed to draw blood and/or given me a hematoma</li> <li>I've had people tell me that I'm probably stating my preferences an offensive way and that I should be more polite; I've then invited them along with me to observe and no one has ever had a suggestion on how I could state things different to elicit a larger fraction of positive responses; in general, people are shocked and upset when they see how nurses and phlebotomists respond</li> <li>In retrospect, I should probably just get up and leave when someone has the &quot;bad&quot; response, which will probably increase the person's feeling that I'm stupid</li> <li>One issue I have (and not the main one that makes it hard to &quot;get a stick&quot;) is that, during a blood draw, the blood will slow down and then usually stop. Some nurses like to wiggle the needle around to see if that starts things up again, which sometimes works (maybe 50/50) and will generally leave me with a giant bruise or a hematoma or both. After this happened a few times, I asked if getting my blood flowing (e.g., by moving around a lot before a blood draw) could make a difference and every nurse or phlebotomist I talked to said that was silly and that it wouldn't make any difference. I tried it anyway and that solved this problem, although I still have the problem of being hard to stick properly</li> </ul></li> </ul></li> <li>Interviews: I'm generally not adversarial in interviews, but I try to say things that I think are true and try to avoid saying things that I think are false and this frequently <a href="https://twitter.com/danluu/status/1447268693075841024">causes interviewers to think that I'm stupid</a> (<a href="algorithms-interviews/">I generally fail interviews at a fairly high rate</a>, so who knows for sure if this is related, but having someone look at you like you're an idiot or start using a condescending tone of voice with condescending body language after you say something &quot;stupid&quot; seems like a bad sign).</li> <li>Generally trying to improve at things as well as being earnest <ul> <li>Even before &quot;tryhard&quot; was an insult, a lot of people in my extended social circles thought that being a tryhard was idiotic and that one shouldn't try and should instead play it cool (this was before I worked as an engineer; as an engineer, I think that effort is more highly respected than among my classmates from school as well as internet folks I knew back when I was in school)</li> </ul></li> <li>Generally admitting when I'm bad or untalented at stuff, e.g., <a href="learning-to-program/">mentioning that I struggled to learn to program in this post</a>; an interviewer at Jane Street really dug into what I'd written in that post and tore me a new one for that post (it was the most hostile interview I've ever experienced by a very large margin), which is the kind of thing that sometimes happens when you're earnest and put yourself out there, but I still view the upsides as being greater than the downsides</li> <li>Recruiting: I have an unorthodox recruiting pitch which candidly leads with the downsides, often causing people to say that I'm a terrible recruiter (or sarcastically say that I'm a great recruiter); I haven't publicly written up the pitch (yet?) because it's negative enough that I'm concerned that I'd be fired for putting it on the internet <ul> <li>I have never failed to close a full-time candidate (I once failed to close an intern candidate) and have brought in a lot of people who never would've considered working for us otherwise. My recruiting pitch sounds comically stupid, but it's much more effective than the standard recruiting spiel most people give</li> </ul></li> <li>Posting things on the internet: self explanatory</li> </ul> <p>Although most of the examples above are &quot;real life&quot; examples, being willing to look stupid is also highly effective at work. Besides the obvious reason that it allows you to learn faster and become more effective, it also makes it much easier to find high ROI ideas. If you go after trendy or reasonable sounding ideas, to do something really extraordinary, you have to have better ideas/execution than everyone else working on the same problem. But if you're thinking about ideas that most people consider too stupid to consider, you'll often run into ideas that are both very high ROI as well as simple and easy that anyone could've done had they not dismissed the idea out of hand. It may still technically be true that you need to have better execution than anyone else who's trying the same thing, but if no one else trying the same thing, that's easy to do!</p> <p>I don't actually have to be nearly as smart or work nearly as hard as most people to get good results. If I try to solve some a problem by doing what everyone else is doing and go looking for problems where everyone else is looking, if I want to do something valuable, I'll have to do better than a lot of people, maybe even better than everybody else if the problem is really hard. If the problem is considered trendy, a lot of very smart and hardworking people will be treading the same ground and doing better than that is very difficult. But I have a dumb thought, one that's too stupid sounding for anyone else to try, I don't necessarily have to be particularly smart or talented or hardworking to come up with valuable solutions. Often, the dumb solution is something any idiot could've come up with and the reason the problem hasn't been solved is because no one was willing to think the dumb thought until an idiot like me looked at the problem.</p> <p>Overall, I view the upsides of being willing to look stupid as much larger than the downsides. When it comes to things that aren't socially judged, like winning a game, understanding something, or being able to build things due to having a good understanding, it's all upside. There can be downside for things that are &quot;about&quot; social judgement, like interviews and dates but, even there, I think a lot of things that might seem like downsides are actually upsides.</p> <p>For example, if a date thinks I'm stupid because I ask them what a word means, so much so that they show it in their facial expression and/or tone of voice, I think it's pretty unlikely that we're compatible, so I view finding that out sooner rather than later as upside and not downside.</p> <p>Interviews are the case where I think there's the most downside since, at large companies, the interviewer likely has no connection to the job or your co-workers, so them having a pattern of interaction that I would view as a downside has no direct bearing on the work environment I'd have if I were offered the job and took it. There's probably some correlation but I can probably get much more signal on that elsewhere. But I think that being willing to say things that I know have a good chance of causing people to think I'm stupid is a deeply ingrained enough habit that it's not worth changing just for interviews and I can't think of another context where the cost is nearly as high as it is in interviews. In principle, I could probably change how I filter what I say only in interviews, but I think that would be a very large amount of work and not really worth the cost. An easier thing to do would be to change how I think so that I reflexively avoid thinking and saying &quot;stupid&quot; thoughts, which a lot of folks seem to do, but that seems even more costly.</p> <h3 id="appendix-do-you-try-to-avoid-looking-stupid">Appendix: do you try to avoid looking stupid?</h3> <p>On reading a draft of this, Ben Kuhn remarked,</p> <blockquote> <p>[this post] caused me to realize that I'm actually very bad at this, at least compared to you but perhaps also just bad in general.</p> <p>I asked myself &quot;why can't Dan just avoid saying things that make him look stupid specifically in interviews,&quot; then I started thinking about what the mental processes involved must look like in order for that to be impossible, and realized they must be extremely different from mine. Then tried to think about the last time I did something that made someone think I was stupid and realized I didn't have a readily available example)</p> </blockquote> <p>One problem I expect this post to have is that most people will read this and decide that they're very willing to look stupid. This reminds me of how most people, when asked, think that they're creative, innovative, and take big risks. I think that feels true since people often operate at the edge of their comfort zone, but there's a difference between feeling like you're taking big risks and taking big risks, e.g., when asked, someone I know who is among the most conservative people I know thinks that they take a lot of big risks and names things like sometimes jaywalking as risk that they take.</p> <p>This might sound ridiculous, <a href="everything-is-broken/">as ridiculous as saying that I run into hundreds to thousands of software bugs per week</a>, but I think I run into someone who thinks that I'm an idiot in a way that's obvious to me around once a week. The car insurance example is from a few days ago, and if I wanted to think of other recent examples, there's a long string of them.</p> <p>If you don't regularly have people thinking that you're stupid, I think it's likely that at least one of the following is true</p> <ul> <li>You have extremely filtered interactions with people and basically only interact with people of your choosing and you have filtered out any people who have the reactions describe in this post <ul> <li>If you count internet comments, then you do not post things to the internet or do not read internet comments</li> </ul></li> <li>You are avoiding looking stupid</li> <li>You are not noticing when people think you're stupid</li> </ul> <p>I think the last one of those is unlikely because, while I sometimes have interactions like the school one described, where the people were too nice to tell me that they think I'm stupid and I only found out via a third party, just as often, the person very clearly wants me to know that they think I'm stupid. The way it happens reminds me of being a pedestrian in NYC, where, when a car tries to cut you off when you have right of way and fails (e.g., when you're crossing a crosswalk and have the walk signal and the driver guns it to try to get in front of you to turn right), the driver will often scream at you and gesture angrily until you acknowledge them and, if you ignore them, will try very hard to get your attention. In the same way that it seems very important to some people who are angry that you know they're angry, many people seem to think it's very important that you know that they think that you're stupid and will keep increasing the intensity of their responses until you acknowledge that they think you're stupid.</p> <p>One thing that might be worth noting is that I don't go out of my way to sound stupid or otherwise be non-conformist. If anything, it's the opposite. I generally try to conform in areas that aren't important to me when it's easy to conform, e.g., I dressed more casually in the office on the west coast than on the east coast since it's not important to me to convey some particular image based on how I dress and I'd rather spend my &quot;weirdness points&quot; on pushing radical ideas than on dressing unusually. After I changed how I dressed, one of the few people in the office who dressed really sharply in a way that would've been normal in the east coast office jokingly said to me, &quot;so, the west coast got to you, huh?&quot; and a few other people remarked that I looked a lot less stuffy/formal.</p> <p>Another thing to note is that &quot;avoiding looking stupid&quot; seems to usually go beyond just filtering out comments or actions that might come off as stupid. Most people I talk to (and Ben is an exception here) have a real aversion evaluating stupid thoughts and (I'm guessing) also to having stupid thoughts. When I have an idea that sounds stupid, it's generally (and again, Ben is an exception here) extremely difficult to get someone to really consider the idea. Instead, most people reflexively reject the idea without really engaging with it at all and (I'm guessing) the same thing happens inside their heads when a potentially stupid sounding thought might occur to them. I think the danger here is not having a concious process that lets you decide to broadcast or not broadcast stupid sounding thoughts (that seems great if it's low overhead), and instead it's having some non-concious process automatically reject thinking about stupid sounding things.</p> <p>Of course, stupid-sounding thoughts are frequently wrong, so, if you're not going to rely on social proof to filter out bad ideas, you'll have to hone your intuition or find trusted friends/colleagues who are able to catch your stupid-sounding ideas that are actually stupid. That's beyond the scope of this post. but I'll note that <a href="p95-skill/">because almost no one attempts to hone their intuition for this kind of thing, it's very easy to get relatively good at it by just trying to do it at all</a>.</p> <h3 id="appendix-stories-from-other-people">Appendix: stories from other people</h3> <p>A disproportionate fraction of people whose work I really respect operate in a similar way to me with respect to looking stupid and also have a lot of stories about looking stupid.</p> <p>One example from Laurence Tratt is from when he was job searching:</p> <blockquote> <p>I remember being rejected from a job at my current employer because a senior person who knew me told other people that I was &quot;too stupid&quot;. For a long time, I found this bemusing (I thought I must be missing out on some deep insights), but eventually I found it highly amusing, to the point I enjoy playing with it.</p> </blockquote> <p>Another example: the other day, when I was talking to Gary Bernhardt, he told me a story about a time when he was chatting with someone who specialized in microservices on Kubernetes for startups and Gary said that he thought that most small (by transaction volume) startups could get away with being on a managed platform like Heroku or Google App Engine. The more Gary explained about his opinion, the more sure the person was that Gary was stupid.</p> <h3 id="appendix-context">Appendix: context</h3> <p>There are a lot of contexts that I'm not exposed to where it may be much more effective to train yourself to avoid looking stupid or incompetent, e.g., <a href="https://twitter.com/apartovi/status/1449856639331340289">see this story by Ali Partovi about how his honesty led to Paul Graham's company being acquired by Yahoo instead of his own, which eventually led to Paul Graham founding YC and becoming one of the most well-known and influential people in the valley</a>. If you're in a context where it's more important to look competent than to be competent then this post doesn't apply to you. Personally, I've tried to avoid such contexts, although they're probably more lucrative than the contexts I operate in.</p> <h3 id="appendix-how-to-not-care-about-looking-stupid">Appendix: how to not care about looking stupid</h3> <p>This post has discussed what to do but not how to do it. Unfortunately, &quot;how&quot; is idiosyncratic and will vary greatly by person, so general advice here won't be effective. For myself, for better or for worse, this one came easy to me as I genuinely felt that I was fairly stupid during my formative years, so the idea that some random person thinks I'm stupid is like water off a duck's back.</p> <p>It's hard to say why anyone feels a certain way about anything, but I'm going to guess that, for me, it was a combination of two things. First, my childhood friends were all a lot smarter than me. In the abstract, I knew that there were other kids out there who weren't obviously smarter than me but, weighted by interactions, most of my interactions were with my friends, which influenced how I felt more than reasoning about the distribution of people that were out there. Second, I grew up in a fairly abusive household and one of the minor things that went along with the abuse was regularly being yelled at, sometimes for hours on end, for being so shamefully, embarrassingly, stupid (I was in the same class as <a href="https://en.wikipedia.org/wiki/Po-Shen_Loh">this kid</a> and my father was deeply ashamed that I didn't measure up).</p> <p>I wouldn't exactly recommend this path, but it seems to have worked out ok.</p> <p>Thanks to Ben Kuhn, Laurence Tratt, Jeshua Smith, Niels Olson, Justin Blank, Tao L., Colby Russell, Anja Boskovic, David Coletta, @conservatif, and Ahmad Jarara for comments/corrections/discussion.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:F">This happens in a way that I notice something like once a week and it seems like it must happen much more frequently in ways that I don't notice. <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:T"><p>A semi-recent example of this from my life is when I wanted to understand why wider tires have better grip. A naive reason one might think this is true is that wider tire = larger contact patch = more friction, and a lot of people seem to believe the naive reason. A reason the naive reason is wrong is because, as long as the tire is inflated semi-reasonably, given a fixed vehicle weight and tire pressure, the total size of the tire's contact patch won't change when tire width is changed. Another naive reason that the original naive reason is wrong is that, at a &quot;spherical cow&quot; level of detail, the level of grip is unrelated to the contact patch size.</p> <p>Most people I talked who don't race cars (e.g., autocross, drag racing, etc.) and <a href="https://twitter.com/danluu/status/1304093800474636288">the top search results online used the refutation to the naive reason plus an incorrect application of high school physics to incorrectly conclude that varying tire width has no effect on grip</a>.</p> <p>But there is an effect and the reason is subtler than more width = larger contact patch.</p> <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:B">I was arguably #1 in the world one season, when I put up a statistically dominant performance and my team won every game I played even though I disproportionately played in games against other top teams (and we weren't undefeated and other top players on the team played in games we lost). <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> </ol> </div> What to learn learn-what/ Mon, 18 Oct 2021 00:00:00 +0000 learn-what/ <p>It's common to see people advocate for learning skills that they have or using processes that they use. For example, Steve Yegge has a set of blog posts where he recommends reading compiler books and learning about compilers. His reasoning is basically that, if you understand compilers, you'll see compiler problems everywhere and will recognize all of the cases where people are solving a compiler problem without using compiler knowledge. Instead of hacking together some half-baked solution that will never work, you can apply a bit of computer science knowledge to solve the problem in a better way with less effort. That's not untrue, but it's also not a reason to study compilers in particular because you can say that about many different areas of computer science and math. Queuing theory, computer architecture, mathematical optimization, operations research, etc.</p> <p>One response to that kind of objection is to say that <a href="https://twitter.com/danluu/status/899141882760110081">one should study everything</a>. While being an extremely broad generalist can work, it's gotten much harder to &quot;know a bit of everything&quot; and be effective because there's more of everything over time (in terms of both breadth and depth). And even if that weren't the case, I think saying “should” is too strong; whether or not someone enjoys having that kind of breadth is a matter of taste. Another approach that can also work, one that's more to my taste, is to, <a href="https://alumni.media.mit.edu/~cahn/life/gian-carlo-rota-10-lessons.html">as Gian Carlo Rota put it</a>, learn a few tricks:</p> <blockquote> <p>A long time ago an older and well known number theorist made some disparaging remarks about Paul Erdos' work. You admire contributions to mathematics as much as I do, and I felt annoyed when the older mathematician flatly and definitively stated that all of Erdos' work could be reduced to a few tricks which Erdos repeatedly relied on in his proofs. What the number theorist did not realize is that other mathematicians, even the very best, also rely on a few tricks which they use over and over. Take Hilbert. The second volume of Hilbert's collected papers contains Hilbert's papers in invariant theory. I have made a point of reading some of these papers with care. It is sad to note that some of Hilbert's beautiful results have been completely forgotten. But on reading the proofs of Hilbert's striking and deep theorems in invariant theory, it was surprising to verify that Hilbert's proofs relied on the same few tricks. Even Hilbert had only a few tricks!</p> </blockquote> <p>If you look at how people succeed in various fields, you'll see that this is a common approach. For example, <a href="https://judoinfo.com/weers1/">this analysis of world-class judo players found that most rely on a small handful of throws</a>, concluding<sup class="footnote-ref" id="fnref:J"><a rel="footnote" href="#fn:J">1</a></sup></p> <blockquote> <p>Judo is a game of specialization. You have to use the skills that work best for you. You have to stick to what works and practice your skills until they become automatic responses.</p> </blockquote> <p>If you watch an anime or a TV series &quot;about&quot; fighting, people often improve by increasing the number of techniques they know because that's an easy thing to depict but, in real life, getting better at techniques you already know is often more effective than having a portfolio of hundreds of &quot;moves&quot;.</p> <p><a href="https://staffeng.com/stories/joy-ebertz" rel="nofollow">Relatedly, Joy Ebertz says</a>:</p> <blockquote> <p>One piece of advice I got at some point was to amplify my strengths. All of us have strengths and weaknesses and we spend a lot of time talking about ‘areas of improvement.’ It can be easy to feel like the best way to advance is to eliminate all of those. However, it can require a lot of work and energy to barely move the needle if it’s truly an area we’re weak in. Obviously, you still want to make sure you don’t have any truly bad areas, but assuming you’ve gotten that, instead focus on amplifying your strengths. How can you turn something you’re good at into your superpower?</p> </blockquote> <p>I've personally found this to be true in a variety of disciplines. While it's really difficult to measure programmer effectiveness in anything resembling an objective manner, this isn't true of some things I've done, like competitive video games (a very long time ago at this point, back before there was &quot;real&quot; money in competitive gaming), the thing that took me from being a pretty decent player to a <a href="look-stupid/#fn:B">very good player</a> was abandoning practicing things I wasn't particularly good at and focusing on increasing the edge I had over everybody else at the few things I was unusually good at.</p> <p>This can work for games and sports because you can get better maneuvering yourself into positions that take advantage of your strengths as well as avoiding situations that expose your weaknesses. I think this is actually more effective at work than it is in sports or gaming since, unlike in competitive endeavors, you don't have an opponent who will try to expose your weaknesses and force you into positions where your strengths are irrelevant. If I study queuing theory instead of compilers, a rival co-worker isn't going to stop me from working on projects where queuing theory knowledge is helpful and leave me facing a field full of projects that require compiler knowledge.</p> <p>One thing that's worth noting is that skills don't have to be things people would consider fields of study or discrete techniques. For the past three years, the main skill I've been applying and improving is something you might call &quot;looking at data&quot;; the term is in quotes because I don't know of a good term for it. I don't think it's what most people would think of as &quot;statistics&quot;, in that I don't often need to do anything as sophisticated as logistic regression, let alone actually sophisticated. Perhaps one could argue that this is something data scientists do, but if I look at what I do vs. what data scientists we hire do as well as what we screen for in data scientist interviews, we don't appear to want to hire data scientists with the skill I've been working on nor do they do what I'm doing (this is a long enough topic that I might turn it into its own post at some point).</p> <p>Unlike Matt Might or Steve Yegge, I'm not going to say that you should take a particular approach, but I'll say that working on a few things and not being particularly well rounded has worked for me in multiple disparate fields and it appears to work for a lot of other folks as well.</p> <p>If you want to take this approach, this still leaves the question of what skills to learn. This is one of the most common questions I get asked and I think my answer is probably not really what people are looking for and not very satisfying since it's both <a href="https://twitter.com/danluu/status/1428445465662603272">obvious and difficult to put into practice</a>.</p> <p>For me, two ingredients for figuring out what to spend time learning are having a relative aptitude for something (relative to other things I might do, not relative to other people) and also having a good environment in which to learn. To say that someone should look for those things is so vague that's it's nearly useless, but it's still better than the usual advice, which boils down to &quot;learn what I learned&quot;, which results in advice like &quot;Career pro tip: if you want to get good, REALLY good, at designing complex and stateful distributed systems at scale in real-world environments, learn functional programming. It is an almost perfectly identical skillset.&quot; or the even more extreme claims from some language communities, like Chuck Moore's claim that Forth is <a href="boring-languages/">at least 100x as productive as boring languages</a>.</p> <p>I took generic internet advice early in my career, including language advice (this was when much of this kind of advice was relatively young and it was not yet possible to easily observe that, despite many people taking advice like this, people who took this kind of advice were not particularly effective and people who are particularly effective were not likely to have taken this kind of advice). I learned <a href="https://twitter.com/sc13ts/status/1448003352655060997">Haskell, Lisp, Forth</a>, <a href="https://malisper.me/there-is-more-to-programming-than-programming-languages/">etc</a>. At one point in my career, I was on a two person team that implemented what might still be, a decade later, the highest performance Forth processor in existence (it was a 2GHz IPC-oriented processor) and I programmed it as well (there were good reasons for this to be a stack processor, so Forth seemed like as good a choice as any). <a href="https://yosefk.com/blog/my-history-with-forth-stack-machines.html">Like Yossi Kreinin, I think I can say that I spent more effort than most people have becoming proficient in Forth, and like him, not only did I not find it find it to be a 100x productivity tool, it wasn't clear that it would, in general, even be 1x on productivity</a>. To be fair, a number of other tools did better than 1x on productivity but, overall, I think following internet advice was very low ROI and the things that I learned that were high ROI weren't things people were recommending.</p> <p>In retrospect, when people said things like &quot;Forth is very productive&quot;, what I suspect they really meant was &quot;Forth makes me very productive and I have not considered how well this generalizes to people with different aptitudes or who are operating in different contexts&quot;. I find it totally plausible that Forth (or Lisp or Haskell or any other tool or technique) does work very well for some particular people, but I think that people tend to overestimate how much something working for them means that it works for other people, <a href="https://twitter.com/danluu/status/1355661542155378688">making advice generally useless because it doesn't distinguish between advice that's aptitude or circumstance specific and generalizable advice, which is in stark contrast to fields where people actually discuss the pros and cons of particular techniques</a><sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">2</a></sup>.</p> <p>While a coach can give you advice that's tailored to you 1 on 1 or in small groups, that's difficult to do on the internet, which is why the best I can do here is the uselessly vague &quot;pick up skills that are suitable for you&quot;. Just for example, two skills that clicked for me are &quot;having an adversarial mindset&quot; and &quot;looking at data&quot;. A perhaps less useless piece of advice is that, if you're having a hard time identifying what those might be, you can ask people who know you very well, e.g., my manager and Ben Kuhn independently named coming up with solutions that span many levels of abstraction as a skill of mine that I frequently apply (and I didn't realize I was doing that until they pointed it out).</p> <p>Another way to find these is to look for things you can't help but do that most other people don't seem to do, which is true for me of both &quot;looking at data&quot; and &quot;having an adversarial mindset&quot;. Just for example, on having an adversarial mindset, when a company I was working for was beta testing a new custom bug tracker, I filed some of the first bugs on it and put unusual things into the fields to see if it would break. Some people really didn't understand why anyone would do such a thing and were baffled, disgusted, or horrified, but a few people (including the authors, who I knew wouldn't mind), really got it and were happy to see the system pushed past its limits. Poking at the limits of a system to see where it falls apart doesn't feel like work to me; it's something that I'd have to stop myself from doing if I wanted to not do it, which made spending a decade getting better at testing and verification techniques felt like something hard not to do and not work. Looking deeply into data is one I've spent more than a decade on at this point and it's another one that, to me, emotionally feels almost wrong to not improve at.</p> <p>That these things are suited to me is basically due to my personality, and not something inherent about human beings. Other people are going to have different things that really feel easy/right for them, which is great, since if everyone was into looking at data and no one was into building things, that would be very problematic (although, IMO, looking at data is, on average, underrated).</p> <p>The other major ingredient in what I've tried to learn is finding environments that are conducive to learning things that line up with my skills that make sense for me. Although suggesting that other people do the same sounds like advice that's so obvious that it's useless, based on how I've seen people select what team and company to work on, I think that almost nobody does this and, as a result, discussing this may not be completely useless.</p> <p>An example of not doing this which typifies what I usually see is a case I just happened to find out about because I chatted with a manager about why their team had lost their new full-time intern conversion employee. I asked them about it since it was unusual for that manager to lose anyone since they're very good at retaining people and have low turnover on their teams. It turned out that their intern had wanted to work on infra, but had joined this manager's product team because they didn't know that they could ask to be on a team that matched their preferences. After the manager found out, the manager wanted the intern to be happy and facilitated a transfer to an infra team. In this case, this was a double whammy since the new hire doubly didn't consider working in an environment conducive for learning the skills they wanted. They made no attempt to work in the area they were interested in and then they joined a company that has a dysfunctional infra org that generally has poor design and operational practices, making the company a relatively difficult place to learn about infra on top of not even trying to land on an infra team. While that's an unusually bad example, in the median case that I've seen, people don't make decisions that result in particularly good outcomes with respect to learning even though good opportunities to learn are one of the top things people say that they want.</p> <p>For example, Steve Yegge has noted:</p> <blockquote> <p>The most frequently-asked question from college candidates is: &quot;what kind of training and/or mentoring do you offer?&quot; ... One UW interviewee just told me about Ford Motor Company's mentoring program, which Ford had apparently used as part of the sales pitch they do for interviewees. [I've elided the details, as they weren't really relevant. -stevey 3/1/2006] The student had absorbed it all in amazing detail. That doesn't really surprise me, because it's one of the things candidates care about most.</p> </blockquote> <p>For myself, I was lucky that my first job, Centaur, was a great place to develop having an adversarial mindset with respect to testing and verification. When I compare what the verification team there accomplished, it's comparable to peer projects at other companies that employed much larger teams to do very similar things with similar or worse effectiveness, implying that the team was highly productive, which made that a really good place to learn.</p> <p>Moreover, I don't think I could've learned as quickly on my own or by trying to follow advice from books or the internet. I think that <a href="hardware-unforgiving/">people who are really good at something have too many bits of information in their head about how to do it for that information to really be compressible into a book, let alone a blog post</a>. In sports, good coaches are able to convey that kind of information over time, but I don't know of anything similar for programming, so I think the best thing available for learning rate is to find an environment that's full of experts<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">3</a></sup>.</p> <p>For &quot;looking at data&quot;, while I got a lot better at it from working on that skill in environments where people weren't really taking data seriously, the rate of improvement during the past few years, where I'm in an environment where I can toss ideas back and forth with people who are very good at understanding the limitations of what data can tell you as well as good at informing data analysis with deep domain knowledge, has been much higher. I'd say that I improved more at this in each individual year at my current job than I did in the decade prior to my current job.</p> <p>One thing to perhaps note is that the environment, how you spend your day-to-day, is inherently local. My current employer is probably the least data driven of the three large tech companies I've worked for, but my vicinity is a great place to get better at looking at data because I spend a relatively large fraction of my time working with people who are great with data, like Rebecca Isaacs, and a relatively small fraction of the time working with people who don't take data seriously.</p> <p>This post has discussed some strategies with an eye towards why they can be valuable, but I have to admit that my motivation for learning from experts wasn’t to create value. It's more that I find learning to be fun and there are some areas where I'm motivated enough to apply the skills regardless of the environment, and learning from experts is such a great opportunity to have fun that it's hard to resist. Doing this for a couple of decades has turned out to be useful, but that's not something I knew would happen for quite a while (and I had no idea that this would effectively transfer to a new industry until I changed from hardware to software).</p> <p>A lot of career advice I see is oriented towards career or success or growth. That kind of advice often tells people to have a long-term goal or strategy in mind. It will often have some argument that's along the lines of &quot;a random walk will only move you sqrt(n) in some direction whereas a directed walk will move you n in some direction&quot;. I don't think that's wrong, but I think that, for many people, that advice implicitly underestimates the difficulty of finding an area that's suited to you<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">4</a></sup>, which I've basically <a href="https://twitter.com/jeanqasaur/status/1074528356324892672">done by trial and error</a>.</p> <h3 id="appendix-parts-of-the-problem-this-post-doesn-t-discuss-in-detail">Appendix: parts of the problem this post doesn't discuss in detail</h3> <p>One major topic not discussed is how to balance what &quot;level&quot; of skill to work on, which could be something high level, like &quot;looking at data&quot;, to something lower level, like &quot;Bayesian multilevel models&quot;, to something even lower level, like &quot;typing speed&quot;. That's a large enough topic that it deserves its own post that I'd expect to be longer than this one but, for now, <a href="productivity-velocity/#appendix-one-way-to-think-about-what-to-improve">here's a comment from Gary Bernhardt about something related that I believe also applies to this topic</a>.</p> <p>Another major topic that's not discussed here is picking skills that are relatively likely to be applicable. It's a little too naive to just say that someone should think about learning skills they have an aptitude for without thinking about applicability.</p> <p>But while it's pretty easy to pick out skills where it's very difficult to either have an impact on the world or make a decent amount of money or achieve whatever goal you might want to achieve, like &quot;basketball&quot; or &quot;boxing&quot;, it's harder to pick between plausible skills, like computer architecture vs. PL.</p> <p>But I think semi-reasonable sounding skills are likely enough to be high return if they're a good fit for someone that trial and error among semi-reasonable sounding skills is fine, although it probably helps <a href="productivity-velocity/">to be able to try things out quickly</a></p> <h3 id="appendix-related-posts">Appendix: related posts</h3> <ul> <li>Ben Kuhn on, in some sense, <a href="https://www.benkuhn.net/conviction/">what it's like to really learn something</a></li> <li>Holden Karnofsky on <a href="https://80000hours.org/podcast/episodes/holden-karnofsky-building-aptitudes-kicking-ass/">having an aptitude-first approach to careers instead of a career-path-first approach</a>, which is sort of analogous to thinking about cross cutting skills like &quot;looking at data&quot; or &quot;having an adversarial mindset&quot; and not just thinking about skills like &quot;compilers&quot; or &quot;queuing theory&quot;</li> <li>Peter Drucker on <a href="https://www.csub.edu/~ecarter2/CSUB.MKTG%20490%20F10/DRUCKER%20HBR%20Managing%20Oneself.pdf">how to understand one's strengths and weaknesses and do work that compatible with ones own inclinations</a></li> <li>Alexy Guzey on <a href="https://guzey.com/advice/">the effectiveness of advice</a></li> <li>Edward Kmett with <a href="https://www.youtube.com/watch?v=Z8KcCU-p8QA">another perspective on how to think about learning</a></li> <li>Patrick Collison <a href="https://patrickcollison.com/advice">on how to maximize useful learning and find what you'll enjoy</a></li> </ul> <p><small>Thanks to Ben Kuhn, Alexey Guzey, Marek Majkowski, Nick Bergson-Shilcock, @bekindtopeople2, Aaron Levin, Milosz Danczak, Anja Boskovic, John Doty, Justin Blank, Mark Hansen, &quot;wl&quot;, and Jamie Brandon for comments/corrections/discussion.</small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:J">This is an old analysis. If you were to do one today, you'd see a different mix of throws, but it's still the case that you see specialists having a lot of success, e.g., Riner with osoto gari <a class="footnote-return" href="#fnref:J"><sup>[return]</sup></a></li> <li id="fn:F">To be fair to blanket, context free, advice, to learn a particular topic, functional programming really clicked for me and I could imagine that, if that style of thinking wasn't already natural for me (as a result of coming from a hardware background), the advice that one should learn functional programming because it will change how you think about problems might've been useful for me, but on the other hand, that means that the advice could've just as easily been to learn hardware engineering. <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:M"><p>I don't have a large enough sample nor have I polled enough people to have high confidence that this works as a general algorithm but, for finding groups of world-class experts, what's worked for me is finding excellent managers. The two teams I worked on with the highest density of world-class experts have been teams under really great management. I have a higher bar for excellent management than most people and, from having talked to many people about this, almost no one I've talked to has worked for or even knows a manager as good as one I would consider to be excellent (and, general, both the person I'm talking to agrees with me on this, indicating that it's not the case that they have a manager who's excellent in dimensions I don't care about and vice versa); from discussions about this, I would guess that a manager I think of as excellent is at least 99.9%-ile. How to find such a manager is a long discussion that I might turn into another post.</p> <p>Anyway, despite having a pretty small sample on this, I think the mechanism for this is plausible, in that the excellent managers I know have very high retention as well as a huge queue of people who want to work for them, making it relatively easy for them to hire and retain people with world-class expertise since <a href="hiring-lemons/">the rest of the landscape is so bleak</a>.</p> <p>A more typical strategy, one that I don't think generally works and also didn't work great for me when I tried it is to work on the most interesting sounding and/or hardest problems around. While I did work with some really great people while trying to <a href="https://www.benkuhn.net/hard/">work on interesting / hard problems</a>, including one of the best engineers I've ever worked with, I don't think that worked nearly as well as looking for good management w.r.t. working with people I really want to learn from. I believe the general problem with this algorithm is the same problem with going to work in video games because video games are cool and/or interesting. The fact that so many people want to work on exciting sounding problems leads to dysfunctional environments that can persist indefinitely.</p> <p>In one case, I was on a team that had 100% turnover in nine months and it would've been six if it hadn't taken so long for one person to find a team to transfer to. In the median case, my cohort (people who joined around when I joined, ish) had about 50% YoY turnover and I think that people had pretty good reasons for leaving. Not only is this kind of turnover a sign that the environment is often a pretty unhappy one, these kinds of environments often differentially cause people who I'd want to work with and/or learn from to leave. For example, on the team I was on where the TL didn't believe in using version control, automated testing, or pipelined designs, I worked with Ikhwan Lee, who was great. Of course, Ikhwan left pretty quickly while the TL stayed and is still there six years later.</p> <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:S">Something I've seen many times among my acquaintances is that people will pick a direction before they have any idea whether or not it's suitable for them. Often, after quite some time (more than a decade in some cases), they'll realize that they're actually deeply unhappy with the direction they've gone, sometimes because it doesn't match their temperament, and sometimes because it's something they're actually bad at. In any case, wandering around randomly and finding yourself sqrt(n) down a path you're happy with doesn't seem so bad compared to having made it n down a path you're unhappy with. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> </ol> </div> Some reasons to work on productivity and velocity productivity-velocity/ Fri, 15 Oct 2021 00:00:00 +0000 productivity-velocity/ <p>A common topic of discussion among my close friends is where the bottlenecks are in our productivity and how we can execute more quickly. This is very different from what I see in my extended social circles, where people commonly say that <a href="https://twitter.com/danluu/status/1440106603093495810">velocity doesn't matter</a>. In online discussions about this, I frequently see people go a step further and assign moral valence to this, saying that it is actually bad to try to increase velocity or be more productive or work hard (see appendix for more examples).</p> <p>The top reasons I see people say that productivity doesn't matter (or is actually bad) fall into one of three buckets:</p> <ul> <li>Working on the right thing is more important than working quickly</li> <li>Speed at X doesn't matter because you don't spend much time doing X</li> <li>Thinking about productivity is bad and you should &quot;live life&quot;</li> </ul> <p>I certainly agree that working on the right thing is important, but increasing velocity doesn't stop you from working on the right thing. If anything, each of these is a force multiplier for the other. Having strong execution skills becomes more impactful if you're good at picking the right problem and vice versa.</p> <p>It's true that the gains from picking the right problem can be greater than the gains from having better tactical execution because the gains from picking the right problem can be unbounded, but it's also much easier to improve tactical execution and doing so also helps with picking the right problem because having faster execution lets you experiment more quickly, which helps you find the right problem.</p> <p>A concrete example of this is a project I worked on to quantify the machine health of the fleet. The project discovered a number of serious issues (a decent fraction of hosts were actively corrupting data or had a performance problem that would increase tail latency by &gt; 2 orders of magnitude, or both). This was considered serious enough that a new team was created to deal with the problem.</p> <p>In retrospect, my first attempts at quantifying the problem were doomed and couldn't have really worked (or not in a reasonable amount of time, anyway). I spent a few weeks cranking through ideas that couldn't work and a critical part of getting to the idea that did work after &quot;only&quot; a few weeks was being able to quickly try out and discard ideas that didn't work. In part of a previous post, I described how long a tiny part of that process took and multiple people objected to that being impossibly fast in internet comments.</p> <p>I find this a bit funny since I'm not a naturally quick programmer. <a href="learning-to-program/">Learning to program was a real struggle for me</a> and I was pretty slow at it for a long time (and I still am in aspects that I haven't practiced). My &quot;one weird trick&quot; is that I've explicitly worked on speeding up things that I do frequently and most people have not. I view the situation as somewhat analogous to sports before people really trained. For a long time, many athletes didn't seriously train, and then once people started trying to train, the training was often misguided by modern standards. For example, if you read commentary on baseball from the 70s, you'll see people saying that baseball players shouldn't weight train because it will make them &quot;muscle bound&quot; (many people thought that weight lifting would lead to &quot;too much&quot; bulk, causing people to be slower, have less explosive power, and be less agile). But today, players get a huge advantage from using performance-enhancing drugs that increase their muscle-bound-ness, which implies that players could not get too &quot;muscle bound&quot; from weight training alone. An analogous comment to one discussed above would be saying that athletes shouldn't worry about power/strength and should increase their skill, but power increases returns to skill and vice versa.</p> <p>Coming back to programming, if you explicitly practice and train and almost no one else does, you'll be able to do things relatively quickly compared to most people even if, like me, you don't have much talent for programming and getting started at all was a real struggle. Of course, there's always going to be someone more talented out there who's executing faster after having spent less time improving. But, luckily for me, <a href="p95-skill/">relatively few people seriously attempt to improve</a>, so I'm able to do ok.</p> <p>Anyway, despite operating at a rate that some internet commenters thought was impossible, it took me weeks of dead ends to find something that worked. If I was doing things at a speed that people thought was normal, I suspect it would've taken long enough to find a feasible solution that I would've dropped the problem after spending maybe one or two quarters on it. The number of plausible-ish seeming dead ends was probably not unrelated to why the problem was still an open problem despite being a critical issue for years. Of course, someone who's better at having ideas than me could've solved the problem without the dead ends, but as we discussed earlier, it's fairly easy to find low hanging fruit on &quot;execution speed&quot; and not so easy to find low hanging fruit on &quot;having better ideas&quot;. However, it's possible to, to a limited extent, simulate someone who has better ideas than me by being able to quickly try out and discard ideas (I also work on having better ideas, but I think it makes sense to go after the easier high ROI wins that are available as well). Being able to try out ideas quickly also improves the rate at which I can improve at having better ideas since a key part of that is building intuition by getting feedback on what works.</p> <p>The next major objection is that speed at a particular task doesn't matter because time spent on that task is limited. At a high level, I don't agree with this objection because, while this may hold true for any particular kind of task, the solution to that is to try to improve each kind of task and not to reject the idea of improvement outright. A sub-objection people have is something like &quot;but I spend 20 hours in unproductive meetings every week, so it doesn't matter what I do with my other time&quot;. I think this is doubly wrong, in that if you then only have 20 hours of potentially productive time, whatever productivity multiplier you have on that time still holds for your general productivity. Also, it's generally possible to drop out of meetings that are a lost cause and increase the productivity of meetings that aren't a lost cause<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">1</a></sup>.</p> <p>More generally, when people say that optimizing X doesn't help because they don't spend time on X and are not bottlenecked on X, that doesn't match my experience as I find I spend plenty of time bottlenecked on X for commonly dismissed Xs. I think that part of this is because getting faster at X can actually increase time spent on X due to a sort of virtuous cycle feedback loop of where it makes sense to spend time. Another part of this is illustrated in this comment by Fabian Giesen:</p> <blockquote> <p>It is commonly accepted, verging on a cliche, that you have no idea where your program spends time until you actually profile it, but the corollary that you also don't know where <em>you</em> spend your time until you've measured it is not nearly as accepted.</p> </blockquote> <p>When I've looked how people spend time vs. how people think they spend time, it's wildly inaccurate and I think there's a fundamental reason that, unless they measure, people's estimates of how they spend their time tends to be way off, which is nicely summed in by another Fabian Giesen quote, which happens to be about solving Rubik's cubes but applies to other cognitive tasks:</p> <blockquote> <p>Paraphrasing a well-known cuber, &quot;your own pauses never seem bad while you're solving, because your brain is busy and you know what you're thinking about, but once you have a video it tends to become blindingly obvious what you need to improve&quot;. Which is pretty much the usual &quot;don't assume, profile&quot; advice for programs, but applied to a situation where you're concentrated and busy for the entire time, whereas the default assumption in programming circles seems to be that as long as you're actually doing work and not distracted or slacking off, you can't possibly be losing a lot of time</p> </blockquote> <p>Unlike most people who discuss this topic online, I've actually looked at where my time goes and a lot of it goes to things that are canonical examples of things that you shouldn't waste time improving because people don't spend much time doing them.</p> <p>An example of one of these, the most commonly cited bad-thing-to-optimize example that I've seen, is typing speed (when discussing this, people usually say that typing speed doesn't matter because more time is spent thinking than typing). But, when I look at where my time goes, a lot of it is spent typing.</p> <p>A specific example is that I've written a number of influential docs at my current job and when people ask how long some doc took to write, they're generally surprised that the doc only took a day to write. As with the machine health example, a thing that velocity helps with is figuring out which docs will be influential. If I look at the docs I've written, I'd say that maybe 15% were really high impact (caused a new team to be created, changed the direction of existing teams, resulted in significant changes to the company's bottom line, etc.). Part of it is that I don't always know which ideas will resonate with other people, but part of it is also that I often propose ideas that are long shots because the ideas sound too stupid to be taken seriously (e.g., one of my proposed solutions to a capacity crunch was to, for each rack, turn off 10% of it, thereby increasing effective provisioned capacity, which is about as stupid sounding an idea as one could come up with). If I was much slower at writing docs, it wouldn't make sense to propose real long shot ideas. As things are today, if I think an idea has a 5% chance of success, in expectation, I need to spend ~20 days writing docs to have one of those land.</p> <p>I spend roughly half my writing time typing. If I typed at what some people say median typing speed is (40 WPM) instead of the rate some random typing test clocked me at (110 WPM), this would be a 0.5 + 0.5 * 110/40 = 1.875x slowdown, putting me at nearly 40 days of writing before a longshot doc lands, which would make that a sketchier proposition. If I hadn't optimized the non-typing part of my writing workflow as well, I think I would be, on net, maybe 10x slower<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">2</a></sup>, which would put me at more like ~200 days per high impact longshot doc, which is enough that I think that I probably wouldn't write longshot docs<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">3</a></sup>.</p> <p>More generally, Fabian Giesen has noted that this kind of non-linear impact of velocity is common:</p> <blockquote> <p>There are &quot;phase changes&quot; as you cross certain thresholds (details depend on the problem to some extent) where your entire way of working changes. ... ​​There's a lot of things I could in theory do at any speed but in practice cannot, because as iteration time increases it first becomes so frustrating that I can't do it for long and eventually it takes so long that it literally drops out of my short-term memory, so I need to keep notes or otherwise organize it or I can't do it at all.</p> <p>Certainly if I can do an experiment in an interactive UI by dragging on a slider and see the result in a fraction of a second, at that point it's very &quot;no filter&quot;, if you want to try something you just do it.</p> <p>Once you're at iteration times in the low seconds (say a compile-link cycle with a statically compiled lang) you don't just try stuff anymore, you also spend time thinking about whether it's gonna tell you anything because it takes long enough that you'd rather not waste a run.</p> <p>Once you get into several-minute or multi-hour iteration times there's a lot of planning to not waste runs, and context switching because you do other stuff while you wait, and note-taking/bookkeeping; also at this level mistakes are both more expensive (because a wasted run wastes more time) and more common (because your attention is so divided).</p> <p>As you scale that up even more you might now take significant resources for a noticeable amount of time and need to get that approved and budgeted, which takes its own meetings etc.</p> </blockquote> <p>A specific example of something moving from one class of item to another in my work was <a href="metrics-analytics/">this project on metrics analytics</a>. There were a number of proposals on how to solve this problem. There was broad agreement that the problem was important with no dissenters, but the proposals were all the kinds of things you'd allocate a team to work on through multiple roadmap cycles. Getting a project that expensive off the ground requires a large amount of organizational buy-in, enough that many important problems don't get solved, including this one. But it turned out, if scoped properly and executed reasonably, the project was actually something a programmer could create an MVP of in a day, which takes no organizational buy-in to get off the ground. Instead of needing to get multiple directors and a VP to agree that the problem is among the org's most important problems, you just need a person who thinks the problem is worth solving.</p> <p>Going back to Xs where people say velocity doesn't matter because they don't spend a lot time on X, another one I see frequently is coding, and it is also not my personal experience that coding speed doesn't matter. For the machine health example discussed above, after I figured out something that would work, I spent one month working on basically nothing but that, coding, testing, and debugging. I think I had about 6 hours of meetings during that month, but other than that plus time spent eating, etc., I would go in to work, code all day, and then go home. I think it's much more difficult to compare coding speed across people because it's rare to see people do the same or very similar non-trivial tasks, so I won't try to compare to anyone else, but if I look at my productivity before I worked on improving it as compared to where I'm at now, the project probably would have been infeasible without the speedups I've found by looking at my velocity.</p> <p><a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's law</a> based arguments can make sense when looking for speedups in a fixed benchmark, like a sub-task of SPECint, but when you have a system where getting better at a task increases returns to doing that task and can increase time spent on the task, it doesn't make sense to say that you shouldn't work on something because you spend a lot of time doing it. I spend time on things that are high ROI, but those things are generally only high ROI because I've spent time improving my velocity, which reduces the &quot;I&quot; in ROI.</p> <p>The last major argument I see against working on velocity assigns negative moral weight to the idea of thinking about productivity and working on velocity at all. This kind of comment often assigns positive moral weight to various kinds of leisure, such as spending time with friends and family. I find this argument to be backwards. If someone thinks it's important to spend time with friends and family, an easy way to do that is to be more productive at work and spend less time working.</p> <p>Personally, I deliberately avoid working long hours and I suspect I don't work more than the median person at my company, which is a company where I think work-life balance is pretty good overall. A lot of my productivity gains have gone to leisure and not work. Furthermore, deliberately working on velocity has <a href="https://twitter.com/danluu/status/1444034823329177602">allowed me to get promoted relatively quickly</a><sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">4</a></sup>, which means that I make more money than I would've made if I didn't get promoted, which gives me more freedom to spend time on things that I value.</p> <p>For people that aren't arguing that you shouldn't think about productivity because it's better to focus on leisure and instead argue that you simply shouldn't think about productivity at all because it's unnatural and one should live a natural life, that ultimately comes down to personal preference, but for me, I value the things I do outside of work too much to not explicitly work on productivity at work.</p> <p>As with <a href="why-benchmark/">this post on reasons to measure</a>, while this post is about practical reasons to improve productivity, the main reason I'm personally motivated to work on my own productivity isn't practical. The main reason is that I enjoy the process of getting better at things, whether that's some nerdy board game, a sport I have zero talent at that will never have any practical value to me, or work. For me, a secondary reason is that, given that my lifespan is finite, I want to allocate my time to things that I value, and increasing productivity allows me to do more of that, but that's not a thought i had until I was about 20, at which point I'd already been trying to improve at most things I spent significant time on for many years.</p> <p>Another common reason for working on productivity is that mastery and/or generally being good at something seems satisfying for a lot of people. That's not one that resonates with me personally, but when I've asked other people about why they work on improving their skills, that seems to be a common motivation.</p> <p>A related idea, one that Holden Karnofsky has been talking about for a while, is that if you ever want to make a difference in the world in some way, it's useful to work on your skills even in jobs where it's not obvious that being better at the job is useful, because the developed skills will give you more leverage on the world when you switch to something that's more aligned with what you want to achieve.</p> <h3 id="appendix-one-way-to-think-about-what-to-improve">Appendix: one way to think about what to improve</h3> <p>Here's a framing I like from Gary Bernhardt (not set off in a quote block since this entire section, other than this sentence, is his).</p> <p>People tend to fixate on a single granularity of analysis when talking about efficiency. E.g., &quot;thinking is the most important part so don't worry about typing speed&quot;. If we step back, the response to that is &quot;efficiency exists at every point on the continuum from year-by-year strategy all the way down to millisecond-by-millisecond keystrokes&quot;. I think it's safe to assume that gains at the larger scale will have the biggest impact. But as we go to finer granularity, it's not obvious where the ROI drops off. Some examples, moving from coarse to fine:</p> <ol> <li>The macro point that you started with is: programming isn't just thinking; it's thinking plus tactical activities like editing code. Editing faster means more time for thinking.</li> <li>But editing code costs more than just the time spent typing! Programming is highly dependent on short-term memory. Every pause to edit is a distraction where you can forget the details that you're juggling. Slower editing effectively weakens your short-term memory, which reduces effectiveness.</li> <li>But editing code isn't just hitting keys! It's hitting keys plus the editor commands that those keys invoke. A more efficient editor can dramatically increase effective code editing speed, even if you type at the same WPM as before.</li> <li>But each editor command doesn't exist in a vacuum! There are often many ways to make the same edit. A Vim beginner might type &quot;hhhhxxxxxxxx&quot; when &quot;bdw&quot; is more efficient. An advanced Vim user might use &quot;bdw&quot;, not realizing that it's slower than &quot;diw&quot; despite having the same number of keystrokes. (In QWERTY keyboard layout, the former is all on the left hand, whereas the latter alternates left-right-left hands. At 140 WPM, you're typing around 14 keystrokes per second, so each finger only has 70 ms to get into position and press the key. Alternating hands leaves more time for the next finger to get into position while the previous finger is mid-keypress.)</li> </ol> <p>We have to choose how deep to go when thinking about this. I think that there's clear ROI in thinking about 1-3, and in letting those inform both tool choice and practice. I don't think that (4) is worth a lot of thought. It seems like we naturally find &quot;good enough&quot; points there. But that also makes it a nice fence post to frame the others.</p> <h3 id="appendix-more-examples">Appendix: more examples</h3> <ul> <li><a href="https://twitter.com/b0rk/status/1367172498954059791">Velocity doesn't matter</a>, from Julia Evans, who I believe has been the most widely read programming blogger since about 2015</li> <li><a href="https://news.ycombinator.com/item?id=10529064">In the comments on a post where Ben Kuhn notes that he got 50% more productive by allocating his time better, people are nearly uniformly negative about the post and say that he works too much</a>. Although Ben clarified in multiple comments as well as in the post that not all time tracked was worked, the commenters are too busy taking the moral high ground to actually respond to the contents of the post</li> <li><a href="https://news.ycombinator.com/item?id=28879240">Comments on Jamie Brandon's &quot;Speed Matters&quot;</a> <ul> <li><a href="https://news.ycombinator.com/item?id=28880190">Working quickly is pointless because you will be forced to do more work</a></li> <li><a href="https://news.ycombinator.com/item?id=28881360">Speed doesn't matter if you're doing the right thing, and also, if such a thing as speed did exist, it would be unmeasurable and therefore pointless to discuss</a></li> <li><a href="https://news.ycombinator.com/item?id=28879823">Thinking about productivity is unhealthy. One should relax instead</a></li> <li><a href="https://news.ycombinator.com/item?id=28881320">You can only choose 2 of &quot;good, fast, cheap&quot;, therefore it is counterproductive to work on speed</a></li> <li><a href="https://news.ycombinator.com/item?id=28880653">A large speedup is impossible</a></li> <li><a href="https://news.ycombinator.com/item?id=28880173">&quot;The author mistakes coding for typing&quot;</a></li> <li>etc.</li> <li>As with Ben's post, virtually all of these comments are addressed in the post itself. I'm going to stop noting when this is true because it is generally true of the posts referred to here.</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=22255996">The #3 comment on a post by Michael Malis on &quot;How to Improve Your Productivity as a Working Programmer &quot;</a>: &quot;Fuck it, the entire work environment seems designed to decrease productivity . . . Why should I bother . . .&quot; <ul> <li>#4 comment: &quot;What if I don't want to improve my productivity ? Just take time.&quot; <ul> <li>After the initial indignation, this comment goes on and proves that the commenter missed the point entirely, as the rest of the comment explains how the commenter works productively, which the commenter apparently is ok with as long as it's not phrased as a way to work productively, because one is supposed to be morally outraged by someone wanting to be productive and sharing techniques about how to be productive with other people who might be interested in being productive</li> <li>In the responses, someone points out that someone who's more productive would be able to spend more time on leisure; that comment is uniformly panned because &quot;work expands so as to fill the time available for its completion&quot;, as if how one spends time is some sort of immutable law of nature and not something under anyone's control</li> </ul></li> <li>Another comment: &quot;Alright. What are we optimizing for? Productivity? Or the end-goals of any of: achieving more, climbing the corporate ladder, making more money, etc..?&quot;</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=13752887">Comments on a post by antirez about productivity</a> <ul> <li><a href="https://news.ycombinator.com/item?id=13753443">The article is talking about the 10x programmer universe, not the normal universe most people live in</a></li> <li><a href="https://news.ycombinator.com/item?id=13753611">It's pointless to work on productivity since your environment determines productivity</a></li> <li><a href="https://news.ycombinator.com/item?id=13753465">Productive programmers are selfish, don't mentor, etc., and are bad for their teams because their increased productivity always comes from neglecting more important things, so anyone who's productive as a programmer is actually counterproductive for the team</a> <ul> <li>If you read the entire comments to the post, you'll see that this is a common theme</li> </ul></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=20737304">Comments on Alexy Guezy's thoughts on productivity</a> <ul> <li><a href="https://news.ycombinator.com/item?id=20737854">&quot;Serious question: Is anything less productive than reading other people's productivity thoughts? It's a combination of procrastination and finding out what works for someone who is presumably more productive than you (ie: guilt).&quot;</a></li> <li><a href="https://news.ycombinator.com/item?id=20737854">An anti-productivity article titled &quot;Against Productivity</a></li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=9715810">Typing speed doesn't matter because you only spend 0.5% to 1% of your time typing</a> <ul> <li>Despite the talk about 8-hour work days, I think people who get 4 hours of real work in a day are generally considered extremely productive. 0.5% to 1% of 4 hours is 1.2 minutes to 2.4 minutes a day or, for someone who types 100 wpm, 240 total words <a href="https://thorstenball.com/blog/2020/09/01/typing-can-be-the-bottleneck/">across slack, JIRA, email, actual code, commit messages, design docs, comments on design docs, documentation, etc.</a>; I don't believe I know any professional programmers who type that little</li> </ul></li> <li><a href="https://news.ycombinator.com/item?id=38797640">&quot;I feel like there is a correlation between fast-twitch programming muscles and technical debt . . . but we were all young once, I remember thinking the only thing holding me back was 4.77MHz &quot;</a>, a comment on a blog post benchmarking build times on different machines (where the post has nothing resembling the idea that the only thing holding back developers is build times)</li> </ul> <p>etc.</p> <p>Some positive examples of people who have used their productivity to &quot;fund&quot; things that they value include Andy Kelley (Zig), Jamie Brandon (various), Andy Matuschak (mnemonic medium, various), Saul Pwanson (VisiData), Andy Chu (Oil Shell). I'm drawing from programming examples, but you can find plenty of others, e.g., Nick Adnitt (<a href="https://darksidecanoes.wordpress.com/">Darkside Canoes</a>) and, of course, numerous people who've retired to pursue interests that aren't work-like at all.</p> <h3 id="appendix-another-reason-to-avoid-being-productive">Appendix: another reason to avoid being productive</h3> <p>An idea that's become increasingly popular in my extended social circles at major tech companies is that one should avoid doing work and <a href="https://www.reddit.com/r/antiwork/comments/pvjc6f/they_dont_give_a_fuck_about_you/">waste as much time as possible</a>, often called &quot;antiwork&quot;, which seems like a natural extension of &quot;tryhard&quot; becoming an insult. The reason given is often something like, work mainly enriches upper management at your employer and/or shareholders, who are generally richer than you.</p> <p>I'm sympathetic to the argument and <a href="https://twitter.com/danluu/status/802971209176477696">agree that upper management and shareholders capture most of the value from work</a>. But as much as I sympathize with the idea of deliberately being unproductive to &quot;stick it to the man&quot;, I value spending my time on things that I want enough that I'd rather get my work done quickly so I can do things I enjoy more than work. Additionally, having been productive in the past has given me good options for jobs, so I have work that I enjoy a lot more than my acquaintances in tech who have embraced the &quot;antiwork&quot; movement.</p> <p>The less control you have over your environment, the more it makes sense to embrace &quot;antiwork&quot;. Programmers at major tech companies have, relatively speaking, a lot of control over their environment, which is why I'm not &quot;antiwork&quot; even though I'm sympathetic to the cause.</p> <p>Although it's about a different topic, a related comment <a href="https://twitter.com/PracheeAC/status/1448789430488092672">from Prachee Avasthi about avoiding controversial work and avoiding pushing for necessary changes when pre-tenure ingrains habits that are hard break post-tenure</a>. If one wants to be &quot;antiwork&quot; forever, that's not a problem, but if one wants to move the needle on something at some point, building &quot;antiwork&quot; habits while working for a major tech company will instill counterproductive habits.</p> <p><small> Thanks to Fabian Giesen, Gary Bernhardt, Ben Kuhn, David Turner, Marek Majkowski, Anja Boskovic, Aaron Levin, Lifan Zeng, Justin Blank, Heath Borders, Tao L., Nehal Patel, @chozu@fedi.absturztau.be, Alex Allain, and Jamie Brandon for comments/corrections/discussion</small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:M">When I look at the productiveness of meetings, there are some people who are very good at keeping meetings on track and useful. For example, one person who I've been in meetings with who is extraordinarily good at ensuring meetings are productive is Bonnie Eisenman. Early on in my current job, I asked her how she was so effective at keeping meetings productive and have been using that advice since then (I'm not nearly as good at it as she is, but even so, improving at this was a significant win for me). <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:T"><p>10x might sound like an implausibly large speedup on writing, but in a discussion on writing speed on a private slack, a well-known newsletter author mentioned that their net writing speed for a 5k word newsletter was a little under 2 words per minute (WPM). My net rate (including time spent editing, etc.) is over 20 WPM per doc.</p> <p>With a measured typing speed of 110 WPM, that might sound like I spend a small fraction of my time typing, but it turns out it's roughly half the time. If I look at my writing speed, it's much slower than my typing test speed and it seems that it's perhaps half the rate. If I look at where the actual time goes, roughly half of it goes to typing and half goes to thinking, semi-serially, which creates long pauses in my typing.</p> <p>If I look at where the biggest win here could come, it would be from thinking and typing in parallel, which is something I'd try to achieve by practicing typing more, not less. But even without being able to do that, and with above average typing speed, I still spend half of my time typing!</p> <p>The reason my net speed is well under the speed that I write is that I do multiple passes and re-write. Some time is spent reading as I re-write, but I read much more quickly than I write, so that's a pretty small fraction of time. In principle, I could adopt an approach that involves less re-writing, but I've tried a number of things that one might expect would lead to that goal and haven't found one that works for me (yet?).</p> <p>Although the example here is about work, this also holds for my personal blog, where my velocity is similar. If I wrote ten times slower than I do, I don't think I'd have much of a blog. My guess is that I would've written a few posts or maybe even a few drafts and not gotten to the point where I'd post and then stop.</p> <p>I enjoy writing and get a lot of value out of it in a variety of ways, but I value the other things in my life enough that I don't think writing would have a place in my life if my net writing speed were 2 WPM.</p> <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:S"><p>Another strategy would be to write shorter docs. There's a style of doc where that works well, but I frequently write docs where I leverage my writing speed to discuss a problem that would be difficult to convincingly discuss without a long document.</p> <p>One example of a reason that my docs is that I frequently work on problems that span multiple levels of the stack, which means that I end up presenting data from multiple levels of the stack as well as providing enough context about why the problem at some level drives a problem up or down the stack for people who aren't deeply familiar with that level of the stack, which is necessary since few readers will have strong familiarity with every level needed to understand the problem.</p> <p>In most cases, there have been previous attempts to motivate/fund work on the problem that didn't get traction because there wasn't a case linking an issue at one level of the stack to important issues at other levels of the stack. I could avoid problems that span many levels of the stack, but there's a lot of low hanging fruit among those sorts of problems for technical and organizational reasons, so I don't think it makes sense to ignore them just because it takes a day to write a document explaining the problem (although it might make sense if it took ten days, at least in cases where people might be skeptical of the solution).</p> <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:P">Of course, promotions are highly unfair and being more productive doesn't guarantee promotion. If I just look at what things are correlated with level, <a href="https://twitter.com/altluu/status/1448012821854257155">it's not even clear to me that productivity is more strongly correlated with level than height</a>, but among factors that are under my control, productivity is one of the easiest to change. <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> </ol> </div> The value of in-house expertise in-house/ Wed, 29 Sep 2021 00:00:00 +0000 in-house/ <p>An alternate title for this post might be, &quot;Twitter has a kernel team!?&quot;. At this point, I've heard that surprised exclamation enough that I've lost count of the number times that's been said to me (I'd guess that it's more than ten but less than a hundred). If we look at trendy companies that are within a couple factors of two in size of Twitter (in terms of either market cap or number of engineers), they mostly don't have similar expertise, often as a result of path dependence — because they &quot;grew up&quot; in the cloud, they didn't need kernel expertise to keep the lights on the way an on prem company does. While that makes it socially understandable that people who've spent their career at younger, trendier, companies, are surprised by Twitter having a kernel team, I don't think there's a technical reason for the surprise.</p> <p>Whether or not it has kernel expertise, a company Twitter's size is going to regularly run into kernel issues, from major production incidents to papercuts. Without a kernel team or the equivalent expertise, the company will muddle through the issues, running into unnecessary problems as well as taking an unnecessarily long time to mitigate incidents. As an example of a critical production incident, just because it's already been written up publicly, I'll cite <a href="https://blog.twitter.com/engineering/en_us/topics/open-source/2020/hunting-a-linux-kernel-bug">this post</a>, which dryly notes:</p> <blockquote> <p>Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug</p> </blockquote> <p>What this implies but doesn't explicitly say is that this firewall misconfiguration was the most severe incident that's occured during my time at Twitter and I believe it's actually the most severe outage that Twitter has had since 2013 or so. As a company, we would've still been able to mitigate the issue without a kernel team or another team with deep Linux expertise, but it would've taken longer to understand why the initial fix didn't work, which is the last thing you want when you're debugging a serious outage. Folks on the kernel team were already familiar with the various diagnostic tools and debugging techniques necessary to quickly understand why the initial fix didn't work, which is not common knowledge at some peer companies (I polled folks at a number of similar-scale peer companies to see if they thought they had at least one person with the knowledge necessary to quickly debug the bug and the answer was no at many companies).</p> <p>Another reason to have in-house expertise in various areas is that they easily pay for themselves, which is a special case of <a href="sounds-easy/">the generic argument that large companies should be larger than most people expect because tiny percentage gains are worth a large amount in absolute dollars</a>. If, in the lifetime of the specialist team like the kernel team, a single person found something that persistently reduced <a href="https://en.wikipedia.org/wiki/Total_cost_of_ownership">TCO</a> by 0.5%, that would pay for the team in perpetuity, and Twitter’s kernel team has found many such changes. In addition to <a href="https://patchwork.ozlabs.org/project/netdev/list/?submitter=211&amp;state=*&amp;archive=both&amp;param=4&amp;page=1">kernel patches</a> that sometimes have that kind of impact, people will also find configuration issues, etc., that have that kind of impact.</p> <p>So far, I've only talked about the kernel team because that's the one that most frequently elicits surprise from folks for merely existing, but I get similar reactions when people find out that Twitter has a bunch of ex-Sun JVM folks who worked on HotSpot, like Ramki Ramakrishna, Tony Printezis, and John Coomes. People wonder why a social media company would need such deep JVM expertise. As with the kernel team, companies our size that use the JVM run into weird issues and JVM bugs and it's helpful to have people with deep expertise to debug those kinds of issues. And, as with the kernel team, individual optimizations to the JVM can pay for the team in perpetuity. A concrete example is <a href="https://github.com/oracle/graal/pull/636">this patch by Flavio Brasil, which virtualizes compare and swap calls</a>.</p> <p>The context for this is that Twitter uses a lot of Scala. Despite a lot of claims otherwise, Scala uses more memory and is significantly slower than Java, which has a significant cost if you use Scala at scale, enough that it makes sense to do optimization work to reduce the performance gap between idiomatic Scala and idiomatic Java.</p> <p>Before the patch, if you profiled our Scala code, you would've seen an unreasonably large amount of time spent in Future/Promise, including in cases where you might naively expect that the compiler would optimize the work away. One reason for this is that Futures use a <a href="https://en.wikipedia.org/wiki/Compare-and-swap">compare-and-swap</a> (CAS) operation that's opaque to JVM optimization. The patch linked above avoids CAS operations when the Future doesn't escape the scope of the method. <a href="https://github.com/twitter/util/commit/3245a8e1a98bd5eb308f366678528879d7140f5e">This companion patch</a> removes CAS operations in some places that are less amenable to compiler optimization. The two patches combined reduced the cost of typical major Twitter services using idiomatic Scala by 5% to 15%, paying for the JVM team in perpetuity many times over and that wasn't even the biggest win Flavio found that year.</p> <p>I'm not going to do a team-by-team breakdown of teams that pay for themselves many times over because there are so many of them, even if I limit the scope to &quot;teams that people are surprised that Twitter has&quot;.</p> <p>A related topic is how people talk about &quot;buy vs. build&quot; discussions. I've seen a number of discussions where someone has argued for &quot;buy&quot; because that would obviate the need for expertise in the area. This can be true, but I've seen this argued for much more often than it is true. An example where I think this tends to be untrue is with distributed tracing. <a href="tracing-analytics/">We've previously looked at some ways Twitter gets value out of tracing</a>, which came out of the vision Rebecca Isaacs put into place. On the flip side, when I talk to people at peer companies with similar scale, most of them have not (yet?) succeeded at getting significant value from distributed tracing. This is so common that I see a viral Twitter thread about how useless distributed tracing is more than once a year. Even though we went with the more expensive &quot;build&quot; option, just off the top of my head, I can think of multiple uses of tracing that have returned between 10x and 100x the cost of building out tracing, whereas people at a number of companies that have chosen the cheaper &quot;buy&quot; option commonly complain that tracing isn't worth it.</p> <p>Coincidentally, I was just talking about this exact topic to Pam Wolf, a civil engineering professor with experience in (civil engineering) industry on multiple continents, who had a related opinion. For large scale systems (projects), you need an in-house expert (owner's-side engineer) for each area that you don't handle in your own firm. While it's technically possible to hire yet another firm to be the expert, that's more expensive than developing or hiring in-house expertise and, in the long run, also more risky. That's pretty analogous to my experience working as an electrical engineer as well, where orgs that outsource functions to other companies without retaining an in-house expert pay a very high cost, and not just monetarily. They often ship sub-par designs with long delays on top of having high costs. &quot;Buying&quot; can and often does reduce the amount of expertise necessary, but it often doesn't remove the need for expertise.</p> <p>This related to another common abstract argument that's commonly made, that companies should concentrate on &quot;their area of comparative advantage&quot; or &quot;most important problems&quot; or &quot;core business need&quot; and outsource everything else. We've already seen a couple of examples where this isn't true because, at a large enough scale, it's more profitable to have in-house expertise than not regardless of whether or not something is core to the business (one could argue that all of the things that are moved in-house are core to the business, but that would make the concept of coreness useless). Another reason this abstract advice is too simplistic is that businesses can somewhat arbitrarily choose what their comparative advantage is. A large<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">1</a></sup> example of this would be Apple bringing CPU design in-house. Since acquiring PA Semi (formerly the team from SiByte and, before that, a team from DEC) for $278M, Apple has been producing <a href="https://twitter.com/danluu/status/1433297089866383364">the best chips in the phone and laptop power envelope by a pretty large margin</a>. But, before the purchase, there was nothing about Apple that made the purchase inevitable, that made CPU design an inherent comparative advantage of Apple. But if a firm can pick an area and make it an area of comparative advantage, saying that the firm should choose to concentrate on its comparative advantage(s) isn't very helpful advice.</p> <p>$278M is a lot of money in absolute terms, but as a fraction of Apple's resources, that was tiny and much smaller companies also have the capability to do cutting edge work by devoting a small fraction of their resources to it, e.g., Twitter, for a cost that any $100M company could afford, <a href="https://twitter.com/danluu/status/1381687511362138113">created novel cache algorithms and data structures</a> and is doing other cutting edge cache work. Having great <a href="cache-incidents/">cache infra</a> isn't any more core to Twitter's business than creating a great CPU is to Apple's, but it is a lever that Twitter can use to make more money than it could otherwise.</p> <p>For small companies, it doesn't make sense to have in-house experts for everything the company touches, but companies don't have to get all that large before it starts making sense to have in-house expertise in their operating system, language runtime, and other components that people often think of as being fairly specialized. Looking back at Twitter's history, Yao Yue has noted that when she was working on cache in Twitter's early days (when we had ~100 engineers), she would regularly go to the kernel team for help debugging production incidents and that, in some cases, debugging could've easily taken 10x longer without help from the kernel team. Social media companies tend to have relatively high scale on a per-user and per-dollar basis, so not every company is going to need the same kind of expertise when they have 100 engineers, but there are going to be other areas that aren't obviously core business needs where expertise will pay off even for a startup that has 100 engineers.</p> <p>Thanks to Ben Kuhn, Yao Yue, Pam Wolf, John Hergenroeder, Julien Kirch, Tom Brearley, and Kevin Burke for comments/corrections/discussion.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:L"><p>Some other large examples of this are Korean chaebols, like Hyundai. Looking at how Hyundai Group's companies are connected to Hyundai Motor Company isn't really the right lens with which to examine Hyundai, but I'm going to use that lens anyway since most readers of this blog are probably already familiar with Hyundai Motor and will not be familiar with how Korean chaebols operate.</p> <p>Speaking very roughly, with many exceptions, American companies have tended to take the advice to specialize and concentrate on their competencies, at least since the 80s. This is the opposite of the direction that Korean chaebols have gone. Hyundai not only makes cars, they make the steel their cars use, the robots they use to automate production, the cement used for their factories, the construction equipment used to build their factories, the containers and ships used to ship cars (which they also operate), the transmissions for their cars, etc.</p> <p>If we look at a particular component, say, their 8-speed transmission vs. the widely used and lauded ZF 8HP transmission, reviewers typically slightly prefer the ZF transmission. But even so, having good-enough in-house transmissions, as well as many other in-house components that companies would typically buy, doesn't exactly seem to be a disadvantage for Hyundai.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> </ol> </div> Measurement, benchmarking, and data analysis are underrated why-benchmark/ Fri, 27 Aug 2021 00:00:00 +0000 why-benchmark/ <p>A question I get asked with some frequency is: why bother measuring X, why not build something instead? More bluntly, in a recent conversation with a newsletter author, his comment on some future measurement projects I wanted to do (in the same vein as other projects like <a href="keyboard-v-mouse/">keyboard vs. mouse</a>, <a href="keyboard-latency/">keyboard</a>, <a href="term-latency/">terminal</a> and <a href="input-lag/">end-to-end</a> latency measurements), delivered with a smug look and a bit contempt in the tone, was &quot;so you just want to get to the top of Hacker News?&quot;</p> <p>The implication for the former is that measuring is less valuable than building and for the latter that measuring isn't valuable at all (perhaps other than for fame), but I don't see measuring as lesser let alone worthless. If anything, because measurement is, <a href="https://twitter.com/danluu/status/1082321431109795840">like writing</a>, not generally valued, it's much easier to find high ROI measurement projects than high ROI building projects.</p> <p>Let's start by looking at a few examples of high impact measurement projects. My go-to example for this is Kyle Kingsbury's work with <a href="https://jepsen.io">Jepsen</a>. Before Jepsen, a handful of huge companies (the now $1T+ companies that people are calling &quot;hyperscalers&quot;) had decently tested distributed systems. They mostly didn't talk about testing methods in a way that really caused the knowledge to spread to the broader industry. Outside of those companies, most distributed systems were, <a href="testing/">by my standards</a>, not particularly well tested.</p> <p>At the time, a common pattern in online discussions of distributed correctness was:</p> <p><strong>Person A</strong>: Database X corrupted my data.<br> <strong>Person B</strong>: It works for me. It's never corrupted my data.<br> <strong>A</strong>: How do you know? Do you ever check for data corruption?<br> <strong>B</strong>: What do you mean? I'd know if we had data corruption (alternate answer: <a href="https://mobile.twitter.com/danluu/status/918845240240410624">sure, we sometimes have data corruption, but it's probably a hardware problem and therefore not our fault</a>)</p> <p>Kyle's early work found critical flaws in nearly everything he tested, despite Jepsen being much less sophisticated then than it is now:</p> <ul> <li><a href="https://aphyr.com/posts/283-call-me-maybe-redis">Redis Cluster / Redis Sentinel</a>: &quot;we demonstrate Redis losing 56% of writes during a partition&quot;</li> <li><a href="https://aphyr.com/posts/284-call-me-maybe-mongodb">MongoDB</a>: &quot;In this post, we’ll see MongoDB drop a phenomenal amount of data&quot;</li> <li><a href="https://aphyr.com/posts/285-call-me-maybe-riak">Riak</a>: &quot;we’ll see how last-write-wins in Riak can lead to unbounded data loss&quot;</li> <li><a href="https://aphyr.com/posts/292-call-me-maybe-nuodb">NuoDB</a>: &quot;If you are considering using NuoDB, be advised that the project’s marketing and documentation may exceed its present capabilities&quot;</li> <li><a href="https://aphyr.com/posts/291-call-me-maybe-zookeeper">Zookeeper</a>: the one early Jepsen test of a distributed system that didn't find a catastrophic bug</li> <li><a href="https://aphyr.com/posts/315-call-me-maybe-rabbitmq">RabbitMQ clustering</a>: &quot;RabbitMQ lost ~35% of acknowledged writes ... This is not a theoretical problem. I know of at least two RabbitMQ deployments which have hit this in production.&quot;</li> <li><a href="https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul">etcd &amp; Consul</a>: &quot;etcd’s registers are not linearizable . . . 'consistent' reads in Consul return the local state of any node that considers itself a leader, allowing stale reads.&quot;</li> <li><a href="https://aphyr.com/posts/317-call-me-maybe-elasticsearch">ElasticSearch</a>: &quot;the health endpoint will lie. It’s happy to report a green cluster during split-brain scenarios . . . 645 out of 1961 writes acknowledged then lost.&quot;</li> </ul> <p>Many of these problems had existed for quite a while</p> <blockquote> <p>What’s really surprising about this problem is that it’s gone unaddressed for so long. The original issue was reported in July 2012; almost two full years ago. There’s no discussion on the website, nothing in the documentation, and users going through Elasticsearch training have told me these problems weren’t mentioned in their classes.</p> </blockquote> <p>Kyle then quotes a number of users who ran into issues into production and then dryly notes</p> <blockquote> <p>Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present</p> </blockquote> <p>Although we don't have an A/B test of universes where Kyle exists vs. not and can't say how long it would've taken for distributed systems to get serious about correctness in a universe where Kyle didn't exist, from having spent many years looking at how developers treat correctness bugs, I would bet on distributed systems having rampant correctness problems until someone like Kyle came along. The typical response that I've seen when a catastrophic bug is reported is that the project maintainers will assume that the bug report is incorrect (and you can see many examples of this if you look at responses from the first few years of Kyle's work). When the reporter doesn't have a repro for the bug, which is quite common when it comes to distributed systems, the bug will be written off as non-existent.</p> <p>When the reporter does have a repro, the next line of defense is to argue that the behavior is fine (you can also see many examples of these from looking at responses to Kyle's work). Once the bug is acknowledged as real, the next defense is to argue that the bug doesn't need to be fixed because it's so uncommon (e.g., &quot;<a href="https://news.ycombinator.com/item?id=5913610">It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit? Are you building a NASA Shuttle Crawler-transporter to get groceries?</a>&quot;). And then, after it's acknowledged that the bug should be fixed, the final line of defense is to argue that the project takes correctness very seriously and there's really nothing more that could have been done; development and test methodology doesn't need to change because it was just a fluke that the bug occurred, and analogous bugs won't occur in the future without changes in methodology.</p> <p>Kyle's work blew through these defenses and, without something like it, my opinion is that we'd still see these as the main defense used against distributed systems bugs (as opposed to test methodologies that can actually produce pretty reliable systems).</p> <p>That's one particular example, but I find that it's generally true that, in areas where no one is publishing measurements/benchmarks of products, the products are generally sub-optimal, often in ways that are relatively straightforward to fix once measured. Here are a few examples:</p> <ul> <li>Keyboards: after I published <a href="keyboard-latency/">this post on keyboard latency</a>, at least one major manufacturer that advertises high-speed gaming devices actually started optimizing input device latency. At the time, so few people measured keyboard latency that I could only find one other person who'd done a measurement (I wanted to look for other measurements because my measured results seemed so high as to be implausible, and the one measurement I could find online was in the same range as my measurements). Now, every major manufacturer of gaming keyboards and mice has fairly low latency devices available whereas, before, companies making gaming devices were focused on buzzword optimizations that had little to no impact (like higher speed USB polling)</li> <li>Computers: after I published some other posts on <a href="input-lag/">computer</a> <a href="term-latency/">latency</a>, an engineer at a major software company that wasn't previously doing serious UI latency work told me that some engineers had started measuring and optimizing UI latency; also, the author of alacritty <a href="https://github.com/alacritty/alacritty/issues/673">filed this ticket</a> on how to reduce alacritty latency</li> <li>Vehicle headlights: Jennifer Stockburger has noted that, when Consumer Reports started testing headlights, engineers at auto manufacturers thanked CR for giving them the ammunition they needed to make headlights more effective; previously, they would lose the argument to designers who wanted nicer looking but less effective headlights since making cars safer by designing better headlights is a hard sell because there's no business case, but making cars score higher on Consumer Reports reviews allowed them to sometimes win the argument. Without third-party measurements, a business oriented car exec has no reason to listen to engineers because almost no new car buyers will do anything resembling decent testing of well their headlights illuminate the road and even fewer buyers will test how much the headlights blind oncoming drivers, so <a href="https://mastodon.social/@danluu/111802159638869338">designers are left unchecked to create the product they think looks best regardless of effectiveness</a></li> <li>Vehicle <a href="https://en.wikipedia.org/wiki/Anti-lock_braking_system">ABS</a>: after Consumer Reports and Car and Driver found that the Tesla Model 3 had extremely long braking distances (152 ft. from 60mph and 196 ft. from 70mph), Tesla updated the algorithms used to modulate the brakes, which improved braking distances enough that Tesla went from worst in class to better than average</li> <li>Vehicle impact safety: Other than Volvo, car manufacturers generally design their cars to get the highest possible score on published crash tests; <a href="car-safety/">they'll add safety as necessary to score well on new tests when they're published, but not before</a></li> </ul> <p>Anyone could've done the projects above (while Consumer Reports buys the cars they test, some nascent car reviewers rent cars on Turo)!</p> <p>This post has explained why measuring things is valuable but, to be honest, the impetus for my measurements is curiosity. I just want to know the answer to a question. I did this long before I had a blog and I often don't write up my results even now that I have a blog. But even if you have no curiosity about what's actually happening when you interact with the world and you're &quot;just&quot; looking for something useful to do, the lack of measurements of almost everything means that it's easy to find high ROI measurement projects, at least in terms of impact on the world — if you want to make money, building something is probably easier to monetize.</p> <h3 id="appendix-so-you-just-want-to-get-to-the-top-of-hacker-news">Appendix: &quot;so you just want to get to the top of Hacker News?&quot;</h3> <p>When I look at posts that I enjoy reading that make it to the top of HN, like <a href="https://www.chrisfenton.com/">Chris Fenton's projects</a> or <a href="https://www.windytan.com/">Oona Raisanen's projects</a>, I think it's pretty clear that they're not motivated by HN or other fame since they were doing these interesting projects long before their blogs were a hit on HN or other social media. I don't know them, but if I had to guess why they do their projects, it's primarily because they find it fun to work on the kinds of projects they work on.</p> <p>I obviously can't say that no one works on personal projects with the primary goal of hitting the top of HN but, as a motivation, it's so inconsistent with the most obvious explanations for the personal project content I read on HN (that someone is having fun, is curious, etc.) that I find it a bit mind boggling that someone would think this is a plausible imputed motivation.</p> <h3 id="appendix-the-motivation-for-my-measurement-posts">Appendix: the motivation for my measurement posts</h3> <p><a href="https://en.wikipedia.org/wiki/The_Death_of_the_Author">There's a sense in which it doesn't really matter why I decided to write these posts</a>, but if I were reading someone else's post on this topic, I'd still be curious what got them writing, so here's what prompted me to write my measurement posts (which, for the purposes of this list, include posts where I collate data and don't do any direct measurement).</p> <ul> <li><a href="car-safety/">danluu.com/car-safety</a>: I was thinking about buying a car and wanted to know if I should expect significant differences in safety between manufacturers given that cars mostly get top marks on tests done in the U.S. <ul> <li>This wasn't included in the post because I thought it was too trivial to include (because the order of magnitude is obvious even without carrying out the computation), but I also computed the probability of dying in a car accident as well as the expected change in life expectancy between an old used car and a new-ish used car</li> </ul></li> <li><a href="cli-complexity/">danluu.com/cli-complexity</a>: I had this idea when I saw something by Gary Berhardt where he showed off how to count the number of single-letter command line options that <code>ls</code>, which made me wonder if that was a recent change or not</li> <li><a href="overwatch-gender/">danluu.com/overwatch-gender</a>: I had just seen two gigantic reddit threads debating whether or not there's a gender bias in how women are treated in online games and figured that I could get data on the matter in less time than was spent by people writing comments in those threads</li> <li><a href="input-lag/">danluu.com/input-lag</a>: I wanted to know if I could trust my feeling that modern computers that I use are much higher latency than older devices that I'd used</li> <li><a href="keyboard-latency/">danluu.com/keyboard-latency</a>: I wanted to know how much latency came from keyboards (display latency is already well tested by <a href="https://blurbusters.com">https://blurbusters.com</a>)</li> <li><a href="bad-decisions/">danluu.com/bad-decisions</a>: I saw a comment by someone in the rationality community defending bad baseball coaching decisions, saying that they're not a big deal because they only cost you maybe four games a year, which isn't a big deal and wanted to know how big a deal bad coaching decisions were</li> <li><a href="android-updates/">danluu.com/android-updates</a>: I was curious how many insecure Android devices are out there due to most Android phones not being updatable</li> <li><a href="filesystem-errors/">danluu.com/filesystem-errors</a>: I was curious how much filesystems had improved with respect to data corruption errors found by a 2005 paper</li> <li><a href="term-latency/">danluu.com/term-latency</a>: I felt like terminal benchmarks were all benchmarking something that's basically irrelevant to user experience (throughput) and wanted to know what it would look like if someone benchmarked something that might matter more; I also wanted to know if my feeling that iTerm2 was slow was real or my imagination</li> <li><a href="keyboard-v-mouse/">danluu.com/keyboard-v-mouse</a>: the most widely cited sources for keyboard vs. mousing productivity were pretty obviously bogus as well as being stated with extremely high confidence; I wanted to see if non-bogus tests would turn up the same results or different results</li> <li><a href="web-bloat/">danluu.com/web-bloat</a>: I took a road trip across the U.S., where the web was basically unusable, and wanted to quantify the unusability of the web without access to very fast internet</li> <li><a href="bimodal-compensation/">danluu.com/bimodal-compensation</a>: I was curious if we were seeing a hollowing out of mid-tier jobs in programming like we saw with law jobs</li> <li><a href="yegge-predictions/">danluu.com/yegge-predictions</a>: I had the impression that Steve Yegge made unusually good predictions about the future of tech and wanted to see of my impression was correct</li> <li><a href="postmortem-lessons/">danluu.com/postmortem-lessons</a>: I wanted to see what data was out there on postmortem causes to see if I could change how I operate and become more effective</li> <li><a href="boring-languages/">danluu.com/boring-languages</a>: I was curious how much of the software I use was written in boring, old, languages</li> <li><a href="blog-ads/">danluu.com/blog-ads</a>: I was curious how much money I could make if I wanted to monetize the blog</li> <li><a href="everything-is-broken/">danluu.com/everything-is-broken</a>: I wanted to see if my impression of how many bugs I run into was correct. Many people told me that the idea that people run into a lot of software bugs on a regular basis was an illusion caused by selective memory and I wanted to know if that was the case for me or not</li> <li><a href="integer-overflow/">danluu.com/integer-overflow</a>: I had a discussion with a language designer who was convinced that integer overflow checking was too expensive to do for an obviously bogus reason (because it's expensive if you do a benchmark that's 100% integer operations) and I wanted to see if my quick mental-math estimate of overhead was the right order of magnitude</li> <li><a href="octopress-speedup/">danluu.com/octopress-speedup</a>: after watching a talk by Dan Espeset, I wanted to know if there were easy optimizations I could do to my then-Octopress site</li> <li><a href="broken-builds/">danluu.com/broken-builds</a>: I had a series of discussions with someone who claimed that their project had very good build uptime despite it being broken regularly; I wanted to know if their claim was correct with respect to other, similar, projects</li> <li><a href="empirical-pl/">danluu.com/empirical-pl</a>: I wanted to know what studies backed up claims from people who said that there was solid empirical proof of the superiority of &quot;fancy&quot; type systems</li> <li><a href="2choices-eviction/">danluu.com/2choices-eviction</a>: I was curious what would happen if &quot;two random choices&quot; was applied to cache eviction</li> <li><a href="gender-gap/">danluu.com/gender-gap</a>: I wanted to verify the claims in an article that claimed that there is no gender gap in tech salaries</li> <li><a href="3c-conflict/">danluu.com/3c-conflict</a>: I wanted to create a simple example illustrating the impact of alignment on memory latency</li> </ul> <p>BTW, writing up this list made me realize that a narrative I had in my head about how and when I started really looking at data seriously must be wrong. I thought that this was something that came out of my current job, but that clearly cannot be the case since a decent fraction of my posts from before my current job are about looking at data and/or measuring things (and I didn't even list some of the data-driven posts where I just read some papers and look at what data they present). After seeing the list above, I realized that I did projects like the above not only long before I had the job, but long before I had this blog.</p> <h3 id="appendix-why-you-can-t-trust-some-reviews">Appendix: why you can't trust some reviews</h3> <p>One thing that both increases and decreases the impact of doing good measurements is that most measurements that are published aren't very good. This increases the personal value of understanding how to do good measurements and of doing good measurements, but it blunts the impact on other people, since people generally don't understand what makes measurements invalid and don't have a good algorithm for deciding which measurements to trust.</p> <p>There are a variety of reasons that published measurements/reviews are often problematic. A major issue with reviews is that, in some industries, reviewers are highly dependent on manufacturers for review copies.</p> <p>Car reviews are one of the most extreme examples of this. Consumer Reports is the only major reviewer that independently sources their cars, which often causes them to disagree with other reviewers since they'll try to buy the trim level of the car that most people buy, which is often quite different from the trim level reviewers are given by manufacturers and Consumer Reports generally manages to avoid reviewing cars that are unrepresentatively picked or tuned. There have been a couple where Consumer Reports reviewers (who also buy the cars) have said that they thought someone realized they worked for Consumer Reports and then said that they needed to keep the car overnight before giving them the car they'd just bought; when that's happened, the reviewer has walked away from the purchase.</p> <p>There's pretty significant copy-to-copy variation between cars and the cars reviewers get tend to be ones that were picked to avoid cosmetic issues (paint problems, panel gaps, etc.) as well as checked for more serious issues. Additionally, cars can have their software and firmware tweaked (e.g., it's common knowledge that review copies of BMWs have an engine &quot;tune&quot; that would void your warranty if you modified your car similarly).</p> <p>Also, because Consumer Reports isn't getting review copies from manufacturers, they don't have to pull their punches and can write reviews that are highly negative, something you rarely see from car magazines and don't often see from car youtubers, where you generally have to read between the lines to get an honest review since a review that explicitly mentions negative things about a car can mean losing access (the youtuber who goes by &quot;savagegeese&quot; has mentioned having trouble getting access to cars from some companies after giving honest reviews).</p> <p>Camera lenses are another area where it's been documented that reviewers get unusually good copies of the item. There's tremendous copy-to-copy variation between lenses so vendors pick out good copies and let reviewers borrow those. In many cases (e.g., any of the FE mount ZA Zeiss lenses or the Zeiss lens on the RX-1), based on how many copies of a lens people need to try and return to get a good copy, it appears that the median copy of the lens has noticeable manufacturing defects and that, in expectation, perhaps one in ten lenses has no obvious defect (this could also occur if only a few copies were bad and those were serially returned, but very few photographers really check to see if their lens has issues due to manufacturing variation). Because it's so expensive to obtain a large number of lenses, the amount of copy-to-copy variation was unquantified until <a href="https://www.lensrentals.com/">lensrentals</a> started measuring it; they've found that different manufacturers can have very different levels of copy-to-copy variation, which I hope will apply pressure to lens makers that are currently selling a lot of bad lenses while selecting good ones to hand to reviewers.</p> <p>Hard drives are yet another area where it's been documented that reviewers get copies of the item that aren't represnetative. Extreme Tech has reported, multiple times, that Adata, Crucial, and Western Digital have handed out review copies of SSDs that are not what you get as a consumer. One thing I find interesting about that case is that Extreme Tech says</p> <blockquote> <p>Agreeing to review a manufacturer’s product is an extension of trust on all sides. The manufacturer providing the sample is trusting that the review will be of good quality, thorough, and objective. The reviewer is trusting the manufacturer to provide a sample that accurately reflects the performance, power consumption, and overall design of the final product. When readers arrive to read a review, they are trusting that the reviewer in question has actually tested the hardware and that any benchmarks published were fairly run.</p> </blockquote> <p>This makes it sound like the reviewer's job is to take a trusted handed to them by the vendor and then run good benchmarks, absolving the reviewer of the responsibility of obtaining representative devices and ensuring that they're representative. I'm reminded of the SRE motto, &quot;hope is not a strategy&quot;. Trusting vendors is not a strategy. We know that vendors will lie and cheat to look better at benchmarks. Saying that it's a vendor's fault for lying or cheating can shift the blame, but it won't result in reviews being accurate or useful to consumers.</p> <p>While we've only discussed a few specific areas where there's published evidence that reviews cannot be trusted because they're compromised by companies, but this isn't anything specific to those industries. As consumers, we should expect that any review that isn't performed by a trusted, independent, agency, that purchases its own review copies has been compromised and is not representative of the median consumer experience.</p> <p>Another issue with reviews is that most online reviews that are highly ranked in search are really just SEO affiliate farms.</p> <p>A more general issue is that reviews are also affected by the exact same problem as items that are not reviewed: people generally can't tell which reviews are actually good and which are not, so review sites are selected on things other than the quality of the review. A prime example of this is Wirecutter, which is so popular among tech folks that noting that so many tech apartments in SF have identical Wirecutter recommended items is a tired joke. For people who haven't lived in SF, you can get a peek into the mindset by reading the comments <a href="https://www.reddit.com/r/fatFIRE/comments/iioq01/impossible_to_avoid_lifestyle_inflation_with/">on this post about how it's &quot;impossible&quot; to not buy the wirecutter recommendation for anything</a> which is full of comments from people who re-assure that poster that, due to the high value of the poster's time, it would be irresponsible to do anything else.</p> <p>The thing I find funny about this is that if you take benchmarking seriously (in any field) and just read the methodology for the median Wirecutter review, without even trying out the items reviewed you can see that the methodology is poor and that they'll generally select items that are mediocre and sometimes even worst in class. A thorough exploration of this really deserves its own post, but I'll cite one example of poorly reviewed items here: in <a href="https://benkuhn.net/vc">https://benkuhn.net/vc</a>, Ben Kuhn looked into how to create a nice video call experience, which included trying out a variety of microphones and webcams. Naturally, Ben tried Wirecutter's recommended microphone and webcam. The webcam was quite poor, no better than using the camera from an ancient 2014 iMac or his 2020 Macbook (and, to my eye, actually much worse; more on this later). And the microphone was roughly comparable to using the built-in microphone on his laptop.</p> <p>I have a lot of experience with Wirecutter's recommended webcam because so many people have it and it is shockingly bad in a distinctive way. Ben noted that, if you look at a still image, the white balance is terrible when used in the house he was in, and if you talk to other people who've used the camera, that is a common problem. But the issue I find to be worse is that, if you look at the video, under many conditions (and I think most, given how often I see this), the webcam will refocus regularly, making the entire video flash out of and then back into focus (another issue is that it often focuses on the wrong thing, but that's less common and I don't see that one with everybody who I talk to who uses Wirecutter's recommended webcam). I actually just had a call yesterday with a friend of mine who was using a different setup than I'd normally seen him with, the mediocre but perfectly acceptable macbook webcam. His video was going in and out of focus every 10-30 seconds, so I asked him if he was using Wirecutter's recommended webcam and of course he was, because what other webcam would someone in tech buy that has the same problem?</p> <p>This level of review quality is pretty typical for Wirecutter reviews and they appear to generally be the most respected and widely used review site among people in tech.</p> <h3 id="appendix-capitalism">Appendix: capitalism</h3> <p>When I was in high school, there was a clique of proto-edgelords who did things like read The Bell Curve and argue its talking points to anyone who would listen.</p> <p>One of their favorite topics was how the free market would naturally cause companies that make good products rise to the top and companies that make poor products to disappear, resulting in things generally being safe, a good value, and so on and so forth. I still commonly see this opinion espoused by people working in tech, including people who fill their condos with Wirecutter recommended items. I find the juxtaposition of people arguing that the market will generally result in products being good while they themselves buy overpriced garbage to be deliciously ironic. To be fair, it's not all overpriced garbage. Some of it is overpriced mediocrity and some of it is actually good; it's just that it's not too different from what you'd get if you just naively bought random stuff off of Amazon without reading third-party reviews.</p> <p>For a related discussion, <a href="tech-discrimination/">see this post on people who argue that markets eliminate discrimination even as they discriminate</a>.</p> <h3 id="appendix-other-examples-of-the-impact-of-measurement-or-lack-thereof">Appendix: other examples of the impact of measurement (or lack thereof)</h3> <ul> <li>Electronic stability control <ul> <li>Toyota RAV4: <a href="https://www.youtube.com/watch?v=j3qrCNR4U9A">before</a> <a href="https://www.youtube.com/watch?v=VtQ24W_lamY">reviews</a> <a href="https://www.youtube.com/watch?v=xSRCJFCmvTk">and after reviews</a></li> <li>Toyota Hilux <a href="https://www.youtube.com/watch?v=xoHbn8-ROiQ">before reviews</a> <a href="https://www.youtube.com/watch?v=y2QSogJj3ec"> and after reviews</a></li> <li>Nissan Rogue: major improvements after Consumer Reports found issues with stability control.</li> <li>Jeep Grand Cherokee: <a href="https://www.youtube.com/watch?v=zaYFLb8WMGM">before reviews</a> <a href="https://www.youtube.com/watch?v=_xFPdfcNmVc">and after reviews</a></li> </ul></li> <li>Some boring stuff at work: a year ago, I wrote <a href="metrics-analytics/">this</a> <a href="tracing-analytics/">pair</a> of posts on observability infrastructure at work. At the time, that work had driven 8 figures of cost savings and that's now well into the 9 figure range. This probably deserves its own post at some point, but the majority of the work was straightforward once someone could actually observe what's going on. <ul> <li>Relatedly: after seeing a few issues impact production services, I wrote a little (5k LOC) parser to parse every line seen in various host-level logs as a check to see what issues were logged that we weren't catching in our metrics. This found major issues in clusters that weren't using an automated solution to catch and remediate host-level issues; for some clusters, over 90% of hosts were actively corrupting data or had a severe performance problem. This led to the creation of a new team to deal with issues like this</li> </ul></li> <li>Tires <ul> <li>Almost all manufacturers other than Michelin see severely reduced wet, snow, and ice, performance as the tire wears <ul> <li>Jason Fenske says that a technical reason for this (among others) is that the <a href="https://en.wikipedia.org/wiki/Siping_(rubber)">sipes</a> that improve grip are generally not cut to the full depth because doing so significantly increases manufacturing cost because the device that cuts the sipes will need to be stronger as well as wear out faster</li> <li>A non-technical reason for this is that a lot of published tire tests are done on new tires, so tire manufacturers can get nearly the same marketing benchmark value by creating only partial-depth sipes</li> </ul></li> <li>As Tire Rack has increased in prominence, some tire manufacturers have made their siping more multi-directional to improve handling while cornering instead of having siping mostly or only perpendicular to the direction of travel, which mostly only helps with acceleration and braking (Consumer Reports snow and ice scores are based on accelerating in a straight line on snow and braking in a straight line on ice, respectively, whereas Tire Rack's winter test scores emphasize all-around snow handling)</li> <li>An example of how measurement impact is bounded: Farrell Scott, the Project Category Manager for Michelin winter tires said that, when designing the successor to the Michelin X-ICE Xi3, one of the primary design criteria was to change how the tire looked because Michelin found that customers thought that the X-ICE Xi3, despite being up there with the Bridge Blizzak WS80 for being the best all-around winter tire (slightly better at some things, slightly worse at others), potential customers often chose other tires because they looked more like the popular conception of a winter tire, with &quot;aggressive&quot; looking tread blocks (this is one thing the famous Nokian Hakkapeliitta tire line was much better at). They also changed the name; instead of incrementing the number, the new tire was called Michelin X-ICE SNOW, to emphasize that the tire is suitable for snow as well as ice.</li> <li>Although some consumers do read reviews, many (and probably most) don't!</li> </ul></li> <li>HDMI to USB converters for live video <ul> <li>If you read the docs for the <a href="https://amzn.to/3gBdiRE">Camlink 4k</a>, they note that the device should use bulk transfers on Windows and Isochronous transfers on Mac (if you use their software, it will automatically make this adjustment) <ul> <li>Fabian Giesen informed me that this may be for the same reason that, when some colleagues of his tested a particular USB3 device on Windows, only 1 out of 5 chipsets tested supported isochronous properly (the rest would do things like bluescreen or hang the machine)</li> </ul></li> <li>I've tried miscellaneous cheap HDMI to USB converters as alternatives to the Camlink 4k, and I have yet to find a cheap one that generally works across a wide variety of computers. They will generally work with at least one computer I have access to with at least one piece of software I want to use, but will simply not work or provide very distorted video in some cases. Perhaps someone should publish benchmarks on HDMI to USB converter quality!</li> </ul></li> <li>HDMI to VGA converters <ul> <li>Many of these get very hot and then overheat and stop working in 15 minutes to 2 hours. Some aren't even warm to the touch. Good luck figuring out which ones work!</li> </ul></li> <li>Water filtration <ul> <li>Brita claims that their &quot;longlast&quot; filters remove lead. However, two different Amazon reviewers indicated that they measured lead levels in contaminated water before and after and found that lead levels weren't reduced</li> <li>It used to be the case that water flows very slowly through &quot;longast&quot; filters and this was a common complaint of users who bought the filters. Now some (or perhaps all) &quot;longlast&quot; filters filter water much more quickly but don't filter to Brita's claimed levels of filtration</li> </ul></li> <li>Sports refereeing <ul> <li><a href="https://twitter.com/umpireauditor/status/1515648000076312579">Baseball umpires are famously bad at making correct calls and we've had the technology to make nearly flawless calls for decades</a>, but many people argue that having humans make incorrect calls is &quot;part of the game&quot; and the game wouldn't be as real if computers were in the loop on calls</li> <li>Some sports have partially bowed to pressure to make correct rulings when possible, e.g., in football, NFL coaches started being allowed to challege two calls per game based on video footage in 1999 (3 starting in 2004, if the first two challenges were successful), copying the system that the niche USFL created in 1985</li> </ul></li> <li>Storage containers <ul> <li>Rubbermaid storage containers (Rougneck &amp; Toughneck) used to be famous for their quality and durability. Of course, it was worth more in the short term to cut back on the materials used and strength of the containers, so another firm bought the brand and continues to use it, producing similar looking containers that are famous for buckling if you stack containers on top of each other, which is the entire point of the nestable / stackable containers. I haven't seen anyone really benchmark storage containers seriously for how well they handle load so, in general, you can't really tell if this is going to happen to you or not.</li> </ul></li> <li>Speaker vibration isolation solutions <ul> <li><a href="https://ethanwiner.com/speaker_isolation.htm">Ethan Winer concludes that these are audiophile placebo</a> <br /></li> </ul></li> </ul> <p><small>Thanks to Fabian Giesen, Ben Kuhn, Yuri Vishnevsky, @chordowl, Seth Newman, Justin Blank, Per Vognsen, John Hergenroeder, Pam Wolf, Ivan Echevarria, and Jamie Brandon for comments/corrections/discussion.</small></p> Against essential and accidental complexity essential-complexity/ Tue, 29 Dec 2020 00:00:00 +0000 essential-complexity/ <p>In the classic 1986 essay, <a href="http://worrydream.com/refs/Brooks-NoSilverBullet.pdf">No Silver Bullet</a>, Fred Brooks argued that there is, in some sense, not that much that can be done to improve programmer productivity. His line of reasoning is that programming tasks contain a core of essential/conceptual<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">1</a></sup> complexity that's fundamentally not amenable to attack by any potential advances in technology (such as languages or tooling). He then uses an <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Ahmdahl's law</a> argument, saying that because 1/X of complexity is essential, it's impossible to ever get more than a factor of X improvement via technological improvements.</p> <p>Towards the end of the essay, Brooks claims that at least 1/2 (most) of complexity in programming is essential, bounding the potential improvement remaining for all technological programming innovations combined to, at most, a factor of 2<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">2</a></sup>:</p> <blockquote> <p>All of the technological attacks on the accidents of the software process are fundamentally limited by the productivity equation:</p> <p>Time of task = Sum over i { Frequency_i Time_i }</p> <p>If, as I believe, the conceptual components of the task are now taking most of the time, then no amount of activity on the task components that are merely the expression of the concepts can give large productivity gains.</p> </blockquote> <p>Brooks states a bound on how much programmer productivity can improve. But, in practice, to state this bound correctly, one would have to be able to conceive of problems that no one would reasonably attempt to solve due to the amount of friction involved in solving the problem with current technologies.</p> <p>Without being able to predict the future, this is impossible to estimate. If we knew the future, it might turn out that there's some practical limit on how much computational power or storage programmers can productively use, bounding the resources available to a programmer, but getting a bound on the amount of accidental complexity would still require one to correctly reason about how programmers are going to be able to use zillions times more resources than are available today, which is so difficult we might as well call it impossible.</p> <p>Moreover, for each class of tool that could exist, one would have to effectively anticipate all possible innovations. Brooks' strategy for this was to look at existing categories of tools and state, for each, that they would be ineffective or that they were effective but played out. This was wrong not only because it underestimated gains from classes of tools that didn't exist yet, <a href="butler-lampson-1999/">weren't yet effective</a>, or he wasn't familiar with (e.g., he writes off formal methods, but it doesn't even occur to him to mention fuzzers, static analysis tools that don't fully formally verify code, tools like valgrind, etc.) but also because Brooks thought that every class of tool where there was major improvement was played out and it turns out that none of them were. For example, Brooks wrote off programming languages as basically done, just before the rise of &quot;scripting languages&quot; as well as just before GC languages took over the vast majority of programming<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">3</a></sup>. Although you will occasionally hear statements like this, not many people will volunteer to write a webapp in C because gains from modern languages can't be more than 2x over using a modern language.</p> <p>Another one Brooks writes off is AI, saying &quot;The techniques used for speech recognition seem to have little in common with those used for image recognition, and both are different from those used in expert systems&quot;. But, of course this is no longer true now — neural nets are highly effective for both image recognition and speech recognition. Whether or not they'll be highly effective as a programming tool is to be determined, but a lynchpin of Brooks's argument against AI has been invalidated and it's not a stretch to think that a greatly improved GPT-2 could give significant productivity gains to programmers. Of course, it's not reasonable to expect that Brooks could've foreseen neural nets becoming effective for both speech and image recognition, but that's exactly what makes it unreasonable for Brooks to write off all future advance in AI as well as every other field of computer science.</p> <p>Brooks also underestimates gains from practices and tooling that enables practices. Just for example, looking at what old school programming gurus advocated, we have <a href="https://twitter.com/danluu/status/885214004649615360">Ken Thompson arguing that language safety is useless</a> and that bugs happen because people write fragile code, which they should not do if they don't want to have bugs and <a href="https://mastodon.social/@danluu/110213164472670607">Jamie Zawinski arguing that, when on a tight deadline, automated testing is a waste of time and &quot;there’s a lot to be said for just getting it right the first time&quot; without testing</a>. Brooks acknowledges the importance of testing, but the only possible improvement to testing that he mentions are expert systems that could make testing easier for beginners. If you look at the complexity of moderately large scale modern software projects, they're well beyond any software project that had been seen in the 80s. If you really think about what it would mean to approach these projects using old school correctness practices, I think the speedup from those sorts of practices to modern practices is infinite for a typical team since most teams using those practices would fail to produce a working product at all if presented with a problem that many big companies have independently solved, e.g., produce a distributed database with some stated SLO. Someone could dispute the infinite speedup claim, but anyone who's worked on a complex project that's serious about correctness will have used <a href="testing/">tools and techniques that result in massive development speedups</a>, easily more than 2x compared to 80s practices, a possibility that didn't seem to occur to Brooks as it appears that Brooks thought that serious testing improvements were not possible due to the essential complexity involved in testing.</p> <p>Another basic tooling/practice example would be version control. A version control system that multi-file commits, branches, automatic merging that generally works as long as devs don't touch the same lines, etc., is a fairly modern invention. <a href="microsoft-culture/">During the 90s, Microsoft was at the cutting edge of software development and they didn't manage to get a version control system that supported the repo size they needed (30M LOC for Win2k development) and supported branches until after Win2k</a>. Branches were simulated by simply copying the entire source tree and then manually attempting to merge copies of the source tree. Special approval was required to change the source tree and, due to the pain of manual merging, the entire Win2k team (5000 people, including 1400 devs and 1700 testers) could only merge 100 changes per day on a good day (0 on a bad day when the build team got stalled due to time spent fixing build breaks). This was a decade after Brooks was writing and there was still easily an order of magnitude speedup available from better version control tooling, test tooling and practices, machine speedups allowing faster testing, etc. Note that, in addition to not realizing that version control and test tooling would later result in massive productivity gains, Brooks claimed that hardware speedups wouldn't make developers significantly more productive even though hardware speed was noted to be a major limiting factor in Win2k development velocity. Brooks couldn't conceive of anyone building a project as complex as Win2k, which could really utilize faster hardware. Of course, using the tools and practices of Brooks's time, it was practically impossible to build as project as complex as Win2k, but tools and practices advanced so quickly that it was possible only a decade later even if development velocity moved in slow motion compared to what we're used to today due to &quot;stone age&quot; tools and practices.</p> <p>To pick another sub-part of the above, Brooks didn't list CI/CD as a potential productivity improvement because Brooks couldn't even imagine ever having tools that could possibly enable modern build practices. Writing in 1995, Brooks mentions that someone from Microsoft told him that they build nightly. To that, Brooks says that it may be too much work to enable building (at least) once a day, noting that Bell Northern Research, quite reasonably, builds weekly. Shortly after Brooks wrote that, Google was founded and engineers at Google couldn't even imagine settling for a setup like Microsoft had, let alone building once a week. They had to build a lot of custom software to get a monorepo of Google's scale on to what would be considered modern practices today, but they were able to do it. A startup that I worked for that was founded in 1995 also built out its own CI infra that allowed for constant merging and building from HEAD because that's what anyone who was looking at what could be done instead of thinking that everything that could be done has been done would do. For large projects, just having CI/CD alone and maintaining a clean build over building weekly should easily be a 2x productivity improvement, large than would be possible if Brooks's claim that half of complexity was essential would allow for. It's good that engineers at Google, the startup I worked for, as well as many other places didn't believe that it wasn't possible to get a 2x improvement and actually built tools that enabled massive productivity improvements.</p> <p>In some sense, looking at No Silver Bullet is quite similar to when <a href="cli-complexity/#maven">we looked at Unix and found the Unix mavens saying that we should write software like they did in the 70s</a> and that <a href="https://twitter.com/danluu/status/885214004649615360">the languages they invented are as safe as any language can be</a>. Long before computers were invented, elders have been telling the next generation that they've done everything that there is to be done and that the next generation won't be able to achieve more. In the computer age, we've seen countless similar predictions outside of programming as well, such as Cliff Stoll's now-infamous prediction that the internet wouldn't chagne anything:</p> <blockquote> <p>Visionaries see a future of telecommuting workers, interactive libraries and multimedia classrooms. They speak of electronic town meetings and virtual communities. Commerce and business will shift from offices and malls to networks and modems. And the freedom of digital networks will make government more democratic.</p> <p>Baloney. Do our computer pundits lack all common sense? The truth is no online database will replace your daily newspaper ... How about electronic publishing? Try reading a book on disc. At best, it's an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can't tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we'll soon buy books and newspapers straight over the Intenet. Uh, sure. ... Then there's cyberbusiness. We're promised instant catalog shopping—just point and click for great deals. We'll order airline tickets over the network, make restaurant reservations and negotiate sales contracts. Stores will become obselete. So how come my local mall does more business in an afternoon than the entire Internet handles in a month?</p> </blockquote> <p>If you do a little search and replace, Stoll is saying the same thing Brooks did. Sure, technologies changed things in the past, but I can't imagine how new technologies would change things, so they simply won't.</p> <p>Even without knowing any specifics about programming, we would be able to see that these kinds of arguments have not historically help up and have decent confidence that the elders are not, in fact, correct this time.</p> <p>Brooks kept writing about software for quite a while after he was a practitioner, but didn't bother to keep up with what was happening in industry after moving into Academia in 1964, which is already obvious from the 1986 essay we looked at, but even more obvious if you look at his 2010 book, Design of Design, where he relies on <a href="https://www.patreon.com/posts/46629220">the same examples he relied on in earlier essays and books</a>, where the bulk of his new material comes from a house that he built. We've seen that <a href="cocktail-ideas/">programmers who try to generalize their knowledge to civil engineering generally make silly statements that any 2nd year civil engineering student can observe are false</a>, and it turns out that trying to glean deep insights about software engineering design techniques from house building techniques doesn't work any better, but since Brooks didn't keep up with the industry, that's what he had to offer. While there are timeless insights that transcend era and industry, Brooks has very specific suggestions, e.g., running software teams like <a href="cocktail-ideas/">cocktail party</a> surgical teams, which come from thinking about how one could improve on the development practices Brooks saw at IBM in the 50s. But it turns out the industry has moved well beyond IBM's 1950s software practices and ideas that are improvements over what IBM did in the 1950s aren't particularly useful 70 years later.</p> <p>Going back to the main topic of this post and looking at the specifics of what he talks about with respect to accidental complexity with the benefit of hindsight, we can see that Brooks' 1986 claim that we've basically captured all the productivity gains high-level languages can provide isn't too different from an assembly language programmer saying the same thing in 1955, thinking that assembly is as good as any language can be<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">4</a></sup> and that his claims about other categories are similar. The main thing these claims demonstrate are a lack of imagination. When Brooks referred to conceptual complexity, he was referring to complexity of using the conceptual building blocks that Brooks was familiar with in 1986 (on problems that Brooks would've thought of as programming problems). There's no reason anyone should think that Brooks' 1986 conception of programming is fundamental any more than they should think that how an assembly programmer from 1955 thought was fundamental. People often make fun of the apocryphal &quot;640k should be enough for anybody&quot; quote, but Brooks saying that, across all categories of potential productivity improvement, we've done most of what's possible to do, is analogous and not apocryphal!</p> <p>If we look at the future, the fraction of complexity that might be accidental is effectively unbounded. One might argue that, if we look at the present, these terms wouldn't be meaningless. But, while this will vary by domain, I've personally never worked on a non-trivial problem that isn't completely dominated by accidental complexity, making the concept of essential complexity meaningless on any problem I've worked on that's worth discussing.</p> <h3 id="appendix-concrete-problems">Appendix: concrete problems</h3> <p>Let's see how this essential complexity claim holds for a couple of things I did recently at work:</p> <ul> <li>scp from a bunch of hosts to read and download logs, and then parse the logs to understand the scope of a problem</li> <li>Query two years of metrics data from every instance of every piece of software my employer has, for some classes of software and then generate a variety of plots that let me understand some questions I have about what our software is doing and how it's using computer resources</li> </ul> <h4 id="logs">Logs</h4> <p>If we break this task down, we have</p> <ul> <li>scp logs from a few hundred thousand machines to a local box <ul> <li>used a Python script for this to get parallelism with more robust error handling than you'd get out of pssh/parallel-scp</li> <li>~1 minute to write the script</li> </ul></li> <li>do other work while logs download</li> <li>parse downloaded logs (a few TB) <ul> <li>used a Rust script for this, a few minutes to write (used Rust instead of Python for performance reasons here — just opening the logs and scanning each line with idiomatic Python was already slower than I'd want if I didn't want to farm the task out to multiple machines)</li> </ul></li> </ul> <p>In 1986, perhaps I would have used telnet or ftp instead of scp. Modern scripting languages didn't exist yet (perl was created in 1987 and perl5, the first version that some argue is modern, was released in 1994), so writing code that would do this with parallelism and &quot;good enough&quot; error handling would have taken more than an order of magnitude more time than it takes today. In fact, I think just getting semi-decent error handling while managing a connection pool could have easily taken an order of magnitude longer than this entire task took me (not including time spent downloading logs in the background).</p> <p>Next up would be parsing the logs. It's not fair to compare an absolute number like &quot;1 TB&quot;, so let's just call this &quot;enough that we care about performance&quot; (we'll talk about scale in more detail in the metrics example). Today, we have our choice of high-performance languages where it's easy to write, fast, safe code and harness the power of libraries (e.g., a regexp library<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">5</a></sup>) that make it easy to write a quick and dirty script to parse and classify logs, farming out the work to all of the cores on my computer (I think Zig would've also made this easy, but I used Rust because my team has a critical mass of Rust programmers).</p> <p>In 1986, there would have been no comparable language, but more importantly, I wouldn't have been able to trivially find, download, and compile the appropriate libraries and would've had to write all of the parsing code by hand, turning a task that took a few minutes into a task that I'd be lucky to get done in an hour. Also, if I didn't know how to use the library or that I could use a library, I could easily find out how I should solve the problem on StackOverflow, which would massively reduce accidental complexity. Needless to say, there was no real equivalent to Googling for StackOverflow solutions in 1986.</p> <p>Moreover, even today, this task, a pretty standard programmer devops/SRE task, after at least an order of magnitude speedup over the analogous task in 1986, is still nearly entirely accidental complexity.</p> <p>If the data were exported into our metrics stack or if our centralized logging worked a bit differently, the entire task would be trivial. And if neither of those were true, but the log format were more uniform, I wouldn't have had to write any code after getting the logs; <a href="https://github.com/BurntSushi/ripgrep">rg</a> or <a href="https://github.com/ggreer/the_silver_searcher">ag</a> would have been sufficient. If I look for how much time I spent on the essential conceptual core of the task, it's so small that it's hard to estimate.</p> <h4 id="query-metrics">Query metrics</h4> <p>We really only need one counter-example, but I think it's illustrative to look at a more complex task to see how Brooks' argument scales for a more involved task. If you'd like to skip this lengthy example, <a href="#summary">click here to skip to the next section</a>.</p> <p>We can view my metrics querying task as being made up of the following sub-tasks:</p> <ul> <li>Write a set of <a href="https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)">Presto SQL</a> queries that effectively scan on the order of 100 TB of data each, from a data set that would be on the order of 100 PB of data if I didn't <a href="metrics-analytics/">maintain tables that only contain a subset of data that's relevant</a> <ul> <li>Maybe 30 seconds to write the first query and a few minutes for queries to finish, using on the order of 1 CPU-year of CPU time</li> </ul></li> <li>Write some ggplot code to plot the various properties that I'm curious about <ul> <li>Not sure how long this took; less time than the queries took to complete, so this didn't add to the total time of this task</li> </ul></li> </ul> <p>The first of these tasks is so many orders of magnitude quicker to accomplish today that I'm not even able to hazard a guess to as to how much quicker it is today within one or two orders of magnitude, but let's break down the first task into component parts to get some idea about the ways in which the task has gotten easier.</p> <p>It's not fair to port absolute numbers like 100 PB into 1986, but just the idea of having a pipeline that collects and persists comprehensive data analogous to the data I was looking at for a consumer software company (various data on the resource usage and efficiency of our software) would have been considered absurd in 1986. Here we see one fatal flaw in the concept of accidental essential complexity providing an upper bound on productivity improvements: tasks with too much accidental complexity wouldn't have even been considered possible. The limit on how much accidental complexity Brooks sees is really a limit of his imagination, not something fundamental.</p> <p>Brooks explicitly dismisses increased computational power as something that will not improve productivity (&quot;Well, how many MIPS can one use fruitfully?&quot;, more on this later), but both storage and CPU power (not to mention network speed and RAM) were sources of accidental complexity so large that they bounded the space of problems Brooks was able to conceive of.</p> <p>In this example, let's say that we somehow had enough storage to keep the data we want to query in 1986. The next part would be to marshall on the order of 1 CPU-year worth of resources and have the query complete in minutes. As with the storage problem, this would have also been absurd in 1986<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">6</a></sup>, so we've run into a second piece of non-essential complexity so large that it would stop a person from 1986 from thinking of this problem at all.</p> <p>Next up would be writing the query. If I were writing for the Cray-2 and wanted to be productive, I probably would have written the queries in Cray's dialect of Fortran 77. Could I do that in less than 300 seconds per query? Not a chance; I couldn't even come close with Scala/Scalding and I think it would be a near thing even with Python/PySpark. This is the aspect where I think we see the smallest gain and we're still well above one order of magnitude here.</p> <p>After we have the data processed, we have to generate the plots. Even with today's technology, I think not using ggplot would cost me at least 2x in terms of productivity. I've tried every major plotting library that's supposedly equivalent (in any language) and every library I've tried either has multiple show-stopping bugs rendering plots that I consider to be basic in ggplot or is so low-level that I lose more than 2x productivity by being forced to do stuff manually that would be trivial in ggplot. In 2020, the existence of a single library already saves me 2x on this one step. If we go back to 1986, before the concept of <a href="https://amzn.to/3r9Mvzw">the grammar of graphics</a> and any reasonable implementation, there's no way that I wouldn't lose at least two orders of magnitude of time on plotting even assuming some magical workstation hardware that was capable of doing the plotting operations I do in a reasonable amount of time (my machine is painfully slow at rendering the plots; a Cray-2 would not be able to do the rendering in anything resembling a reasonable timeframe).</p> <p>The number of orders of magnitude of accidental complexity reduction for this problem from 1986 to today is so large I can't even estimate it and yet this problem still contains such a large fraction of accidental complexity that it's once again difficult to even guess at what fraction of complexity is essential. To write it all down all of the accidental complexity I can think of would require at least 20k words, but just to provide a bit of the flavor of the complexity, let me write down a few things.</p> <ul> <li>SQL; this is one of those things that's superficially simple <a href="https://scattered-thoughts.net/writing/select-wat-from-sql/">but actually extremely complex</a> <ul> <li>Also, Presto SQL</li> </ul></li> <li>Arbitrary Presto limits, some of which are from Presto and some of which are from the specific ways we operate Presto and the version we're using <ul> <li>There's an internal Presto data structure assert fail that gets triggered when I use both <code>numeric_histogram</code> and <code>cross join unnest</code> in a particular way. Because it's a waste of time to write the bug-exposing query, wait for it to fail, and then re-write it, I have a mental heuristic I use to guess, for any query that uses both constructs, whether or not I'll hit the bug and I apply it to avoid having to write two queries. If the heuristic applies, I'll instead write a more verbose query that's slower to execute instead of the more straightforward query</li> <li>We partition data by date, but Presto throws this away when I join tables, resulting in very large and therefore expensive joins when I join data across a long period of time even though, in principle, this could be a series of cheap joins; if the join is large enough to cause my query to blow up, I'll write what's essentially a little query compiler to execute day-by-day queries and then post-process the data as necessary instead of writing the naive query <ul> <li>There are a bunch of cases where some kind of optimization in the query will make the query feasible without having to break the query across days (e.g., if I want to join host-level metrics data with the table that contains what cluster a host is in, that's a very slow join across years of data, but I also know what kinds of hosts are in which clusters, which, in some cases, lets me filter hosts out of the host-level metrics data that's in there, like core count and total memory, which can make the larger input to this join small enough that the query can succeed without manually partitioning the query)</li> </ul></li> <li>We have a Presto cluster that's &quot;fast&quot; but has &quot;low&quot; memory limits a cluster that's &quot;slow&quot; but has &quot;high&quot; memory limits, so I mentally estimate how much per-node memory a query will need so that I can schedule it to the right cluster</li> <li>etc.</li> </ul></li> <li>When, for performance reasons, I should compute the CDF or histogram in Presto vs. leaving it to the end for ggplot to compute</li> <li>How much I need to downsample the data, if at all, for ggplot to be able to handle it, and how that may impact analyses</li> <li>Arbitrary ggplot stuff <ul> <li>roughly how many points I need to put in a scatterplot before I should stop using <code>size = [number]</code> and should switch to single-pixel plotting because plotting points as circles is too slow</li> <li>what the minimum allowable opacity for points is</li> <li>If I exceed the maximum density where you can see a gradient in a scatterplot due to this limit, how large I need to make the image to reduce the density appropriately (when I would do this instead of using a heatmap deserves its own post)</li> <li>etc.</li> </ul></li> <li>All of the above is about tools that I use to write and examine queries, but there's also the mental model of all of the data issues that must be taken into account when writing the query in order to generate a valid result, which includes things like clock skew, Linux accounting bugs, issues with our metrics pipeline, issues with data due to problems in the underlying data sources, etc.</li> <li>etc.</li> </ul> <p>For each of Presto and ggplot I implicitly hold over a hundred things in my head to be able to get my queries and plots to work and I choose to use these because these are the lowest overhead tools that I know of that are available to me. If someone asked me to name the percentage of complexity I had to deal with that was essential, I'd say that it was so low that there's no way to even estimate it. For some queries, it's arguably zero — my work was necessary only because of some arbitrary quirk and there would be no work to do without the quirk. But even in cases where some kind of query seems necessary, I think it's unbelievable that essential complexity could have been more than 1% of the complexity I had to deal with.</p> <p>Revisiting Brooks on computer performance, even though I deal with complexity due to the limitations of hardware performance in 2020 and would love to have faster computers today, Brooks wrote off faster hardware as pretty much not improving developer productivity in 1986:</p> <blockquote> <p>What gains are to be expected for the software art from the certain and rapid increase in the power and memory capacity of the individual workstation? Well, how many MIPS can one use fruitfully? The composition and editing of programs and documents is fully supported by today’s speeds. Compiling could stand a boost, but a factor of 10 in machine speed would surely . . .</p> </blockquote> <p>But this is wrong on at least two levels. First, if I had access to faster computers, a huge amount of my accidental complexity would go away (if computers were powerful enough, I wouldn't need complex tools like Presto; I could just run a query on my local computer). We have much faster computers now, but it's still true that having faster computers would make many involved engineering tasks trivial. As James Hague notes, in the mid-80s, <a href="https://prog21.dadgum.com/29.html">writing a spellchecker was a serious engineering problem due to performance constraints</a>.</p> <p>Second, (just for example) ggplot only exists because computers are so fast. A common complaint from people who work on performance is that tool X has somewhere between two and ten orders of magnitude of inefficiency when you look at the fundamental operations it does vs. the speed of hardware today<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">7</a></sup>. But what fraction of programmers can realize even one half of the potential performance of a modern multi-socket machine? I would guess fewer than one in a thousand and I would say certainly fewer than one in a hundred. And performance knowledge isn't independent of other knowledge — controlling for age and experience, it's negatively correlated with knowledge of non-&quot;systems&quot; domains since time spent learning about the esoteric accidental complexity necessary to realize half of the potential of a computer is time spent not learning about &quot;directly&quot; applicable domain knowledge. When we look software that requires a significant amount of domain knowledge (e.g., ggplot) or that's large enough that it requires a large team to implement (e.g., IntelliJ<sup class="footnote-ref" id="fnref:V"><a rel="footnote" href="#fn:V">8</a></sup>), the vast majority of it wouldn't exist if machines were orders of magnitude slower and writing usable software required wringing most of the performance out of the machine. Luckily for us, hardware has gotten much faster, allowing the vast majority of developers to ignore performance-related accidental complexity and instead focus on all of the other accidental complexity necessary to be productive today.</p> <p>Faster computers both reduce the amount of accidental complexity tool users run into as well as the amount of accidental complexity that tool creators need to deal with, allowing more productive tools to come into existence.</p> <h3 id="2022-update">2022 Update</h3> <p>A lot of people have said that this post is wrong because Brooks was obviously saying X and Brooks did not mean the things I quoted in this post. But people state all sorts of different Xs for what Brooks really meant so, in aggregate, these counterarguments are self-refuting because they think that Brooks &quot;obviously&quot; meant one specific thing but, if it were so obvious, people wouldn't have so many different ideas of what Brooks meant.</p> <p>This is, of course, inevitable when it comes to a Rorschach test essay like Brooks's essay, which states a wide variety of different and contradictory things.</p> <p>Thanks to Peter Bhat Harkins, Ben Kuhn, Yuri Vishnevsky, Chris Granger, Wesley Aptekar-Cassels, Sophia Wisdom, Lifan Zeng, Scott Wolchok, Martin Horenovsky, @realcmb, Kevin Burke, Aaron Brown, @up_lurk, and Saul Pwanson for comments/corrections/discussion.</p> <p><link rel="prefetch" href="cli-complexity/"> <link rel="prefetch" href="metrics-analytics/"> <link rel="prefetch" href=""> <link rel="prefetch" href="about/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:C"><blockquote> <p>The accidents I discuss in the next section. First let us consider the essence</p> <p>The essence of a software entity is a construct of interlocking concepts: data sets, relationships among data items, algorithms, and invocations of functions. This essence is abstract, in that the conceptual construct is the same under many different representations. It is nonetheless highly precise and richly detailed.</p> <p>I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation. We still make syntax errors, to be sure; but they are fuzz compared to the conceptual errors in most systems.</p> </blockquote> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:T"><p>Curiously, he also claims, in the same essay, that no individual improvement can yield a 10x improvement within one decade. While this technically doesn't contradict his Ahmdal's law argument plus the claim that &quot;most&quot; (i.e., at least half) of complexity is essential/conceptual, it's unclear why he would include this claim as well.</p> <p>When Brooks revisited his essay in 1995 in No Silver Bullet Refired, he claimed that he was correct by using the weakest form of the three claims he made in 1986, that within one decade, no single improvement would result in an order of magnitude improvement. However, he did then re-state the strongest form of the claim he made in 1986 and made it again in 1995, saying that this time, no set of technological improvements could improve productivity more than 2x, for real:</p> <blockquote> <p>It is my opinion, and that is all, that the accidental or representational part of the work is now down to about half or less of the total. Since this fraction is a question of fact, its value could in principle be settled by measurement. Failing that, my estimate of it can be corrected by better informed and more current estimates. Significantly, no one who has written publicly or privately has asserted that the accidental part is as large as 9/10.</p> </blockquote> <p>By the way, I find it interesting that he says that no one disputed this 9/10ths figure. Per the body of this post, I would put it at far above 9/10th for my day-to-day work and, if I were to try to solve the same problems in 1986, the fraction would have been so high that people wouldn't have even conceived of the problem. As a side effect of having worked in hardware for a decade, I've also done work that's not too different from what some people faced in 1986 (microcode, assembly &amp; C written for DOS) and I would put that work as easily above 9/10th as well.</p> <p>Another part of his follow-up that I find interesting is that he quotes Harel's &quot;Biting the Silver Bullet&quot; from 1992, which, among other things, argues that that decade deadline for an order of magnitude improvement is arbitrary. Brooks' response to this is</p> <blockquote> <p>There are other reasons for the decade limit: the claims made for candidate bullets all have had a certain immediacy about them . . . We will surely make substantial progress over the next 40 years; an order of magnitude over 40 years is hardly magical.</p> </blockquote> <p>But by Brooks' own words when he revisits the argument in 1995, if 9/10th of complexity is essential, it would be impossible to get more than an order of magnitude improvement from reducing it, with no caveat on the timespan:</p> <blockquote> <p>&quot;NSB&quot; argues, indisputably, that if the accidental part of the work is less than 9/10 of the total, shrinking it to zero (which would take magic) will not give an order of magnitude productivity improvement.</p> </blockquote> <p>Both his original essay and the 1995 follow-up are charismatically written and contain a sort of local logic, where each piece of the essay sounds somewhat reasonable if you don't think about it too hard and you forget everything else the essay says. As with the original, a pedant could argue that this is technically not incoherent — after all, Brooks could be saying:</p> <ul> <li>at most 9/10th of complexity is accidental (if we ignore the later 1/2 claim, which is the kind of suspension of memory/disbelief one must do to read the essay)</li> <li>it would not be surprising for us to eliminate 100% of accidental complexity after 40 years</li> </ul> <p>While this is technically consistent (again, if we ignore the part that's inconsistent) and is a set of claims one could make, this would imply that 40 years from 1986, i.e., in 2026, it wouldn't be implausible for there to be literally zero room for any sort of productivity improvement from tooling, languages, or any other potential source of improvement. But this is absurd. If we look at other sections of Brooks' essay and combine their reasoning, we see other inconsistencies and absurdities.</p> <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:L"><p>Another issue that we see here is Brooks' insistence on bright-line distinctions between categories. Essential vs. accidental complexity. &quot;Types&quot; of solutions, such as languages vs. &quot;build vs. buy&quot;, etc.</p> <p>Brooks admits that &quot;build vs. buy&quot; is one avenue of attack on essential complexity. Perhaps he would agree that buying a regexp package would reduce the essential complexity since that would allow me to avoid keeping all of the concepts associated with writing a parser in my head for simple tasks. But what if, instead of buying regexes, I used a language where they're bundled into the standard library or is otherwise distributed with the language? Or what if, instead of having to write my own concurrency primitives, those are bundled into the language? Or for that matter, what about <a href="https://golang.org/pkg/net/http/">an entire HTTP server</a>? There is no bright-line distinction between what's in a library one can &quot;buy&quot; (for free in many cases nowadays) and one that's bundled into the language, so there cannot be a bright-line distinction between what gains a language provides and what gains can be &quot;bought&quot;. But if there's no bright-line distinction here, then it's not possible to say that one of these can reduce essential complexity and the other can't and maintain a bright-line distinction between essential and accidental complexity (in a response to Brooks, Harel argued against there being a clear distinction in a response, and Brooks' response was to say that there there is, in fact, a bright-line distinction, although he provided no new argument).</p> <p>Brooks' repeated insistence on these false distinctions means that the reasoning in the essay isn't composable. As we've already seen in another footnote, if you take reasoning from one part of the essay and apply it alongside reasoning from another part of the essay, it's easy to create absurd outcomes and sometimes outright contradictions.</p> <p>I suspect this is one reason discussions about essential vs. accidental complexity are so muddled. It's not just that <a href="https://twitter.com/hillelogram/status/1211433465956196352">Brooks is being vague and handwave-y</a>, he's actually not self-consistent, so there isn't and cannot be a coherent takeaway. Michael Feathers has noted <a href="https://twitter.com/mfeathers/status/1259295515532865543">that people are generally not able to correct identify essential complexity</a>; as he says, <a href="https://twitter.com/mfeathers/status/1256995176959971329">One person’s essential complexity is another person’s accidental complexity.</a>. This is exactly what we should expect from the essay, since people who have different parts of it in mind will end up with incompatible views.</p> <p>This is also a problem when criticizing Brooks. Inevitably, someone will say that what Brooks really meant was something completely different. And that will be true. But Brooks will have meant something completely different while also having meant the things he said that I mention. In defense of the view I'm presenting in the body of the text here, it's a coherent view that one could have had in 1986. Many of Brooks' statements don't make sense even when considered as standalone statements, let alone when cross-referenced with the rest of his essay. For example, the statement that no single development will result in an order of magnitude improvement in the next decade. This statement is meaningless as Brooks does not define and no one can definitively say what a &quot;single improvement&quot; is. And, as mentioned above, Brooks' essay reads quite oddly and basically does not make sense if that's what he's trying to claim. Another issue with most other readings of Brooks is that those are positions that are also meaningless even if Brooks had done the work to make them well defined. Why does it matter if one single improvement or two result in an order of magnitude improvement. If it's two improvements, we'll use them both.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:A"><p>And by the way, this didn't only happen in 1955. I've worked with people who, this century, told me that assembly is basically as productive as any high level language. This probably sounds ridiculous to almost every reader of this blog, but if you talk to people who spend all day writing microcode or assembly, you'll occasionally meet somebody who believes this.</p> <p>Thinking that the tools you personally use are as good as it gets is an easy trap to fall into.</p> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:B">Another quirk is that, while Brooks acknowledges that code re-use and libraries can increase productivity, he claims languages and tools are pretty much over, but both of these claims can't hold because there isn't a bright-line distinction between libraries and languages/tools. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:F"><p>Let's arbitrarily use a Motorola 68k processor with an FP co-processor that could do 200 kFLOPS as a reference for how much power we might have in a consumer CPU (FLOPS is a bad metric for multiple reasons, but this is just to get an idea of what it would take to get 1 CPU-year of computational resources, and Brooks himself uses MIPS as a term as if it's meaningful). By comparison, the Cray-2 could achieve 1.9 GFLOPS, or roughly 10000x the performance (I think actually less if we were to do a comparable comparison instead of using non-comparable GFLOPS numbers, but let's be generous here). There are 525600 / 5 = 105120 five minute periods in a year, so to get 1 CPU year's worth of computation in five minutes we'd need 105120 / 10000 = 10 Cray-2s per query, not including the overhead of aggregating results across Cray-2s.</p> <p>It's unreasonable to think that a consumer software company in 1986 would have enough Cray-2s lying around to allow for any random programmer to quickly run CPU years worth of queries whenever they wanted to do some data analysis. One sources claims that 27 Cray-2s were ever made over the production lifetime of the machine (1985 to 1990). Even if my employer owned all of them and they were all created by 1986, that still wouldn't be sufficient to allow the kind of ad hoc querying capacity that I have access to in 2020.</p> <p>Today, someone at a startup can even make an analogous argument when comparing to a decade ago. You used to have to operate a cluster that would be prohibitively annoying for a startup to operate unless the startup is very specialized, but you can now just use Snowflake and basically get Presto but only pay for the computational power you use (plus a healthy markup) instead of paying to own a cluster and for all of the employees necessary to make sure the cluster is operable.</p> <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:O">I actually run into one of these every time I publish a new post. I write my posts in Google docs and then copy them into emacs running inside tmux running inside Alacritty. My posts are small enough to fit inside L2 cache, so I could have 64B/3.5 cycle write bandwidth. And yet, the copy+paste operation can take ~1 minute and is so slow I can watch the text get pasted in. Since my chip is working super hard to make sure the copy+paste happens, it's running at its full non-turbo frequency of 4.2Ghz, giving it 76.8GB/s of write bandwidth. For a 40kB post, 1 minute = 666B/s. 76.8G/666 =~ 8 orders of magnitude left on the table. <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:V">In this specific case, I'm sure somebody will argue that Visual Studio was quite nice in 2000 and ran on much slower computers (and the debugger was arguably better than it is in the current version). But there was no comparable tool on Linux, nor was there anything comparable to today's options in the VSCode-like space of easy-to-learn programming editor that provides programming-specific facilities (as opposed to being a souped up version of notepad) without being a full-fledged IDE. <a class="footnote-return" href="#fnref:V"><sup>[return]</sup></a></li> </ol> </div> How do cars do in out-of-sample crash testing? car-safety/ Tue, 30 Jun 2020 00:06:34 -0700 car-safety/ <p>Any time you have a benchmark that gets taken seriously, some people will start gaming the benchmark. Some famous examples in computing are the CPU benchmark <a href="https://spec.org/benchmarks.html">specfp</a> and video game benchmarks. With specfp, Sun managed to increase its score on <a href="https://www.spec.org/osg/cpu2000/CFP2000/179.art/docs/179.art.html">179.art</a> (a sub-benchmark of specfp) by 12x with a compiler tweak that essentially re-wrote the benchmark kernel, which increased the Sun <a href="https://en.wikipedia.org/wiki/UltraSPARC_III">UltraSPARC</a>’s overall specfp score by 20%. At times, GPU vendors have added specialized benchmark-detecting code to their drivers that lowers image quality during benchmarking to produce higher benchmark scores. Of course, gaming the benchmark isn't unique to computing and we see people do this <a href="discontinuities/">in other fields</a>. It’s not surprising that we see this kind of behavior since improving benchmark scores by cheating on benchmarks is much cheaper (and therefore higher ROI) than improving benchmark scores by actually improving the product.</p> <p>As a result, I'm generally suspicious when people take highly specific and well-known benchmarks too seriously. Without other data, you don't know what happens when conditions aren't identical to the conditions in the benchmark. With GPU and CPU benchmarks, it’s possible for most people to run the standard benchmarks with slightly tweaked conditions. If the results change dramatically for small changes to the conditions, that’s evidence that the vendor is, if not cheating, at least shading the truth.</p> <p>Benchmarks of physical devices can be more difficult to reproduce. Vehicle crash tests are a prime example of this -- they're highly specific and well-known benchmarks that use up a car for some test runs.</p> <p>While there are multiple organizations that do crash tests, they each have particular protocols that they follow. Car manufacturers, if so inclined, could optimize their cars for crash test scores instead of actual safety. Checking to see if crash tests are being gamed with hyper-specific optimizations isn't really feasible for someone who isn't a billionaire. The easiest way we can check is by looking at what happens when new tests are added since that lets us see a crash test result that manufacturers weren't optimizing for just to get a good score.</p> <p>While having car crash test results is obviously better than not having them, the results themselves don't tell us what happens when we get into an accident that doesn't exactly match a benchmark. Unfortunately, if we get into a car accident, we don't get to ask the driver of the vehicle we're colliding with to change their location, angle of impact, and speed, in order for the collision to comply with an <a href="https://en.wikipedia.org/wiki/Insurance_Institute_for_Highway_Safety">IIHS</a>, <a href="https://en.wikipedia.org/wiki/National_Highway_Traffic_Safety_Administration">NHTSA</a>, or <a href="https://en.wikipedia.org/wiki/New_Car_Assessment_Program">*NCAP</a>, test protocol.</p> <p>For this post, we're going to look at <a href="https://en.wikipedia.org/wiki/Insurance_Institute_for_Highway_Safety">IIHS</a> test scores when they added the (driver side) small overlap and passenger side small overlap tests, which were added in 2012, and 2018, respectively. We'll start with a summary of the results and then discuss what those results mean and other factors to consider when evaluating car safety, followed by details of the methodology.</p> <h3 id="results">Results</h3> <p>The ranking below is mainly based on how well vehicles scored when the driver-side small overlap test was added in 2012 and how well models scored when they were modified to improve test results.</p> <ul> <li><strong>Tier 1</strong>: good without modifications <ul> <li>Volvo</li> </ul></li> <li><strong>Tier 2</strong>: mediocre without modifications; good with modifications <ul> <li>None</li> </ul></li> <li><strong>Tier 3</strong>: poor without modifications; good with modifications <ul> <li>Mercedes</li> <li>BMW</li> </ul></li> <li><strong>Tier 4</strong>: poor without modifications; mediocre with modifications <ul> <li>Honda</li> <li>Toyota</li> <li>Subaru</li> <li>Chevrolet</li> <li>Tesla</li> <li>Ford</li> </ul></li> <li><strong>Tier 5</strong>: poor with modifications or modifications not made <ul> <li>Hyundai</li> <li>Dodge</li> <li>Nissan</li> <li>Jeep</li> <li>Volkswagen</li> </ul></li> </ul> <p>These descriptions are approximations. Honda, Ford, and Tesla are the poorest fits for these descriptions, with Ford arguably being halfway in between Tier 4 and Tier 5 but also arguably being better than Tier 4 and not fitting into the classification and Honda and Tesla not really properly fitting into any category (with their category being the closest fit), but some others are also imperfect. Details below.</p> <h3 id="general-commentary">General commentary</h3> <p>If we look at overall mortality in the U.S., there's a pretty large age range for which car accidents are the leading cause of death. Although the numbers will vary depending on what data set we look at, when the driver-side small overlap test was added, the IIHS estimated that 25% of vehicle fatalities came from small overlap crashes. It's also worth noting that small overlap crashes were thought to be implicated in a significant fraction of vehicle fatalities at least since the 90s; this was not a novel concept in 2012.</p> <p>Despite the importance of small overlap crashes, from looking at the results when the IIHS added the driver-side and passenger-side small overlap tests in 2012 and 2018, it looks like almost all car manufacturers were optimizing for benchmark and not overall safety. Except for Volvo, all carmakers examined produced cars that fared poorly on driver-side small overlap crashes until the driver-side small overlap test was added.</p> <p>When the driver-side small overlap test was added in 2012, most manufacturers modified their vehicles to improve driver-side small overlap test scores. However, until the IIHS added a passenger-side small overlap test in 2018, most manufacturers skimped on the passenger side. When the new test was added, they beefed up passenger safety as well. To be fair to car manufacturers, some of them got the hint about small overlap crashes when the driver-side test was added in 2012 and did not need to make further modifications to score well on the passenger-side test, including Mercedes, BMW, and Tesla (and arguably a couple of others, but the data is thinner in the other cases; Volvo didn't need a hint).</p> <h3 id="other-benchmark-limitations">Other benchmark limitations</h3> <p>There are a number of other areas where we can observe that most car makers are optimizing for benchmarks at the expensive of safety.</p> <h4 id="gender-weight-and-height">Gender, weight, and height</h4> <p>Another issue is crash test dummy overfitting. For a long time, adult NHSTA and IIHS tests used a 1970s 50%-ile male dummy, which is 5'9&quot; and 171lbs. Regulators called for a female dummy in 1980 but due to budget cutbacks during the Reagan era, initial plans were shelved and the NHSTA didn't put one in a car until 2003. The female dummy is a scaled down version of the male dummy, scaled down to 5%-ile 1970s height and weight (4'11&quot;, 108lbs; another model is 4'11&quot;, 97lbs). In frontal crash tests, when a female dummy is used, it's always a passenger (a 5%-ile woman is in the driver's seat in one NHSTA side crash test and the IIHS side crash test). For reference, in 2019, the average weight of a U.S. adult male was 198 lbs and the average weight of a U.S. adult female was 171 lbs.</p> <p>Using a 1970s U.S. adult male crash test dummy causes a degree of overfitting for 1970s 50%-ile men. For example, starting in the 90s, manufacturers started adding systems to protect against whiplash. Volvo and Toyota use a kind of system that reduces whiplash in men and women and appears to have slightly more benefit for women. Most car makers use a kind of system that reduces whiplash in men but, on average, has little impact on whiplash injuries in women.</p> <p>It appears that we also see a similar kind of optimization for crashes in general and not just whiplash. We don't have crash test data on this, and looking at real-world safety data is beyond the scope of this post, but I'll note that, until around the time the NHSTA put the 5%-ile female dummy into some crash tests, most car manufacturers not named Volvo had a significant fatality rate differential in side crashes based on gender (with men dying at a lower rate and women dying at a higher rate).</p> <p>Volvo claims to have been using computer models to simulate what would happen if women (including pregnant women) are involved in a car accident for decades.</p> <h4 id="other-crashes">Other crashes</h4> <p>Volvo is said to have a crash test facility where they do a number of other crash tests that aren't done by testing agencies. A reason that they scored well on the small overlap tests when they were added is that they were already doing small overlap crash tests before the IIHS started doing small overlap crash tests.</p> <p>Volvo also says that they test rollovers (the IIHS tests roof strength and the NHSTA computes how difficult a car is to roll based on properties of the car, but neither tests what happens in a real rollover accident), rear collisions (Volvo claims these are especially important to test if there are children in the 3rd row of a 3-row SUV), and driving off the road (Volvo has a &quot;standard&quot; ditch they use; they claim this test is important because running off the road is implicated in a large fraction of vehicle fatalities).</p> <p>If other car makers do similar tests, I couldn't find much out about the details. Based on crash test scores, it seems like they weren't doing or even considering small overlap crash tests before 2012. Based on how many car makers had poor scores when the passenger side small overlap test was added in 2018, I think it would be surprising if other car makers had a large suite of crash tests they ran that aren't being run by testing agencies, but it's theoretically possible that they do and just didn't include a passenger side small overlap test.</p> <h3 id="caveats">Caveats</h3> <p>We shouldn't overgeneralize from these test results. As we noted above, crash test results test very specific conditions. As a result, what we can conclude when a couple new crash tests are added is also very specific. Additionally, there are a number of other things we should keep in mind when interpreting these results.</p> <h5 id="limited-sample-size">Limited sample size</h5> <p>One limitation of this data is that we don't have results for a large number of copies of the same model, so we're unable to observe intra-model variation, which could occur due to minor, effectively random, differences in test conditions as well as manufacturing variations between different copies of same model. We can observe that these do matter since some cars will see different results when two copies of the same model are tested. For example, here's a quote from the IIHS report on the Dodge Dart:</p> <blockquote> <p>The Dodge Dart was introduced in the 2013 model year. Two tests of the Dart were conducted because electrical power to the onboard (car interior) cameras was interrupted during the first test. In the second Dart test, the driver door opened when the hinges tore away from the door frame. In the first test, the hinges were severely damaged and the lower one tore away, but the door stayed shut. In each test, the Dart’s safety belt and front and side curtain airbags appeared to adequately protect the dummy’s head and upper body, and measures from the dummy showed little risk of head and chest injuries.</p> </blockquote> <p>It looks like, had electrical power to the interior car cameras not been disconnected, there would have been only one test and it wouldn't have become known that there's a risk of the door coming off due to the hinges tearing away. In general, we have no direct information on what would happen if another copy of the same model were tested.</p> <p>Using IIHS data alone, one thing we might do here is to also consider results from different models made by the same manufacturer (or built on the same platform). Although this isn't as good as having multiple tests for the same model, test results between different models from the same manufacturer are correlated and knowing that, for example, a 2nd test of a model that happened by chance showed significantly worse results should probably reduce our confidence in other test scores from the same manufacturer. There are some things that complicate this, e.g., if looking at Toyota, the Yaris is actually a re-branded Mazda2, so perhaps that shouldn't be considered as part of a pooled test result, and doing this kind of statistical analysis is beyond the scope of this post.</p> <h4 id="actual-vehicle-tested-may-be-different">Actual vehicle tested may be different</h4> <p>Although I don't think this should impact the results in this post, another issue to consider when looking at crash test results is how results are shared between models. As we just saw, different copies of the same model can have different results. Vehicles that are somewhat similar are often considered the same for crash test purposes and will share the same score (only one of the models will be tested).</p> <p>For example, this is true of the Kia Stinger and the Genesis G70. The Kia Stinger is 6&quot; longer than the G70 and a fully loaded AWD Stinger is about 500 lbs heavier than a base-model G70. The G70 is the model that IIHS tested -- if you look up a Kia Stinger, you'll get scores for a Stinger with a note that a base model G70 was tested. That's a pretty big difference considering that cars that are nominally identical (such as the Dodge Darts mentioned above) can get different scores.</p> <h4 id="quality-may-change-over-time">Quality may change over time</h4> <p>We should also be careful not to overgeneralize temporally. If we look at crash test scores of recent Volvos (vehicles on the Volvo P3 and Volvo SPA platforms), crash test scores are outstanding. However, if we look at Volvo models based on the older Ford C1 platform<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">1</a></sup>, crash test scores for some of these aren't as good (in particular, while the S40 doesn't score poorly, it scores Acceptable in some categories instead of Good across the board). Although Volvo has had stellar crash test scores recently, this doesn't mean that they have always had or will always have stellar crash test scores.</p> <h4 id="models-may-vary-across-markets">Models may vary across markets</h4> <p>We also can't generalize across cars sold in different markets, even for vehicles that sound like they might be identical. For example, see <a href="https://www.youtube.com/watch?v=UL_2MdSTM7g">this crash test of a Nissan NP300 manufactured for sale in Europe vs. a Nissan NP300 manufactured for sale in Africa</a>. Since European cars undergo EuroNCAP testing (similar to how U.S. cars undergo NHSTA and IIHS testing), vehicles sold in Europe are optimized to score well on EuroNCAP tests. Crash testing cars sold in Africa has only been done relatively recently, so car manufacturers haven't had PR pressure to optimize their cars for benchmarks and they'll produce cheaper models or cheaper variants of what superficially appear to be the same model. This appears to be no different from what most car manufacturers do in the U.S. or Europe -- they're optimizing for cost as long as they can do that without scoring poorly on benchmarks. It's just that, since there wasn't an African crash test benchmark, that meant they could go all-in on the cost side of the cost-safety tradeoff<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">2</a></sup>.</p> <p><a href="https://deepblue.lib.umich.edu/bitstream/handle/2027.42/112977/103199.pdf">This report</a> compared U.S. and European car models and found differences in safety due to differences in regulations. They found that European models had lower injury risk in frontal/side crashes and that driver-side mirrors were designed in a way that reduced the risk of lane-change crashes relative to U.S. designs and that U.S. vehicles were safer in rollovers and had headlamps that made pedestrians more visible.</p> <h4 id="non-crash-tests">Non-crash tests</h4> <p>Over time, more and more of the &quot;low hanging fruit&quot; from crash safety has been picked, making crash avoidance relatively more important. Tests of crash mitigation are relatively primitive compared to crash tests and we've seen that crash tests had and have major holes. One might expect, based on what we've seen with crash tests, that Volvo has a particularly good set of tests they use for their crash avoidance technology (traction control, stability control, automatic braking, etc.), but &quot;bar room&quot; discussion with folks who are familiar with what vehicle safety tests are being done on automated systems seems to indicate that's not the case. There was a relatively recent recall of quite a few Volvo vehicles due to the safety systems incorrectly not triggering. I'm not going to tell the story about that one here, but I'll say that it's fairly horrifying and indicative of serious systemic issues. From other backchannel discussions, it sounds like BMW is relatively serious about the software side of safety, for a car company, but the lack of rigor in this kind of testing would be horrifying to someone who's seen a release process for something like a mainstream CPU.</p> <p>Crash avoidance becoming more important might also favor companies that have more user-friendly driver assistance systems, e.g., in multiple generations of tests, Consumer Reports has given GM's Super Cruise system the highest rating while they've repeatedly noted that Tesla's Autopilot system facilitates unsafe behavior.</p> <h4 id="scores-of-vehicles-of-different-weights-aren-t-comparable">Scores of vehicles of different weights aren't comparable</h4> <p>A 2700lb subcompact vehicle that scores Good may fare worse than a 5000lb SUV that scores Acceptable. This is because the small overlap tests involve driving the vehicle into a fixed obstacle, as opposed to a reference vehicle or vehicle-like obstacle of a specific weight. This is, in some sense, equivalent to crashing the vehicle into a vehicle of the same weight, so it's as if the 2700lb subcompact was tested by running it into a 2700lb subcompact and the 5000lb SUV was tested by running it into another 5000 lb SUV.</p> <h4 id="how-to-increase-confidence">How to increase confidence</h4> <p>We've discussed some reasons we should reduce our confidence in crash test scores. If we wanted to increase our confidence in results, we could look at test results from other test agencies and aggregate them and also look at public crash fatality data (more on this later). I haven't looked at the terms and conditions of scores from other agencies, but one complication is that the IIHS does not allow you to display the result of any kind of aggregation if you use their API or data dumps (I, time consumingly, did not use their API for this post because of that).</p> <h4 id="using-real-life-crash-data">Using real life crash data</h4> <p>Public crash fatality data is complex and deserves its own post. In this post, I'll note that, if you look at the easiest relevant data for people in the U.S., this data does not show that Volvos are particularly safe (or unsafe). For example, if we look at <a href="https://www.iihs.org/api/datastoredocument/status-report/pdf/52/3">this report from 2017, which covers models from 2014</a>, two Volvo models made it into the report and both score roughly middle of the pack for their class. In the previous report, one Volvo model is included and it's among the best in its class, in the next, one Volvo model is included and it's among the worst in its class. We can observe this kind of variance for other models, as well. For example, among 2014 models, the Volkswagen Golf had one of the highest fatality rates for all vehicles (not just in its class). But among 2017 vehicles, it had among the lowest fatality rates for all vehicles. It's unclear how much of that change is from random variation and how much is because of differences between a 2014 and 2017 Volkswagen Golf.</p> <p>Overall, it seems like noise is a pretty important factor in results. And if we look at the information that's provided, we can see a few things that are odd. First, there are a number of vehicles where the 95% confidence interval for the fatality rate runs from 0 to N. We should have pretty strong priors that there was no 2014 model vehicle that was so safe that the probability of being killed in a car accident was zero. If we were taking a Bayesian approach (though I believe the authors of the report are not), and someone told us that the uncertainty interval for the true fatality rate of a vehicle had a &gt;= 5% of including zero, we would say that either we should use a more informative prior or we should use a model that can incorporate more data (in this case, perhaps we could try to understand the variance between fatality rates of different models in the same class and then use the base rate of fatalities for the class as a prior, or we could incorporate information from other models under the same make if those are believed to be correlated).</p> <p>Some people object to using informative priors as a form of bias laundering, but we should note that the prior that's used for the IIHS analysis is not completely uninformative. All of the intervals reported stop at zero because they're using the fact that a vehicle cannot create life to bound the interval at zero. But we have information that's nearly as strong that no 2014 vehicle is so safe that the expected fatality rate is zero, using that information is not fundamentally different from capping the interval at zero and not reporting negative numbers for the uncertainty interval of the fatality rate.</p> <p>Also, the IIHS data only includes driver fatalities. This is understandable since that's the easiest way to normalize for the number of passengers in the car, but it means that we can't possibly see the impact of car makers not improving passenger small-overlap safety until the passenger-side small overlap test was added in 2018, the result of lack of rear crash testing for the case Volvo considers important (kids in the back row of a 3rd row SUV). This also means that we cannot observe the impact of a number of things Volvo has done, e.g., being very early on pedestrian and then cyclist detection in their automatic braking system, adding a crumple zone to reduce back injuries in run-off-road accidients, which they observed often cause life-changing spinal injuries due to the impact from vehicles drop, etc.</p> <p>We can also observe that, in the IIHS analysis, many factors that one might want to control for aren't (e.g., miles driven isn't controlled for, which will make trucks look relatively worse and luxury vehicles look relatively better, rural vs. urban miles driven also isn't controlled for, which will also have the same directional impact). One way to see that the numbers are heavily influenced by confounding factors is by looking at AWD or 4WD vs. 2WD versions of cars. They often have wildly different fatalty rates even though the safety differences are not very large (and the difference is often in favor of the 2WD vehicle). Some plausible causes of that are random noise, differences in who buys different versions of the same vehicle, and differences in how the vehicle are used.</p> <p>If we'd like to answer the question &quot;which car makes or models are more or less safe&quot;, I don't find any of the aggregations that are publicly available to be satisfying and I think we need to look at the source data and do our own analysis to see if the data are consistent with what we see in crash test results.</p> <h3 id="conclusion">Conclusion</h3> <p>We looked at 12 different car makes and how they fared when the IIHS added small overlap tests. We saw that only Volvo was taking this kind of accident seriously before companies were publicly shamed for having poor small overlap safety by the IIHS even though small overlap crashes were known to be a significant source of fatalities at least since the 90s.</p> <p>Although I don't have the budget to do other tests, such as a rear crash test in a fully occupied vehicle, it appears plausible and perhaps even likely that most car makers that aren't Volvo would have mediocre or poor test scores if a testing agency decided to add another kind of crash test.</p> <h3 id="bonus-real-engineering-vs-programming">Bonus: &quot;real engineering&quot; vs. programming</h3> <p>As Hillel Wayne has noted, although <a href="https://twitter.com/danluu/status/1162469763374673920">programmers often have an idealized view of what &quot;real engineers&quot; do</a>, when you <a href="https://youtu.be/3018ABlET1Y">compare what &quot;real engineers&quot; do with what programmers do, it's frequently not all that different</a>. In particular, a common lament of programmers is that we're not held liable for our mistakes or poor designs, even in cases where that costs lives.</p> <p>Although automotive companies can, in some cases, be held liable for unsafe designs, just optimizing for a small set of benchmarks, which must've resulted in extra deaths over optimizing for safety instead of benchmark scores, isn't something that engineers or corporations were, in general, held liable for.</p> <h3 id="bonus-reputation">Bonus: reputation</h3> <p>If I look at what people in my extended social circles think about vehicle safety, Tesla has the best reputation by far. If you look at broad-based consumer polls, that's a different story, and Volvo usually wins there, with other manufacturers fighting for a distant second.</p> <p>I find the Tesla thing interesting since <a href="https://news.ycombinator.com/item?id=33512915">their responses are basically the opposite of what you'd expect from a company that was serious about safety</a>. When serious problems have occurred (with respect to safety or otherwise), they often have a very quick response that's basically &quot;everything is fine&quot;. <a href="https://twitter.com/danluu/status/760652055446827009">I would expect an organization that's serious about safety or improvement to respond with &quot;we're investigating&quot;, followed by a detailed postmortem explaining what went wrong, but that doesn't appear to be Tesla's style</a>.</p> <p>For example, on the driver-side small overlap test, Tesla had one model with a relevant score and it scored Acceptable (below Good, but above Poor and Marginal) even after modifications were made to improve the score. <a href="https://www.businessinsider.com/tesla-responds-to-model-s-crash-test-findings-iihs-2017-7">Tesla disputed the results, saying they make &quot;the safest cars in history&quot;</a> and implying that IIHS should be ignored because they have ulterior motives, in favor of crash test scores from an agency that is objective and doesn't have ulterior motives, i.e., the agency that gave Tesla a good score:</p> <blockquote> <p>While IIHS and dozens of other private industry groups around the world have methods and motivations that suit their own subjective purposes, the most objective and accurate independent testing of vehicle safety is currently done by the U.S. Government which found Model S and Model X to be the two cars with the lowest probability of injury of any cars that it has ever tested, making them the safest cars in history.</p> </blockquote> <p>As we've seen, Tesla isn't unusual for optimizing for a specific set of crash tests and achieving a mediocre score when an unexpected type of crash occurs, but their response is unusual. However, it makes sense from a cynical PR perspective. As we've seen over the past few years, loudly proclaiming something, regardless of whether or not it's true, even when there's incontrovertible evidence that it's untrue, seems to not only work, that kind of bombastic rhetoric appears to attract superfans who will aggressively defend the brand. If you watch car reviewers on youtube, they'll sometimes mention that they get hate mail for reviewing Teslas just like they review any other car and that they don't see anything like it for any other make.</p> <p>Apple also used this playbook to good effect in the 90s and early '00s, when they were rapidly falling behind in performance and responded not by improving performance, but by running a series of ad campaigns saying that had the best performance in the world and that they were shipping &quot;supercomputers&quot; on the desktop.</p> <p>Another reputational quirk is that I know a decent number of people who believe that the safest cars they can buy are &quot;American Cars from the 60's and 70's that aren't made of plastic&quot;. We don't have directly relevant small overlap crash test scores for old cars, but the test data we do have on old cars indicates that they fare extremely poorly in overall safety compared to modern cars. For a visually dramatic example, <a href="https://www.youtube.com/watch?v=fPF4fBGNK0U">see this crash test of a 1959 Chevrolet Bel Air vs. a 2009 Chevrolet Malibu</a>.</p> <h3 id="appendix-methodology-summary">Appendix: methodology summary</h3> <p>The top-line results section uses scores for the small overlap test both because it's the one where I think it's the most difficult to justify skimping on safety as measured by the test and it's also been around for long enough that we can see the impact of modifications to existing models and changes to subsequent models, which isn't true of the passenger side small overlap test (where many models are still untested).</p> <p>For the passenger side small overlap test, someone might argue that the driver side is more important because you virtually always have a driver in a car accident and may or may not have a front passenger. Also, for small overlap collisions (which simulates a head-to-head collision where the vehicles only overlap by 25%), driver's side collisions are more likely than passenger side collisions.</p> <p>Except to check Volvo's scores, I didn't look at roof crash test scores (which were added in 2009). I'm not going to describe the roof test in detail, but for the roof test, someone might argue that the roof test score should be used in conjunction with scoring the car for rollover probability since the roof test just tests roof strength, which is only relevant when a car has rolled over. I think, given what the data show, this objection doesn't hold in many cases (the vehicles with the worst roof test scores are often vehicles that have relatively high rollover rates), but it does in some cases, which would complicate the analysis.</p> <p>In most cases, we only get one reported test result for a model. However, there can be multiple versions of a model -- including before and after making safety changes intended to improve the test score. If changes were made to the model to improve safety, the test score is usually from after the changes were made and we usually don't get to see the score from before the model was changed. However, there are many exceptions to this, which are noted in the detailed results section.</p> <p>For this post, scores only count if the model was introduced before or near when the new test was introduced, since models introduced later could have design changes that optimize for the test.</p> <h3 id="appendix-detailed-results">Appendix: detailed results</h3> <p>On each test, IIHS gives an overall rating (from worst to best) of Poor, Marginal, Acceptable, or Good. The tests have sub-scores, but we're not going to use those for this analysis. In each sub-section, we'll look at how many models got each score when the small overlap tests were added.</p> <h4 id="volvo">Volvo</h4> <p>All Volvo models examined scored Good (the highest possible score) on the new tests when they were added (roof, driver-side small overlap, and passenger-side small overlap). One model, the 2008-2017 XC60, had a change made to trigger its side curtain airbag during a small overlap collision in 2013. Other models were tested without modifications.</p> <h4 id="mercedes">Mercedes</h4> <p>Of three pre-existing models with test results for driver-side small overlap, one scored Marginal without modifications and two scored Good after structural modifications. The model where we only have unmodified test scores (Mercedes C-Class) was fully re-designed after 2014, shortly after the driver-side small overlap test was introduced.</p> <p>As mentioned above, we often only get to see public results for models without modifications to improve results xor with modifications to improve results, so, for the models that scored Good, we don't actually know how they would've scored if you bought a vehicle before Mercedes updated the design, but the Marginal score from the one unmodified model we have is a negative signal.</p> <p>Also, when the passenger side small overlap test was added, the Mercedes vehicles also generally scored Good. This is, indicating that Mercedes didn't only increase protection on the driver's side in order to improve test scores.</p> <h4 id="bmw">BMW</h4> <p>Of the two models where we have relevant test scores, both scored Marginal before modifications. In one of the cases, there's also a score after structural changes were made in the 2017 model (recall that the driver-side small overlap test was introduced in 2012) and the model scored Good afterwards. The other model was fully-redesigned after 2016.</p> <p>For the five models where we have relevant passenger-side small overlap scores, all scored Good, indicating that the changes made to improve driver-side small overlap test scores weren't only made on the driver's side.</p> <h4 id="honda">Honda</h4> <p>Of the five Honda models where we have relevant driver-side small overlap test scores, two scored Good, one scored Marginal, and two scored Poor. The model that scored Marginal had structural changes plus a seatbelt change in 2015 that changed its score to Good, other models weren't updated or don't have updated IIHS scores.</p> <p>Of the six Honda models where we have passenger driver-side small overlap test scores, two scored Good without modifications, two scored Acceptable without modifications, and one scored Good with modifications to the bumper.</p> <p>All of those models scored Good on the driver side small overlap test, indicating that when Honda increased the safety on the driver's side to score Good on the driver's side test, they didn't apply the same changes to the passenger side.</p> <h4 id="toyota">Toyota</h4> <p>Of the six Toyota models where we have relevant driver-side small overlap test scores for unmodified models, one score Acceptable, four scored Marginal, and one scored Poor.</p> <p>The model that scored Acceptable had structural changes made to improve its score to Good, but on the driver's side only. The model was later tested in the passenger-side small overlap test and scored Acceptable. Of the four models that scored Marginal, one had structural modifications made in 2017 that improved its score to Good and another had airbag and seatbelt changes that improved its score to to Acceptable. The vehicle that scored Poor had structural changes made that improved its score to acceptable in 2014, followed by later changes that improved its score to Good.</p> <p>There are four additional models where we only have scores from after modifications were made. Of those, one scored Good, one score Acceptable, one scored Marginal, and one scored Poor.</p> <p>In general, changes appear to have been made to the driver's side only and, on introduction of the passenger side small overlap test, vehicles had passenger side small overlap scores that were the same as the driver's side score before modifications.</p> <h4 id="ford">Ford</h4> <p>Of the two models with relevant driver-side small overlap test scores for unmodified models, one scored Marginal and one scored Poor. Both of those models were produced into 2019 and neither has an updated test result. Of the three models where we have relevant results for modified vehicles, two scored Acceptable and one score Marginal. Also, one model was released the year the small overlap test was introduced and one the year after; both of those scored Acceptable. It's unclear if those should be considered modified or not since the design may have had last-minute changes before release.</p> <p>We only have three relevant passenger-side small overlap tests. One is Good (for a model released in 2015) and the other two are Poor; these are the two models mentioned above as having scored Marginal and Poor, respectively, on the driver-side small overlap test. It appears that the models continued to be produced into 2019 without safety changes. Both of these unmodified models were trucks and this isn't very unusual for a truck and is one of a number of reasons that fatality rates are generally higher in trucks -- until recently, many of them are based on old platforms that hadn't been updated for a long time.</p> <h4 id="chevrolet">Chevrolet</h4> <p>Of the three Chevrolet models where we have relevant driver-side small overlap test scores before modifications, one scored Acceptable and two scored Marginal. One of the Marginal models had structural changes plus a change that caused side curtain airbags to deploy sooner in 2015, which improved its score to Good.</p> <p>Of the four Chevrolet models where we only have relevant driver-side small overlap test scores after the model was modified (all had structural modifications), two scored Good and two scored Acceptable.</p> <p>We only have one relevant score for the passenger-side small overlap test, that score is Marginal. That's on the model that was modified to improve its driver-side small overlap test score from Marginal to Good, indicating that the changes were made to improve the driver-side test score and not to improve passenger safety.</p> <h4 id="subaru">Subaru</h4> <p>We don't have any models where we have relevant passenger-side small overlap test scores for models before they were modified.</p> <p>One model had a change to cause its airbag to deploy during small overlap tests; it scored Acceptable. Two models had some kind of structural changes, one of which scored Good and one of which score Acceptable.</p> <p>The model that had airbag changes had structural changes made in 2015 that improved its score from Acceptable to Good.</p> <p>For the one model where we have relevant passenger-side small overlap test scores, the score was Marginal. Also, for one of the models with structural changes, it was indicated that, among the changes, were changes to the left part of the firewall, indicating that changes were made to improve the driver's side test score without improving safety for a passenger on a passenger-side small overlap crash.</p> <h4 id="tesla">Tesla</h4> <p>There's only one model with relevant results for the driver-side small overlap test. That model scored Acceptable before and after modifications were made to improve test scores.</p> <h4 id="hyundai">Hyundai</h4> <p>Of the five vehicles where we have relevant driver-side small overlap test scores, one scored Acceptable, three scored Marginal, and one scored Poor. We don't have any indication that models were modified to improve their test scores.</p> <p>Of the two vehicles where we have relevant passenger-side small overlap test scores for unmodified models, one scored Good and one scored Acceptable.</p> <p>We also have one score for a model that had structural modifications to score Acceptable, which later had further modifications that allowed it to score Good. That model was introduced in 2017 and had a Good score on the driver-side small overlap test without modifications, indicating that it was designed to achieve a good test score on the driver's side test without similar consideration for a passenger-side impact.</p> <h4 id="dodge">Dodge</h4> <p>Of the five models where we have relevant driver-side small overlap test scores for unmodified models, two scored Acceptable, one scored Marginal, and two scored Poor. There are also two models where we have test scores after structural changes were made for safety in 2015; both of those models scored Marginal.</p> <p>We don't have relevant passenger-side small overlap test scores for any model, but even if we did, the dismal scores on the modified models means that we might not be able to tell if similar changes were made to the passenger side.</p> <h4 id="nissan">Nissan</h4> <p>Of the seven models where we have relevant driver-side small overlap test scores for unmodified models, two scored Acceptable and five scored Poor.</p> <p>We have one model that only has test scores for a modified model; the frontal airbags and seatbelts were modified in 2013 and the side curtain airbags were modified in 2017. The score afterward modifications was Marginal.</p> <p>One of the models that scored Poor had structural changes made in 2015 that improved its score to Good.</p> <p>Of the four models where we have relevant passenger-side small overlap test scores, two scored Good, one scored Acceptable (that model scored good on the driver-side test), and one score Marginal (that model also scored Marginal on the driver-side test).</p> <h4 id="jeep">Jeep</h4> <p>Of the two models where we have relevant driver-side small overlap test scores for unmodified models, one scored Marginal and one scored Poor.</p> <p>There's one model where we only have test score after modifications; that model has changes to its airbags and seatbelts and it scored Marginal after the changes. This model was also later tested on the passenger-side small overlap test and scored Poor.</p> <p>One other model has a relevant passenger-side small overlap test score; it scored Good.</p> <h4 id="volkswagen">Volkswagen</h4> <p>The two models where we have relevant driver-side small overlap test scores for unmodified models both scored Marginal.</p> <p>Of the two models where we only have scores after modifications, one was modified 2013 and scored Marginal after modifications. It was then modified again in 2015 and scored Good after modifications. That model was later tested on the passenger side small-overlap test, where it scored Acceptable, indicating that the modifications differentially favored the driver's side. The other scored Acceptable after changes made in 2015 and then scored Good after further changes made in 2016. The 2016 model was later tested on the passenger-side small overlap test and scored Marginal, once again indicating that changes differentially favored the driver's side.</p> <p>We have passenger-side small overlap test for two other models, both of which scored Acceptable. These were models introduced in 2015 (well after the introduction of the driver-side small overlap test) and scored Good on the driver-side small overlap test.</p> <h3 id="2021-update">2021 update</h3> <p><a href="https://www.iihs.org/news/detail/small-suvs-struggle-in-new-tougher-side-test">The IIHS has released the first set of results for their new &quot;upgraded&quot; side-impact tests</a>. They've been making noises about doing this for quite and have mentioned that in real-world data on (some) bad crashes, they've observed intrusion into the cabin that's significantly greater than is seen on their tests. They've mentioned that some vehicles do relatively well on on the new tests and some less well but haven't released official scores until now.</p> <p>The results in the new side-impact tests are different from the results described in the posts above. So far, only small SUVs have had their results released and only the Mazda CX-5 has a result of &quot;Good&quot;. Of the three manufacturers that did well on the tests describe in this post, only Volvo has public results and they scored &quot;Acceptable&quot;. Some questions I have are:</p> <ul> <li>Will Volvo score better for their other vehicles (most of their vehicles are built on a different platform from the vehicle that has public results)?</li> <li>Will Volvo quickly update their vehicles to achieve the highest score on the test? Unlike a lot of other manufacturers, we don't have recent data from Volvo on how they responded to something like this because they didn't need to update their vehicles to achieve the highest score on the last two new tests</li> <li>Will BMW and Mercedes either score well and the new test or quickly update their vehicles to score well once again?</li> <li>Will other Mazda vehicles also score well without updates?</li> </ul> <h3 id="appendix-miscellania">Appendix: miscellania</h3> <p>A number of name brand car makes weren't included. Some because they have relatively low sales in the U.S. are low and/or declining rapidly (Mitsubishi, Fiat, Alfa Romeo, etc.), some because there's very high overlap in what vehicles are tested (Kia, Mazda, Audi), and some because there aren't relevant models with driver-side small overlap test scores (Lexus). When a corporation owns an umbrella of makes, like FCA with Jeep, Dodge, Chrysler, Ram, etc., these weren't pooled since most people who aren't car nerds aren't going to recognize FCA, but may recognize Jeep, Dodge, and Chrysler.</p> <p>If the terms of service of the API allowed you to use IIHS data however you wanted, I would've included smaller makes, but since the API comes with very restrictive terms on how you can display or discuss the data which aren't compatible with exploratory data analysis and I couldn't know how I would want to display or discuss the data before looking at the data, I pulled all of these results by hand (and didn't click through any EULAs, etc.), which was fairly time consuming, so there was a trade-off between more comprehensive coverage and the rest of my life.</p> <h3 id="appendix-what-car-should-i-buy">Appendix: what car should I buy?</h3> <p>That depends on what you're looking for, there's no way to make a blanket recommendation. For practical information about particular vehicles, <a href="https://www.youtube.com/user/TTACVideo">Alex on Autos</a> is the best source that I know of. I don't generally like videos as a source of practical information, but car magazines tend to be much less informative than youtube car reviewers. There are car reviewers that are much more popular, but their popularity appears to come from having witty banter between charismatic co-hosts or other things that not only aren't directly related to providing information, they actually detract from providing information. If you just want to know about how cars work, <a href="https://www.youtube.com/user/EngineeringExplained">Engineering Explained</a> is also quite good, but the information there is generally practical.</p> <p>For reliability information, Consumer Reports is probably your best bet (you can also look at J.D. Power, but the way they aggregate information makes it much less useful to consumers).</p> <p><small>Thanks to Leah Hanson, Travis Downs, Prabin Paudel, Jeshua Smith, and Justin Blank for comments/corrections/discussion</small></p> <p><link rel="prefetch" href=""> <link rel="prefetch" href="about/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:F">this includes the 2004-2012 Volvo S40/V50, 2006-2013 Volvo C70, and 2007-2013 Volvo C30, which were designed during the period when Ford owned Volvo. Although the C1 platform was a joint venture between Ford, Volvo, and Mazda engineers, the work was done under a Ford VP at a Ford facility. <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:A"><p>to be fair, as we saw with the IIHS small overlap tests, not every manufacturer did terribly. In 2017 and 2018, 8 vehicles sold in Africa were crash tested. One got what we would consider a mediocre to bad score in the U.S. or Europe, five got what we would consider to be a bad score, and &quot;only&quot; three got what we would consider to be an atrocious score. The Nissan NP300, Datsun Go, and Cherry QQ3 were the three vehicles that scored the worst. Datsun is a sub-brand of Nissan and Cherry is a Chinese brand, also known as Qirui.</p> <p>We see the same thing if we look at cars sold in India. Recently, some tests have been run on cars sent to the Indian market and a number of vehicles from Datsun, Renault, Chevrolet, Tata, Honda, Hyundai, Suzuki, Mahindra, and Volkswagen came in with atrocious scores that would be considered impossibly bad in the U.S. or Europe.</p> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> </ol> </div> Finding the Story voyager-story/ Tue, 02 Jun 2020 00:05:34 -0700 voyager-story/ <p><em>This is an archive of an old pseudonymously written post from the 90s from someone whose former pseudonym seems to have disappeared from the internet.</em></p> <p>I see that <cite>Star Trek: Voyager</cite> has added a new character, a Borg. (From the photos, I also see that they're still breeding women for breast size in the 24th century.) What ticked me off was the producer's comment (I'm paraphrasing), &quot;The addition of Seven of Nine will give us limitless story possibilities.&quot;</p> <p>Uh-huh. Riiiiiight.</p> <p>Look, they did't recognize the stories they <i>had</i>. I watched the first few episodes of <cite>Voyager</cite> and quit when my bullshit meter when off the scale. (Maybe that's not fair, to judge them by only a few episodes. But it's not fair to subject me to crap like the holographic lungs, either.)</p> <p>For those of you who don't watch <cite>Star Trek: Voyager</cite>, the premise is that the <i>Voyager</i>, sort of a space corvette, gets transported umpteen zillions of light years from where it should be. It will take over seventy years at top speed for them to get home to their loved ones. For reasons we needn't go into here, the crew consists of a mix of loyal Federation members and rebels.</p> <p>On paper, this looks good. There's an uneasy alliance in the crew, there's exploration as they try to get home, there's the whole &quot;island in space&quot; routine. And the <cite>Voyager</cite> is nowhere near as big as the Enterprise — it's not mentally healthy for people to stay aboard for that long.</p> <p>But can this idea actually sustain a whole series? Would it be interesting to watch five years of &quot;the crew bickers&quot; or &quot;they find a new clue to faster interstellar travel but it falls through&quot;? I don't think so.</p> <p>(And, in fact, the crew settled down <i>awfully</i> quickly.)</p> <p>The demands of series television subvert the premise. The basic demand of series television is that our regular characters are people we come to know and to care about — we want them to come into our living rooms every week. We must care about their changes, their needs, their desires. We must worry when they're put in jeopardy. But we know it's a series, so it's hard to make us worry. We know that the characters will be back next week.</p> <p>The demands of a <i>story</i> require someone to <i>change</i> of their own accord, to recognize some difference. The need to change can be imposed from without, but the actual change must be self-motivated. (This is the fundamental paradox of series television: the only character allowed to change is a guest, but the instrument of that change has to be a series regular, therefore depriving both characters of the chance to do something interesting.)</p> <p>Series with strict continuity of episodes (episode 2 must follow episode 1) allow change — but they're harder to sell in syndication after the show goes off the air. Economics favour unchanging regular characters.</p> <p>Some series — such as Hill Street Blues — get around the jeopardy problem by actually making characters disposable. Some characters show up for a few episodes and then die, reminding us that it could happen to the regulars, too. Sometimes it does happen to the regulars.</p> <p>(When the characters change in the pilot, there may be a problem. A writer who was approached to work on Mary Tyler Moore's last series saw from the premise that it would be brilliant for six episodes and then had noplace to go. The first Fox series starring Tea Leoni, <cite>Flying Blind</cite>, had a very funny pilot and set up an untenable situation.)</p> <p>I'm told the only interesting character on <cite>Voyager</cite> has been the doctor, who can change. He's the only character allowed to grow.</p> <p>The first problem with Voyager, then, is that characters aren't allowed to change — or the change is imposed from outside. (By the way, an imposed change is a great way to <i>start</i> a story. The character then fights it, and that's interesting. It's a terrible way to <i>end</i> a story.)</p> <p>The second problem is that they don't make use of the elements they have. Let's go back to the first season. There was an episode in which there's a traitor on board who is as smart as Janeway herself. (How psychiatric testing missed this, I don't know, but the Trek universe has never had really good luck with psychiatry.) After leading Janeway by the nose for fifty minutes, she figures out who it is, and confronts him. He says yes — <i>and beams off the ship</i>, having conveniently made a deal with the locals.</p> <p>Perfect for series television. We've got a supposedly intelligent villain out there who could come back and Janeway's been given a run for her money — except that I felt cheated. Where's the story? Where's the resolution?</p> <p>Here's what I think they should have done. It's not traditional series television, but I think it would have been better stories.</p> <p>First of all, the episode ends when Janeway confronts the bad guy and arrests him. He's put in the brig — <i>and stays there</i>. The viewer gets some sense of victory here.</p> <p>But now there's someone as smart as Janeway in the brig. Suddenly we've set up <cite>Silence of the Lambs</cite>. (I don't mind stealing if I steal from good sources.) Whenever a problem is <i>big enough</i>, Janeway has this option: she can go to the brig and try and make a deal with the bad guy. &quot;The ship dies, you die.&quot; Not only that, here's someone on board ship with whom she has a unique relationship — one not formally bounded by rank. What does the bad guy really want?</p> <p>And whenever Janeway's feeling low, he can taunt her. &quot;By the way, I thought of a way to get everyone home in one-tenth the time. Have you, Captain?&quot;</p> <p>You wouldn't put him in every episode. But any time you need that extra push, he's there. Remember, we can have him escape any time we want, through the same sleight used in the original episode.</p> <p>Furthermore, it's one thing to catch him; it's another thing to keep him there. You can generate another entire episode out of an escape attempt by the prisoner. But that would be an intermediate thing. Let's talk about the finish I would have liked to have seen.</p> <p>Let's invent a crisis. The balonium generator explodes; we're deep in warp space; our crack engineering crew has jury-rigged a repair to the sensors and found a Class M planet that might do for the repairs. Except it's just too far away. The margin is tight — but can't be done. There are two too many people on board ship. Each requires a certain amount of food, air, water, etc. Under pressure, Neelix admits that his people can go into suspended animation, so he does. The doctor tries heroically but the engineer who was tending the balonium generator dies. (Hmmm. Power's low. The doctor can only be revived at certain critical moments.) Looks good — but they were using air until they died; one more crew member <i>must</i> die for the rest to live.</p> <p>And somebody remembers the guy in the brig. &quot;The question of his guilt,&quot; says Tuvok, &quot;is resolved. The authority of the Captain is absolute. You are within your rights to hold a summary court martial and sentence him to death.&quot;</p> <p>And Janeway says no. &quot;The Federation doesn't do that.&quot;</p> <p>Except that everyone will die if she doesn't. The pressure is on Janeway, now. Janeway being Janeway, she's looking for a technological fix. &quot;Find an answer, dammit!&quot; And the deadline is coming up. After a certain point, the prisoner has to die, along with someone else.</p> <p>A crewmember volunteers to die (a regular). Before Janeway can accept, yet another (regular) crewmember volunteers, and Janeway is forced to decide. — And Tuvok points out that while morally it's defensible if that member volunteered to die, the ship cannot continue without either of those crewmembers. It <i>can</i> continue without the prisoner. Clearly the prisoner is not worth as much as those crewmembers, but she is the captain. She <i>must</i> make this decision.</p> <p>Our fearless engineering crew thinks they might have a solution, but it will use nearly everything they've got, and they need another six hours to work on the feasibility. Someone in the crew tries to resolve the problem for her by offing the prisoner — the failure uses up more valuable power. Now the deadline moves up closer, past the six hours deadline. The engineering crew's idea is no longer feasible.</p> <p>For his part, the prisoner is now bargaining. He says he's got ideas to help. Does he? He's tried to destroy the ship before. And he won't reveal them until he gets a full pardon.</p> <p>(This is all basic plotting: keep piling on difficulties. Put a carrot in front of the characters, keep jerking it away.)</p> <p>The tricky part is the ending. It's a requirement that the ending derive logically from what has gone before. If you're going to invoke a technological fix, you have to set the groundwork for it in the first half of the show. Otherwise it's technobabble. It's deus ex machina. (Any time someone says just after the last commercial break, &quot;Of course! If we vorpalize the antibogon flow, we're okay!&quot; I want to smack a writer in the head.)</p> <p>Given the situation set up here, we have three possible endings:</p> <ul> <li>Some member of the crew tries to solve the problem by sacrificing themselves. (Remember, McCoy and Spock did this.) This is a weak solution (unless Janeway does it) because it takes the focus off Janeway's decision.</li> <li>Janeway strikes a deal with the prisoner, and together they come up with a solution (which doesn't involve the antibogon flow). This has the interesting repercussions of granting the prisoner his freedom — while everyone else on ship hates his guts. Grist for another episode, anyway.</li> <li>Janeway kills the prisoner but refuses to hold the court martial. She may luck out — the prisoner might survive; that million-to-one-shot they've been praying for but couldn't rely on comes through — but she has decided to kill the prisoner rather than her crew.</li> </ul> <p>My preferred ending is the third one, even though the prisoner need not die. The decision we've set up is a difficult one, and it is meaningful. It is a command decision. Whether she ends up killing the prisoner is not relevant; what is relevant is that she <i>decides</i> to do it.</p> <p>John Gallishaw once categorized all stories as either stories of <i>achievement</i> or of <i>decision</i>. A decision story is much harder to write, because both choices have to matter.</p> A simple way to get more value from tracing tracing-analytics/ Sun, 31 May 2020 00:06:34 -0700 tracing-analytics/ <p>A lot of people seem to think that distributed tracing isn't useful, or at least not without extreme effort that isn't worth it for companies smaller than FB. For example, <a href="https://twitter.com/theatrus/status/1220205677672452097">here are</a> <a href="https://twitter.com/mattklein123/status/1049813546077323264">a couple of</a> public conversations that sound like a number of private conversations I've had. Sure, there's value somewhere, <a href="https://twitter.com/mattklein123/status/1220150248741326855">but it costs too much to unlock</a>.</p> <p>I think this overestimates how much work it is to get a lot of value from tracing. At Twitter, Rebecca Isaacs was able to lay out a vision for how to get value from tracing and executed on it (with help from a number other folks, including Jonathan Simms, Yuri Vishnevsky, Ruben Oanta, Dave Rusek, Hamdi Allam, and many others<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">1</a></sup>) such that the work easily paid for itself. This post is going to describe the tracing &quot;infrastructure&quot; we've built and describe some use cases where we've found it to be valuable. Before we get to that, let's start with some background about the situation before Rebecca's vision came to fruition.</p> <p>At a high level, we could say that we had a trace-view oriented system and ran into all of the issues that one might expect from that. Those issues are discussed in more detail in <a href="https://medium.com/@copyconstruct/distributed-tracing-weve-been-doing-it-wrong-39fc92a857df">this article by Cindy Sridharan</a>. However, I'd like to discuss the particular issues we had in more detail since I think it's useful to look at what specific things were causing problems.</p> <p>Taken together, the issues were problematic enough that tracing was underowned and arguably unowned for years. Some individuals did work in their spare time to keep the lights on or improve things, but the lack of obvious value from tracing led to a vicious cycle where the high barrier to getting value out of tracing made it hard to fund organizationally, which made it hard to make tracing more usable.</p> <p>Some of the issues that made tracing low ROI included:</p> <ul> <li>Schema made it impossible to run simple queries &quot;in place&quot;</li> <li>No real way to aggregate info <ul> <li>No way to find interesting or representative traces</li> </ul></li> <li>Impossible to know actual sampling rate, sampling highly non-representative</li> <li>Time</li> </ul> <h3 id="schema">Schema</h3> <p>The schema was effectively a set of traces, where each trace was a set of spans and each span was a set of annotations. Each span that wasn't a root span had a pointer to its parent, so that the graph structure of a trace could be determined.</p> <p>For the purposes of this post, we can think of each trace as either an external request including all sub-RPCs or a subset of a request, rooted downstream instead of at the top of the request. We also trace some things that aren't requests, like builds and git operations, but for simplicity we're going to ignore those for this post even though the techniques we'll discuss also apply to those.</p> <p>Each span corresponds to an RPC and each annotation is data that a developer chose to record on a span (e.g., the size of the RPC payload, queue depth of various queues in the system at the time of the span, or GC pause time for GC pauses that interrupted the RPC).</p> <p>Some issues that came out of having a schema that was a set of sets (of bags) included:</p> <ul> <li>Executing any query that used information about the graph structure inherent in a trace required reading every span in the trace and reconstructing the graph</li> <li>Because there was no index or summary information of per-trace information, any query on a trace required reading every span in a trace</li> <li>Practically speaking, because the two items above are too expensive to do at query time in an ad hoc fashion, the only query people ran was some variant of &quot;give me a few spans matching a simple filter&quot;</li> </ul> <h3 id="aggregation">Aggregation</h3> <p>Until about a year and a half ago, the only supported way to look at traces was to go to the UI, filter by a service name from a combination search box + dropdown, and then look at a list of recent traces, where you could click on any trace to get a &quot;trace view&quot;. Each search returned the N most recent results, which wouldn't necessarily be representative of all recent results (for reasons mentioned below in the Sampling section), let alone representative of all results over any other time span.</p> <p>Per the problems discussed above in the schema section, since it was too expensive to run queries across a non-trivial number of traces, it was impossible to ask questions like &quot;are any of the traces I'm looking at representative of common traces or am I looking at weird edge cases?&quot; or &quot;show me traces of specific tail events, e.g., when a request from service A to service B times out or when write amplification from service A to some backing database is &gt; 3x&quot;, or even &quot;only show me complete traces, i.e., traces where we haven't dropped spans from the trace&quot;.</p> <p>Also, if you clicked on a trace that was &quot;too large&quot;, the query would time out and you wouldn't be able to view the trace -- this was another common side effect of the lack of any kind of rate limiting logic plus the schema.</p> <h3 id="sampling">Sampling</h3> <p>There were multiple places where a decision was made to sample or not. There was no document that listed all of these places, making it impossible to even guess at the sampling rate without auditing all code to figure out where sampling decisions were being made.</p> <p>Moreover, there were multiple places where an unintentional sampling decision would be made due to the implementation. Spans were sent from services that had tracing enabled to a local agent, then to a &quot;collector&quot; service, and then from the collector service to our backing DB. Spans could be dropped at of these points: in the local agent; in the collector, which would have nodes fall over and lose all of their data regularly; and at the backing DB, which would reject writes due to hot keys or high load in general.</p> <p>This design where the trace id is the database key, with no intervening logic to pace out writes, meant that a 1M span trace (which we have) would cause 1M writes to the same key over a period of a few seconds. Another problem would be requests with a fanout of thousands (which exists at every tech company I've worked for), which could cause thousands writes with the same key over a period of a few milliseconds.</p> <p>Another sampling quirk was that, in order to avoid missing traces that didn't start at our internal front end, there was logic that caused an independent sampling decision in every RPC. If you do the math on this, if you have a service-oriented architecture like ours and you sample at what naively might sound like a moderately low rate, like, you'll end up with the vast majority of your spans starting at a leaf RPC, resulting in a single span trace. Of the non-leaf RPCs, the vast majority will start at the 2nd level from the leaf, and so on. The vast majority of our load and our storage costs were from these virtually useless traces that started at or near a leaf, and if you wanted to do any kind of analysis across spans to understand the behavior of the entire system, you'd have to account for this sampling bias on top of accounting for all of the other independent sampling decisions.</p> <h3 id="time">Time</h3> <p>There wasn't really any kind of adjustment for clock skew (there was something, but it attempted to do a local pairwise adjustment, which didn't really improve things and actually made it more difficult to reasonably account for clock skew).</p> <p>If you just naively computed how long a span took, even using timestamps from a single host, which removes many sources of possible clock skew, you'd get a lot of negative duration spans, which is of course impossible because a result can't get returned before the request for the result is created. And if you compared times across different hosts, the results were even worse.</p> <h3 id="solutions">Solutions</h3> <p>The solutions to these problems fall into what I think of as two buckets. For problems like dropped spans due to collector nodes falling over or the backing DB dropping requests, there's some straightforward engineering solution using well understood and widely used techniques. For that particular pair of problems, the short term bandaid was to do some GC tuning that reduced the rate of collector nodes falling over by about a factor of 100. That took all of two minutes, and then we replaced the collector nodes with a real queue that could absorb larger bursts in traffic and pace out writes to the DB. For the issue where we oversampled leaf-level spans due to rolling the sampling dice on every RPC, that's <a href="algorithms-interviews/">one of these little questions that most people would get right in an interview that can sometimes get lost as part of a larger system</a> that has a number of solutions, e.g., since each span has a parent pointer, we must be able to know if an RPC has a parent or not in a relevant place and we can make a sampling decision and create a traceid iff a span has no parent pointer, which results in a uniform probability of each span being sampled, with each sampled trace being a complete trace.</p> <p>The other bucket is building up datasets and tools (and adding annotations) that allow users to answer questions they might have. This isn't a new idea, <a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf">section 5 of the Dapper paper</a> discussed this and it was published in 2010.</p> <p>Of course, one major difference is that Google has probably put at least two orders of magnitude more effort into building tools on top of Dapper than we've put into building tools on top of our tracing infra, so a lot of our tooling is much rougher, e.g., figure 6 from the Dapper paper shows a trace view that displays a set of relevant histograms, which makes it easy to understand the context of a trace. We haven't done the UI work for that yet, so the analogous view requires running a simple SQL query. While that's not hard, presenting the user with the data would be a better user experience than making the user query for the data.</p> <p>Of the work that's been done, the simplest obviously high ROI thing we've done is build a set of tables that contain information people might want to query, structured such that common queries that don't inherently have to do a lot of work don't have to do a lot of work.</p> <p>We have, partitioned by day, the following tables:</p> <ul> <li>trace_index <ul> <li>high-level trace-level information, e.g., does the trace have a root; what is the root; if relevant, what request endpoint was hit, etc.</li> </ul></li> <li>span_index <ul> <li>information on the client and server</li> </ul></li> <li>anno_index <ul> <li>&quot;standard&quot; annotations that people often want to query, e.g., request and response payload sizes, client/server send/recv timestamps, etc.</li> </ul></li> <li>span_metrics <ul> <li>computed metrics, e.g., span durations</li> </ul></li> <li>flat_annotation <ul> <li>All annotations, in case you want to query something not in anno_index</li> </ul></li> <li>trace_graph <ul> <li>For each trace, contains a graph representation of the trace, for use with queries that need the graph structure</li> </ul></li> </ul> <p>Just having this set of tables, queryable with SQL queries (or a Scalding or Spark job in cases where Presto SQL isn't ideal, like when doing some graph queries) is enough for tracing to pay for itself, to go from being difficult to justify to being something that's obviously high value.</p> <p>Some of the questions we've been to answer with this set of tables includes:</p> <ul> <li>For this service that's having problems, give me a representative set of traces</li> <li>For this service that has elevated load, show me which upstream service is causing the load</li> <li>Give me the list of all services that have unusual write amplification to downstream service X <ul> <li>Is traffic from a particular service or for a particular endpoint causing unusual write amplification? For example, in some cases, we see nothing unusual about the total write amplification from B -&gt; C, but we see very high amplification from B -&gt; C when B is called by A.</li> </ul></li> <li>Show me how much time we spend on <a href="https://en.wikipedia.org/wiki/Serialization">serdes</a> vs. &quot;actual work&quot; for various requests</li> <li>Show me how much different kinds of requests cost in terms of backend work</li> <li>For requests that have high latency, as determined by mobile client instrumentation, show me what happened on the backend</li> <li>Show me the set of latency critical paths for this request endpoint (with the annotations we currently have, this has a number issues that probably deserve their own post)</li> <li>Show me the CDF of services that this service depends on <ul> <li>This is a distribution because whether or not a particular service calls another service is data dependent; it's not uncommon to have a service that will only call another one every 1000 calls (on average)</li> </ul></li> </ul> <p>We have built and are building other tooling, but just being able to run queries and aggregations against trace data, both recent and historical, easily pays for all of the other work we'd like to do. <a href="metrics-analytics/">This analogous to what we saw when we looked at metrics data, taking data we already had and exposing it in a way that lets people run arbitrary queries immediately paid dividends</a>. Doing that for tracing is less straightforward than doing that for metrics because the data is richer, but it's a not fundamentally different idea.</p> <p>I think that having something to look at other than the raw data is also more important for tracing than it is for metrics since the metrics equivalent of a raw &quot;trace view&quot; of traces, a &quot;dashboard view&quot; of metrics where you just look at graphs, is obviously and intuitively useful. If that's all you have for metrics, people aren't going to say that it's not worth funding your metrics infra because dashboards are really useful! However, it's a lot harder to see how to get value out of a raw view of traces, which is where a lot of the comments about tracing not being valuable come from. This difference between the complexity of metrics data and tracing data makes the value add for higher-level views of tracing larger than it is for metrics.</p> <p>Having our data in a format that's not just blobs in a NoSQL DB has also allowed us to more easily build tooling on top of trace data that lets users who don't want to run SQL queries get value out of our trace data. An example of this is the Service Dependency Explorer (SDE), which was primarily built by Yuri Vishnevsky, Rebecca Isaacs, and Jonathan Simms, with help from Yihong Chen. If we try to look at the RPC call graph for a single request, we get something that's pretty large. In some cases, the depth of the call tree can be hundreds of levels deep and it's also not uncommon to see a fanout of 20 or more at some levels, which makes a naive visualization difficult to interpret.</p> <p>In order to see how SDE works, let's look at a smaller example where it's relatively easy to understand what's going on. Imagine we have 8 services, <code>A</code> through <code>H</code> and they call each other as shown in the tree below, we we have service <code>A</code> called 10 times, which calls service <code>B</code> a total of 10 times, which calls <code>D</code>, <code>D</code>, and <code>E</code> 50, 20, and 10 times respectively, where the two <code>D</code>s are distinguished by being different RPC endpoints (calls) even though they're the same service, and so on, shown below:</p> <p><img src="images/tracing-analytics/rpc-tree.png" alt="Diagram of RPC call graph; this will implicitly described in the relevant sections, although the entire SDE section in showing off a visual tool and will probably be unsatisfying if you're just reading the alt text; the tables described in the previous section are more likely to be what you want if you want a non-visual interpretation of the data, the SDE is a kind of visualization" width="610" height="488"></p> <p>If we look at SDE from the standpoint of node E, we'll see the following: <img src="images/tracing-analytics/e.png" alt="SDE centered on service E, showing callers and callees, direct and indirect" width="1266" height="576"></p> <p>We can see the direct callers and callees, 100% of calls of E are from C, and 100% of calls of E also call C and that we have 20x load amplification when calling C (200/10 = 20), the same as we see if we look at the RPC tree above. If we look at indirect callees, we can see that D has a 4x load amplification (40 / 10 = 4).</p> <p>If we want to see what's directly called by C downstream of E, we can select it and we'll get arrows to the direct descendants of C, which in this case is every indirect callee of E.</p> <p><img src="images/tracing-analytics/e.png" alt="SDE centered on service E, with callee C highlighted" width="1263" height="239"></p> <p>For a more complicated example, we can look at service D, which shows up in orange in our original tree, above.</p> <p>In this case, our summary box reads:</p> <ul> <li>On May 28, 2020 there were... <ul> <li>10 total <a href="https://courses.cs.washington.edu/courses/cse551/15sp/notes/talk-tsa.pdf">TFE</a>-rooted traces</li> <li>110 total traced RPCs to D</li> <li>2.1 thousand total traced RPCs caused by D</li> <li>3 unique call paths from TFE endpoints to D endpoints</li> </ul></li> </ul> <p>The fact that we see D three times in the tree is indicated in the summary box, where it says we have 3 unique call paths from our front end, <a href="https://courses.cs.washington.edu/courses/cse551/15sp/notes/talk-tsa.pdf">TFE</a> to D.</p> <p>We can expand out the calls to D and, in this case, see both of the calls and what fraction of traffic is to each call.</p> <p><img src="images/tracing-analytics/d.png" alt="SDE centered on service D, with different calls to D expanded by having clicked on D" width="1262" height="364"></p> <p>If we click on one of the calls, we can see which nodes are upstream and downstream dependencies of a particular call, <code>call4</code> is shown below and we can see that it never hits services <code>C</code>, <code>H</code>, and <code>G</code> downstream even though service <code>D</code> does for <code>call3</code>. Similarly, we can see that its upstream dependencies consist of being called directly by C, and indirectly by B and E but not A and C:</p> <p><img src="images/tracing-analytics/d-4.png" alt="SDE centered on service D, with call4 of D highlighted by clicking on call 4; shows only upstream and downstream load that are relevant to call4" width="1263" height="343"></p> <p>Some things we can easily see from SDE are:</p> <ul> <li>What load a service or RPC call causes <ul> <li>Where we have unusual load amplification, whether that's generally true for a service or if it only occurs on some call paths</li> </ul></li> <li>What causes load to a service or RPC call</li> <li>Where and why we get cycles (very common for <a href="https://www.thestrangeloop.com/2018/leverage-vs-autonomy-in-a-large-software-system.html">Strato</a>, among other things</li> <li>What's causing weird super deep traces</li> </ul> <p>These are all things a user could get out of queries to the data we store, but having a tool with a UI that lets you click around in real time to explore things lowers the barrier to finding these things out.</p> <p>In the example shown above, there are a small number of services, so you could get similar information out of the more commonly used sea of nodes view, where each node is a service, with some annotations on the visualization, but when we've looked at real traces, showing thousands of services and a global makes it very difficult to see what's going on. Some of Rebecca's early analyses used a view like that, but we've found that you need to have a lot of implicit knowledge to make good use of a view like that, a view that discards a lot more information and highlights a few things makes it easier to users who don't happen to have the right implicit knowledge to get value out of looking at traces.</p> <p>Although we've demo'd a view of RPC count / load here, we could also display other things, like latency, errors, payload sizes, etc.</p> <h3 id="conclusion">Conclusion</h3> <p>More generally, this is just a brief description of a few of the things we've built on top of the data you get if you have basic distributed tracing set up. You probably don't want to do exactly what we've done since you probably have somewhat different problems and you're very unlikely to encounter the exact set of problems that our tracing infra had. From backchannel chatter with folks at other companies, I don't think the level of problems we had was unique; if anything, our tracing infra was in a better state than at many or most peer companies (which excludes behemoths like FB/Google/Amazon) since it basically worked and people could and did use the trace view we had to debug real production issues. But, as they say, unhappy systems are unhappy in their own way.</p> <p>Like our previous look at <a href="metrics-analytics/">metrics analytics</a>, this work was done incrementally. Since trace data is much richer than metrics data, a lot more time was spent doing ad hoc analyses of the data before writing the Scalding (MapReduce) jobs that produce the tables mentioned in this post, but the individual analyses were valuable enough that there wasn't really a time when this set of projects didn't pay for itself after the first few weeks it took to clean up some of the worst data quality issues and run an (extremely painful) ad hoc analysis with the existing infra.</p> <p>Looking back at discussions on whether or not it makes sense to work on tracing infra, people often point to the numerous failures at various companies to justify a buy (instead of build) decision. I don't think that's exactly unreasonable, the base rate of failure of similar projects shouldn't be ignored. But, on the other hand, most of the work described wasn't super tricky, beyond getting organizational buy-in and having a clear picture of the value that tracing can bring.</p> <p>One thing that's a bit beyond the scope of this post that probably deserves its own post is that, tracing and metrics, while not fully orthogonal, are complementary and having only one or the other leaves you blind to a lot of problems. You're going to pay a high cost for that in a variety of ways: unnecessary incidents, extra time spent debugging incidents, generally higher monetary costs due to running infra inefficiently, etc. Also, while metrics and tracing individually gives you much better visibility than having either alone, some problemls require looking at both together; some of the most interesting analyses I've done involve joining (often with a literal SQL join) trace data and <a href="metrics-analytics/">metrics data</a>.</p> <p>To make it concrete, an example of something that's easy to see with tracing but annoying to see with logging unless you add logging to try to find this in particular (which you can do for any individual case, but probably don't want to do for the thousands of things tracing makes visible), is something we looked at above: &quot;show me cases where a specific call path from the load balancer to <code>A</code> causes high load amplification on some service <code>B</code>, which may be multiple hops away from <code>A</code> in the call graph. In some cases, this will be apparent because <code>A</code> generally causes high load amplificaiton on <code>B</code>, but if it only happens in some cases, that's still easy to handle with tracing but it's very annoying if you're just looking at metrics.</p> <p>An example of something where you want to join tracing and metrics data is when looking at the performance impact of something like a bad host on latency. You will, in general, not be able to annotate the appropriate spans that pass through the host as bad because, if you knew the host was bad at the time of the span, the host wouldn't be in production. But you can sometimes find, with historical data, a set of hosts that are bad, and then look up latency critical paths that pass through the host to determine the end-to-end impact of the bad host.</p> <p>Everyone has their own biases, with respect to tracing, mine come from generally working on things that try to direct improve cost, reliability, and latency, so the examples are focused on that, but there are also a lot of other uses for tracing. You can check out Distributed Tracing in Practice or Mastering Distributed Tracing for some other perspectives.</p> <h4 id="acknowledgements">Acknowledgements</h4> <p><small>Thanks to Rebecca Isaacs, Leah Hanson, Yao Yue, and Yuri Vishnevsky for comments/corrections/discussion.</small></p> <p><link rel="prefetch" href=""> <link rel="prefetch" href="metrics-analytics/"> <link rel="prefetch" href="algorithms-interviews/"> <link rel="prefetch" href="about/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:O"><p>this will almost certainly be an incomplete list, but some other people who've pitched in include Moses, Tiina, Rich, Rahul, Ben, Mike, Mary, Arash, Feng, Jenny, Andy, Yao, Yihong, Vinu, and myself.</p> <p>Note that this relatively long list of contributors doesn't contradict this work being high ROI. I'd estimate that there's been less than 2 person-years worth of work on everything discussed in this post. Just for example, while I spend a fair amount of time doing analyses that use the tracing infra, I think I've only spent on the order of one week on the infra itself.</p> <p>In case it's not obvious from the above, even though I'm writing this up, I was a pretty minor contributor to this. I'm just writing it up because I sat next to Rebecca as this work was being done and was super impressed by both her process and the outcome.</p> <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> </ol> </div> A simple way to get more value from metrics metrics-analytics/ Sat, 30 May 2020 00:06:34 -0700 metrics-analytics/ <p>We spent one day<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">1</a></sup> building a system that immediately found a mid 7 figure optimization (which ended up shipping). In the first year, we shipped mid 8 figures per year worth of cost savings as a result. The key feature this system introduces is the ability to query metrics data across all hosts and all services and over any period of time (since inception), so we've called it LongTermMetrics (LTM) internally since I like boring, descriptive, names.</p> <p>This got started when I was looking for a starter project that would both help me understand the Twitter infra stack and also have some easily quantifiable value. Andy Wilcox suggested looking at <a href="https://stackoverflow.com/a/37102630/334816">JVM survivor space</a> utilization for some large services. If you're not familiar with what survivor space is, you can think of it as a configurable, fixed-size buffer, in the JVM (at least if you use the GC algorithm that's default at Twitter). At the time, if you looked at a random large services, you'd usually find that either:</p> <ol> <li>The buffer was too small, resulting in poor performance, sometimes catastrophically poor when under high load.</li> <li>The buffer was too large, resulting in wasted memory, i.e., wasted money.</li> </ol> <p>But instead of looking at random services, there's no fundamental reason that we shouldn't be able to query all services and get a list of which services have room for improvement in their configuration, sorted by performance degradation or cost savings. And if we write that query for JVM survivor space, this also goes for other configuration parameters (e.g., other JVM parameters, CPU quota, memory quota, etc.). Writing a query that worked for all the services turned out to be a little more difficult than I was hoping due to a combination of data consistency and performance issues. Data consistency issues included things like:</p> <ul> <li>Any given metric can have ~100 names, e.g., I found 94 different names for JVM survivor space <ul> <li>I suspect there are more, these were just the ones I could find via a simple search</li> </ul></li> <li>The same metric name might have a different meaning for different services <ul> <li>Could be a counter or a gauge</li> <li>Could have different units, e.g., bytes vs. MB or microseconds vs. milliseconds</li> </ul></li> <li>Metrics are sometimes tagged with an incorrect service name</li> <li>Zombie shards can continue to operate and report metrics even though the cluster manager has started up a new instance of the shard, resulting in duplicate and inconsistent metrics for a particular shard name</li> </ul> <p>Our metrics database, <a href="https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb.html" rel="noreferrer">MetricsDB</a>, was specialized to handle monitoring, dashboards, alerts, etc. and didn't support general queries. That's totally reasonable, since monitoring and dashboards are lower on Maslow's hierarchy of observability needs than general metrics analytics. In backchannel discussions from folks at other companies, the entire set of systems around MetricsDB seems to have solved a lot of the problems that plauge people at other companies with similar scale, but the specialization meant that we couldn't run arbitrary SQL queries against metrics in MetricsDB.</p> <p>Another way to query the data is to use the copy that gets written to <a href="https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS">HDFS</a> in <a href="https://en.wikipedia.org/wiki/Apache_Parquet">Parquet</a> format, which allows people to run arbitrary SQL queries (as well as write <a href="https://en.wikipedia.org/wiki/Cascading_(software)#cite_ref-27">Scalding</a> (MapReduce) jobs that consume the data).</p> <p>Unfortunately, due to the number of metric names, the data on HDFS can't be stored in a columnar format with one column per name -- <a href="https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)">Presto</a> gets unhappy if you feed it too many columns and we have enough different metrics that we're well beyond that limit. If you don't use a columnar format (and don't apply any other tricks), you end up reading a lot of data for any non-trivial query. The result was that you couldn't run any non-trivial query (or even many trivial queries) across all services or all hosts without having it time out. We don't have similar timeouts for Scalding, but Scalding performance is much worse and a simple Scalding query against a day's worth of metrics will usually take between three and twenty hours, depending on cluster load, making it unreasonable to use Scalding for any kind of exploratory data analysis.</p> <p>Given the data infrastructure that already existed, an easy way to solve both of these problems was to write a Scalding job to store the 0.1% to 0.01% of metrics data that we care about for performance or capacity related queries and re-write it into a columnar format. I would guess that at least 90% of metrics are things that almost no one will want to look at in almost any circumstance, and of the metrics anyone really cares about, the vast majority aren't performance related. A happy side effect of this is that since such a small fraction of the data is relevant, it's cheap to store it indefinitely. The standard metrics data dump is deleted after a few weeks because it's large enough that it would be prohibitively expensive to store it indefinitely; a longer metrics memory will be useful for capacity planning or other analyses that prefer to have historical data.</p> <p>The data we're saving includes (but isn't limited to) the following things for each shard of each service:</p> <ul> <li>utilizations and sizes of various buffers</li> <li>CPU, memory, and other utilization</li> <li>number of threads, context switches, core migrations</li> <li>various queue depths and network stats</li> <li>JVM version, feature flags, etc.</li> <li><a href="https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)">GC</a> stats</li> <li><a href="https://github.com/twitter/finagle">Finagle</a> metrics</li> </ul> <p>And for each host:</p> <ul> <li>various things from <a href="https://en.wikipedia.org/wiki/Procfs">procfs</a>, like <code>iowait</code> time, <code>idle</code>, etc.</li> <li>what cluster the machine is a part of</li> <li>host-level info like NIC speed, number of cores on the host, memory,</li> <li>host-level stats for &quot;health&quot; issues like thermal throttling, machine checks, etc.</li> <li>OS version, host-level software versions, host-level feature flags, etc.</li> <li><a href="https://github.com/twitter/rezolus">Rezolus</a> metrics</li> </ul> <p>For things that we know change very infrequently (like host NIC speed), we store these daily, but most of these are stored at the same frequency and granularity that our other metrics is stored for. In some cases, this is obviously wasteful (e.g., for JVM tenuring threshold, which is typically identical across every shard of a service and rarely changes), but this was the easiest way to handle this given the infra we have around metrics.</p> <p>Although the impetus for this project was figuring out which services were under or over configured for JVM survivor space, it started with GC and container metrics since those were very obvious things to look at and we've been incrementally adding other metrics since then. To get an idea of the kinds of things we can query for and how simple queries are if you know a bit of SQL, here are some examples:</p> <h4 id="very-high-p90-jvm-survivor-space">Very High p90 JVM Survivor Space</h4> <p>This is part of the original goal of finding under/over-provisioned services. Any service with a very high p90 JVM survivor space utilization is probably under-provisioned on survivor space. Similarly, anything with a very low p99 or p999 JVM survivor space utilization when under peak load is probably overprovisioned (query not displayed here, but we can scope the query to times of high load).</p> <p>A Presto query for very high p90 survivor space across all services is:</p> <pre><code>with results as ( select servicename, approx_distinct(source, 0.1) as approx_sources, -- number of shards for the service -- real query uses [coalesce and nullif](https://prestodb.io/docs/current/functions/conditional.html) to handle edge cases, omitted for brevity approx_percentile(jvmSurvivorUsed / jvmSurvivorMax, 0.90) as p90_used, approx_percentile(jvmSurvivorUsed / jvmSurvivorMax, 0.50) as p50_used, from ltm_service where ds &gt;= '2020-02-01' and ds &lt;= '2020-02-28' group by servicename) select * from results where approx_sources &gt; 100 order by p90_used desc </code></pre> <p>Rather than having to look through a bunch of dashboards, we can just get a list and then send diffs with config changes to the appropriate teams or write a script that takes the output of the query and automatically writes the diff. The above query provides a pattern for any basic utilization numbers or rates; you could look at memory usage, new or old gen GC frequency, etc., with similar queries. In one case, we found a service that was wasting enough RAM to pay my salary for a decade.</p> <p>I've been moving away from using thresholds against simple percentiles to find issues, but I'm presenting this query because this is a thing people commonly want to do that's useful and I can write this without having to spend a lot of space explain why it's a reasonable thing to do; what I prefer to do instead is out of scope of this post and probably deserves its own post.</p> <h4 id="network-utilization">Network utilization</h4> <p>The above query was over all services, but we can also query across hosts. In addition, we can do queries that join against properties of the host, feature flags, etc.</p> <p>Using one set of queries, we were able to determine that we had a significant number of services running up against network limits even though host-level network utilization was low. The compute platform team then did a gradual rollout of a change to network caps, which we monitored with queries like the one below to determine that we weren't see any performance degradation (theoretically possible if increasing network caps caused hosts or switches to hit network limits).</p> <p>With the network change, we were able to observe, smaller queue depths, smaller queue size (in bytes), fewer packet drops, etc.</p> <p>The query below only shows queue depths for brevity; adding all of the quantities mentioned is just a matter of typing more names in.</p> <p>The general thing we can do is, for any particular rollout of a platform or service-level feature, we can see the impact on real services.</p> <pre><code>with rolled as ( select -- rollout was fixed for all hosts during the time period, can pick an arbitrary element from the time period arbitrary(element_at(misc, 'egress_rate_limit_increase')) as rollout, hostId from ltm_deploys where ds = '2019-10-10' and zone = 'foo' group by ipAddress ), host_info as( select arbitrary(nicSpeed) as nicSpeed, hostId from ltm_host where ds = '2019-10-10' and zone = 'foo' group by ipAddress ), host_rolled as ( select rollout, nicSpeed, rolled.hostId from rolled join host_info on rolled.ipAddress = host_info.ipAddress ), container_metrics as ( select service, netTxQlen, hostId from ltm_container where ds &gt;= '2019-10-10' and ds &lt;= '2019-10-14' and zone = 'foo' ) select service, nicSpeed, approx_percentile(netTxQlen, 1, 0.999, 0.0001) as p999_qlen, approx_percentile(netTxQlen, 1, 0.99, 0.001) as p99_qlen, approx_percentile(netTxQlen, 0.9) as p90_qlen, approx_percentile(netTxQlen, 0.68) as p68_qlen, rollout, count(*) as cnt from container_metrics join host_rolled on host_rolled.hostId = container_metrics.hostId group by service, nicSpeed, rollout </code></pre> <h4 id="other-questions-that-became-easy-to-answer">Other questions that became easy to answer</h4> <ul> <li>What's the latency, CPU usage, CPI, or other performance impact of X? <ul> <li>Increasing or decreasing the number of performance counters we monitor per container</li> <li>Tweaking kernel parameters</li> <li>OS or other releases</li> <li>Increasing or decreasing host-level oversubscription</li> <li>General host-level load</li> <li>Retry budget exhaustion</li> </ul></li> <li>For relevant items above, what's the distribution of X, in general or under certain circumstances?</li> <li>What hosts have unusually poor service-level performance for every service on the host, after controlling for load, etc.? <ul> <li>This has usually turned out to be due to a hardware misconfiguration or fault</li> </ul></li> <li>Which services don't play nicely with other services aside from the general impact on host-level load?</li> <li>What's the latency impact of failover, or other high-load events? <ul> <li>What level of load should we expect in the future given a future high-load event plus current growth?</li> <li>Which services see more load during failover, which services see unchanged load, and which fall somewhere in between?</li> </ul></li> <li>What config changes can we make for any fixed sized buffer or allocation that will improve performance without increasing cost or reduce cost without degrading performance?</li> <li>For some particular host-level health problem, what's the probability it recurs if we see it N times?</li> <li>etc., there are a lot of questions that become easy to answer if you can write arbitrary queries against historical metrics data</li> </ul> <h4 id="design-decisions">Design decisions</h4> <p>LTM is about as boring a system as is possible. Every design decision falls out of taking the path of least resistance.</p> <ul> <li>Why using Scalding? <ul> <li>It's standard at Twitter and the integration made everything trivial. I tried Spark, which has some advantages. However, at the time, I would have had to do manual integration work that I got for free with Scalding.</li> </ul></li> <li>Why use Presto and not something that allows for live slice &amp; dice queries like Druid? <ul> <li><a href="tracing-analytics/">Rebecca Isaacs and Jonathan Simms were doing related work on tracing</a> and we knew that we'd want to do joins between LTM and whatever they created. That's trivial with Presto but would have required more planning and work with something like Druid, at least at the time.</li> <li>George Sirois imported a subset of the data into Druid so we could play with it and the facilities it offers are very nice; it's probably worth re-visiting at some point</li> </ul></li> <li>Why not use Postgres or something similar? <ul> <li>The amount of data we want to store makes this infeasible without a massive amount of effort; even though the cost of data storage is quite low, it's still a &quot;big data&quot; problem</li> </ul></li> <li>Why Parquet instead of a more efficient format? <ul> <li>It was the most suitable of the standard supported formats (the other major suppported format is raw thrift), introducing a new format would be a much larger project than this project</li> </ul></li> <li>Why is the system not real-time (with delays of at least one hour)? <ul> <li>Twitter's batch job pipeline is easy to build on, all that was necessary was to read some tutorial on how it works and then write something similar, but with different business logic.</li> <li>There was a nicely written proposal to build a real-time analytics pipeline for metrics data written a couple years before I joined Twitter, but that never got built because (I estimate) it would have been one to four quarters of work to produce an MVP and it wasn't clear what team had the right mandate to work on that and also had 4 quarters of headcount available. But the add a batch job took one day, you don't need to have roadmap and planning meetings for a day of work, you can just do it and then do follow-on work incrementally.</li> <li>If we're looking for misconfigurations or optimization opportunities, these rarely go away within an hour (and if they did, they must've had small total impact) and, in fact, they often persist for months to years, so we don't lose much by givng up on real-time (we do lose the ability to use the output of this for some monitoring use cases)</li> <li>The real-time version would've been a system that significant operational cost can't be operated by one person without undue burden. This system has more operational/maintenance burden than I'd like, probably 1-2 days of mine time per month a month on average, which at this point makes that a pretty large fraction of the total cost of the system, but it never pages, and the amount of work can easily be handeled by one person.</li> </ul></li> </ul> <h4 id="boring-technology">Boring technology</h4> <p>I think writing about systems like this, that are just boring work is really underrated. A disproportionate number of posts and talks I read are about systems using hot technologies. I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing. Since posts and talks about boring work are relatively rare, I think writing up something like this is more useful than it has any right to be.</p> <p>For example, a couple years ago, at a local meetup that Matt Singer organizes for companies in our size class to discuss infrastructure (basically, companies that are smaller than FB/Amazon/Google) I asked if anyone was doing something similar to what we'd just done. No one who was there was (or not who'd admit to it, anyway), and engineers from two different companies expressed shock that we could store so much data, and not just the average per time period, but some histogram information as well. This work is too straightforward and obvious to be novel, I'm sure people have built analogous systems in many places. It's literally just storing metrics data on HDFS (or, if you prefer a more general term, a data lake) indefinitely in a format that allows interactive queries.</p> <p>If you do the math on the cost of metrics data storage for a project like this in a company in our size class, the storage cost is basically a rounding error. We've shipped individual diffs that easily pay for the storage cost for decades. I don't think there's any reason storing a few years or even a decade worth of metrics should be shocking when people deploy analytics and observability tools that cost much more all the time. But it turns out this was surprising, in part because people don't write up work this boring.</p> <p>An unrelated example is that, a while back, I ran into someone at a similarly sized company who wanted to get similar insights out of their metrics data. Instead of starting with something that would take a day, like this project, they started with deep learning. While I think there's value in applying ML and/or stats to infra metrics, they turned a project that could return significant value to the company after a couple of person-days into a project that took person-years. And if you're only going to <em>either</em> apply simple heuristics guided by someone with infra experience and simple statistical models <em>or</em> naively apply deep learning, I think the former has much higher ROI. Applying both sophisticated stats/ML <em>and</em> practitioner guided heuristics together can get you better results than either alone, but I think it makes a lot more sense to start with the simple project that takes a day to build out and maybe another day or two to start to apply than to start with a project that takes months or years to build out and start to apply. But there are a lot of biases towards doing the larger project: it makes a better resume item (deep learning!), in many places, it makes a better promo case, and people are more likely to give a talk or write up a blog post on the cool system that uses deep learning.</p> <p>The above discusses why writing up work is valuable for the industry in general. <a href="corp-eng-blogs/">We covered why writing up work is valuable to the company doing the write-up in a previous post</a>, so I'm not going to re-hash that here.</p> <h4 id="appendix-stuff-i-screwed-up">Appendix: stuff I screwed up</h4> <p><a href="https://twitter.com/danluu/status/1220228489522974721">I think it's unfortunate that you don't get to hear about the downsides of systems without backchannel chatter</a>, so here are things I did that are pretty obvious mistakes in retrospect. I'll add to this when something else becomes obvious in retrospect.</p> <ul> <li>Not using a double for almost everything <ul> <li>In an ideal world, some things aren't doubles, but everything in our metrics stack goes through a stage where basically every metric is converted to a double</li> <li>I stored most things that &quot;should&quot; be an integral type as an integral type, but doing the conversion from <code>long -&gt; double -&gt; long</code> is never going to be more precise than just doing the<code>long -&gt; double</code> conversion and it opens the door to other problems</li> <li>I stored some things that shouldn't be an integral type as an integral type, which causes small values to unnecessarily lose precision <ul> <li>Luckily this hasn't caused serious errors for any actionable analysis I've done, but there are analyses where it could cause problems</li> </ul></li> </ul></li> <li>Using asserts instead of writing bad entries out to some kind of &quot;bad entries&quot; table <ul> <li>For reasons that are out of scope of this post, there isn't really a reasonable way to log errors or warnings in Scalding jobs, so I used asserts to catch things that shoudn't happen, which causes the entire job to die every time something unexpected happens; a better solution would be to write bad input entries out into a table and then have that table emailed out as a soft alert if the table isn't empty <ul> <li>An example of a case where this would've saved some operational overhead is where we had an unusual amount of clock skew (3600 years), which caused a timestamp overflow. If I had a table that was a log of bad entries, the bad entry would've been omitted from the output, which is the correct behavior, and it would've saved an interruption plus having to push a fix and re-deploy the job.</li> </ul></li> </ul></li> <li>Longterm vs. LongTerm in the code <ul> <li>I wasn't sure which way this should be capitalized when I was first writing this and, when I made a decision, I failed to grep for and squash everything that was written the wrong way, so now this pointless inconsistency exists in various places</li> </ul></li> </ul> <p>These are the kind of thing you expect when you crank out something quickly and don't think it through enough. The last item is trivial to fix and not much of a problem since the ubiquitous use of IDEs at Twitter means that basically anyone who would be impacted will have their IDE supply the correct capitalization for them.</p> <p>The first item is more problematic, both in that it could actually cause incorrect analyses and in that fixing it will require doing a migration of all the data we have. My guess is that, at this point, this will be half a week to a week of work, which I could've easily avoided by spending thirty more seconds thinking through what I was doing.</p> <p>The second item is somewhere in between. Between the first and second items, I think I've probably signed up for roughly double the amount of direct work on this system (so, not including time spent on data analysis on data in the system, just the time spent to build the system) for essentially no benefit.</p> <p><small>Thanks to Leah Hanson, Andy Wilcox, Lifan Zeng, and Matej Stuchlik for comments/corrections/discussion</small></p> <p><link rel="prefetch" href=""> <link rel="prefetch" href="corp-eng-blogs/"> <link rel="prefetch" href="tracing-analytics/"> <link rel="prefetch" href="about/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:D"><p>The actual work involved was about a day's work, but it was done over a week since I had to learn Scala as well as Scalding and the general Twitter stack, the metrics stack, etc.</p> <p>One day is also just an estimate for the work for the initial data sets. Since then, I've done probably a couple more weeks of work and Wesley Aptekar-Cassels and Kunal Trivedi have probably put in another week or two of time. The opertional cost is probably something like 1-2 days of my time per month (on average), bringing the total cost to on the order a month or two.</p> <p>I'm also not counting time spent using the dataset, or time spent debugging issues, which will include a lot of time that I can only roughly guess at, e.g., when the compute platform team changed the network egress limits as a result of some data analysis that took about an hour, that exposed a latent mesos bug that probably cost a day of Ilya Pronin's time, David Mackey has spent a fair amount of time tracking down weird issues where the data shows something odd is going on, but we don't know what is, etc. If you wanted to fully account for time spent on work that came out of some data analysis on the data sets discussed in the post, I suspect, between service-level teams, plus platform-level teams like our JVM, OS, and HW teams, we're probably at roughly 1 person-year of time.</p> <p>But, because the initial work it took to create a working and useful system was a day plus time spent working on orientation material and the system returned seven figures, it's been very easy to justify all of this additional time spent, which probably wouldn't have been the case if a year of up-front work was required. Most of the rest of the time isn't the kind of thing that's usually &quot;charged&quot; on roadmap reviews on creating a system (time spent by users, operational overhead), but perhaps the ongoing operational cost shlould be &quot;charged&quot; when creating the system (I don't think it makes sense to &quot;charge&quot; time spent by users to the system since, the more useful a system is, the more time users will spend using it, that doesn't really seem like a cost).</p> <p>There'a also been work to build tools on top of this, Kunal Trivedi has spent a fair amount of time building a layer on top of this to make the presentation more user friendly than SQL queries, which could arguably be charged to this project.</p> <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> </ol> </div> How (some) good corporate engineering blogs are written corp-eng-blogs/ Wed, 11 Mar 2020 00:00:00 +0000 corp-eng-blogs/ <p>I've been comparing notes with people who run corporate engineering blogs and one thing that I think is curious is that it's pretty common for my personal blog to get more traffic than the entire corp eng blog for a company with a nine to ten figure valuation and it's not uncommon for my blog to get an order of magnitude more traffic.</p> <p>I think this is odd because tech companies in that class often have hundreds to thousands of employees. They're overwhelmingly likely to be better equipped to write a compelling blog than I am and companies get a lot more value from having a compelling blog than I do.</p> <p>With respect to the former, employees of the company will have done more interesting engineering work, have more fun stories, and have more in-depth knowledge than any one person who has a personal blog. On the latter, my blog helps me with job searching and it helps companies hire. But I only need one job, so more exposure, at best, gets me a slightly better job, whereas all but one tech company I've worked for is desperate to hire and loses candidates to other companies all the time. Moreover, I'm not really competing against other candidates when I interview (even if we interview for the same job, if the company likes more than one of us, it will usually just make more jobs). The high-order bit on this blog with respect to job searching is whether or not the process can take significant non-interview feedback or if <a href="algorithms-interviews/">I'll fail the interview because they do a conventional interview</a> and the marginal value of an additional post is probably very low with respect to that. On the other hand, companies compete relatively directly when recruiting, so being more compelling relative to another company has value to them; replicating the playbook Cloudflare or Segment has used with their engineering &quot;brands&quot; would be a significant recruiting advantage. The playbook isn't secret: these companies broadcast their output to the world and are generally happy to talk about their blogging process.</p> <p>Despite the seemingly obvious benefits of having a &quot;good&quot; corp eng blog, most corp eng blogs are full of stuff engineers don't want to read. Vague, high-level fluff about how amazing everything is, content marketing, handwave-y posts about the new hotness (today, that might be using deep learning for inappropriate applications; ten years ago, that might have been using &quot;big data&quot; for inappropriate applications), etc.</p> <p>To try to understand what companies with good corporate engineering blog have in common, I interviewed folks at three different companies that have compelling corporate engineering blogs (Cloudflare, Heap, and Segment) as well as folks at three different companies that have lame corporate engineering blogs (which I'm not going to name).</p> <p>At a high level, the compelling engineering blogs had processes that shared the following properties:</p> <ul> <li>Easy approval process, not many approvals necessary</li> <li>Few or no non-engineering approvals required</li> <li>Implicit or explicit fast <a href="https://en.wikipedia.org/wiki/Service-level_objective">SLO</a> on approvals</li> <li>Approval/editing process mainly makes posts more compelling to engineers</li> <li>Direct, high-level (co-founder, C-level, or VP-level) support for keeping blog process lightweight</li> </ul> <p>The less compelling engineering blogs had processes that shared the following properties:</p> <ul> <li>Slow approval process</li> <li>Many approvals necessary</li> <li>Significant non-engineering approvals necessary <ul> <li>Non-engineering approvals suggest changes authors find frustrating</li> <li>Back-and-forth can go on for months</li> </ul></li> <li>Approval/editing process mainly de-risks posts, removes references to specifics, makes posts vaguer and less interesting to engineers</li> <li>Effectively no high-level support for blogging <ul> <li>Leadership may agree that blogging is good in the abstract, but it's not a high enough priority to take concrete action</li> <li>Reforming process to make blogging easier very difficult; previous efforts have failed</li> <li>Changing process to reduce overhead requires all &quot;stakeholders&quot; to sign off (14 in one case) <ul> <li>Any single stakeholder can block</li> <li>No single stakeholder can approve</li> </ul></li> <li>Stakeholders wary of approving anything that reduces overhead <ul> <li>Approving involves taking on perceived risk (what if something bad happens) with no perceived benefit to them</li> </ul></li> </ul></li> </ul> <p>One person at a company with a compelling blog noted that a downside of having only one approver and/or one primary approver is that if that person is busy, it can takes weeks to get posts approved. That's fair, that's a downside of having centralized approval. However, when we compare to the alternative processes, at one company, people noted that it's typical for approvals to take three to six months and tail cases can take a year.</p> <p>While a few weeks can seem like a long time for someone used to a fast moving company, people at slower moving companies would be ecstatic to have an approval process that only takes twice that long.</p> <p>Here are the processes, as described to me, for the three companies I interviewed (presented in <code>sha512sum</code> order, which is coincidentally ordered by increasing size of company, from a couple hundred employees to nearly one thousand employees):</p> <h4 id="heap">Heap</h4> <ul> <li>Someone has an idea to write a post</li> <li>Writer (who is an engineer) is paired with a &quot;buddy&quot;, who edits and then approves the post <ul> <li>Buddy is an engineer who has a track record of producing reasonable writing</li> <li>This may take a few rounds, may change thrust of the post</li> </ul></li> <li>CTO reads and approves <ul> <li>Usually only minor feedback</li> <li>May make suggestions like &quot;a designer could make this graph look better&quot;</li> </ul></li> <li>Publish post</li> </ul> <p>The first editing phase used to involve posting a draft to a slack channel where &quot;everyone&quot; would comment on the post. This was an unpleasant experience since &quot;everyone&quot; would make comments and a lot of revision would be required. This process was designed to avoid getting &quot;too much&quot; feedback.</p> <h4 id="segment">Segment</h4> <ul> <li>Someone has an idea to write a post <ul> <li>Often comes from: internal docs, external talk, shipped project, open source tooling (built by Segment)</li> </ul></li> <li>Writer (who is an engineer) writes a draft <ul> <li>May have a senior eng work with them to write the draft</li> </ul></li> <li>Until recently, no one really owned the feedback process <ul> <li>Calvin French-Owen (co-founder) and Rick (engineering manager) would usually give most feedback</li> <li>Maybe also get feedback from manager and eng leadership</li> <li>Typically, 3rd draft is considered finished</li> <li>Now, have a full-time editor who owns editing posts</li> </ul></li> <li>Also socialize among eng team, get get feedback from 15-20 people</li> <li>PR and legal will take a look, lightweight approval</li> </ul> <p>Some changes that have been made include</p> <ul> <li>At one point, when trying to establish an &quot;engineering brand&quot;, making in-depth technical posts a top-level priority</li> <li>had a &quot;blogging retreat&quot;, one week spent on writing a post</li> <li>added writing and speaking as explicit criteria to be rewarded in performance reviews and career ladders</li> </ul> <p>Although there's legal and PR approval, Calvin noted &quot;In general we try to keep it fairly lightweight. I see the bigger problem with blogging being a lack of posts or vague, high level content which isn't interesting rather than revealing too much.&quot;</p> <h4 id="cloudflare">Cloudflare</h4> <ul> <li>Someone has an idea to write a post <ul> <li>Internal blogging is part of the culture, some posts come from the internal blog</li> </ul></li> <li>John Graham-Cumming (CTO) reads every post, other folks will read and comment <ul> <li>John is approver for posts</li> </ul></li> <li>Matthew Prince (CEO) also generally supportive of blogging</li> <li>&quot;Very quick&quot; legal approval process, SLO of 1 hour <ul> <li>This process is so lightweight that one person didn't really think of it as an approval, another person didn't mention it at all (a third person did mention this step)</li> <li>Comms generally not involved</li> </ul></li> </ul> <p>One thing to note is that this only applies to technical blog posts. Product announcements have a heavier process because they're tied to sales material, press releases, etc.</p> <p>One thing I find interesting is that Marek interviewed at Cloudflare because of their blog (<a href="https://blog.cloudflare.com/a-tour-inside-cloudflares-latest-generation-servers/">this 2013 blog post on their 4th generation servers caught his eye</a>) and he's now both a key engineer for them as well as one of the main sources of compelling Cloudflare blog posts. At this point, the Cloudflare blog has generated at least a few more generations of folks who interviewed because they saw a blog post and now write compelling posts for the blog.</p> <h4 id="negative-example-1">Negative example #1</h4> <ul> <li>Many people suggested I use this company as a positive example because, in the early days, they had a semi-lightweight process like the above</li> <li>The one thing that made the process non-lightweight was that a founder insisted on signing off on posts and would often heavily rewrite them, but the blog was a success and a big driver of recruiting</li> <li>As the company scaled up, founder approval took longer and longer, causing lengthy delays in the blog process</li> <li>At some point, an outsider was hired to take over the blog publishing process because it was considered important to leadership</li> <li>Afterwards, the process became filled with typical anti-patterns, taking months for approval, with many iterations of changes that engineers found frustrating, that made their blog posts less compelling <ul> <li>Multiple people told me that they vowed to never write another blog post for the company after doing one because the process was so painful</li> <li>The good news is that, long the era of the blog having a reasonable process, the memory of the blog having good output still gave many outsiders a positive impression about the company and its engineering</li> </ul></li> </ul> <h4 id="negative-example-2">Negative example #2</h4> <ul> <li>A friend of mine tried to publish a blog post and it took six months for &quot;comms&quot; to approve</li> <li>About a year after the above, due to the reputation of &quot;negative example #1&quot;, &quot;negative example #2&quot; hired the person who ran the process at &quot;negative example #1&quot; to a senior position in PR/comms and to run the the blogging process at this company. At &quot;negative example #1&quot;, this person took over when the blog went from being something engineers wanted to write on and was the primary driver of the blog process being so onerous that engineers vowed to never write a blog post again after writing one post</li> <li>Hiring the person who presided over the decline of &quot;negative example #1&quot; to improve the process at &quot;negative example #2&quot; did not streamline the process or result in more or better output at &quot;negative example #2&quot;</li> </ul> <h3 id="general-comments">General comments</h3> <p>My opinion is that the natural state of a corp eng blog where people <a href="p95-skill/">get a bit of feedback</a> is a pretty interesting blog. There's <a href="https://twitter.com/rakyll/status/1043952902157459456">a dearth of real, in-depth, technical writing</a>, which makes any half decent, honest, public writing about technical work interesting.</p> <p>In order to have a boring blog, the corporation has to actively stop engineers from putting interesting content out there. Unfortunately, it appears that the natural state of large corporations tends towards risk aversion and blocking people from writing, just in case it causes a legal or PR or other problem. Individual contributors (ICs) might have the opinion that it's ridiculous to block engineers from writing low-risk technical posts while, simultaneously, C-level execs and VPs regularly make public comments that turn into PR disasters, but ICs in large companies don't have the authority or don't feel like they have the authority to do something just because it makes sense. And none of the fourteen stakeholders who'd have to sign off on approving a streamlined process care about streamlining the process since that would be good for the company in a way that doesn't really impact them, not when that would mean seemingly taking responsibility for the risk a streamlined process would add, however small. An exec or a senior VP willing to take a risk can take responsibility for the fallout and, if they're interested in engineering recruiting or morale, they may see a reason to do so.</p> <p>One comment I've often heard from people at more bureaucratic companies is something like &quot;every company our size is like this&quot;, but that's not true. Cloudflare, a $6B company approaching 1k employees is in the same size class as many other companies with a much more onerous blogging process. The corp eng blog situation seems similar to situation on giving real interview feedback. <a href="http://blog.interviewing.io/no-engineer-has-ever-sued-a-company-because-of-constructive-post-interview-feedback-so-why-dont-employers-do-it/">interviewing.io claims that there's significant upside and very little downside to doing so</a>. Some companies actually do give real feedback and the ones that do generally find that it gives them an easy advantage in recruiting with little downside, but the vast majority of companies don't do this and people at those companies will claim that it's impossible to do give feedback since you'll get sued or the company will be &quot;cancelled&quot; even though this generally doesn't happen to companies that give feedback and there are even entire industries where it's common to give interview feedback. It's easy to handwave that some risk exists and very few people have the authority to dismiss vague handwaving about risk when it's coming from multiple orgs.</p> <p>Although this is a small sample size and it's dangerous to generalize too much from small samples, the idea that you need high-level support to blast through bureaucracy is consistent with what I've seen in other areas where most large companies have a hard time doing something easy that has obvious but diffuse value. While this post happens to be about blogging, I've heard stories that are the same shape on a wide variety of topics.</p> <h3 id="appendix-examples-of-compelling-blog-posts">Appendix: examples of compelling blog posts</h3> <p>Here are some blog posts from the blogs mentioned with a short comment on why I thought the post was compelling. This time, in reverse sha512 hash order.</p> <h4 id="cloudflare-1">Cloudflare</h4> <ul> <li><a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/">https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/</a> <ul> <li>Talks about a real technical problem that impacted a lot of people, reasonably in depth</li> <li>Timely, released only eight hours after the outage, when people were still really interested in hearing about what happened; most companies can't turn around a compelling blog post this quickly or can only do it on a special-case basis, Cloudflare is able to crank out timely posts semi-regularly</li> </ul></li> <li><a href="https://blog.cloudflare.com/the-relative-cost-of-bandwidth-around-the-world/">https://blog.cloudflare.com/the-relative-cost-of-bandwidth-around-the-world/</a> <ul> <li>Exploration of some data</li> </ul></li> <li><a href="https://blog.cloudflare.com/the-story-of-one-latency-spike/">https://blog.cloudflare.com/the-story-of-one-latency-spike/</a> <ul> <li>A debugging story</li> </ul></li> <li><a href="https://blog.cloudflare.com/when-bloom-filters-dont-bloom/">https://blog.cloudflare.com/when-bloom-filters-dont-bloom/</a> <ul> <li>A debugging story, this time in the context of developing a data structure</li> </ul></li> </ul> <h4 id="segment-1">Segment</h4> <ul> <li><a href="https://segment.com/blog/when-aws-autoscale-doesn-t/">https://segment.com/blog/when-aws-autoscale-doesn-t/</a> <ul> <li>Concrete explanation of a gotcha in a widely used service</li> </ul></li> <li><a href="https://segment.com/blog/gotchas-from-two-years-of-node/">https://segment.com/blog/gotchas-from-two-years-of-node/</a> <ul> <li>Concrete example and explanation of a gotcha in a widely used tool</li> </ul></li> <li><a href="https://segment.com/blog/automating-our-infrastructure/">https://segment.com/blog/automating-our-infrastructure/</a> <ul> <li>Post with specific details about how a company operates; in theory, any company could write this, but few do</li> </ul></li> </ul> <h4 id="heap-1">Heap</h4> <ul> <li><a href="https://heap.io/blog/engineering/basic-performance-analysis-saved-us-millions">https://heap.io/blog/engineering/basic-performance-analysis-saved-us-millions</a> <ul> <li>Talks about a real problem and solution</li> </ul></li> <li><a href="https://heap.io/blog/engineering/clocksource-aws-ec2-vdso">https://heap.io/blog/engineering/clocksource-aws-ec2-vdso</a> <ul> <li>Talks about a real problem and solution</li> <li>In HN comments, engineers (malisper, kalmar) have technical responses with real reasons in them and not just the usual dissembling that you see in most cases</li> </ul></li> <li><a href="https://heap.io/blog/analysis/migrating-to-typescript">https://heap.io/blog/analysis/migrating-to-typescript</a> <ul> <li>Real talk about how the first attempt at driving a company-wide technical change failed</li> </ul></li> </ul> <p>One thing to note is that these blogs all have different styles. Personally, I prefer the style of Cloudflare's blog, which has a higher proportion of &quot;deep dive&quot; technical posts, but different people will prefer different styles. There are a lot of styles that can work.</p> <p><small>Thanks to Marek Majkowski, Kamal Marhubi, Calvin French-Owen, John Graham-Cunning, Michael Malis, Matthew Prince, Yuri Vishnevsky, Julia Evans, Wesley Aptekar-Cassels, Nathan Reed, Jake Seliger, an anonymous commenter, plus sources from the companies I didn't name for comments/corrections/discussion; none of the people explicitly mentioned in the acknowledgements were sources for information on the less compelling blogs</small></p> <p><link rel="prefetch" href="algorithms-interviews/"> <link rel="prefetch" href="p95-skill/"> <link rel="prefetch" href=""> <link rel="prefetch" href="about/"></p> The growth of command line options, 1979-Present cli-complexity/ Tue, 03 Mar 2020 00:00:00 +0000 cli-complexity/ <p><a href="https://www.xkcd.com/1795/">My hobby</a>: opening up <a href="mcilroy-unix/">McIlroy’s UNIX philosophy</a> on one monitor while reading manpages on the other.</p> <p>The first of McIlroy's dicta is often paraphrased as &quot;do one thing and do it well&quot;, which is <a href="mcilroy-unix/">shortened from</a> &quot;Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new 'features.'&quot;</p> <p>McIlroy's example of this dictum is:</p> <blockquote> <p>Surprising to outsiders is the fact that UNIX compilers produce no listings: printing can be done better and more flexibly by a separate program.</p> </blockquote> <p>If you open up a manpage for <code>ls</code> on mac, you’ll see that it starts with</p> <pre><code>ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...] </code></pre> <p>That is, the one-letter flags to <code>ls</code> include every lowercase letter except for <code>{jvyz}</code>, 14 uppercase letters, plus <code>@</code> and <code>1</code>. That’s 22 + 14 + 2 = 38 single-character options alone.</p> <p>On ubuntu 17, if you read the manpage for coreutils <code>ls</code>, you don’t get a nice summary of options, but you’ll see that <code>ls</code> has 58 options (including <code>--help</code> and <code>--version</code>).</p> <p>To see if <code>ls</code> is an aberration or if it's normal to have commands that do this much stuff, we can look at some common commands, sorted by frequency of use.</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <table> <tr> <th>command</th><th>1979</th><th>1996</th><th>2015</th><th>2017</th></tr> <tr> <td>ls</td><td bgcolor=#6baed6><font color=black>11</font></td><td bgcolor=#2171b5><font color=white>42</font></td><td bgcolor=#08519c><font color=white>58</font></td><td bgcolor=#08519c><font color=white>58</font></td></tr> <tr> <td>rm</td><td bgcolor=#deebf7><font color=black>3</font></td><td bgcolor=#9ecae1><font color=black>7</font></td><td bgcolor=#6baed6><font color=black>11</font></td><td bgcolor=#6baed6><font color=black>12</font></td></tr> <tr> <td>mkdir</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#c6dbef><font color=black>4</font></td><td bgcolor=#9ecae1><font color=black>6</font></td><td bgcolor=#9ecae1><font color=black>7</font></td></tr> <tr> <td>mv</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#9ecae1><font color=black>9</font></td><td bgcolor=#6baed6><font color=black>13</font></td><td bgcolor=#6baed6><font color=black>14</font></td></tr> <tr> <td>cp</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#4292c6><font color=black>18</font></td><td bgcolor=#2171b5><font color=white>30</font></td><td bgcolor=#2171b5><font color=white>32</font></td></tr> <tr> <td>cat</td><td bgcolor=#f7fbff><font color=black>1</font></td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#6baed6><font color=black>12</font></td></tr> <tr> <td>pwd</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#deebf7><font color=black>2</font></td><td bgcolor=#c6dbef><font color=black>4</font></td><td bgcolor=#c6dbef><font color=black>4</font></td></tr> <tr> <td>chmod</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#9ecae1><font color=black>6</font></td><td bgcolor=#9ecae1><font color=black>9</font></td><td bgcolor=#9ecae1><font color=black>9</font></td></tr> <tr> <td>echo</td><td bgcolor=#f7fbff><font color=black>1</font></td><td bgcolor=#c6dbef><font color=black>4</font></td><td bgcolor=#c6dbef><font color=black>5</font></td><td bgcolor=#c6dbef><font color=black>5</font></td></tr> <tr> <td>man</td><td bgcolor=#c6dbef><font color=black>5</font></td><td bgcolor=#6baed6><font color=black>16</font></td><td bgcolor=#2171b5><font color=white>39</font></td><td bgcolor=#2171b5><font color=white>40</font></td></tr> <tr> <td>which</td><td bgcolor=silver><font color=black></font></td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#f7fbff><font color=black>1</font></td><td bgcolor=#f7fbff><font color=black>1</font></td></tr> <tr> <td>sudo</td><td bgcolor=silver><font color=black></font></td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#4292c6><font color=black>23</font></td><td bgcolor=#4292c6><font color=black>25</font></td></tr> <tr> <td>tar</td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#08519c><font color=white>53</font></td><td bgcolor=black><font color=white>134</font></td><td bgcolor=black><font color=white>139</font></td></tr> <tr> <td>touch</td><td bgcolor=#f7fbff><font color=black>1</font></td><td bgcolor=#9ecae1><font color=black>9</font></td><td bgcolor=#6baed6><font color=black>11</font></td><td bgcolor=#6baed6><font color=black>11</font></td></tr> <tr> <td>clear</td><td bgcolor=silver><font color=black></font></td><td bgcolor=white><font color=black>0</font></td><td bgcolor=white><font color=black>0</font></td><td bgcolor=white><font color=black>0</font></td></tr> <tr> <td>find</td><td bgcolor=#6baed6><font color=black>14</font></td><td bgcolor=#08519c><font color=white>57</font></td><td bgcolor=#08519c><font color=white>82</font></td><td bgcolor=#08519c><font color=white>82</font></td></tr> <tr> <td>ln</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#6baed6><font color=black>11</font></td><td bgcolor=#6baed6><font color=black>15</font></td><td bgcolor=#6baed6><font color=black>16</font></td></tr> <tr> <td>ps</td><td bgcolor=#c6dbef><font color=black>4</font></td><td bgcolor=#4292c6><font color=black>22</font></td><td bgcolor=#08306b><font color=white>85</font></td><td bgcolor=#08306b><font color=white>85</font></td></tr> <tr> <td>ping</td><td bgcolor=silver><font color=black></font></td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#2171b5><font color=white>29</font></td></tr> <tr> <td>kill</td><td bgcolor=#f7fbff><font color=black>1</font></td><td bgcolor=#deebf7><font color=black>3</font></td><td bgcolor=#deebf7><font color=black>3</font></td><td bgcolor=#deebf7><font color=black>3</font></td></tr> <tr> <td>ifconfig</td><td bgcolor=silver><font color=black></font></td><td bgcolor=#6baed6><font color=black>16</font></td><td bgcolor=#4292c6><font color=black>25</font></td><td bgcolor=#4292c6><font color=black>25</font></td></tr> <tr> <td>chown</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#9ecae1><font color=black>6</font></td><td bgcolor=#6baed6><font color=black>15</font></td><td bgcolor=#6baed6><font color=black>15</font></td></tr> <tr> <td>grep</td><td bgcolor=#6baed6><font color=black>11</font></td><td bgcolor=#4292c6><font color=black>22</font></td><td bgcolor=#2171b5><font color=white>45</font></td><td bgcolor=#2171b5><font color=white>45</font></td></tr> <tr> <td>tail</td><td bgcolor=#f7fbff><font color=black>1</font></td><td bgcolor=#9ecae1><font color=black>7</font></td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#6baed6><font color=black>13</font></td></tr> <tr> <td>df</td><td bgcolor=white><font color=black>0</font></td><td bgcolor=#6baed6><font color=black>10</font></td><td bgcolor=#4292c6><font color=black>17</font></td><td bgcolor=#4292c6><font color=black>18</font></td></tr> <tr> <td>top</td><td bgcolor=silver><font color=black></font></td><td bgcolor=#9ecae1><font color=black>6</font></td><td bgcolor=#6baed6><font color=black>12</font></td><td bgcolor=#6baed6><font color=black>14</font></td></tr> </table></p> <p>This table has the number of command line options for various commands for v7 Unix (1979), slackware 3.1 (1996), ubuntu 12 (2015), and ubuntu 17 (2017). Cells are darker and blue-er when they have more options (log scale) and are greyed out if no command was found.</p> <p>We can see that the number of command line options has dramatically increased over time; entries tend to get darker going to the right (more options) and there are no cases where entries get lighter (fewer options). </p> <p><a href="https://archive.org/details/DougMcIlroy_AncestryOfLinux_DLSLUG">McIlroy has long decried the increase in the number of options, size, and general functionality of commands</a><sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">1</a></sup>:</p> <blockquote> <p>Everything was small and my heart sinks for Linux when I see the size [inaudible]. The same utilities that used to fit in eight k[ilobytes] are a meg now. And the manual page, which used to really fit on, which used to really be a manual <em>page</em>, is now a small volume with a thousand options... We used to sit around in the UNIX room saying &quot;what can we throw out? Why is there this option?&quot; It's usually, it's often because there's some deficiency in the basic design — you didn't really hit the right design point. Instead of putting in an option, figure out why, what was forcing you to add that option. This viewpoint, which was imposed partly because there was very small hardware ... has been lost and we're not better off for it.</p> </blockquote> <p>Ironically, one of the reasons for the rise in the number of command line options is another McIlroy dictum, &quot;Write programs to handle text streams, because that is a universal interface&quot; (see <code>ls</code> for one example of this).</p> <p>If structured data or objects were passed around, formatting could be left to a final formatting pass. But, with plain text, the formatting and the content are intermingled; because formatting can only be done by parsing the content out, it's common for commands to add formatting options for convenience. Alternately, formatting can be done when the user leverages their knowledge of the structure of the data and encodes that knowledge into arguments to <code>cut</code>, <code>awk</code>, <code>sed</code>, etc. (also using their knowledge of how those programs handle formatting; it's different for different programs and the user is expected to, for example, <a href="https://unix.stackexchange.com/a/132322/261842">know how <code>cut -f4</code> is different from <code>awk '{ print $4 }'</code></a><sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">2</a></sup>). That's a lot more hassle than passing in one or two arguments to the last command in a sequence and it pushes the complexity from the tool to the user.</p> <p>People sometimes say that they don't want to support structured data because they'd have to support multiple formats to make a universal tool, but they already have to support multiple formats to make a universal tool. Some standard commands can't read output from other commands because they use different formats, <code>wc -w</code> doesn't handle Unicode correctly, etc. Saying that &quot;text&quot; is a universal format is like saying that &quot;binary&quot; is a universal format.</p> <p>I've heard people say that there isn't really any alternative to this kind of complexity for command line tools, but people who say that have never really tried the alternative, something like PowerShell. I have plenty of complaints about PowerShell, but passing structured data around and easily being able to operate on structured data without having to hold metadata information in my head so that I can pass the appropriate metadata to the right command line tools at that right places the pipeline isn't among my complaints<sup class="footnote-ref" id="fnref:W"><a rel="footnote" href="#fn:W">3</a></sup>.</p> <p>The sleight of hand that's happening when someone says that we can keep software simple and compatible by making everything handle text is the pretense that text data doesn't have a structure that needs to be parsed<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">4</a></sup>. In some cases, we can just think of everything as a single space separated line, or maybe a table with some row and column separators that we specify (<a href="https://unix.stackexchange.com/a/132322/261842">with some behavior that isn't consistent across tools, of course</a>). That adds some hassle when it works, and then there are the cases where serializing data to a flat text format adds considerable complexity since the structure of data means that simple flattening requires significant parsing work to re-ingest the data in a meaningful way.</p> <p>Another reason commands now have more options is that people have added convenience flags for functionality that could have been done by cobbling together a series of commands. These go all the way back to v7 unix, where <code>ls</code> has an option to reverse the sort order (which could have been done by passing the output to something like <code>tac</code> had they written <code>tac</code> instead of adding a special-case reverse option).</p> <p>Over time, more convenience options have been added. For example, to pick a command that originally has zero options, <code>mv</code> can move <em>and</em> create a backup (three options; two are different ways to specify a backup, one of which takes an argument and the other of which takes zero explicit arguments and reads an implicit argument from the <code>VERSION_CONTROL</code> environment variable; one option allows overriding the default backup suffix). <code>mv</code> now also has options to never overwrite and to only overwrite if the file is newer.</p> <p><code>mkdir</code> is another program that used to have no options where, excluding security things for SELinux or SMACK as well as help and version options, the added options are convenience flags: setting the permissions of the new directory and making parent directories if they don't exist.</p> <p>If we look at <code>tail</code>, which originally had one option (<code>-number</code>, telling <code>tail</code> where to start), it's added both formatting and convenience options For formatting, it has <code>-z</code>, which makes the line delimiter <code>null</code> instead of a newline. Some examples of convenience options are <code>-f</code> to print when there are new changes, <code>-s</code> to set the sleep interval between checking for <code>-f</code> changes, <code>--retry</code> to retry if the file isn't accessible.</p> <p>McIlroy says &quot;we're not better off&quot; for having added all of these options but I'm better off. I've never used some of the options we've discussed and only rarely use others, but that's the beauty of command line options — unlike with a GUI, adding these options doesn't clutter up the interface. The manpage can get cluttered, but in the age of google and stackoverflow, I suspect many people just search for a solution to what they're trying to do without reading the manpage anyway.</p> <p>This isn't to say there's no cost to adding options — more options means more maintenance burden, but that's a cost that maintainers pay to benefit users, which isn't obviously unreasonable considering the ratio of maintainers to users. This is analogous to Gary Bernhardt's comment that it's reasonable to practice a talk fifty times since, if there's a three hundred person audience, the ratio of time spent watching to the talk to time spent practicing will still only be 1:6. In general, this ratio will be even more extreme with commonly used command line tools.</p> <p>Someone might argue that all these extra options create a burden for users. That's not exactly wrong, but that complexity burden was always going to be there, it's just a question of where the burden was going to lie. If you think of the set of command line tools along with a shell as forming a language, a language where anyone can write a new method and it effectively gets added to the standard library if it becomes popular, where standards are defined by dicta like &quot;write programs to handle text streams, because that is a universal interface&quot;, the language was always going to turn into a write-only incoherent mess when taken as a whole. At least with tools that bundle up more functionality and options than is UNIX-y users can replace a gigantic set of wildly inconsistent tools with a merely large set of tools that, while inconsistent with each other, may have some internal consistency.</p> <p>McIlroy implies that the problem is that people didn't think hard enough, the old school UNIX mavens would have sat down in the same room and thought longer and harder until they came up with a set of consistent tools that has &quot;unusual simplicity&quot;. But that was never going to scale, the philosophy made the mess we're in inevitable. It's not a matter of not thinking longer or harder; it's a matter of having a philosophy that cannot scale unless you have a relatively small team with a shared cultural understanding, able to to sit down in the same room.</p> <p>Many of the main long-term UNIX anti-features and anti-patterns that we're still stuck with today, fifty years later, come from the &quot;we should all act like we're in the same room&quot; design philosophy, which is the opposite of the approach you want if you want to create nice, usable, general, interfaces that can adapt to problems that the original designers didn't think of. For example, it's a common complain that modern shells and terminals lack a bunch of obvious features that anyone designing a modern interface would want. When you talk to people who've written a new shell and a new terminal with modern principles in mind, like Jesse Luehrs, they'll note that a major problem is that the UNIX model doesn't have a good seperation of interface and implementation, which works ok if you're going to write a terminal that acts in the same way that a terminal that was created fifty years ago acts, but is immediately and obviously problematic if you want to build a modern terminal. That design philosophy works fine if everyone's in the same room and the system doesn't need to scale up the number of contributors or over time, but that's simply not the world we live in today.</p> <p>If anyone can write a tool and the main instruction comes from &quot;the unix philosophy&quot;, people will have different opinions about what &quot;<a href="https://twitter.com/hillelogram/status/1174714902151421952">simplicity</a>&quot; or &quot;doing one thing&quot;<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">5</a></sup> means, what the right way to do things is, and inconsistency will bloom, resulting in the kind of complexity you get when dealing with a wildly inconsistent language, like PHP. People make fun of PHP and javascript for having all sorts of warts and weird inconsistencies, but as a language and a standard library, any commonly used shell plus the collection of widely used *nix tools taken together is much worse and contains much more accidental complexity due to inconsistency even within a single Linux distro and there's no other way it could have turned out. If you compare across Linux distros, BSDs, Solaris, AIX, etc., the amount of accidental complexity that users have to hold in their heads when switching systems dwarfs PHP or javascript's incoherence. The most widely mocked programming languages are paragons of great design by comparison.</p> <p><a id="maven"></a>To be clear, I'm not saying that I or anyone else could have done better with the knowledge available in the 70s in terms of making a system that was practically useful at the time that would be elegant today. It's easy to look back and find issues with the benefit of hindsight. What I disagree with are comments from Unix mavens speaking today; comments like McIlroy's, which imply that we just forgot or don't understand the value of simplicity, or <a href="https://twitter.com/danluu/status/885214004649615360">Ken Thompson saying that C is as safe a language as any and if we don't want bugs we should just write bug-free code</a>. These kinds of comments imply that there's not much to learn from hindsight; in the 70s, we were building systems as effectively as anyone can today; five decades of collective experience, tens of millions of person-years, have taught us nothing; if we just go back to building systems like the original Unix mavens did, all will be well. I respectfully disagree.</p> <h3 id="appendix-memory">Appendix: memory</h3> <p>Although addressing McIlroy's complaints about binary size bloat is a bit out of scope for this, I will note that, in 2017, I bought a Chromebook that had 16GB of RAM for $300. A 1 meg binary might have been a serious problem in 1979, when a standard Apple II had 4KB. An Apple II cost $1298 in 1979 dollars, or $4612 in 2020 dollars. You can get a low end Chromebook that costs less than 1/15th as much which has four million times more memory. Complaining that memory usage grew by a factor of one thousand when a (portable!) machine that's more than an order of magnitude cheaper has four million times more memory seems a bit ridiculous.</p> <p>I prefer slimmer software, which is why I optimized my home page down to two packets (it would be a single packet if my CDN served high-level brotli), but that's purely an aesthetic preference, something I do for fun. The bottleneck for command line tools isn't memory usage and spending time optimizing the memory footprint of a tool that takes one meg is like getting a homepage down to a single packet. Perhaps a fun hobby, but not something that anyone should prescribe.</p> <h3 id="methodology-for-table">Methodology for table</h3> <p>Command frequencies were sourced from public command history files on github, not necessarily representative of your personal usage. Only &quot;simple&quot; commands were kept, which ruled out things like curl, git, gcc (which has &gt; 1000 options), and wget. What's considered simple is arbitrary. <a href="https://en.wikipedia.org/wiki/Shell_builtin">Shell builtins</a>, like <code>cd</code> weren't included.</p> <p>Repeated options aren't counted as separate options. For example, <code>git blame -C</code>, <code>git blame -C -C</code>, and <code>git blame -C -C -C</code> have different behavior, but these would all be counted as a single argument even though <code>-C -C</code> is effectively a different argument from <code>-C</code>.</p> <p>The table counts sub-options as a single option. For example, <code>ls</code> has the following:</p> <blockquote> <p>--format=WORD across -x, commas -m,  horizontal  -x,  long  -l,  single-column  -1,  verbose  -l, vertical -C</p> </blockquote> <p>Even though there are seven format options, this is considered to be only one option.</p> <p>Options that are explicitly listed as not doing anything are still counted as options, e.g., <code>ls -g</code>, which reads <code>Ignored; for Unix compatibility.</code> is counted as an option.</p> <p>Multiple versions of the same option are also considered to be one option. For example, with <code>ls</code>, <code>-A</code> and <code>--almost-all</code> are counted as a single option.</p> <p>In cases where the manpage says an option is supposed to exist, but doesn't, the option isn't counted. For example, the v7 <code>mv</code> manpage says</p> <blockquote> <p>BUGS</p> <p>If file1 and file2 lie on different file systems, mv must copy the file and delete the original. In this case the owner name becomes that of the copying process and any linking relationship with other files is lost.</p> <p>Mv should take <strong>-f</strong> flag, like rm, to suppress the question if the target exists and is not writable.</p> </blockquote> <p><code>-f</code> isn't counted as a flag in the table because the option doesn't actually exist.</p> <p>The latest year in the table is 2017 because I wrote the first draft for this post in 2017 and didn't get around to cleaning it up until 2020.</p> <h3 id="related">Related</h3> <p><a href="https://blog.plover.com/Unix/tools.html">mjd on the Unix philosophy, with an aside into the mess of /usr/bin/time vs. built-in time</a>.</p> <p><a href="https://groups.google.com/forum/m/#!topic/rec.humor.funny/Q-HG4LpW564">mjd making a joke about the proliferation of command line options in 1991</a>.</p> <p>On HN:</p> <blockquote> <blockquote> <p>p1mrx:</p> <p><a href="https://unix.stackexchange.com/q/112125/261842">It's strange that ls has grown to 58 options, but still can't output \0-terminated filenames</a></p> <p>As an exercise, try to sort a directory by size or date, and pass the result to xargs, while supporting any valid filename. I eventually just gave up and made my script ignore any filenames containing \n.</p> </blockquote> <p>whelming_wave:</p> <p>Here you go: sort all files in the current directory by modification time, whitespace-in-filenames-safe. The <code>printf (od -&gt; sed)' construction converts back out of null-separated characters into newline-separated, though feel free to replace that with anything accepting null-separated input. Granted,</code>sort --zero-terminated' is a GNU extension and kinda cheating, but it's even available on macOS so it's probably fine.</p> </blockquote> <pre><code> printf '%b' $( find . -maxdepth 1 -exec sh -c ' printf '\''%s %s\0'\'' &quot;$(stat -f '\''%m'\'' &quot;$1&quot;)&quot; &quot;$1&quot; ' sh {} \; | \ sort --zero-terminated | \ od -v -b | \ sed 's/^[^ ]*// s/ *$// s/ */ \\/g s/\\000/\\012/g') </code></pre> <blockquote> <p>If you're running this under zsh, you'll need to prefix it with `command' to use the system executable: zsh's builtin printf doesn't support printing octal escape codes for normally printable characters, and you may have to assign the output to a variable and explicitly word-split it.</p> <p>This is all POSIX as far as I know, except for the sort.</p> </blockquote> <p><a href="https://en.wikipedia.org/wiki/The_Unix-Haters_Handbook">The Unix haters handbook</a>.</p> <p><a href="http://www.oilshell.org/blog/2018/01/28.html">Why create a new shell</a>?</p> <p><small> Thanks to Leah Hanson, Jesse Luehrs, Hillel Wayne, Wesley Aptekar-Cassels, Mark Jason Dominus, Travis Downs, and Yuri Vishnevsky for comments/corrections/discussion. </small></p> <p><link rel="prefetch" href="mcilroy-unix/"> <link rel="prefetch" href=""> <link rel="prefetch" href="discontinuities/"> <link rel="prefetch" href="about/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:M">This quote is slightly different than the version I've seen everywhere because I watched <a href="https://archive.org/details/DougMcIlroy_AncestryOfLinux_DLSLUG">the source video</a>. AFAICT, every copy of this quote that's on the internet (indexed by Bing, DuckDuckGo, or Google) is a copy of one person's transcription of the quote. There's some ambiguity because the audio is low quality and I hear something a bit different than whoever transcribed that quote heard. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:T">Another example of something where the user absorbs the complexity because different commands handle formatting differently is <a href="https://blog.plover.com/Unix/tools.html">time formatting</a> — the shell builtin <code>time</code> is, of course, inconsistent with <code>/usr/bin/time</code> and the user is expected to know this and know how to handle it. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:W"><p>Just for example, you can use <code>ConvertTo-Json</code> or <code>ConvertTo-CSV</code> on any object, you <a href="https://docs.microsoft.com/en-us/powershell/scripting/samples/using-format-commands-to-change-output-view">can use cmdlets to change how properties are displayed for objects</a>, and you can write formatting configuration files that define how you prefer things to be formatted.</p> <p>Another way to look at this is through the lens of <a href="https://en.wikipedia.org/wiki/Conway's_law">Conway's law</a>. If we have a set of command line tools that are built by different people, often not organizationally connected, the tools are going to be wildly inconsistent unless someone can define a standard and get people to adopt it. This actually works relatively well on Windows, and not just in PowerShell.</p> <p>A common complaint about Microsoft is that they've created massive API churn, often for non-technical organizational reasons (e.g., a Sinofsky power play, like the one described in the replies to the now-deleted Tweet at <a href="https://twitter.com/stevesi/status/733654590034300929">https://twitter.com/stevesi/status/733654590034300929</a>). It's true. Even so, from the standpoint of a naive user, off-the-shelf Windows software is generally a lot better at passing non-textual data around than *nix. One thing this falls out of is Windows's embracing of non-textual data, which goes back at least to <a href="https://en.wikipedia.org/wiki/Component_Object_Model">COM</a> in 1999 (and arguably OLE and DDE, released in 1990 and 1987, respectively).</p> <p>For example, if you copy from Foo, which supports binary formats <code>A</code> and <code>B</code>, into Bar, which supports formats <code>B</code> and <code>C</code> and you then copy from Bar into Baz, which supports <code>C</code> and <code>D</code>, this will work even though Foo and Baz have no commonly supported formats. </p> <p>When you cut/copy something, the application basically &quot;tells&quot; the clipboard what formats it could provide data in. When you paste into the application, the destination application can request the data in any of the formats in which it's available. If the data is already in the clipboard, &quot;Windows&quot; provides it. If it isn't, Windows gets the data from the source application and then gives to the destination application and a copy is saved for some length of time in Windows. If you &quot;cut&quot; from Excel it will tell &quot;you&quot; that it has the data available in many tens of formats. This kind of system is pretty good for compatibility, although it definitely isn't simple or minimal.</p> <p>In addition to nicely supporting many different formats and doing so for long enough that a lot of software plays nicely with this, Windows also generally has nicer clipboard support out of the box.</p> <p>Let's say you copy and then paste a small amount of text. Most of the time, this will work like you'd expect on both Windows and Linux. But now let's say you copy some text, close the program you copied from, and then paste it. A mental model that a lot of people have is that when they copy, the data is stored in the clipboard, not in the program being copied from. On Windows, software is typically written to conform to this expectation (although, technically, users of the clipboard API don't have to do this). This is less common on Linux with X, where the correct mental model for most software is that copying stores a pointer to the data, which is still owned by the program the data was copied from, which means that paste won't work if the program is closed. When I've (informally) surveyed programmers, they're usually surprised by this if they haven't actually done copy+paste related work for an application. When I've surveyed non-programmers, they tend to find the behavior to be confusing as well as surprising.</p> <p>The downside of having the OS effectively own the contents of the clipboard is that it's expensive to copy large amounts of data. Let's say you copy a really large amount of text, many gigabytes, or some complex object and then never paste it. You don't really want to copy that data from your program into the OS so that it can be available. Windows also handles this reasonably: applications can <a href="https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpeclip/fa309d1b-8034-44bf-b927-adfc753e69c1">provide data only on request</a> when that's deemed advantageous. In the case mentioned above, when someone closes the program, the program can decide whether or not it should push that data into the clipboard or discard it. In that circumstance, a lot of software (e.g., Excel) will prompt to &quot;keep&quot; the data in the clipboard or discard it, which is pretty reasonable.</p> <p>It's not impossible to support some of this on Linux. For example, <a href="https://freedesktop.org/wiki/ClipboardManager/">the ClipboardManager spec</a> describes a persistence mechanism and GNOME applications generally kind of sort of support it (although <a href="https://bugzilla.gnome.org/show_bug.cgi?id=510204#c8">there are some bugs</a>) but the situation on *nix is really different from the more pervasive support Windows applications tend to have for nice clipboard behavior.</p> <a class="footnote-return" href="#fnref:W"><sup>[return]</sup></a></li> <li id="fn:C"><p>Another example of this are tools that are available on top of modern compilers. If we go back and look at McIlroy's canonical example, how proper UNIX compilers are so specialized that listings are a separate tool, we can see that this has changed even if there's still a separate tool you can use for listings. Some commonly used Linux compilers have literally thousands of options and do many things. For example, one of the many things <code>clang</code> now does is static analysis. As of this writing, <a href="https://clang.llvm.org/docs/analyzer/checkers.html#default-checkers">there are 79 normal static analysis checks and 44 experimental checks</a>. If these were separate commands (perhaps individual commands or perhaps a <code>static_analysis</code> command, they'd still rely on the same underlying compiler infrastructure and impose the same maintenance burden — it's not really reasonable to have these static analysis tools operate on plain text and reimplement the entire compiler toolchain necessary to get the point where they can do static analysis. They could be separate commands instead of bundled into <code>clang</code>, but they'd still take a dependency on the same machinery that's used for the compiler and either impose a maintenance and complexity burden on the compiler (which has to support non-breaking interfaces for the tools built on top) or they'd break all the time.</p> <p>Just make everything text so that it's simple makes for a nice soundbite, but in reality the textual representation of the data is often not what you want if you want to do actually useful work.</p> <p>And on clang in particular, whether you make it a monolithic command or thousands of smaller commands, clang simply does more than any compiler that existed in 1979 or even all compilers that existed in 1979 combined. It's easy to say that things were simpler in 1979 and that us modern programmers have lost our way. It's harder to actually propose a design that's actually much simpler and could really get adopted. It's impossible that such a design could maintain all of the existing functionality and configurability and be as simple as something from 1979.</p> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:P">Since its inception, curl has gone from supporting 3 protocols to 40. Does that mean it does 40 things and it would be more &quot;UNIX-y&quot; to split it up into 40 separate commands? Depends on who you ask. If each protocol were its own command, created and maintained by a different person, we'd be in the same situation we are with other commands. Inconsistent command line options, inconsistent output formats despite it all being text streams, etc. Would that be closer to the simplicity McIlroy advocates for? Depends on who you ask. <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> </ol> </div> Suspicious discontinuities discontinuities/ Tue, 18 Feb 2020 00:00:00 +0000 discontinuities/ <p>If you read any personal finance forums late last year, there's a decent chance you ran across a question from someone who was desperately trying to lose money before the end of the year. There are a number of ways someone could do this; one commonly suggested scheme was to buy <a href="https://en.wikipedia.org/wiki/Put_option">put options that were expected to expire worthless</a>, allowing the buyer to (probably) take a loss.</p> <p>One reason people were looking for ways to lose money was that, in the U.S., there's <a href="https://en.wikipedia.org/wiki/Patient_Protection_and_Affordable_Care_Act#Subsidy_Cliff_at_400%_FPL">a hard income cutoff for a health insurance subsidy</a> at $48,560 for individuals (higher for larger households; $100,400 for a family of four). There are a number of factors that can cause the details to vary (age, location, household size, type of plan), but across all circumstances, it wouldn't have been uncommon for an individual going from one side of the cut-off to the other to have their health insurance cost increase by roughly $7200/yr. That means if an individual buying ACA insurance was going to earn $55k, they'd be better off reducing their income by $6440 and getting under the $48,560 subsidy ceiling than they are earning $55k.</p> <p>Although that's an unusually severe example, <a href="http://www.cbo.gov/sites/default/files/cbofiles/attachments/11-15-2012-MarginalTaxRates.pdf">U.S. tax policy is full of discontinuities that disincentivize increasing earnings and, in some cases, actually incentivize decreasing earnings</a>. Some other discontinuities are the <a href="https://en.wikipedia.org/wiki/Temporary_Assistance_for_Needy_Families">TANF</a> income limit, the <a href="https://en.wikipedia.org/wiki/Medicaid">Medicaid</a> income limit, the <a href="https://en.wikipedia.org/wiki/Children%27s_Health_Insurance_Program">CHIP</a> income limit for free coverage, and the CHIP income limit for reduced-cost coverage. These vary by location and circumstance; the TANF and Medicaid income limits fall into ranges generally considered to be &quot;low income&quot; and the CHIP limits fall into ranges generally considered to be &quot;middle class&quot;. These subsidy discontinuities have the same impact as the ACA subsidy discontinuity -- at certain income levels, people are incentivized to lose money.</p> <blockquote> <p>Anyone may arrange his affairs so that his taxes shall be as low as possible; he is not bound to choose that pattern which best pays the treasury. There is not even a patriotic duty to increase one's taxes. Over and over again the Courts have said that there is nothing sinister in so arranging affairs as to keep taxes as low as possible. Everyone does it, rich and poor alike and all do right, for nobody owes any public duty to pay more than the law demands.</p> </blockquote> <p>If you agree with the famous <a href="https://en.wikipedia.org/wiki/Learned_Hand">Learned Hand</a> quote then losing money in order to reduce effective tax rate, increasing disposable income, is completely legitimate behavior at the individual level. However, a tax system that encourages people to lose money, perhaps by funneling it to (on average) much wealthier options traders by buying put options, seems sub-optimal.</p> <p>A simple fix for the problems mentioned above would be to have slow phase-outs instead of sharp thresholds. Slow phase-outs are actually done for some subsidies and, while that can also have problems, they are typically less problematic than introducing a sharp discontinuity in tax/subsidy policy.</p> <p>In this post, we'll look at a variety of discontinuities.</p> <h3 id="hardware-or-software-queues">Hardware or software queues</h3> <p>A naive queue has discontinuous behavior. If the queue is full, new entries are dropped. If the queue isn't full, new entries are not dropped. Depending on your goals, this can often have impacts that are non-ideal. For example, in networking, a naive queue might be considered &quot;unfair&quot; to bursty workloads that have low overall bandwidth utilization because workloads that have low bandwidth utilization &quot;shouldn't&quot; suffer more drops than workloads that are less bursty but use more bandwidth (this is also arguably not unfair, depending on what your goals are).</p> <p>A class of solutions to this problem are <a href="https://en.wikipedia.org/wiki/Random_early_detection">random early drop</a> and its variants, which gives incoming items a probability of being dropped which can be determined by queue fullness (and possibly other factors), smoothing out the discontinuity and mitigating issues caused by having a discontinuous probability of queue drops.</p> <p><a href="//danluu.com/randomize-hn/">This post on voting in link aggregators</a> is fundamentally the same idea although, in some sense, the polarity is reversed. There's a very sharp discontinuity in how much traffic something gets based on whether or not it's on the front page. You could view this as a link getting dropped from a queue if it only receives N-1 votes and not getting dropped if it receives N votes.</p> <h3 id="college-admissions-and-pell-grant-recipients-https-www-insidehighered-com-admissions-article-2019-01-28-study-pressure-enroll-more-pell-eligible-students-has-skewed-colleges"><a href="https://www.insidehighered.com/admissions/article/2019/01/28/study-pressure-enroll-more-pell-eligible-students-has-skewed-colleges">College admissions and Pell Grant recipients</a></h3> <p><a href="https://en.wikipedia.org/wiki/Pell_Grant">Pell Grants</a> started getting used as a proxy for how serious schools are about helping/admitting low-income students. The first order impact is that students above the Pell Grant threshold had a significantly reduced probability of being admitted while students below the Pell Grant threshold had a significantly higher chance of being admitted. Phrased that way, it sounds like things are working as intended.</p> <p>However, when we look at what happens within each group, we see outcomes that are the opposite of what we'd want if the goal is to benefit students from low income families. Among people who don't qualify for a Pell Grant, it's those with the lowest income who are the most severely impacted and have the most severely reduced probability of admission. Among people who do qualify, it's those with the highest income who are mostly likely to benefit, again the opposite of what you'd probably want if your goal is to benefit students from low income families.</p> <p>We can see these in the graphs below, which are histograms of parental income among students at two universities in 2008 (first graph) and 2016 (second graph), where the red line indicates the Pell Grant threshold.</p> <p><img src="images/discontinuities/pell-2008.jpg" alt="Histogram of income distribution of students at two universities in 2008; high incomes are highly overrepresented relative to the general population, but the distribution is smooth" height="816" width="1056"></p> <p><img src="images/discontinuities/pell-2016.jpg" alt="Histogram of income distribution of students at two universities in 2016; high incomes are still highly overrepresented, there's also a sharp discontinuity at the Pell grant threshold; plot looks roughly two upwards sloping piecewise linear functions, with a drop back to nearly 0 at the discontinuity at the Pell grant threshold" height="816" width="1056"></p> <p>A second order effect of universities optimizing for Pell Grant recipients is that savvy parents can do the same thing that some people do to cut their taxable income at the last minute. Someone might put money into a traditional IRA instead of a Roth IRA and, if they're at their IRA contribution limit, they can try to lose money on options, effectively transferring money to options traders who are likely to be wealthier than them, in order to bring their income below the Pell Grant threshold, increasing the probability that their children will be admitted to a selective school.</p> <h3 id="election-statistics-https-arxiv-org-pdf-1410-6059-pdf"><a href="https://arxiv.org/pdf/1410.6059.pdf">Election statistics</a></h3> <p>The following histograms of Russian elections across polling stations shows curious spikes in turnout and results at nice, round, numbers (e.g., 95%) starting around 2004. This appears to indicate that there's election fraud via fabricated results and that at least some of the people fabricating results don't bother with fabricating results that have a smooth distribution.</p> <p><img src="images/discontinuities/russian-elections.png" height="1418" width="1822"></p> <p>For finding fraudulent numbers, also see, <a href="https://en.wikipedia.org/wiki/Benford%27s_law">Benford's law</a>.</p> <h3 id="used-car-sale-prices-https-www-ftc-gov-sites-default-files-documents-public-events-3rd-annual-microeconomics-conference-lacetera-slide-pdf"><a href="https://www.ftc.gov/sites/default/files/documents/public_events/3rd-annual-microeconomics-conference/lacetera_slide.pdf">Used car sale prices</a></h3> <p><a href="https://twitter.com/ainsworld/status/1436418752409706519">Mark Ainsworth points out that there are discontinuities</a> at $10k boundaries in U.S. auto auction sales prices as well as volume of vehicles offered at auction. The price graph below adjusts for a number of factors such as model year, but we can see the same discontinuities in the raw unadjusted data.</p> <p><img src="images/discontinuities/car-prices.png" alt="Graph of car sales prices at auction, showing discontinuities described above" height="1670" width="942"></p> <p><img src="images/discontinuities/car-volumes.png" alt="Graph of car volumes at auction, showing discontinuities described above for dealer sales to auction but not fleet sales to auction" height="1710" width="992"></p> <h3 id="p-values-https-en-wikipedia-org-wiki-p-value"><a href="https://en.wikipedia.org/wiki/P-value">p-values</a></h3> <p>Authors of psychology papers are incentivized to produce papers with <a href="https://en.wikipedia.org/wiki/P-value">p values</a> below some threshold, usually 0.05, but sometimes 0.1 or 0.01. <a href="https://www.ncbi.nlm.nih.gov/pubmed/22853650">Masicampo et al. plotted p values from papers published in three psychology journals</a> and found a curiously high number of papers with p values just below 0.05.</p> <p><img src="images/discontinuities/p-value.png" alt="Histogram of published p-values; spike at p=0.05" height="1000" width="1874"></p> <p>The spike at p = 0.05 consistent with a number of hypothesis that aren't great, such as:</p> <ul> <li>Authors are fudging results to get p = 0.05</li> <li>Journals are much more likely to accept a paper with p = 0.05 than if p = 0.055</li> <li>Authors are much less likely to submit results if p = 0.055 than if p = 0.05</li> </ul> <p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359000/">Head et al. (2015)</a> surveys the evidence across a number of fields.</p> <p>Andrew Gelman and others have been campaigning to get rid of the idea of statistical significance and p-value thresholds for years, <a href="https://stat.columbia.edu/~gelman/research/unpublished/abandon.pdf">see this paper for a short summary of why</a>. Not only would this reduce the incentive for authors to cheat on p values, there are other reasons to not want a bright-line rule to determine if something is &quot;significant&quot; or not.</p> <h3 id="drug-charges-http-econweb-umd-edu-tuttle-files-tuttle-mandatory-minimums-pdf"><a href="http://econweb.umd.edu/~tuttle/files/tuttle_mandatory_minimums.pdf">Drug charges</a></h3> <p>The top two graphs in this set of four show histograms of the amount of cocaine people were charged with possessing before and after the passing of the Fair Sentencing Act in 2010, which raised the amount of cocaine necessary to trigger the 10-year mandatory minimum prison sentence for possession from 50g to 280g. There's a relatively smooth distribution before 2010 and a sharp discontinuity after 2010.</p> <p>The bottom-left graph shows the sharp spike in prosecutions at 280 grams followed by what might be a drop in 2013 after evidentiary standards were changed<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup>.</p> <p><img src="images/discontinuities/cocaine-280.png" height="1118" width="1408"></p> <h3 id="high-school-exit-exam-scores-danluu-com-matura-2013-pdf"><a href="//danluu.com/matura-2013.pdf">High school exit exam scores</a></h3> <p>This is a histogram of high school exit exam scores from the Polish language exam. We can see that a curiously high number of students score 30 or just above thirty while curiously low number of students score from 23-29. This is from 2013; other years I've looked at (2010-2012) show a similar discontinuity.</p> <p>Math exit exam scores don't exhibit any unusual discontinuities in the years I've examined (2010-2013).</p> <p><img src="images/discontinuities/matura-polish-2013.png" height="772" width="1416"></p> <p><a href="https://www.reddit.com/r/dataisbeautiful/comments/1bqf9r/unusual_distributions_of_scores_on_final/c994zxt/">An anonymous reddit commenter explains this</a>:</p> <blockquote> <p>When a teacher is grading matura (final HS exam), he/she doesn't know whose test it is. The only things that are known are: the number (code) of the student and the district which matura comes from (it is usually from completely different part of Poland). The system is made to prevent any kind of manipulation, for example from time to time teachers supervisor will come to check if test are graded correctly. I don't wanna talk much about system flaws (and advantages), it is well known in every education system in the world where final tests are made, but you have to keep in mind that there is a key, which teachers follow very strictly when grading.</p> <p>So, when a score of the test is below 30%, exam is failed. However, before making final statement in protocol, a commision of 3 (I don't remember exact number) is checking test again. This is the moment, where difference between humanities and math is shown: teachers often try to find a one (or a few) missing points, so the test won't be failed, because it's a tragedy to this person, his school and somewhat fuss for the grading team. Finding a &quot;missing&quot; point is not that hard when you are grading writing or open questions, which is a case in polish language, but nearly impossible in math. So that's the reason why distribution of scores is so different.</p> </blockquote> <p>As with p values, having a bright-line threshold, causes curious behavior. In this case, scoring below 30 on any subject (a 30 or above is required in every subject) and failing the exam has arbitrary negative effects for people, so teachers usually try to prevent people from failing if there's an easy way to do it, but a deeper root of the problem is the idea that it's necessary to produce a certification that's the discretization of a continuous score.</p> <h3 id="birth-month-and-sports">Birth month and sports</h3> <p>These are scatterplots of football (soccer) players in the <a href="https://en.wikipedia.org/wiki/UEFA_Youth_League">UEFA Youth League</a>. The x-axis on both of these plots is how old players are modulo the year, i.e., their birth month normalized from 0 to 1.</p> <p>The graph on the left is a histogram, which shows that there is a very strong relationship between where a person's birth falls within the year and their odds of making a club at the UEFA Youth League (U19) level. The graph on the right purports to show that birth time is only weakly correlated with actual value provided on the field. The authors use playing time as a proxy for value, presumably because it's easy to measure. That's not a great measure, but the result they find (younger-within-the-year players have higher value, conditional on making the U19 league) is consistent with other studies on sports and discrimination, which ind (for example) that <a href="tech-discrimination/">black baseball players were significantly better than white baseball players for decades after desegregation in baseball, French-Canadian defensemen are also better than average (French-Canadians are stereotypically afraid to fight, don't work hard enough, and are too focused on offense)</a>.</p> <p>The discontinuity isn't directly shown in the graphs above because the graphs only show birth date for one year. If we were to plot birth date by cohort across multiple years, we'd expect to see a sawtooth pattern in the probability that a player makes it into the UEFA youth league with a 10x difference between someone born one day before vs. after the threshold.</p> <p><img src="images/discontinuities/u19-age.png" height="1030" width="1758"></p> <p>This phenomenon, that birth day or month is a good predictor of participation in higher-level youth sports as well as pro sports, has been studied across a variety of sports.</p> <p>It's generally believed that this is caused by a discontinuity in youth sports:</p> <ol> <li>Kids are bucketed into groups by age in years and compete against people in the same year</li> <li>Within a given year, older kids are stronger, faster, etc., and perform better</li> <li>This causes older-within-year kids to outcompete younger kids, which later results in older-within-year kids having higher levels of participation for a variety of reasons</li> </ol> <p>This is arguably a &quot;bug&quot; in how youth sports works. But <a href="//danluu.com/bad-decisions/">as we've seen in baseball</a> <a href="//danluu.com/tech-discrimination/">as well as a survey of multiple sports</a>, obviously bad decision making that costs individual teams tens or even hundreds of millions of dollars can persist for decades in the face of people pubicly discussing how bad the decisions are. In this case, the youth sports teams aren't feeder teams to pro teams, so they don't have a financial incentive to select players who are skilled for their age (as opposed to just taller and faster because they're slightly older) so this system-wide non-optimal even more difficult to fix than pro sports teams making locally non-optimal decisions that are completely under their control.</p> <h3 id="procurement-auctions-http-www-keikawai-com-full-0804-pdf"><a href="http://www.keikawai.com/Full_0804.pdf">Procurement auctions</a></h3> <p>Kawai et al. looked at Japanese government procurement, in order to find suspicious pattern of bids like the ones described in <a href="https://www.nber.org/papers/w4013">Porter et al. (1993)</a>, which looked at collusion in procurement auctions on Long Island (in New York in the United States). One example that's given is:</p> <blockquote> <p>In February 1983, the New York State Department of Transportation (DoT) held a pro- curement auction for resurfacing 0.8 miles of road. The lowest bid in the auction was $4 million, and the DoT decided not to award the contract because the bid was deemed too high relative to its own cost estimates. The project was put up for a reauction in May 1983 in which all the bidders from the initial auction participated. The lowest bid in the reauction was 20% higher than in the initial auction, submitted by the previous low bidder. Again, the contract was not awarded. The DoT held a third auction in February 1984, with the same set of bidders as in the initial auction. The lowest bid in the third auction was 10% higher than the second time, again submitted by the same bidder. The DoT apparently thought this was suspicious: “It is notable that the same firm submitted the low bid in each of the auctions. Because of the unusual bidding patterns, the contract was not awarded through 1987.”</p> </blockquote> <p>It could be argued that this is expected because different firms have different cost structures, so the lowest bidder in an auction for one particular project should be expected to be the lowest bidder in subsequent auctions for the same project. In order to distinguish between collusion and real structural cost differences between firms, Kawai et al. (2015) looked at auctions where the difference in bid between the first and second place firms was very small, making the winner effectively random.</p> <p>In the auction structure studied, bidders submit a secret bid. If the secret bid is above a secret minimum, then the lowest bidder wins the auction and gets the contract. If not, the lowest bid is revealed to all bidders and another round of bidding is done. Kawai et al. found that, in about 97% of auctions, the bidder who submitted the lowest bid in the first round also submitted the lowest bid in the second round (the probability that the second lowest bidder remains second lowest was 26%).</p> <p>Below, is a histogram of the difference in first and second round bids between the first-lowest and second-lowest bidders (left column) and the second-lowest and third-lowest bidders (right column). Each row has a different filtering criteria for how close the auction has to be in order to be included. In the top row, all auctions that reached the third round were included; in second, and third rows, the normalized delta between the first and second biders was less than 0.05 and 0.01, respectively; in the last row, the normalized delta between the first and the third bidder was less than 0.03. All numbers are normalized because the absolute size of auctions can vary.</p> <p><img src="images/discontinuities/jp-procurement-bidding.png" height="1446" width="1682"></p> <p>We can see that the distributions of deltas between the first and second round are roughly symmetrical when comparing second and third lowest bidders. But when comparing first and second lowest bidders, there's a sharp discontinuity at zero, indicating that second-lowest bidder almost never lowers their bid by more than the first-lower bidder did. If you read the paper, you can see that the same structure persists into auctions that go into a third round.</p> <p>I don't mean to pick on Japanese procurement auctions in particular. There's an extensive literature on procurement auctions that's found collusion in many cases, often much more blatant than the case presented above (e.g., there are a few firms and they round-robin who wins across auctions, or there are a handful of firms and every firm except for the winner puts in the same losing bid).</p> <h3 id="restaurant-inspection-https-iquantny-tumblr-com-post-76928412519-think-nyc-restaurant-grading-is-flawed-heres-scores-http-datafra-me-blog-calling-out-nyc-restaurant-violations"><a href="https://iquantny.tumblr.com/post/76928412519/think-nyc-restaurant-grading-is-flawed-heres">Restaurant inspection</a> <a href="http://datafra.me/blog/Calling-out-NYC-restaurant-violations">scores</a></h3> <p>The histograms below show a sharp discontinuity between 13 and 14, which is the difference between an A grade and a B grade. It appears that some regions also have a discontinuity between 27 and 28, which is the difference between a B and a C and <a href="https://iquantny.tumblr.com/post/76928412519/think-nyc-restaurant-grading-is-flawed-heres">this older analysis from 2014</a> found what appears to be a similar discontinuity between B and C grades.</p> <p><img src="images/discontinuities/nyc-restaurant-inspections.png" height="900" width="1260"></p> <p>Inspectors have discretion in what violations are tallied and it appears that there are cases where restaurant are nudged up to the next higher grade.</p> <h3 id="marathon-finishing-times-https-faculty-chicagobooth-edu-devin-pope-research-pdf-website-marathons-pdf"><a href="https://faculty.chicagobooth.edu/devin.pope/research/pdf/Website_Marathons.pdf">Marathon finishing times</a></h3> <p>A histogram of marathon finishing times (finish times on the x-axis, count on the y-axis) across 9,789,093 finishes shows noticeable discontinuities at every half hour, as well as at &quot;round&quot; times like :10, :15, and :20.</p> <p><img src="images/discontinuities/marathon-times.png" height="1094" width="1530"></p> <p>An analysis of times within each race (<a href="faculty.chicagobooth.edu/devin.pope/research/pdf/Website_Marathons.pdf">see section 4.4, figures 7-9</a>) indicates that this is at least partially because people speed up (or slow down less than usual) towards the end of races if they're close to a &quot;round&quot; time<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">2</a></sup>.</p> <h3 id="notes">Notes</h3> <p>This post doesn't really have a goal or a point, it's just a collection of discontinuities that I find fun.</p> <p>One thing that's maybe worth noting is that I've gotten a lot of mileage out in my career both out of being suspicious of discontinuities and figuring out where they come from and also out of applying standard techniques to smooth out discontinuities.</p> <p>For finding discontinuities, basic tools like &quot;drawing a scatterplot&quot;, &quot;<a href="//danluu.com/perf-tracing/#histogram">drawing a histogram</a>&quot;, &quot;drawing the <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function">CDF</a>&quot; often come in handy. Other kinds of visualizations that add temporality, like <a href="https://github.com/Netflix/flamescope">flamescope</a>, can also come in handy.</p> <p>We <a href="#hardware-or-software-queues">noted above that queues create a kind of discontinuity that, in some circumstances, should be smoothed out</a>. We also noted that we see similar behavior for other kinds of thresholds and that randomization can be a useful tool to smooth out discontinuities in thresholds as well. Randomization can also be used to allow for reducing quantization error when reducing precision with ML and in other applications.</p> <p><small>Thanks to Leah Hanson, Omar Rizwan, Dmitry Belenko, Kamal Marhubi, Danny Vilea, Nick Roberts, Lifan Zeng, Mark Ainsworth, Wesley Aptekar-Cassels, Thomas Hauk, @BaudDev, and Michael Sullivan for comments/corrections/discussion.</p> <p>Also, please feel free to <a href="https://twitter.com/danluu/status/1230595642390564866">send me other interesting discontinuities</a>! </small></p> <p><link rel="prefetch" href="//danluu.com/randomize-hn/"> <link rel="prefetch" href="//danluu.com/bad-decisions/"> <link rel="prefetch" href="//danluu.com/tech-discrimination/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R"><p>Most online commentary I've seen about this paper is incorrect. I've seen this paper used as evidence of police malfeasance because the amount of cocaine seized jumped to 280g. This is the opposite of what's described in the paper, where the author notes that, based on drug seizure records, amounts seized do not appear to be the cause of this change. After noting that drug seizures are not the cause, the author notes that prosecutors can charge people for amounts that are not the same as the amount seized and then notes:</p> <blockquote> <p>I do find bunching at 280g after 2010 in case management data from the Executive Office of the US Attorney (EOUSA). I also find that approximately 30% of prosecutors are responsible for the rise in cases with 280g after 2010, and that there is variation in prosecutor-level bunching both within and between districts. Prosecutors who bunch cases at 280g also have a high share of cases right above 28g after 2010 (the 5-year threshold post-2010) and a high share of cases above 50g prior to 2010 (the 10-year threshold pre-2010). Also, bunching above a mandatory minimum threshold persists across districts for prosecutors who switch districts. Moreover, when a “bunching” prosecutor switches into a new district, all other attorneys in that district increase their own bunching at mandatory minimums. These results suggest that the observed bunching at sentencing is specifically due to prosecutorial discretion</p> </blockquote> <p>This is mentioned in the abstract and then expounded on in the introduction (the quoted passage is from the introduction), so I think that most people commenting on this paper can't have read it. I've done a few surveys of comments on papers on blog posts and I generally find that, in cases where it's possible to identify this (e.g., when the post is mistitled), the vast majority of commenters can't have read the paper or post they're commenting on, but that's a topic for another post.</p> <p>There is some evidence that something fishy may be going on in seizures (e.g., see Fig. A8.(c)), but if the analysis in the paper is correct, that impact of that is much smaller than the impact of prosecutorial discretion.</p> <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:M">One of the most common comments I've seen online about this graph and/or this paper is that this is due to pace runners provided by the marathon. Section 4.4 of the paper gives multiple explanations for why this cannot be the case, once again indicating that people tend to comment without reading the paper. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> </ol> </div> 95%-ile isn't that good p95-skill/ Fri, 07 Feb 2020 00:00:00 +0000 p95-skill/ <p>Reaching 95%-ile isn't very impressive because it's not that hard to do. I think this is one of my most ridiculable ideas. It doesn't help that, when stated nakedly, that sounds elitist. But I think it's just the opposite: most people can become (relatively) good at most things.</p> <p>Note that when I say 95%-ile, I mean 95%-ile among people who participate, not all people (for many activities, just doing it at all makes you 99%-ile or above across all people). I'm also not referring to 95%-ile among people who practice regularly. The &quot;<a href="https://en.wikipedia.org/wiki/One_weird_trick_advertisements">one weird trick</a>&quot; is that, for a lot of activities, being something like 10%-ile among people who practice can make you something like 90%-ile or 99%-ile among people who participate.</p> <p>This post is going to refer to specifics since <a href="https://twitter.com/danluu/status/926493613097439232">the discussions I've seen about this are all in the abstract</a>, which turns them into Rorschach tests. For example, Scott Adams has a widely cited post claiming that it's better to be a generalist than a specialist because, to become &quot;extraordinary&quot;, you have to either be &quot;the best&quot; at one thing or 75%-ile at two things. If that were strictly true, it would surely be better to be a generalist, but that's of course exaggeration and it's possible to get a lot of value out of a specialized skill without being &quot;the best&quot;; since the precise claim, as written, is obviously silly and the rest of the post is vague handwaving, discussions will inevitably devolve into people stating their prior beliefs and basically ignoring the content of the post.</p> <p>Personally, in every activity I've participated in where it's possible to get a rough percentile ranking, people who are 95%-ile constantly make mistakes that seem like they should be easy to observe and correct. &quot;Real world&quot; activities typically can't be reduced to a percentile rating, but achieving what appears to be a similar level of proficiency seems similarly easy.</p> <p>We'll start by looking at Overwatch (a video game) in detail because it's an activity I'm familiar with where it's easy to get ranking information and observe what's happening, and then we'll look at some &quot;real world&quot; examples where we can observe the same phenomena, although we won't be able to get ranking information for real world examples<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <h3 id="overwatch">Overwatch</h3> <p>At 90%-ile and 95%-ile ranks in Overwatch, the vast majority of players will pretty much constantly make basic game losing mistakes. These are simple mistakes like standing next to the objective instead of on top of the objective while the match timer runs out, turning a probable victory into a certain defeat. See the attached footnote if you want enough detail about specific mistakes that you can decide for yourself if a mistake is &quot;basic&quot; or not<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">2</a></sup>.</p> <p>Some reasons we might expect this to happen are:</p> <ol> <li>People don't want to win or don't care about winning</li> <li>People understand their mistakes but haven't put in enough time to fix them</li> <li>People are untalented</li> <li>People don't understand how to spot their mistakes and fix them</li> </ol> <p>In Overwatch, you may see a lot of (1), people who don’t seem to care about winning, at lower ranks, but by the time you get to 30%-ile, it's common to see people indicate their desire to win in various ways, such as yelling at players who are perceived as uncaring about victory or unskilled, complaining about people who they perceive to make mistakes that prevented their team from winning, etc.<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">3</a></sup>. Other than the occasional troll, it's not unreasonable to think that people are generally trying to win when they're severely angered by losing.</p> <p>(2), not having put in time enough to fix their mistakes will, at some point, apply to all players who are improving, but if you look at the median time played at 50%-ile, people who are stably ranked there have put in hundreds of hours (and the median time played at higher ranks is higher). Given how simple the mistakes we're discussing are, not having put in enough time cannot be the case for most players.</p> <p>A common complaint among low-ranked Overwatch players in Overwatch forums is that they're just not talented and can never get better. Most people probably don't have the talent to play in a professional league regardless of their practice regimen, but when you can get to 95%-ile by fixing mistakes like &quot;not realizing that you should stand on the objective&quot;, you don't really need a lot of talent to get to 95%-ile.</p> <p>While (4), people not understanding how to spot and fix their mistakes, isn't the only other possible explanation<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">4</a></sup>, I believe it's the most likely explanation for most players. Most players who express frustration that they're stuck at a rank up to maybe 95%-ile or 99%-ile don't seem to realize that they could drastically improve by observing their own gameplay or having someone else look at their gameplay.</p> <p>One thing that's curious about this is that Overwatch makes it easy to spot basic mistakes (compared to most other activities). After you're killed, the game shows you how you died from the viewpoint of the player who killed you, allowing you to see what led to your death. Overwatch also records the entire game and lets you watch a replay of the game, allowing you to figure out what happened and why the game was won or lost. In many other games, you'd have to set up recording software to be able to view a replay.</p> <p>If you read Overwatch forums, you'll see a regular stream of posts that are basically &quot;I'm SOOOOOO FRUSTRATED! I've played this game for 1200 hours and I'm still ranked 10%-ile, [some Overwatch specific stuff that will vary from player to player]&quot;. Another user will inevitably respond with something like &quot;we can't tell what's wrong from your text, please post a video of your gameplay&quot;. In the cases where the original poster responds with a recording of their play, people will post helpful feedback that will immediately make the player much better if they take it seriously. If you follow these people who ask for help, you'll often see them ask for feedback at a much higher rank (e.g., moving from 10%-ile to 40%-ile) shortly afterwards. It's nice to see that the advice works, but it's unfortunate that so many players don't realize that watching their own recordings or posting recordings for feedback could have saved 1198 hours of frustration.</p> <p>It appears to be common for Overwatch players (well into 95%-ile and above) to:</p> <ul> <li>Want to improve</li> <li>Not get feedback</li> <li>Improve slowly when getting feedback would make improving quickly easy</li> </ul> <p>Overwatch provides the tools to make it relatively easy to get feedback, but people who very strongly express a desire to improve don't avail themselves of these tools.</p> <h3 id="real-life">Real life</h3> <p>My experience is that other games are similar and I think that &quot;real life&quot; activities aren't so different, although there are some complications.</p> <p>One complication is that real life activities tend not to have a single, one-dimensional, objective to optimize for. Another is that what makes someone good at a real life activity tends to be poorly understood (by comparison to games and sports) even in relation to a specific, well defined, goal.</p> <p>Games with rating systems are easy to optimize for: your meta-goal can be to get a high rating, which can typically be achieved by increasing your win rate by fixing the kinds of mistakes described above, like not realizing that you should step onto the objective. For any particular mistake, you can even make a reasonable guess at the impact on your win rate and therefore the impact on your rating.</p> <p>In real life, if you want to be (for example) &quot;a good speaker&quot;, that might mean that you want to give informative talks that help people learn or that you want to give entertaining talks that people enjoy or that you want to give keynotes at prestigious conferences or that you want to be asked to give talks for $50k an appearance. Those are all different objectives, with different strategies for achieving them and for some particular mistake (e.g., spending 8 minutes on introducing yourself during a 20 minute talk), it's unclear what that means with respect to your goal.</p> <p>Another thing that makes games, at least mainstream ones, easy to optimize for is that they tend to have a lot of aficionados who have obsessively tried to figure out what's effective. This means that if you want to improve, unless you're trying to be among the top in the world, you can simply figure out what resources have worked for other people, pick one up, read/watch it, and then practice. For example, if you want to be 99%-ile in a trick-taking card game like bridge or spades (among all players, not subgroups like &quot;ACBL players with masterpoints&quot; or &quot;people who regularly attend North American Bridge Championships&quot;), you can do this by:</p> <ul> <li>learning the basics of the game</li> <li>reading <a href="https://amzn.to/2sUtmsk">a beginner book on cardplay</a></li> <li>practicing applying the material</li> </ul> <p>If you want to become a good speaker and you have a specific definition of “a good speaker” in mind, there still isn't an obvious path forward. Great speakers will give directly contradictory advice (e.g., avoid focusing on presentation skills vs. practice presentation skills). Relatively few people obsessively try to improve and figure out what works, which results in a lack of rigorous curricula for improving. However, this also means that it's easy to improve in percentile terms since <a href="https://twitter.com/danluu/status/1442945072144678914">relatively few people are trying to improve at all</a>.</p> <p>Despite all of the caveats above, my belief is that it's easier to become relatively good at real life activities relative to games or sports because there's so little delibrate practice put into most real life activities. Just for example, if you're a local table tennis hotshot who can beat every rando at a local bar, when you challenge someone to a game and they say &quot;sure, what's your rating?&quot; you know you're in for a shellacking by someone who can probably beat you while playing with a shoe brush (an actual feat that happened to a friend of mine, BTW). You're probably 99%-ile, but someone with no talent who's put in the time to practice the basics is going to have a serve that you can't return as well as be able to kill any shot a local bar expert is able to consitently hit. In most real life activities, there's almost no one who puts in a level of delibrate practice equivalent to someone who goes down to their local table tennis club and practices two hours a week, let alone someone like a top pro, who might seriously train for four hours a day.</p> <p>To give a couple of concrete examples, I helped <a href="http://blog.leahhanson.us/speaking.html">Leah</a> prepare for talks from 2013 to 2017. The first couple practice talks she gave were about the same as you'd expect if you walked into a random talk at a large tech conference. For the first couple years she was speaking, she did something like 30 or so practice runs for each public talk, of which I watched and gave feedback on half. Her first public talk was (IMO) well above average for a talk at a large, well regarded, tech conference and her talks got better from there until she stopped speaking in 2017.</p> <p>As we discussed above, this is more subjective than game ratings and there's no way to really determine a percentile, but if you look at how most people prepare for talks, it's not too surprising that Leah was above average. At one of the first conferences she spoke at, the night before the conference, we talked to another speaker who mentioned that they hadn't finished their talk yet and only had fifteen minutes of material (for a forty minute talk). They were trying to figure out how to fill the rest of the time. That kind of preparation isn't unusual and the vast majority of talks prepared like that aren't great.</p> <p>Most people consider doing 30 practice runs for a talk to be absurd, a totally obsessive amount of practice, but I think Gary Bernhardt has it right when he says that, if you're giving a 30-minute talk to a 300 person audience, that's 150 person-hours watching your talk, so it's not obviously unreasonable to spend 15 hours practicing (and 30 practice runs will probably be less than 15 hours since you can cut a number of the runs short and/or repeatedly practice problem sections). One thing to note that this level of practice, considered obessive when giving a talk, still pales in comparison to the amount of time a middling table tennis club player will spend practicing.</p> <p>If you've studied pedagogy, you might say that the help I gave to Leah was incredibly lame. It's known that having laypeople try to figure out how to improve among themselves is among the worst possible ways to learn something, good instruction is more effective and having a skilled coach or teacher give one-on-one instruction is more effective still<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">5</a></sup>. That's 100% true, my help was incredibly lame. However, most people aren't going to practice a talk more than a couple times and many won't even practice a single time (I don't have great data proving this, this is from informally polling speakers at conferences I've attended). This makes Leah's 30 practice runs an extraordinary amount of practice compared to most speakers, which resulted in a relatively good outcome even though we were using one of the worst possible techniques for improvement.</p> <p>My writing is another example. I'm not going to compare myself to anyone else, but my writing improved dramatically the first couple of years I wrote this blog just because I spent a little bit of effort on getting and taking feedback.</p> <p>Leah read one or two drafts of almost every post and gave me feedback. On the first posts, since neither one of us knew anything about writing, we had a hard time identifying what was wrong. If I had some awkward prose or confusing narrative structure, we'd be able to point at it and say &quot;that looks wrong&quot; without being able to describe what was wrong or suggest a fix. It was like, in the era before spellcheck, when you misspelled a word and could tell that something was wrong, but every permutation you came up with was just as wrong.</p> <p>My fix for that was to hire a professional editor whose writing I respected with the instructions &quot;I don't care about spelling and grammar fixes, there are fundamental problems with my writing that I don't understand, please explain to me what they are&quot;<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">6</a></sup>. I think this was more effective than my helping Leah with talks because we got someone who's basically a professional coach involved. An example of something my editor helped us with was giving us a vocabulary we could use to discuss structural problems, the way <a href="https://amzn.to/2GPPmYE">design patterns</a> gave people a vocabulary to talk about OO design.</p> <h3 id="back-to-this-blog-s-regularly-scheduled-topic-programming">Back to this blog's regularly scheduled topic: programming</h3> <p>Programming is similar to the real life examples above in that it's impossible to assign a rating or calculate percentiles or anything like that, but it is still possible to make significant improvements relative to your former self without too much effort by getting feedback on what you're doing.</p> <p>For example, <a href="https://twitter.com/danluu/status/926492239081197569">here's one thing Michael Malis did</a>:</p> <blockquote> <p>One incredibly useful exercise I’ve found is to watch myself program. Throughout the week, I have a program running in the background that records my screen. At the end of the week, I’ll watch a few segments from the previous week. Usually I will watch the times that felt like it took a lot longer to complete some task than it should have. While watching them, I’ll pay attention to specifically where the time went and figure out what I could have done better. When I first did this, I was really surprised at where all of my time was going.</p> <p>For example, previously when writing code, I would write all my code for a new feature up front and then test all of the code collectively. When testing code this way, I would have to isolate which function the bug was in and then debug that individual function. After watching a recording of myself writing code, I realized I was spending about a quarter of the total time implementing the feature tracking down which functions the bugs were in! This was completely non-obvious to me and I wouldn’t have found it out without recording myself. Now that I’m aware that I spent so much time isolating which function a bugs are in, I now test each function as I write it to make sure they work. This allows me to write code a lot faster as it dramatically reduces the amount of time it takes to debug my code.</p> </blockquote> <p>In the past, I've spent time figuring out where time is going when I code and basically saw the same thing as in Overwatch, except instead of constantly making game-losing mistakes, I was constantly doing pointlessly time-losing things. Just getting rid of some of those bad habits has probably been at least a 2x productivity increase for me, pretty easy to measure since fixing these is basically just clawing back wasted time. For example, I noticed how I'd get distracted for N minutes if I read something on the internet when I needed to wait for two minutes, so I made sure to keep a queue of useful work to fill dead time (and if I was working on something very latency sensitive where I didn't want to task switch, I'd do nothing until I was done waiting).</p> <p>One thing to note here is that it's important to actually track what you're doing and not just guess at what you're doing. When I've recorded what people do and compare it to what they think they're doing, these are often quite different. It would generally be considered absurd to operate a complex software system without metrics or tracing, but it's normal to operate yourself without metrics or tracing, even though you're much more complex and harder to understand than the software you work on.</p> <p><a href="//danluu.com/hn-comments/#what-makes-engineers-productive-https-news-ycombinator-com-item-id-5496914">Jonathan Tang has noted that choosing the right problem dominates execution speed</a>. I don't disagree with that, <a href="productivity-velocity/">but doubling execution speed is still decent win that's independent of selecting the right problem to work on</a> and I don't think that discussing how to choose the right problem can be effectively described in the abstract and the context necessary to give examples would be much longer than the already too long Overwatch examples in this post, maybe I'll write another post that's just about that.</p> <p>Anyway, this is sort of an odd post for me to write since I think that culturally, we care a bit too much about productivity in the U.S., especially in places I've lived recently (NYC &amp; SF). But at a personal level, higher productivity doing work or chores doesn't have to be converted into more work or chores, it can also be converted into more vacation time or more time doing whatever you value.</p> <p>And for games like Overwatch, I don't think improving is a moral imperative; there's nothing wrong with having fun at 50%-ile or 10%-ile or any rank. But in every game I've played with a rating and/or league/tournament system, a lot of people get really upset and unhappy when they lose even when they haven't put much effort into improving. If that's the case, why not put a little bit of effort into improving and spend a little bit less time being upset?</p> <h3 id="some-meta-techniques-for-improving">Some meta-techniques for improving</h3> <ul> <li>Get feedback and practice <ul> <li>Ideally from an expert coach but, if not, this can be from a layperson or even yourself (if you have some way of recording/tracing what you're doing)</li> </ul></li> <li>Guided exercises or exercises with solutions <ul> <li>This is very easy to find in books for &quot;old&quot; games, like chess or Bridge.</li> <li>For particular areas, you can often find series of books that have these, e.g., in math, books in the Springer Undergraduate Mathematics Series (SUMS) tend to have problems with solutions</li> </ul></li> </ul> <p>Of course, these aren't novel ideas, e.g., Kotov's series of books from the 70s, Think like a Grandmaster, Play Like a Grandmaster, Train Like a Grandmaster covered these same ideas because these are some of the most obvious ways to improve.</p> <h3 id="appendix-other-most-ridiculable-ideas">Appendix: other most ridiculable ideas</h3> <p>Here are the ideas I've posted about that were the most widely ridiculed at the time of the post:</p> <ul> <li><a href="//danluu.com/startup-tradeoffs/">It's not uncommon for programmers at trendy tech companies to make $350k/yr or more</a> (2015, stated number was $250k/yr at the time)</li> <li><a href="//danluu.com/monorepo/">Monorepos can be reasonable</a> (2015)</li> <li><a href="//danluu.com/cpu-bugs/">We should expect to see a lot more CPU bugs</a> (2016)</li> <li><a href="//danluu.com/tech-discrimination/">Markets are not incompatible with discrimination</a> (2014)</li> <li><a href="//danluu.com/input-lag/">Computers are getting slower in some ways</a> (2017)</li> <li><a href="//danluu.com/empirical-pl/">Empirical evidence on the benefit of types is almost non-existent</a> (2014)</li> <li>It's reasonable to write technical posts on a subject that avoid domain-specific terminology</li> </ul> <p>My posts on compensation have the dubious distinction of being the posts most frequently called out both for being so obvious that they're pointless as well as for being laughably wrong. I suspect they're also the posts that have had the largest aggregate impact on people -- I've had a double digit number of people tell me one of the compensation posts changed their life and they now make $x00,000/yr more than they used to because they know it's possible to get a much higher paying job and I doubt that I even hear from 10% of the people who make a big change as a result of learning that it's possible to make a lot more money.</p> <p>When I wrote my first post on compensation, in 2015, I got ridiculed more for writing something obviously wrong than for writing something obvious, but the last few years things have flipped around. I still get the occasional bit of ridicule for being wrong when some corner of Twitter or a web forum that's well outside the HN/reddit bubble runs across my post, but the ratio of “obviously wrong” to “obvious” has probably gone from 20:1 to 1:5.</p> <p>Opinions on monorepos have also seen a similar change since 2015. Outside of some folks at big companies, monorepos used to be considered obviously stupid among people who keep up with trends, but this has really changed. Not as much as opinions on compensation, but enough that I'm now a little surprised when I meet a hardline anti-monorepo-er.</p> <p>Although it's taken longer for opinions to come around on CPU bugs, that's probably the post that now gets the least ridicule from the list above.</p> <p>That markets don't eliminate all discrimination is the one where opinions have come around the least. Hardline &quot;all markets are efficient&quot; folks aren't really convinced by academic work like <a href="https://amzn.to/2Or7Z9k">Becker's The Economics of Discrimination</a> or summaries like <a href="//danluu.com/tech-discrimination/">the evidence laid out in the post</a>.</p> <p>The posts on computers having higher latency and the lack of empirical evidence of the benefit of types are the posts I've seen pointed to the most often to defend a ridiculable opinion. I didn't know when I started doing the work for either post and they both happen to have turned up evidence that's the opposite of the most common loud claims (there's very good evidence that advanced type systems improve safety in practice and of course computers are faster in every way, people who think they're slower are just indulging in nostalgia). I don't know if this has changed many opinion. However, I haven't gotten much direct ridicule for either post even though both posts directly state a position I see commonly ridiculed online. I suspect that's partially because both posts are empirical, so there's not much to dispute (though the post on discrimnation is also empirical, but it still gets its share of ridicule).</p> <p>The last idea in the list is more meta: no one directly tells me that I should use more obscure terminology. Instead, I get comments that I must not know much about X because I'm not using terms of art. Using terms of art is a common way to establish credibility or authority, but that's something I don't really believe in. Arguing from authority doesn't tell you anything; adding needless terminology just makes things more difficult for readers who aren't in the field and are reading because they're interested in the topic but don't want to actually get into the field.</p> <p>This is a pretty fundamental disagreement that I have with a lot of people. Just for example, I recently got into a discussion with an authority who insisted that it wasn't possible for me to reasonably disagree with them (I suggested we agree to disagree) because they're an authority on the topic and I'm not. It happens that I worked on the formal verification of a system very similar to the system we were discussing, but I didn't mention that because I don't believe that my status as an authority on the topic matters. If someone has such a weak argument that they have to fall back on an infallible authority, that's usually a sign that they don't have a well-reasoned defense of their position. This goes double when they point to themselves as the infallible authority.</p> <p>I have about 20 other posts on stupid sounding ideas queued up in my head, but I mostly try to avoid writing things that are controversial, so I don't know that I'll write many of those up. If I were to write one post a month (much more frequently than my recent rate) and limit myself to 10% posts on ridiculable ideas, it would take 16 years to write up all of the ridiculable ideas I currently have.</p> <h3 id="appendix-commentary-on-improvement">Appendix: commentary on improvement</h3> <ul> <li><a href="https://www.youtube.com/watch?v=h1a-lhjmfPw">Skyline</a>: 99%-ile isn't that good</li> <li><a href="https://twitter.com/JamesClear/status/1292574538912456707">James Clear</a>: 90%-ile isn't that good</li> <li><a href="https://web.archive.org/web/20120428193015/http://jinfiesto.posterous.com/how-to-seem-good-at-everything-stop-doing-stu">Josh Infiesto</a>: 80%-ile isn't that good</li> <li><a href="https://www.newyorker.com/magazine/2011/10/03/personal-best">Atul Gawande</a>: coaching / feedback is powerful and underrated</li> </ul> <p><small>Thanks to Leah Hanson, Hillel Wayne, Robert Schuessler, Michael Malis, Kevin Burke, Jeremie Jost, Pierre-Yves Baccou, Veit Heller, Jeff Fowler, Malte Skarupe, David Turner, Akiva Leffert, Lifan Zeng, John Hergenroder, Wesley Aptekar-Cassels, Chris Lample, Julia Evans, Anja Boskovic, Vaibhav Sagar, Sean Talts, Emil Sit, Ben Kuhn, Valentin Hartmann, Sean Barrett, Kevin Shannon, Enzo Ferey, Andrew McCollum, Yuri Vishnevsky, and an anonymous commenter for comments/corrections/discussion.</small></p> <p><link rel=prefetch href=//danluu.com/hn-comments/> <link rel=prefetch href=//danluu.com/startup-tradeoffs/> <link rel=prefetch href=//danluu.com/monorepo/> <link rel=prefetch href=//danluu.com/cpu-bugs/> <link rel=prefetch href=//danluu.com/tech-discrimination/> <link rel=prefetch href=//danluu.com/input-lag/> <link rel=prefetch href=//danluu.com/empirical-pl/></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S"><p>The choice of Overwatch is arbitrary among activities I'm familiar with where:</p> <ul> <li>I know enough about the activity to comment on it</li> <li>I've observed enough people trying to learn it that I can say if it's &quot;easy&quot; or not to fix some mistake or class of mistake</li> <li>There's a large enough set of rated players to support the argument</li> <li>Many readers will also be familiar with the activity</li> </ul> <p>99% of my gaming background comes from 90s video games, but I'm not going to use those as examples because relatively few readers will be familiar with those games. I could also use &quot;modern&quot; board games like Puerto Rico, Dominion, Terra Mystica, <a href="https://en.wikipedia.org/wiki/Advanced_Squad_Leader">ASL</a> etc., but the set of people who played in rated games is very low, which makes the argument less convincing (perhaps people who play in rated games are much worse than people who don't — unlikely, but difficult to justify without comparing gameplay between rated and unrated games, which is pretty deep into weeds for this post).</p> <p>There are numerous activities that would be better to use than Overwatch, but I'm not familiar enough with them to use them as examples. For example, on reading a draft of this post, Kevin Burke noted that he's observed the same thing while coaching youth basketball and multiple readers noted that they've observed the same thing in chess, but I'm not familiar enough with youth basketball or chess to confidently say much about either activity even they'd be better examples because it's likely that more readers are familiar with basketball or chess than with Overwatch.</p> <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:O"><p>When I first started playing Overwatch (which is when I did that experiment), I ended up getting rated slightly above 50%-ile (for Overwatch players, that was in Plat -- this post is going to use percentiles and not ranks to avoid making non-Overwatch players have to learn what the ranks mean). It's generally believed and probably true that people who play the main ranked game mode in Overwatch are, on average, better than people who only play unranked modes, so it's likely that my actual percentile was somewhat higher than 50%-ile and that all &quot;true&quot; percentiles listed in this post are higher than the nominal percentiles.</p> <p>Some things you'll regularly see at slightly above 50%-ile are:</p> <ul> <li>Supports (healers) will heal someone who's at full health (which does nothing) while a teammate who's next to them is dying and then dies</li> <li>Players will not notice someone who walks directly behind the team and kills people one at a time until the entire team is killed</li> <li>Players will shoot an enemy until only one more shot is required to kill the enemy and then switch to a different target, letting the 1-health enemy heal back to full health before shooting at that enemy again</li> <li>After dying, players will not wait for their team to respawn and will, instead, run directly into the enemy team to fight them 1v6. This will repeat for the entire game (the game is designed to be 6v6, but in ranks below 95%-ile, it's rare to see a 6v6 engagement after one person on one team dies)</li> <li>Players will clearly have no idea what character abilities do, including for the character they're playing</li> <li>Players go for very high risk but low reward plays (for Overwatch players, a classic example of this is Rein going for a meme pin when the game opens on 2CP defense, very common at 50%-ile, rare at 95%-ile since players who think this move is a good idea tend to have generally poor decision making).</li> <li>People will have terrible aim and will miss four or five shots in a row when all they need to do is hit someone once to kill them</li> <li>If a single flanking enemy threatens a healer who can't escape plus a non-healer with an escape ability, the non-healer will probably use their ability to run away, leaving the healer to die, even though they could easily kill the flanker and save their healer if they just attacked while being healed.</li> </ul> <p>Having just one aspect of your gameplay be merely bad instead of atrocious is enough to get to 50%-ile. For me, that was my teamwork, for others, it's other parts of their gameplay. The reason I'd say that my teamwork was bad and not good or even mediocre was that I basically didn't know how to play the game, didn't know what any of the characters’ strengths, weaknesses, and abilities are, so I couldn't possibly coordinate effectively with my team. I also didn't know how the game modes actually worked (e.g., under what circumstances the game will end in a tie vs. going into another round), so I was basically wandering around randomly with a preference towards staying near the largest group of teammates I could find. That's above average.</p> <p>You could say that someone is pretty good at the game since they're above average. But in a non-relative sense, being slightly above average is quite bad -- it's hard to argue that someone who doesn't notice their entire team being killed from behind while two teammates are yelling &quot;[enemy] behind us!&quot; over voice comms isn't bad.</p> <p>After playing a bit more, I ended up with what looks like a &quot;true&quot; rank of about 90%-ile when I'm using a character I know how to use. Due to volatility in ranking as well as matchmaking, I played in games as high as 98%-ile. My aim and dodging were still atrocious. Relative to my rank, my aim was actually worse than when I was playing in 50%-ile games since my opponents were much better and I was only a little bit better. In 90%-ile, two copies of myself would probably lose fighting against most people 2v1 in the open. I would also usually lose a fight if the opponent was in the open and I was behind cover such that only 10% of my character was exposed, so my aim was arguably more than 10x worse than median at my rank.</p> <p>My &quot;trick&quot; for getting to 90%-ile despite being a 1/10th aimer was learning how the game worked and playing in a way that maximized the probability of winning (to the best of my ability), as opposed to playing the game like it's an <a href="https://en.wikipedia.org/wiki/Deathmatch">FFA</a> game where your goal is to get kills as quickly as possible. It takes a bit more context to describe what this means in 90%-ile, so I'll only provide a couple examples, but these are representative of mistakes the vast majority of 90%-ile players are making all of the time (with the exception of a few players who have grossly defective aim, like myself, who make up for their poor aim by playing better than average for the rank in other ways).</p> <p>Within the game, the goal of the game is to win. There are different game modes, but for the mainline ranked game, they all will involve some kind of objective that you have to be on or near. It's very common to get into a situation where the round timer is ticking down to zero and your team is guaranteed to lose if no one on your team touches the objective but your team may win if someone can touch the objective and not die instantly (which will cause the game to go into overtime until shortly after both teams stop touching the objective). A concrete example of this that happens somewhat regularly is, the enemy team has four players on the objective while your team has two players near the objective, one tank and one support/healer. The other four players on your team died and are coming back from spawn. They're close enough that if you can touch the objective and not instantly die, they'll arrive and probably take the objective for the win, but they won't get there in time if you die immediately after taking the objective, in which case you'll lose.</p> <p>If you're playing the support/healer at 90%-ile to 95%-ile, this game will almost always end as follows: the tank will move towards the objective, get shot, decide they don't want to take damage, and then back off from the objective. As a support, you have a small health pool and will die instantly if you touch the objective because the other team will shoot you. Since your team is guaranteed to lose if you don't move up to the objective, you're forced to do so to have any chance of winning. After you're killed, the tank will either move onto the objective and die or walk towards the objective but not get there before time runs out. Either way, you'll probably lose.</p> <p>If the tank did their job and moved onto the objective before you died, you could heal the tank for long enough that the rest of your team will arrive and you'll probably win. The enemy team, if they were coordinated, could walk around or through the tank to kill you, but they won't do that -- anyone who knows that will cause them to win the game and can aim well enough to successfully follow through can't help but end up in a higher rank). And the hypothetical tank on your team who knows that it's their job to absorb damage for their support in that situation and not vice versa won't stay at 95%-ile very long because they'll win too many games and move up to a higher rank.</p> <p>Another basic situation that the vast majority of 90%-ile to 95%-ile players will get wrong is when you're on offense, waiting for your team to respawn so you can attack as a group. Even at 90%-ile, maybe 1/4 to 1/3 of players won't do this and will just run directly at the enemy team, but enough players will realize that 1v6 isn't a good idea that you'll often 5v6 or 6v6 fights instead of the constant 1v6 and 2v6 fights you see at 50%-ile. Anyway, while waiting for the team to respawn in order to get a 5v6, it's very likely one player who realizes that they shouldn't just run into the middle of the enemy team 1v6 will decide they should try to hit the enemy team with long-ranged attacks 1v6. People will do this instead of hiding in safety behind a wall even when the enemy has multiple snipers with instant-kill long range attacks. People will even do this against multiple snipers when they're playing a character that isn't a sniper and needs to hit the enemy 2-3 times to get a kill, making it overwhelmingly likely that they won't get a kill while taking a significant risk of dying themselves. For Overwatch players, people will also do this when they have full ult charge and the other team doesn't, turning a situation that should be to your advantage (your team has ults ready and the other team has used ults), into a neutral situation (both teams have ults) <em>at best</em>, and instantly losing the fight at worst.</p> <p>If you ever read an Overwatch forum, whether that's one of the reddit forums or the official Blizzard forums, a common complaint is &quot;why are my teammates so bad? I'm at [90%-ile to 95%-ile rank], but all my teammates are doing obviously stupid game-losing things all the time, like [an example above]&quot;. The answer is, of course, that the person asking the question is also doing obviously stupid game-losing things all the time because anyone who doesn't constantly make major blunders wins too much to stay at 95%-ile. This also applies to me.</p> <p>People will argue that players at this rank <em>should</em> be good because they're better than 95% of other players, which makes them relatively good. But non-relatively, it's hard to argue that someone who doesn't realize that you should step on the objective to probably win the game instead of not touching the objective for a sure loss is good. One of the most basic things about Overwatch is that it's an objective-based game, but the majority of players at 90%-ile to 95%-ile don't play that way.</p> <p>For anyone who isn't well into the 99%-ile, reviewing recorded games will reveal game-losing mistakes all the time. For myself, usually ranked 90%-ile or so, watching a recorded game will reveal tens of game losing mistakes in a close game (which is maybe 30% of losses, the other 70% are blowouts where there isn't a single simple mistake that decides the game).</p> <p>It's generally not too hard to fix these since the mistakes are like the example above: simple enough that once you see that you're making the mistake, the fix is straightforward because the mistake is straightforward.</p> <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:A"><p>There are probably some people who just want to be angry at their teammates. Due to how infrequently you get matched with the same players, it's hard to see this in the main rated game mode, but I think you can sometimes see this when Overwatch sometimes runs mini-rated modes.</p> <p>Mini-rated modes have a much smaller playerbase than the main rated mode, which has two notable side effects: players with a much wider variety of skill levels will be thrown into the same game and you'll see the same players over and over again if you play multiple games.</p> <p>Since you ended up matched with the same players repeatedly, you'll see players make the same mistakes and cause themselves to lose in the same way and then have the same tantrum and blame their teammates in the same way game after game.</p> <p>You'll also see tantrums and teammate blaming in the normal rated game mode, but when you see it, you generally can't tell if the person who's having a tantrum is just having a bad day or if it's some other one-off occurrence since, unless you're ranked very high or very low (where there's a smaller pool of closely rated players), you don't run into the same players all that frequently. But when you see a set of players in 15-20 games over the course of a few weeks and you see them lose the game for the same reason a double digit number of times followed by the exact same tantrum, you might start to suspect that some fraction of those people really want to be angry and that the main thing they're getting out of playing the game is a source of anger. You might also wonder about this from how some people use social media, but that's a topic for another post.</p> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:D"><p>For example, there will also be players who have some kind of disability that prevents them from improving, but at the levels we're talking about, 99%-ile or below, that will be relatively rare (certainly well under 50%, and I think it's not unreasonable to guess that it's well under 10% of people who choose to play the game). IIRC, there's at least one player who's in the top 500 who's deaf (this is severely disadvantageous since sound cues give a lot of fine-grained positional information that cannot be obtained in any other way), at least one legally blind player who's 99%-ile, and multiple players with physical impairments that prevent them from having fine-grained control of a mouse, i.e., who are basically incapable of aiming, who are 99%-ile.</p> <p>There are also other kinds of reasons people might not improve. For example, Kevin Burke has noted that when he coaches youth basketball, some children don't want to do drills that they think make them look foolish (e.g., avoiding learning to dribble with their off hand even during drills where everyone is dribbling poorly because they're using their off hand). When I spent a lot of time in a climbing gym with a world class coach who would regularly send a bunch of kids to nationals and some to worlds, I'd observe the same thing in his classes -- kids, even ones who are nationally or internationally competitive, would sometimes avoid doing things because they were afraid it would make them look foolish to their peers. The coach's solution in those cases was to deliberately make the kid look extremely foolish and tell them that it's better to look stupid now than at nationals.</p> <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:C">note that, here, a skilled coach is someone who is skilled at coaching, not necessarily someone who is skilled at the activity. People who are skilled at the activity but who haven't explicitly been taught how to teach or spent a lot of time working on teaching are generally poor coaches. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:F">If you read the acknowledgements section of any of my posts, you can see that I get feedback from more than just two people on most posts (and I really appreciate the feedback), but I think that, by volume, well over 90% of the feedback I've gotten has come from Leah and a professional editor. <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> </ol> </div> Algorithms interviews: theory vs. practice algorithms-interviews/ Sun, 05 Jan 2020 00:00:00 +0000 algorithms-interviews/ <p>When I ask people at trendy big tech companies why algorithms quizzes are mandatory, the most common answer I get is something like &quot;we have so much scale, we can't afford to have someone accidentally write an <code>O(n^2)</code> algorithm and bring the site down&quot;<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">1</a></sup>. One thing I find funny about this is, even though a decent fraction of the value I've provided for companies has been solving phone-screen level algorithms problems on the job, I can't pass algorithms interviews! When I say that, people often think I mean that I fail half my interviews or something. It's more than half.</p> <p>When I wrote a draft blog post of my interview experiences, draft readers panned it as too boring and repetitive because I'd failed too many interviews. I should summarize my failures as a table because no one's going to want to read a 10k word blog post that's just a series of failures, they said (which is good advice; I'm working on a version with a table). I’ve done maybe 40-ish &quot;real&quot; software interviews and passed maybe one or two of them (arguably zero)<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">2</a></sup>.</p> <p>Let's look at a few examples to make it clear what I mean by &quot;phone-screen level algorithms problem&quot;, above.</p> <p>At one big company I worked for, a team wrote a core library that implemented a resizable array for its own purposes. On each resize that overflowed the array's backing store, the implementation added a constant number of elements and then copied the old array to the newly allocated, slightly larger, array. This is a classic example of how not to <a href="https://en.wikipedia.org/wiki/Dynamic_array">implement a resizable array</a> since it results in linear time resizing instead of <a href="https://en.wikipedia.org/wiki/Amortized_analysis">amortized constant time</a> resizing. It's such a classic example that it's often used as the canonical example when demonstrating amortized analysis.</p> <p>For people who aren't used to big tech company phone screens, typical phone screens that I've received are one of:</p> <ul> <li>an &quot;easy&quot; coding/algorithms question, maybe with a &quot;very easy&quot; warm-up question in front.</li> <li>a series of &quot;very easy&quot; coding/algorithms questions,</li> <li>a bunch of trivia (rare for generalist roles, but not uncommon for low-level or performance-related roles)</li> </ul> <p>This array implementation problem is considered to be so easy that it falls into the &quot;very easy&quot; category and is either a warm-up for the &quot;real&quot; phone screen question or is bundled up with a bunch of similarly easy questions. And yet, this resizable array was responsible for roughly 1% of all GC pressure across all JVM code at the company (it was the second largest source of allocations across all code) as well as a significant fraction of CPU. Luckily, the resizable array implementation wasn't used as a generic resizable array and it was only instantiated by a semi-special-purpose wrapper, which is what allowed this to &quot;only&quot; be responsible for 1% of all GC pressure at the company. If asked as an interview question, it's overwhelmingly likely that most members of the team would've implemented this correctly in an interview. My fixing this made my employer more money annually than I've made in my life.</p> <p>That was the second largest source of allocations, the number one largest source was converting a pair of <code>long</code> values to byte arrays in the same core library. It appears that this was done because someone wrote or copy pasted a hash function that took a byte array as input, then modified it to take two inputs by taking two byte arrays and operating on them in sequence, which left the hash function interface as <code>(byte[], byte[])</code>. In order to call this function on two longs, they used a handy <code>long</code> to <code>byte[]</code> conversion function in a widely used utility library. That function, in addition to allocating a <code>byte[]</code> and stuffing a <code>long</code> into it, also reverses the endianness of the long (the function appears to have been intended to convert <code>long</code> values to network byte order).</p> <p>Unfortunately, switching to a more appropriate hash function would've been a major change, so my fix for this was to change the hash function interface to take a pair of longs instead of a pair of byte arrays and have the hash function do the endianness reversal instead of doing it as a separate step (since the hash function was already shuffling bytes around, this didn't create additional work). Removing these unnecessary allocations made my employer more money annually than I've made in my life.</p> <p>Finding a constant factor speedup isn't technically an algorithms question, but it's also something you see in algorithms interviews. As a follow-up to an algorithms question, I commonly get asked &quot;can you make this faster?&quot; The answer is to these often involves doing a simple optimization that will result in a constant factor improvement.</p> <p>A concrete example that I've been asked twice in interviews is: you're storing IDs as ints, but you already have some context in the question that lets you know that the IDs are densely packed, so you can store them as a bitfield instead. The difference between the bitfield interview question and the real-world superfluous array is that the real-world existing solution is so far afield from the expected answer that you probably wouldn’t be asked to find a constant factor speedup. More likely, you would've failed the interview at that point.</p> <p>To pick an example from another company, the configuration for <a href="bitfunnel-sigir.pdf">BitFunnel</a>, a search index used in Bing, is another example of an interview-level algorithms question<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">3</a></sup>.</p> <p>The full context necessary to describe the solution is a bit much for this blog post, but basically, there's a set of bloom filters that needs to be configured. One way to do this (which I'm told was being done) is to write a black-box optimization function that uses gradient descent to try to find an optimal solution. I'm told this always resulted in some strange properties and the output configuration always resulted in non-idealities which were worked around by making the backing bloom filters less dense, i.e. throwing more resources (and therefore money) at the problem.</p> <p>To create a more optimized solution, you can observe that the fundamental operation in BitFunnel is equivalent to multiplying probabilities together, so, for any particular configuration, you can just multiply some probabilities together to determine how a configuration will perform. Since the configuration space isn't all that large, you can then put this inside a few for loops and iterate over the space of possible configurations and then pick out the best set of configurations. This isn't quite right because multiplying probabilities assumes a kind of independence that doesn't hold in reality, but that seems to work ok for the same reason that <a href="https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering">naive Bayesian spam filtering</a> worked pretty well when it was introduced even though it incorrectly assumes the probability of any two words appearing in an email are independent. And if you want the full solution, you can work out the non-independent details, although that's probably beyond the scope of an interview.</p> <p>Those are just three examples that came to mind, I run into this kind of thing all the time and could come up with tens of examples off the top of my head, perhaps more than a hundred if I sat down and tried to list every example I've worked on, certainly more than a hundred if I list examples I know of that someone else (or no one) has worked on. Both the examples in this post as well as the ones I haven’t included have these properties:</p> <ul> <li>The example could be phrased as an interview question</li> <li>If phrased as an interview question, you'd expect most (and probably) all people on the relevant team to get the right answer in the timeframe of an interview</li> <li>The cost savings from fixing the example is worth more annually than my lifetime earnings to date</li> <li>The example persisted for long enough that it's reasonable to assume that it wouldn't have been discovered otherwise</li> </ul> <p>At the start of this post, we noted that people at big tech companies commonly claim that they have to do algorithms interviews since it's so costly to have inefficiencies at scale. My experience is that these examples are legion at every company I've worked for that does algorithms interviews. Trying to get people to solve algorithms problems on the job by asking algorithms questions in interviews doesn't work.</p> <p>One reason is that even though big companies try to make sure that the people they hire can solve algorithms puzzles they also incentivize many or most developers to avoid deploying that kind of reasoning to make money.</p> <p>Of the three solutions for the examples above, two are in production and one isn't. That's about my normal hit rate if I go to a random team with a diff and don't persistently follow up (as opposed to a team that I have reason to believe will be receptive, or a team that's asked for help, or if I keep pestering a team until the fix gets taken).</p> <p>If you're very cynical, you could argue that it's surprising the success rate is that high. If I go to a random team, it's overwhelmingly likely that efficiency is in neither the team's objectives or their org's objectives. The company is likely to have spent a decent amount of effort incentivizing teams to hit their objectives -- what's the point of having objectives otherwise? Accepting my diff will require them to test, integrate, deploy the change and will create risk (because all deployments have non-zero risk). Basically, I'm asking teams to do some work and take on some risk to do something that's worthless to them. Despite incentives, people will usually take the diff, but they're not very likely to spend a lot of their own spare time trying to find efficiency improvements(and their normal work time will be spent on things that are aligned with the team's objectives)<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">4</a></sup>.</p> <p>Hypothetically, let's say a company didn't try to ensure that its developers could pass algorithms quizzes but did incentivize developers to use relatively efficient algorithms. I don't think any of the three examples above could have survived, undiscovered, for years nor could they have remained unfixed. Some hypothetical developer working at a company where people profile their code would likely have looked at the hottest items in the profile for the most computationally intensive library at the company. The &quot;trick&quot; for both isn't any kind of algorithms wizardry, it's just looking at all, which is something incentives can fix. The third example is less inevitable since there isn't a standard tool that will tell you to look at the problem. It would also be easy to try to spin the result as some kind of wizardry -- that example formed the core part of a paper that won &quot;best paper award&quot; at the top conference in its field (IR), but the reality is that the &quot;trick&quot; was applying high school math, which means the real trick was having enough time to look at places where high school math might be applicable to find one.</p> <p>I actually worked at a company that used the strategy of &quot;don't ask algorithms questions in interviews, but do incentivize things that are globally good for the company&quot;. During my time there, I only found one single fix that nearly meets the criteria for the examples above (if the company had more scale, it would've met all of the criteria, but due to the company's size, increases in efficiency were worth much less than at big companies -- much more than I was making at the time, but the annual return was still less than my total lifetime earnings to date).</p> <p>I think the main reason that I only found one near-example is that enough people viewed making the company better as their job, so straightforward high-value fixes tended not exist because systems were usually designed such that they didn't really have easy to spot improvements in the first place. In the rare instances where that wasn't the case, there were enough people who were trying to do the right thing for the company (instead of being forced into obeying local incentives that are quite different from what's globally beneficial to the company) that someone else was probably going to fix the issue before I ever ran into it.</p> <p>The algorithms/coding part of that company's interview (initial screen plus onsite combined) was easier than the phone screen at major tech companies and we basically didn't do a system design interview.</p> <p>For a while, we tried an algorithmic onsite interview question that was on the hard side but in the normal range of what you might see in a BigCo phone screen (but still easier than you'd expect to see at an onsite interview). We stopped asking the question because every new grad we interviewed failed the question (we didn't give experienced candidates that kind of question). We simply weren't prestigious enough to get candidates who can easily answer those questions, so it was impossible to hire using <a href="//danluu.com/programmer-moneyball/">the same trendy hiring filters that everybody else had</a>. In contemporary discussions on interviews, what we did is often called &quot;lowering the bar&quot;, but it's unclear to me why we should care how high of a bar someone can jump over when little (and in some cases none) of the job they're being hired to do involves jumping over bars. And, in the cases where you do want them to jump over bars, they're maybe 2&quot; high and can easily be walked over.</p> <p>When measured on actual productivity, that was the most productive company I've worked for. I believe the reasons for that are cultural and too complex to fully explore in this post, but I think it helped that we didn't filter out perfectly good candidates with algorithms quizzes and assumed people could pick that stuff up on the job if we had a culture of people generally doing the right thing instead of focusing on local objectives.</p> <p>If other companies want people to solve interview-level algorithms problems on the job perhaps they could try incentivizing people to solve algorithms problems (when relevant). That could be done in addition to or even instead of filtering for people who can whiteboard algorithms problems.</p> <h3 id="appendix-how-did-we-get-here">Appendix: how did we get here?</h3> <p>Way back in the day, interviews often involved &quot;trivia&quot; questions. Modern versions of these might look like the following:</p> <ul> <li>What's MSI? MESI? MOESI? MESIF? What's the advantage of MESIF over MOESI?</li> <li>What happens when you throw in a destructor? What if it's C++11? What if a sub-object's destructor that's being called by a top-level destructor throws, which other sub-object destructors will execute? What if you throw during stack unwinding? Under what circumstances would that not cause <code>std::terminate</code> to get called?</li> </ul> <p>I heard about this practice back when I was in school and even saw it with some &quot;old school&quot; companies. This was back when Microsoft was the biggest game in town and people who wanted to copy a successful company were likely to copy Microsoft. The most widely read programming blogger at the time (Joel Spolsky) was telling people they need to adopt software practice X because Microsoft was doing it and they couldn't compete without adopting the same practices. For example, in one of the most influential programming blog posts of the era, Joel Spolsky advocates for what he called the Joel test in part by saying that you have to do these things to keep up with companies like Microsoft:</p> <blockquote> <p>A score of 12 is perfect, 11 is tolerable, but 10 or lower and you’ve got serious problems. The truth is that most software organizations are running with a score of 2 or 3, and they need <em>serious</em> help, because companies like Microsoft run at 12 full-time.</p> </blockquote> <p>At the time, popular lore was that Microsoft asked people questions like the following (and I was actually asked one of these brainteasers during my on interview with Microsoft around 2001, along with precisely zero algorithms or coding questions):</p> <ul> <li>how would you escape from a blender if you were half an inch tall?</li> <li>why are manhole covers round?</li> <li>a windowless room has 3 lights, each of which is controlled by a switch outside of the room. You are outside the room. You can only enter the room once. How can you determine which switch controls which lightbulb?</li> </ul> <p>Since I was interviewing during the era when this change was happening, I got asked plenty of trivia questions as well plenty of brainteasers (including all of the above brainteasers). Some other questions that aren't technically brainteasers that were popular at the time were <a href="https://en.wikipedia.org/wiki/Fermi_problem">Fermi problems</a>. Another trend at the time was for behavioral interviews and a number of companies I interviewed with had 100% behavioral interviews with zero technical interviews.</p> <p>Anyway, back then, people needed a rationalization for copying Microsoft-style interviews. When I asked people why they thought brainteasers or Fermi questions were good, the convenient rationalization people told me was usually that they tell you if a candidate can really think, unlike those silly trivia questions, which only tell if you people have memorized some trivia. What we really need to hire are candidates who can really think!</p> <p>Looking back, people now realize that this wasn't effective and cargo culting Microsoft's every decision won't make you as successful as Microsoft because Microsoft's success came down to a few key things plus network effects, so copying how they interview can't possibly turn you into Microsoft. Instead, it's going to turn you into a company that interviews like Microsoft but isn't in a position to take advantage of the network effects that Microsoft was able to take advantage of.</p> <p>For interviewees, the process with brainteasers was basically as it is now with algorithms questions, except that you'd review <a href="https://www.amazon.com/gp/product/0316778494/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0316778494&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=bcc3345d4905164e75b79f2cd2a0f5fb">How Would You Move Mount Fuji</a> before interviews instead of <a href="https://www.amazon.com/gp/product/0984782850/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0984782850&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=381539624231e0e40b66f2695c925536">Cracking the Coding Interview</a> to pick up a bunch of brainteaser knowledge that you'll never use on the job instead of algorithms knowledge you'll never use on the job.</p> <p>Back then, interviewers would learn about questions specifically from interview prep books like &quot;How Would You Move Mount Fuji?&quot; and then ask them to candidates who learned the answers from books like &quot;How Would You Move Mount Fuji?&quot;. When I talk to people who are ten years younger than me, they think this is ridiculous -- those questions obviously have nothing to do the job and being able to answer them well is much more strongly correlated with having done some interview prep than being competent at the job. Hillel Wayne has discussed how people come up with interview questions today (and I've also seen it firsthand at a few different companies) and, outside of groups that are testing for knowledge that's considered specialized, it doesn't seem all that different today.</p> <p>At this point, we've gone through a few decades of programming interview fads, each one of which looks ridiculous in retrospect. Either we've finally found the real secret to interviewing effectively and have reasoned our way past whatever roadblocks were causing everybody in the past to use obviously bogus fad interview techniques, or we're in the middle of another fad, one which will seem equally ridiculous to people looking back a decade or two from now.</p> <p>Without knowing anything about the effectiveness of interviews, at a meta level, since the way people get interview techniques is the same (crib the high-level technique from the most prestigious company around), I think it would be pretty surprising if this wasn't a fad. I would be less surprised to discover that current techniques were not a fad if people were doing or referring to empirical research or had independently discovered what works.</p> <p>Inspired by <a href="https://twitter.com/danluu/status/925728187375595521">a comment by Wesley Aptekar-Cassels</a>, the last time I was looking for work, I asked some people how they checked the effectiveness of their interview process and how they tried to reduce bias in their process. The answers I got (grouped together when similar, in decreasing order of frequency were):</p> <ul> <li>Huh? We don't do that and/or why would we do that?</li> <li>We don't really know if our process is effective</li> <li>I/we just know that it works</li> <li>I/we aren't biased</li> <li>I/we would notice bias if it existed, which it doesn't</li> <li>Someone looked into it and/or did a study, but no one who tells me this can ever tell me anything concrete about how it was looked into or what the study's methodology was</li> </ul> <h3 id="appendix-training">Appendix: training</h3> <p>As with most real world problems, when trying to figure out why seven, eight, or even nine figure per year interview-level algorithms bugs are lying around waiting to be fixed, there isn't a single &quot;root cause&quot; you can point to. Instead, there's a kind of <a href="https://en.wikipedia.org/wiki/Hedgehog_defence">hedgehog defense</a> of misaligned incentives. Another part of this is that <a href="//danluu.com/programmer-moneyball/#training-mentorship">training is woefully underappreciated</a>.</p> <p>We've discussed that, at all but one company I've worked for, there are incentive systems in place that cause developers to feel like they shouldn't spend time looking at efficiency gains even when a simple calculation shows that there are tens or hundreds of millions of dollars in waste that could easily be fixed. And then because this isn't incentivized, developers tend to not have experience doing this kind of thing, making it unfamiliar, which makes it feel harder than it is. So even when a day of work could return $1m/yr in savings or profit (quite common at large companies, in my experience), people don't realize that it's only a day of work and could be done with only a small compromise to velocity. One way to solve this latter problem is with training, but that's even harder to get credit for than efficiency gains that aren't in your objectives!</p> <p>Just for example, I once wrote a moderate length tutorial (4500 words, shorter than this post by word count, though probably longer if you add images) on how to find various inefficiencies (how to use an allocation or CPU time profiler, how to do service-specific GC tuning for the GCs we use, how to use some tooling I built that will automatically find inefficiencies in your JVM or container configs, etc., basically things that are simple and often high impact that it's easy to write a runbook for; if you're at Twitter, you can read this at <a href="http://go/easy-perf">http://go/easy-perf</a>). I've had a couple people who would've previously come to me for help with an issue tell me that they were able to debug and fix an issue on their own and, secondhand, I heard that a couple other people who I don't know were able to go off and increase the efficiency of their service. I'd be surprised if I’ve heard about even 10% of cases where this tutorial helped someone, so I'd guess that this has helped tens of engineers, and possibly quite a few more.</p> <p>If I'd spent a week doing &quot;real&quot; work instead of writing a tutorial, I'd have something concrete, with quantifiable value, that I could easily put into a promo packet or performance review. Instead, I have this nebulous thing that, at best, counts as a bit of &quot;extra credit&quot;. I'm not complaining about this in particular -- this is exactly the outcome I expected. But, on average, companies get what they incentivize. If they expect training to come from developers (as opposed to hiring people to produce training materials, which tends to be very poorly funded compared to engineering) but don't value it as much as they value dev work, then there's going to be a shortage of training.</p> <p>I believe you can also see training under-incentivized in public educational materials due to the relative difficulty of monetizing education and training. If you want to monetize explaining things, there are a few techniques that seem to work very well. If it's something that's directly obviously valuable, selling a video course that's priced &quot;very high&quot; (hundreds or thousands of dollars for a short course) seems to work. Doing corporate training, where companies fly you in to talk to a room of 30 people and you charge $3k per head also works pretty well.</p> <p>If you want to reach (and potentially help) a lot of people, putting text on the internet and giving it away works pretty well, but monetization for that works poorly. For technical topics, I'm not sure the non-ad-blocking audience is really large enough to monetize via ads (as opposed to a pay wall).</p> <p>Just for example, Julia Evans can support herself from her <a href="https://jvns.ca/blog/2019/10/01/zine-revenue-2019/">zine income</a>, which she's said has brought in roughly $100k/yr for the past two years. Someone who does very well in corporate training can pull that in with a one or two day training course and, from what I've heard of corporate speaking rates, some highly paid tech speakers can pull that in with two engagements. Those are significantly above average rates, especially for speaking engagements, but since we're comparing to Julia Evans, I don't think it's unfair to use an above average rate.</p> <h3 id="appendix-misaligned-incentive-hedgehog-defense-part-3">Appendix: misaligned incentive hedgehog defense, part 3</h3> <p>Of the three examples above, I found one on a team where it was clearly worth zero to me to do anything that was actually valuable to the company and the other two on a team where it valuable to me to do things that were good for the company, regardless of what they were. In my experience, that's very unusual for a team at a big company, but even on that team, incentive alignment was still quite poor. At one point, after getting a promotion and a raise, I computed the ratio of the amount of money my changes made the company vs. my raise and found that my raise was 0.03% of the money that I made the company, only counting easily quantifiable and totally indisputable impact to the bottom line. The vast majority of my work was related to tooling that had a difficult to quantify value that I suspect was actually larger than the value of the quantifiable impact, so I probably received well under 0.01% of the marginal value I was producing. And that's really an overestimate of how much I was incentivized I was to do the work -- at the margin, I strongly suspect that anything I did was worth zero to me. After the first $10m/yr or maybe $20m/yr, there's basically no difference in terms of performance reviews, promotions, raises, etc. Because there was no upside to doing work and there's some downside (could get into a political fight, could bring the site down, etc.), the marginal return to me of doing more than &quot;enough&quot; work was probably negative.</p> <p>Some companies will give very large out-of-band bonuses to people regularly, but that work wasn't for a company that does a lot of that, so there's nothing the company could do to indicate that it valued additional work once someone did &quot;enough&quot; work to get the best possible rating on a performance review. From a <a href="https://en.wikipedia.org/wiki/Mechanism_design">mechanism design</a> point of view, the company was basically asking employees to stop working once they did &quot;enough&quot; work for the year.</p> <p>So even on this team, which was relatively well aligned with the company's success compared to most teams, the company's compensation system imposed a low ceiling on how well the team could be aligned.</p> <p>This also happened in another way. As is common at a lot of companies, managers were given a team-wide budget for raises that was mainly a function of headcount, that was then doled out to team members in a zero-sum way. Unfortunately for each team member (at least in terms of compensation), the team pretty much only had productive engineers, meaning that no one was going to do particularly well in the zero-sum raise game. The team had very low turnover because people like working with good co-workers, but the company was applying one the biggest levers it has, compensation, to try to get people to leave the team and join less effective teams.</p> <p>Because this is such a common setup, I've heard of managers at multiple companies who try to retain people who are harmless but ineffective to try to work around this problem. If you were to ask someone, abstractly, if the company wants to hire and retain people who are ineffective, I suspect they'd tell you no. But insofar as a company can be said to want anything, it wants what it incentivizes.</p> <h3 id="related">Related</h3> <ul> <li><a href="//danluu.com/programmer-moneyball/">Downsides of cargo-culting trendy hiring practices</a></li> <li><a href="//danluu.com/wat/">Normalization of deviance</a></li> <li><a href="https://thezvi.wordpress.com/2019/05/30/quotes-from-moral-mazes/">Zvi Mowshowitz's on Moral Mazes</a>, a book about how corporations have systemic issues that cause misaligned incentives at every level</li> <li><a href="https://www.jefftk.com/p/programmers-should-plan-for-lower-pay#lw-ftocr9cDfYtyQj9x2">&quot;randomsong&quot; on on how it's possible to teach almost anybody to program</a>. thematically related, the idea being that programming isn't as hard as a lot of programmers would like to believe</li> <li><a href="https://noidea.dog/glue">Tanya Reilly on how &quot;glue work&quot; is poorly incentivized</a>, training being poorly incentivized is arguably a special case of this</li> <li><a href="https://sockpuppet.org/blog/2015/03/06/the-hiring-post/">Thomas Ptacek on using hiring filters that are decently correlated with job performance</a></li> <li><a href="https://mtlynch.io/why-i-quit-google/">Michael Lynch on his personal experience of big company incentives</a></li> <li><a href="https://news.ycombinator.com/item?id=21961560">An anonymous HN commenter on doing almost no work at Google, they say about 10% capacity, for six years and getting promoted</a></li> </ul> <p><small>Thanks to Leah Hanson, Heath Borders, Lifan Zeng, Justin Findlay, Kevin Burke, @chordowl, Peter Alexander, Niels Olson, Kris Shamloo, Chip Thien, Yuri Vishnevsky, and Solomon Boulos for comments/corrections/discussion</small></p> <p><link rel="prefetch" href="//danluu.com"> <link rel="prefetch" href="//danluu.com/programmer-moneyball/"> <link rel="prefetch" href="//danluu.com/wat/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:A">For one thing, most companies that copy the Google interview don't have that much scale. But even for companies that do, most people don't have jobs where they're designing high-scale algorithms (maybe they did at Google circa 2003, but from what I've seen at three different big tech companies, most people's jobs are pretty light on algorithms work). <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:S"><p>Real is in quotes because I've passed a number of interviews for reasons outside of the interview process. Maybe I had a very strong internal recommendation that could override my interview performance, maybe someone read my blog and assumed that I can do reasonable work based on my writing, maybe someone got a backchannel reference from a former co-worker of mine, or maybe someone read some of my open source code and judged me on that instead of a whiteboard coding question (and as far as I know, that last one has only happened once or twice). I'll usually ask why I got a job offer in cases where I pretty clearly failed the technical interview, so I have a collection of these reasons from folks.</p> <p>The reason it's arguably zero is that the only software interview where I inarguably got a &quot;real&quot; interview and was coming in cold was at Google, but that only happened because the interviewers that were assigned interviewed me for the wrong ladder -- I was interviewing for a hardware position, but I was being interviewed by software folks, so I got what was basically a standard software interview except that one interviewer asked me some questions about state machine and cache coherence (or something like that). After they realized that they'd interviewed me for the wrong ladder, I had a follow-up phone interview from a hardware engineer to make sure I wasn't totally faking having worked at a hardware startup from 2005 to 2013. It's possible that I failed the software part of the interview and was basically hired on the strength of the follow-up phone screen.</p> <p>Note that this refers only to software -- I'm actually pretty good at hardware interviews. At this point, I'm pretty out of practice at hardware and would probably need a fair amount of time to ramp up on an actual hardware job, but the interviews are a piece of cake for me. One person who knows me pretty well thinks this is because I &quot;talk like a hardware engineer&quot; and both say things that make hardware folks think I'm legit as well as say things that sound incredibly stupid to most programmers in a way that's more about shibboleths than actual knowledge or skills.</p> <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:P"><p>This one is a bit harder than you'd expect to get in a phone screen, but it wouldn't be out of line in an onsite interview (although a friend of mine once got <a href="https://code.google.com/codejam/contest/2437491/dashboard#s=p2">a Google Code Jam World Finals question</a> in a phone interview with Google, so you might get something this hard or harder, depending on who you draw as an interviewer).</p> <p>BTW, if you're wondering what my friend did when they got that question, it turns out they actually knew the answer because they'd seen and attempted the problem during Google Code Jam. They didn't get the right answer at the time, but they figured it out later just for fun. However, my friend didn't think it was reasonable to give that as a phone screen questions and asked the interviewer for another question. The interviewer refused, so my friend failed the phone screen. At the time, I doubt there were more than a few hundred people in the world who would've gotten the right answer to the question in a phone screen and almost all of them probably would've realized that it was an absurd phone screen question. After failing the interview, my friend ended up looking for work for almost six months before passing an interview for a startup where he ended up building a number of core systems (in terms of both business impact and engineering difficulty). My friend is still there after the mid 10-figure IPO -- the company understands how hard it would be to replace this person and treats them very well. None of the other companies that interviewed this person even wanted to hire them at all and they actually had a hard time getting a job.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:E">Outside of egregious architectural issues that will simply cause a service to fall over, the most common way I see teams fix efficiency issues is to ask for more capacity. Some companies try to counterbalance this in some way (e.g., I've heard that at FB, a lot of the teams that work on efficiency improvements report into the capacity org, which gives them the ability to block capacity requests if they observe that a team has extreme inefficiencies that they refuse to fix), but I haven't personally worked in an environment where there's an effective system fix to this. Google had a system that was intended to address this problem that, among other things, involved making headcount fungible with compute resources, but I've heard that was rolled back in favor of a more traditional system for <a href="https://en.wikipedia.org/wiki/Incentive_compatibility">reasons</a>. <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> </ol> </div> Files are fraught with peril deconstruct-files/ Fri, 12 Jul 2019 00:00:00 +0000 deconstruct-files/ <p><small><em>This is a psuedo-transcript for a talk given at Deconstruct 2019. <a href="//danluu.com/web-bloat/">To make this accessible for people on slow connections</a> as well as people using screen readers, the slides have been replaced by in-line text (the talk has ~120 slides; at an average of 20 kB per slide, that's 2.4 MB. If you think that's trivial, consider that <a href="https://blogs.microsoft.com/on-the-issues/2019/04/08/its-time-for-a-new-approach-for-mapping-broadband-data-to-better-serve-americans/">half of Americans still aren't on broadband</a> and the situation is much worse in developing countries.</em></small></p> <p>Let's talk about files! Most developers seem to think that files are easy. Just for example, let's take a look at the top reddit r/programming comments from when Dropbox announced that they were only going to support ext4 on Linux (the most widely used Linux filesystem). For people not familiar with reddit r/programming, I suspect r/programming is the most widely read English language programming forum in the world.</p> <p>The top comment reads:</p> <blockquote> <p>I'm a bit confused, why do these applications have to support these file systems directly? Doesn't the kernel itself abstract away from having to know the lower level details of how the files themselves are stored?</p> <p>The only differences I could possibly see between different file systems are file size limitations and permissions, but aren't most modern file systems about on par with each other?</p> </blockquote> <p>The #2 comment (and the top replies going two levels down) are:</p> <blockquote> <p>#2: Why does an application care what the filesystem is?</p> <p>#2: Shouldn't that be abstracted as far as &quot;normal apps&quot; are concerned by the OS?</p> <p>Reply: It's a leaky abstraction. I'm willing to bet each different FS has its own bugs and its own FS specific fixes in the dropbox codebase. More FS's means more testing to make sure everything works right . . .</p> <p>2nd level reply: What are you talking about? This is a dropbox, what the hell does it need from the FS? There are dozenz of fssync tools, data transfer tools, distributed storage software, and everything works fine with inotify. What the hell does not work for dropbox exactly?</p> <p>another 2nd level reply: Sure, but any bugs resulting from should be fixed in the respective abstraction layer, not by re-implementing the whole stack yourself. You shouldn't re-implement unless you don't get the data you need from the abstraction. . . . DropBox implementing FS-specific workarounds and quirks is way overkill. That's like vim providing keyboard-specific workarounds to avoid faulty keypresses. All abstractions are leaky - but if no one those abstractions, nothing will ever get done (and we'd have billions of &quot;operating systems&quot;).</p> </blockquote> <p>In this talk, we're going to look at how file systems differ from each other and other issues we might encounter when writing to files. We're going to look at the file &quot;stack&quot; starting at the top with the file API, which we'll see is nearly impossible to use correctly and that supporting multiple filesystems without corrupting data is much harder than supporting a single filesystem; move down to the filesystem, which we'll see has serious bugs that cause data loss and data corruption; and then we'll look at disks and see that disks can easily corrupt data at a rate five million times greater than claimed in vendor datasheets.</p> <h3 id="file-api">File API</h3> <h4 id="writing-one-file">Writing one file</h4> <p>Let's say we want to write a file safely, so that we don't want to get data corruption. For the purposes of this talk, this means we'd like our write to be &quot;atomic&quot; -- our write should either fully complete, or we should be able to undo the write and end up back where we started. Let's look at an example from Pillai et al., OSDI’14.</p> <p>We have a file that contains the text <code>a foo</code> and we want to overwrite <code>foo</code> with <code>bar</code> so we end up with <code>a bar</code>. We're going to make a number of simplifications. For example, you should probably think of each character we're writing as a sector on disk (or, if you prefer, you can imagine we're using a hypothetical advanced NVM drive). Don't worry if you don't know what that means, I'm just pointing this out to note that this talk is going to contain many simplifications, which I'm not going to call out because we only have twenty-five minutes and the unsimplified version of this talk would probably take about three hours.</p> <p>To write, we might use the <code>pwrite</code> syscall. This is a function provided by the operating system to let us interact with the filesystem. Our invocation of this syscall looks like:</p> <pre><code>pwrite( [file], “bar”, // data to write 3, // write 3 bytes 2) // at offset 2 </code></pre> <p><code>pwrite</code> takes the file we're going to write, the data we want to write, <code>bar</code>, the number of bytes we want to write, <code>3</code>, and the offset where we're going to start writing, <code>2</code>. If you're used to using a high-level language, like Python, you might be used to an interface that looks different, but underneath the hood, when you write to a file, it's eventually going to result in a syscall like this one, which is what will actually write the data into a file.</p> <p>If we just call <code>pwrite</code> like this, we might succeed and get <code>a bar</code> in the output, or we might end up doing nothing and getting <code>a foo</code>, or we might end up with something in between, like <code>a boo</code>, <code>a bor</code>, etc.</p> <p>What's happening here is that we might crash or lose power when we write. Since <code>pwrite</code> isn't guaranteed to be atomic, if we crash, we can end up with some fraction of the write completing, causing data corruption. One way to avoid this problem is to store an &quot;undo log&quot; that will let us restore corrupted data. Before we're modify the file, we'll make a copy of the data that's going to be modified (into the undo log), then we'll modify the file as normal, and if nothing goes wrong, we'll delete the undo log.</p> <p>If we crash while we're writing the undo log, that's fine -- we'll see that the undo log isn't complete and we know that we won't have to restore because we won't have started modifying the file yet. If we crash while we're modifying the file, that's also ok. When we try to restore from the crash, we'll see that the undo log is complete and we can use it to recover from data corruption:</p> <pre><code>creat(/d/log) // Create undo log write(/d/log, &quot;2,3,foo&quot;, 7) // To undo, at offset 2, write 3 bytes, &quot;foo&quot; pwrite(/d/orig, “bar&quot;, 3, 2) // Modify original file as before unlink(/d/log) // Delete log file </code></pre> <p>If we're using <code>ext3</code> or <code>ext4</code>, widely used Linux filesystems, and we're using the mode <code>data=journal</code> (we'll talk about what these modes mean later), here are some possible outcomes we could get:</p> <pre><code>d/log: &quot;2,3,f&quot; d/orig: &quot;a foo&quot; d/log: &quot;&quot; d/orig: &quot;a foo&quot; </code></pre> <p>It's possible we'll crash while the log file write is in progress and we'll have an incomplete log file. In the first case above, we know that the log file isn't complete because the file says we should start at offset <code>2</code> and write <code>3</code> bytes, but only one byte, <code>f</code>, is specified, so the log file must be incomplete. In the second case above, we can tell the log file is incomplete because the undo log format should start with an offset and a length, but we have neither. Either way, since we know that the log file isn't complete, we know that we don't need to restore.</p> <p>Another possible outcome is something like:</p> <pre><code>d/log: &quot;2,3,foo&quot; d/orig: &quot;a boo&quot; d/log: &quot;2,3,foo&quot; d/orig: &quot;a bar&quot; </code></pre> <p>In the first case, the log file is complete we crashed while writing the file. This is fine, since the log file tells us how to restore to a known good state. In the second case, the write completed, but since the log file hasn't been deleted yet, we'll restore from the log file.</p> <p>If we're using <code>ext3</code> or <code>ext4</code> with <code>data=ordered</code>, we might see something like:</p> <pre><code>d/log: &quot;2,3,fo&quot; d/orig: &quot;a boo&quot; d/log: &quot;&quot; d/orig: &quot;a bor&quot; </code></pre> <p>With <code>data=ordered</code>, there's no guarantee that the <code>write</code> to the log file and the <code>pwrite</code> that modifies the original file will execute in program order. Instesad, we could get</p> <pre><code>creat(/d/log) // Create undo log pwrite(/d/orig, “bar&quot;, 3, 2) // Modify file before writing undo log! write(/d/log, &quot;2,3,foo&quot;, 7) // Write undo log unlink(/d/log) // Delete log file </code></pre> <p>To prevent this re-ordering, we can use another syscall, <code>fsync</code>. <code>fsync</code> is a barrier (prevents re-ordering) and it flushes caches (which we'll talk about later).</p> <pre><code>creat(/d/log) write(/d/log, “2,3,foo”, 7) fsync(/d/log) // Add fsync to prevent re-ordering pwrite(/d/orig, “bar”, 3, 2) fsync(/d/orig) // Add fsync to prevent re-ordering unlink(/d/log) </code></pre> <p>This works with <code>ext3</code> or <code>ext4</code>, <code>data=ordered</code>, but if we use <code>data=writeback</code>, we might see something like:</p> <pre><code>d/log: &quot;2,3,WAT&quot; d/orig: &quot;a boo&quot; </code></pre> <p>Unfortunately, with <code>data=writeback</code>, the <code>write</code> to the log file isn't guaranteed to be atomic and the filesystem metadata that tracks the file length can get updated before we've finished writing the log file, which will make it look like the log file contains whatever bits happened to be on disk where the log file was created. Since the log file exists, when we try to restore after a crash, we may end up &quot;restoring&quot; random garbage into the original file. To prevent this, we can add a checksum (a way of making sure the file is actually valid) to the log file.</p> <pre><code>creat(/d/log) write(/d/log,“…[✓∑],foo”,7) // Add checksum to log file to detect incomplete log file fsync(/d/log) pwrite(/d/orig, “bar”, 3, 2) fsync(/d/orig) unlink(/d/log) </code></pre> <p>This should work with <code>data=writeback</code>, but we could still see the following:</p> <pre><code>d/orig: &quot;a boo&quot; </code></pre> <p>There's no log file! Although we created a file, wrote to it, and then fsync'd it. Unfortunately, there's no guarantee that the directory will actually store the location of the file if we crash. In order to make sure we can easily find the file when we restore from a crash, we need to fsync the parent of the newly created log.</p> <pre><code>creat(/d/log) write(/d/log,“…[✓∑],foo”,7) fsync(/d/log) fsync(/d) /// fsync parent directory pwrite(/d/orig, “bar”, 3, 2) fsync(/d/orig) unlink(/d/log) </code></pre> <p>There are a couple more things we should do. We shoud also fsync after we're done (not shown), and we also need to check for errors. These syscalls can return errors and those errors need to be handled appropriately. There's at least one filesystem issue that makes this very difficult, but since that's not an API usage thing per se, we'll look at this again in the <strong>Filesystems</strong> section.</p> <p>We've now seen what we have to do to write a file safely. It might be more complicated than we like, but it seems doable -- if someone asks you to write a file in a self-contained way, like an interview question, and you know the appropriate rules, you can probably do it correctly. But what happens if we have to do this as a day-to-day part of our job, where we'd like to write to files safely every time to write to files in a large codebase.</p> <h4 id="api-in-practice">API in practice</h4> <p>Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.</p> <p>When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?</p> <h5 id="concurrent-programming-is-hard">Concurrent programming is hard</h5> <p>There are a number of reasons for this. If you ask people &quot;what are hard problems in programming?&quot;, you'll get answers like distributed systems, concurrent programming, security, aligning things with CSS, dates, etc.</p> <p>And if we look at what mistakes cause bugs when people do concurrent programming, we see bugs come from things like &quot;incorrectly assuming operations are atomic&quot; and &quot;incorrectly assuming operations will execute in program order&quot;. These things that make concurrent programming hard also make writing files safely hard -- we saw examples of both of these kinds of bugs in our first example. More generally, many of the same things that make concurrent programming hard are the same things that make writing to files safely hard, so of course we should expect that writing to files is hard!</p> <p>Another property writing to files safely shares with concurrent programming is that it's easy to write code that has infrequent, non-deterministc failures. With respect to files, people will sometimes say this makes things easier (&quot;I've never noticed data corruption&quot;, &quot;your data is still mostly there most of the time&quot;, etc.), but if you want to write files safely because you're working on software that shouldn't corrupt data, this makes things more difficult by making it more difficult to tell if your code is really correct.</p> <h5 id="api-inconsistent">API inconsistent</h5> <p>As we saw in our first example, even when using one filesystem, different modes may have significantly different behavior. Large parts of the file API look like this, where behavior varies across filesystems or across different modes of the same filesystem. For example, if we look at mainstream filesystems, appends are atomic, except when using <code>ext3</code> or <code>ext4</code> with <code>data=writeback</code>, or <code>ext2</code> in any mode and directory operations can't be re-ordered w.r.t. any other operations, except on <code>btrfs</code>. In theory, we should all read the POSIX spec carefully and make sure all our code is valid according to POSIX, but if they check filesystem behavior at all, people tend to code to what their filesystem does and not some abtract spec.</p> <p>If we look at one particular mode of one filesystem (<code>ext4</code> with <code>data=journal</code>), that seems relatively possible to handle safely, but when writing for a variety of filesystems, especially when handling filesystems that are very different from <code>ext3</code> and <code>ext4</code>, like <code>btrfs</code>, it becomes very difficult for people to write correct code.</p> <h4 id="docs-unclear">Docs unclear</h4> <p>In our first example, we saw that we can get different behavior from using different <code>data=</code> modes. If we look at the manpage (manual) on what these modes mean in <code>ext3</code> or <code>ext4</code>, we get:</p> <blockquote> <p>journal: All data is committed into the journal prior to being written into the main filesystem.</p> <p>ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.</p> <p>writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. <strong>This is rumoured to be</strong> the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.</p> </blockquote> <p>If you want to know how to use your filesystem safely, and you don't already know what a journaling filesystem is, this definitely isn't going to help you. If you know what a journaling filesystem is, this will give you some hints but it's still not sufficient. It's theoretically possible to figure everything out from reading the source code, but this is pretty impractical for most people who don't already know how the filesystem works.</p> <p>For English-language documentation, there's lwn.net and the Linux kernel mailing list (LKML). LWN is great, but they can't keep up with everything, so LKML is the place to go if you want something comprehensive. Here's an example of an exchange on LKML about filesystems:</p> <p><strong>Dev 1</strong>: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.<br> <strong>Dev 2</strong>: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.<br> <strong>Dev 1</strong>: Where is that documented?<br> <strong>Dev 2</strong>: linux-kernel mailing list archives.<br> <strong>FS dev</strong>: Probably from some 6-8 years ago, in e-mail postings that I made.<br></p> <p>While the filesystem developers tend to be helpful and they write up informative responses, most people probably don't keep up with the past 6-8 years of LKML.</p> <h4 id="performance-correctness-conflict">Performance / correctness conflict</h4> <p>Another issue is that the file API has an inherent conflict between performance and correctness. We noted before that <code>fsync</code> is a barrier (which we can use to enforce ordering) and that it flushes caches. If you've ever worked on the design of a high-performance cache, like a microprocessor cache, you'll probably find the bundling of these two things into a single primitive to be unusual. A reason this is unusual is that flushing caches has a significant performance cost and there are many cases where we want to enforce ordering without paying this performance cost. Bundling these two things into a single primitive forces us to pay the cache flush cost when we only care about ordering.</p> <p>Chidambaram et al., SOSP’13 looked at the performance cost of this by modifying <code>ext4</code> to add a barrier mechanism that doesn't flush caches and they found that, if they modified software appropriately and used their barrier operation where a full <code>fsync</code> wasn't necessary, they were able to achieve performance roughly equivalent to <code>ext4</code> with cache flushing entirely disabled (which is unsafe and can lead to data corruption) without sacrificing safety. However, making your own filesystem and getting it adopted is impractical for most people writing user-level software. Some databases will bypass the filesystem entirely or almost entirely, but this is also impractical for most software.</p> <p>That's the file API. Now that we've seen that it's extraordinarily difficult to use, let's look at filesystems.</p> <h3 id="filesystem">Filesystem</h3> <p>If we want to make sure that filessystems work, one of the most basic tests we could do is to inject errors are the layer below the filesystem to see if the filesystem handles them properly. For example, on a write, we could have the disk fail to write the data and return the appropriate error. If the filesystem drops this error or doesn't handle ths properly, that means we have data loss or data corruption. This is analogous to the kinds of distributed systems faults Kyle Kingsbury talked about in his distributed systems testing talk yesterday (although these kinds of errors are much more straightforward to test).</p> <p>Prabhakaran et al., SOSP’05 did this and found that, for most filesystems tested, almost all write errors were dropped. The major exception to this was on ReiserFS, which did a pretty good job with all types of errors tested, but ReiserFS isn't really used today for reasons beyond the scope of this talk.</p> <p><a href="//danluu.com/filesystem-errors/">We (Wesley Aptekar-Cassels and I) looked at this again in 2017</a> and found that things had improved significantly. Most filesystems (other than JFS) could pass these very basic tests on error handling.</p> <p>Another way to look for errors is to look at filesystems code to see if it handles internal errors correctly. Gunawai et al., FAST’08 did this and found that internal errors were dropped a significant percentage of the time. The technique they used made it difficult to tell if functions that could return many different errors were correctly handling each error, so they also looked at calls to functions that can only return a single error. In those cases, depending on the function, errors were dropped roughly 2/3 to 3/4 of the time, depending on the function.</p> <p>Wesley and I also looked at this again in 2017 and found significant improvement -- errors for the same functions Gunawi et al. looked at were &quot;only&quot; ignored 1/3 to 2/3 of the time, depending on the function.</p> <p>Gunawai et al. also looked at comments near these dropped errors and found comments like &quot;Just ignore errors at this point. There is nothing we can do except to try to keep going.&quot; (XFS) and &quot;Error, skip block and hope for the best.&quot; (ext3).</p> <p>Now we've seen that while filesystems used to drop even the most basic errors, they now handle then correctly, but there are some code paths where errors can get dropped. For a concrete example of a case where this happens, let's look back at our first example. If we get an <a href="https://lwn.net/Articles/752063/">error on <code>fsync</code></a>, unless we have a pretty recent Linux kernel (Q2 2018-ish), there's a pretty good chance that the error will be dropped and it may even get reported to the wrong process!</p> <p>On recent Linux kernels, there's a good chance the error will be reported (to the correct process, even). Wilcox, PGCon’18 notes that an error on <code>fsync</code> is basically unrecoverable. The details for depending on filesystem -- on <code>XFS</code> and <code>btrfs</code>, modified data that's in the filesystem will get thrown away and there's no way to recover. On <code>ext4</code>, the data isn't thrown away, but it's marked as unmodified, so the filesystem won't try to write it back to disk later, and if there's memory pressure, the data can be thrown out at any time. If you're feeling adventurous, you can try to recover the data before it gets thrown out with various tricks (e.g., by forcing the filesystem to mark it as modified again, or by writing it out to another device, which will force the filesystem to write the data out even though it's marked as unmodified), but there's no guarantee you'll be able to recover the data before it's thrown out. On Linux <code>ZFS</code>, it appears that there's a code path designed to do the right thing, but CPU usage spikes and the system may hang or become unusable.</p> <p>In general, there isn't a good way to recover from this on Linux. Postgres, MySQL, and MongoDB (widely used databases) will crash themselves and the user is expected to restore from the last checkpoint. Most software will probably just silently lose or corrupt data. And <code>fsync</code> is a relatively good case -- for example, <code>syncfs</code> simply doesn't return errors on Linux at all, leading to silent data loss and data corruption.</p> <p>BTW, when Craig Ringer first proposed that Postgres should crash on <code>fsync</code> error, the <a href="//danluu.com/fsyncgate/">first response on the Postgres dev mailing list</a> was:</p> <blockquote> <p>Surely you jest . . . If [current behavior of fsync] is actually the case, we need to push back on this kernel brain damage</p> </blockquote> <p>But after talking through the details, everyone agreed that crashing was the only good option. One of the many unfortunate things is that most disk errors are transient. Since the filesystem discards critical information that's necessary to proceed without data corruption on any error, transient errors that could be retried instead force software to take drastic measures.</p> <p>And while we've talked about Linux, this isn't unique to Linux. Fsync error handling (and error handling in general) is broken on many different operating systems. At the time Postgres &quot;discovered&quot; the behavior of fsync on Linux, FreeBSD had arguably correct behavior, but OpenBSD and NetBSD behaved the same as Linux (true error status dropped, retrying causes success response, data lost). This has been fixed on OpenBSD and probably some other BSDs, but Linux still basically has the same behavior and you don't have good guarantees that this will work on any random UNIX-like OS.</p> <p>Now that we've seen that, for many years, filesystems failed to handle errors in some of the most straightforward and simple cases and that there are cases that still aren't handled correctly today, let's look at disks.</p> <h3 id="disk">Disk</h3> <h4 id="flushing">Flushing</h4> <p>We've seen that it's easy to not realize we have to call <code>fsync</code> when we have to call <code>fsync</code>, and that even if we call <code>fsync</code> appropriately, bugs may prevent <code>fsync</code> from actually working. Rajimwale et al., DSN’11 into whether or not disks actually flush when you ask them to flush, assuming everything above the disk works correctly (their paper is actually mostly about something else, they just discuss this briefly at the beginning). Someone from Microsoft anonymously told them &quot;[Some disks] do not allow the file system to force writes to disk properly&quot; and someone from Seagate, a disk manufacturer, told them &quot;[Some disks (though none from us)] do not allow the file system to force writes to disk properly&quot;. Bairavasundaram et al., FAST’07 also found the same thing when they looked into disk reliability.</p> <h4 id="error-rates">Error rates</h4> <p>We've seen that filessystems sometimes don't handle disk errors correctly. If we want to know how serious this issue is, we should look at the rate at which disks emit errors. Disk datasheets will usually an uncorrectable bit error rate of 1e-14 for consumer HDDs (often called spinning metal or spinning rust disks), 1e-15 for enterprise HDDs, 1e-15 for consumer SSDs, and 1e-16 for enterprise SSDs. This means that, on average, we expect to see one unrecoverable data error every 1e14 bits we read on an HDD.</p> <p>To get an intuition for what this means in practice, 1TB is now a pretty normal disk size. If we read a full drive once, that's 1e12 bytes, or almost 1e13 bits (technically 8e12 bits), which means we should see, in expectation, one unrecoverable if we buy a 1TB HDD and read the entire disk ten-ish times. Nowadays, we can buy 10TB HDDs, in which case we'd expect to see an error (technically, 8/10th errors) on every read of an entire consumer HDD.</p> <p>In practice, observed data rates are are significantly higher. Narayanan et al., SYSTOR’16 (Microsoft) observed SSD error rates from 1e-11 to 6e-14, depending on the drive model. Meza et al., SIGMETRICS’15 (FB) observed even worse SSD error rates, 2e-9 to 6e-11 depending on the model of drive. Depending on the type of drive, 2e-9 is 2 gigabits, or 250 MB, 500 thousand to 5 million times worse than stated on datasheets depending on the class of drive.</p> <p>Bit error rate is arguably a bad metric for disk drives, but this is the metric disk vendors claim, so that's what we have to compare against if we want an apples-to-apples comparison. See Bairavasundaram et al., SIGMETRICS'07, Schroeder et al., FAST'16, and others for other kinds of error rates.</p> <p>One thing to note is that it's often claimed that SSDs don't have problems with corruption because they use error correcting codes (ECC), which can fix data corruption issues. &quot;Flash banishes the specter of the unrecoverable data error&quot;, etc. The thing this misses is that modern high-density flash devices are very unreliable and need ECC to be usable at all. Grupp et al., FAST’12 looked at error rates of the kind of flash the underlies SSDs and found errors rates from 1e-1 to 1e-8. 1e-1 is one error every ten bits, 1e-8 is one error every 100 megabits.</p> <h4 id="power-loss">Power loss</h4> <p>Another claim you'll hear is that SSDs are safe against power loss and some types of crashes because they now have &quot;power loss protection&quot; -- there's some mechanism in the SSDs that can hold power for long enough during an outage that the internal SSD cache can be written out safely.</p> <p><a href="http://lkcl.net/reports/ssd_analysis.html">Luke Leighton tested this</a> by buying 6 SSDs that claim to have power loss protection and found that four out of the six models of drive he tested failed (every drive that wasn't an Intel drive). If we look at the details of the tests, when drives fail, it appears to be because they were used in a way that the implementor of power loss protection didn't expect (writing &quot;too fast&quot;, although well under the rate at which the drive is capable of writing, or writing &quot;too many&quot; files in parallel). When a drive advertises that it has power loss protection, this appears to mean that someone spent some amount of effort implementing something that will, under some circumstances, prevent data loss or data corruption under power loss. But, as we saw in Kyle's talk yesterday on distributed systems, if you want to make sure that the mechanism actually works, you can't rely on the vendor to do rigorous or perhaps even any semi-serious testing and you have to test it yourself.</p> <h4 id="retention">Retention</h4> <p>If we look at SSD datasheets, a young-ish drive (one with 90% of its write cycles remaining) will usually be specced to hold data for about ten years after a write. If we look at a worn out drive, one very close to end-of-life, it's specced to retain data for one year to three months, depending on the class of drive. I think people are often surprised to find that it's within spec for a drive to lose data three months after the data is written.</p> <p>These numbers all come from datasheets and specs, as we've seen, datasheets can be a bit optimistic. On many early SSDs, using up most or all of a drives write cycles would cause the drive to brick itself, so you wouldn't even get the spec'd three month data retention.</p> <h3 id="corollaries">Corollaries</h3> <p>Now that we've seen that there are significant problems at every level of the file stack, let's look at a couple things that follow from this.</p> <h4 id="what-to-do">What to do?</h4> <p>What we should do about this is a big topic, in the time we have left, one thing we can do instead of writing to files is to use databases. If you want something lightweight and simple that you can use in most places you'd use a file, SQLite is pretty good. I'm not saying you should never use files. There is a tradeoff here. But if you have an application where you'd like to reduce the rate of data corruption, considering using a database to store data instead of using files.</p> <h4 id="fs-support">FS support</h4> <p>At the start of this talk, we looked at this Dropbox example, where most people thought that there was no reason to remove support for most Linux filesystems because filesystems are all the same. I believe their hand was forced by the way they want to store/use data, which they can only do with <code>ext</code> given how they're doing things (which is arguably a mis-feature), but even if that wasn't the case, perhaps you can see why software that's attempting to sync data to disk reliably and with decent performance might not want to support every single filesystem in the universe for an OS that, for their product, is relatively niche. Maybe it's worth supporting every filesystem for PR reasons and then going through the contortions necessary to avoid data corruption on a per-filesystem basis (you can try coding straight to your reading of the POSIX spec, but as we've seen, that won't save you on Linux), but the PR problem is caused by a misunderstanding.</p> <p>The other comment we looked at on reddit, and also a common sentiment, is that it's not a program's job to work around bugs in libraries or the OS. But user data gets corrupted regardless of who's &quot;fault&quot; the bug is, and as we've seen, bugs can persist in the filesystem layer for many years. In the case of <code>Linux</code>, most filesystems other than <code>ZFS</code> seem to have decided it's correct behavior to throw away data on fsync error and also not report that the data can't be written (as opposed to <code>FreeBSD</code> or <code>OpenBSD</code>, where most filesystems will at least report an error on subsequent <code>fsync</code>s if the error isn't resolved). This is arguably a bug and also arguably correct behavior, but either way, if your software doesn't take this into account, you're going to lose or corrupt data. If you want to take the stance that it's not your fault that the filesystem is corrupting data, your users are going to pay the cost for that.</p> <h3 id="faq">FAQ</h3> <p>While putting this talk to together, I read a bunch of different online discussions about how to write to files safely. For discussions outside of specialized communities (e.g., LKML, the Postgres mailing list, etc.), many people will drop by to say something like &quot;why is everyone making this so complicated? You can do this very easily and completely safely with this one weird trick&quot;. Let's look at the most common &quot;one weird trick&quot;s from two thousand internet comments on how to write to disk safely.</p> <h4 id="rename">Rename</h4> <p>The most frequently mentioned trick is to rename instead of overwriting. If you remember our single-file write example, we made a copy of the data that we wanted to overwrite before modifying the file. The trick here is to do the opposite:</p> <ol> <li>Make a copy of the entire file</li> <li>Modify the copy</li> <li>Rename the copy on top of the original file</li> </ol> <p>This trick doesn't work. People seem to think that this is safe becaus the POSIX spec says that <code>rename</code> is atomic, but that only means <code>rename</code> is atomic with respect to normal operation, that doesn't mean it's atomic on crash. This isn't just a theoretical problem; if we look at mainstream Linux filesystems, most have at least one mode where rename isn't atomic on crash. Rename also isn't guaranteed to execute in program order, as people sometimes expect.</p> <p>The most mainstream exception where rename is atomic on crash is probably <code>btrfs</code>, but even there, it's a bit subtle -- as noted in Bornholt et al., ASPLOS’16, <code>rename</code> is only atomic on crash when renaming to replace an existing file, not when renaming to create a new file. Also, Mohan et al., OSDI’18 found numerous rename atomicity bugs on <code>btrfs</code>, some quite old and some introduced the same year as the paper, so you want not want to rely on this without extensive testing, even if you're writing <code>btrfs</code> specific code.</p> <p>And even if this worked, the performance of this technique is quite poor.</p> <h4 id="append">Append</h4> <p>The second most frequently mentioned trick is to only ever append (instead of sometimes overwriting). This also doesn't work. As noted in Pillai et al., OSDI’14 and Bornholt et al., ASPLOS’16, appends don't guarantee ordering or atomicity and believing that appends are safe is the cause of some bugs.</p> <h4 id="one-weird-tricks">One weird tricks</h4> <p>We've seen that the most commonly cited simple tricks don't work. Something I find interesting is that, in these discussions, people will drop into a discussion where it's already been explained, often in great detail, why writing to files is harder than someone might naively think, ignore all warnings and explanations and still proceed with their explanation for why it's, in fact, really easy. Even when warned that files are harder than people think, people still think they're easy!</p> <h3 id="conclusion">Conclusion</h3> <p>In conclusion, computers don't work (but you probably already know this if you're here at Gary-conf). This talk happened to be about files, but there are many areas we could've looked into where we would've seen similar things.</p> <p>One thing I'd like to note before we finish is that, IMO, the underlying problem isn't technical. If you look at what huge tech companies do (companies like FB, Amazon, MS, Google, etc.), they often handle writes to disk pretty safely. They'll make sure that they have disks where power loss protection actually work, they'll have patches into the OS and/or other instrumentation to make sure that errors get reported correctly, there will be large distributed storage groups to make sure data is replicated safely, etc. We know how to make this stuff pretty reliable. It's hard, and it takes a lot of time and effort, i.e., a lot of money, but it can be done.</p> <p>If you ask someone who works on that kind of thing why they spend mind boggling sums of money to ensure (or really, increase the probability of) correctness, you'll often get an answer like &quot;we have a zillion machines and if you do the math on the rate of data corruption, if we didn't do all of this, we'd have data corruption every minute of every day. It would be totally untenable&quot;. A huge tech company might have, what, order of ten million machines? The funny thing is, if you do the math for how many consumer machines there are out there and much consumer software runs on unreliable disks, the math is similar. There are many more consumer machines; they're typically operated at much lighter load, but there are enough of them that, if you own a widely used piece of desktop/laptop/workstation software, the math on data corruption is pretty similar. Without &quot;extreme&quot; protections, we should expect to see data corruption all the time.</p> <p>But if we look at how consumer software works, it's usually quite unsafe with respect to handling data. IMO, the key difference here is that when a huge tech company loses data, whether that's data on who's likely to click on which ads or user emails, the company pays the cost, directly or indirectly and the cost is large enough that it's obviously correct to spend a lot of effort to avoid data loss. But when consumers have data corruption on their own machines, they're mostly not sophisticated enough to know who's at fault, so the company can avoid taking the brunt of the blame. If we have a global optimization function, the math is the same -- of course we should put more effort into protecting data on consumer machines. But if we're a company that's locally optimizing for our own benefit, the math works out differently and maybe it's not worth it to spend a lot of effort on avoiding data corruption.</p> <p>Yesterday, Ramsey Nasser gave a talk where he made a very compelling case that something was a serious problem, which was followed up by a comment that his proposed solution will have a hard time getting adoption. I agree with both parts -- he discussed an important problem, and it's not clear how solving that problem will make anyone a lot of money, so the problem is likely to go unsolved.</p> <p>With GDPR, we've seen that regulation can force tech companies to protect people's privacy in a way they're not naturally inclined to do, but regulation is a very big hammer and the unintended consequences can often negate or more than negative the benefits of regulation. When we look at the history of regulations that are designed to force companies to do the right thing, we can see that it's often many years, sometimes decades, before the full impact of the regulation is understood. Designing good regulations is hard, much harder than any of the technical problems we've discussed today.</p> <h3 id="acknowledgements">Acknowledgements</h3> <p>Thanks to Leah Hanson, Gary Bernhardt, Kamal Marhubi, Rebecca Isaacs, Jesse Luehrs, Tom Crayford, Wesley Aptekar-Cassels, Rose Ames, chozu@fedi.absturztau.be, and Benjamin Gilbert for their help with this talk!</p> <p>Sorry we went so fast. If there's anything you missed you can catch it in the pseudo-transcript at danluu.com/deconstruct-files.</p> <p><small><em>This &quot;transcript&quot; is pretty rough since I wrote it up very quickly this morning before the talk. I'll try to clean it within a few weeks, which will include adding material that was missed, inserting links, fixing typos, adding references that were missed, etc.</em></p> <p><em>Thanks to Anatole Shaw, Jernej Simoncic, @junh1024, Yuri Vishnevsky, and Josh Duff for comments/corrections/discussion on this transcript.</em></small></p> <p><link rel="prefetch" href="//danluu.com"> <link rel="prefetch" href="//danluu.com/fsyncgate/"></p> Randomized trial on gender in Overwatch overwatch-gender/ Tue, 19 Feb 2019 00:00:00 +0000 overwatch-gender/ <p>A recurring discussion in Overwatch (as well as other online games) is whether or not women are treated differently from men. If you do a quick search, you can find hundreds of discussions about this, some of which have well over a thousand comments. These discussions tend to go the same way and involve the same debate every time, with the same points being made on both sides. Just for example, <a href="https://www.reddit.com/r/Overwatch/comments/8hvmih/the_girl_problem_an_open_letter_to_the_overwatch/">these</a> <a href="https://www.reddit.com/r/Overwatch/comments/8i4wvs/a_response_to_the_girl_problem_post_moral/">three</a> <a href="https://www.reddit.com/r/Overwatch/comments/8i7cbx/when_we_call_talking_about_sexism_in_overwatch/">threads</a> on reddit that spun out of a single post that have a total of 10.4k comments. On one side, you have people saying &quot;sure, women get trash talked, but I'm a dude and I get trash talked, everyone gets trash talked there's no difference&quot;, &quot;I've never seen this, it can't be real&quot;, etc., and on the other side you have people saying things like &quot;when I play with my boyfriend, I get accused of being carried by him all the time but the reverse never happens&quot;, &quot;people regularly tell me I should play mercy[, a character that's a female healer]&quot;, and so on and so forth. In less time than has been spent on a single large discussion, we could just run the experiment, so here it is.</p> <p>This is the result of playing 339 games in the two main game modes, quick play (QP) and competitive (comp), where roughly half the games were played with a masculine name (where the username was a generic term for a man) and half were played with a feminine name (where the username was a woman's name). I recorded all of the comments made in each of the games and then classified the comments by type. Classes of comments were &quot;sexual/gendered comments&quot;, &quot;being told how to play&quot;, &quot;insults&quot;, and &quot;compliments&quot;.</p> <p>In each game that's included, I decided to include the game (or not) in the experiment before the character selection screen loaded. In games that were included, I used the same character selection algorithm, I wouldn't mute anyone for spamming chat or being a jerk, I didn't speak on voice chat (although I had it enabled), I never sent friend requests, and I was playing outside of a group in order to get matched with 5 random players. When playing normally, I might choose a character I don't know how to use well and I'll mute people who pollute chat with bad comments. There are a lot of games that weren't included in the experiment because I wasn't in a mood to listen to someone rage at their team for fifteen minutes and the procedure I used involved pre-committing to not muting people who do that.</p> <h3 id="sexual-or-sexually-charged-comments">Sexual or sexually charged comments</h3> <p>I thought I'd see more sexual comments when using the feminine name as opposed to the masculine name, but that turned out to not be the case. There was some mention of sex, genitals, etc., in both cases and the rate wasn't obviously different and was actually higher in the masculine condition.</p> <p>Zero games featured comments were directed specifically at me in the masculine condition and two (out of 184) games in the feminine condition featured comments that were directed at me. Most comments were comments either directed at other players or just general comments to team or game chat.</p> <p>Examples of typical undirected comments that would occur in either condition include &quot;my girlfriend keeps sexting me how do I get her to stop?&quot;, &quot;going in balls deep&quot;, &quot;what a surprise. *strokes dick* [during the post-game highlight]&quot;, and &quot;support your local boobies&quot;.</p> <p>The two games that featured sexual comments directed at me had the following comments:</p> <ul> <li>&quot;please mam can i have some coochie&quot;, &quot;yes mam please&quot; [from two different people], &quot;:boicootie:&quot;</li> <li>&quot;my dicc hard&quot; [believed to be directed at me from context]</li> </ul> <p>During games not included in the experiment (I generally didn't pay attention to which username I was on when not in the experiment), I also got comments like &quot;send nudes&quot;. Anecdotally, there appears to be a difference in the rate of these kinds of comments directed at the player, but the rate observed in the experiment is so low that <a href="https://statmodeling.stat.columbia.edu/2010/12/21/lets_say_uncert/">uncertainty intervals</a> around any estimates of the true rate will be similar in both conditions unless we use a <a href="https://en.wikipedia.org/wiki/Strong_prior">strong prior</a>.</p> <p>The fact that this difference couldn't be observed in 339 games was surprising to me, although it's not inconsistent with <a href="https://aquila.usm.edu/cgi/viewcontent.cgi?article=1429&amp;context=honors_theses">McDaniel's thesis, a survey of women who play video games</a>. 339 games probably sounds like a small number to serious gamers, but the only other randomized experiment I know of on this topic (besides this experiment) is <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0131613">Kasumovic et al.</a>, which notes that &quot;[w]e stopped at 163 [games] as this is a substantial time effort&quot;.</p> <p>All of the analysis uses the number of games in which a type of comment occured and not tone to avoid having to code comments as having a certain tone in order to avoid possibly injecting bias into the process. Sentiment analysis models, even state-of-the-art ones often <a href="https://twitter.com/danluu/status/896176897675153409">return nonsensical results</a>, so this basically has to be done by hand, at least today. With much more data, some kind of sentiment analysis, done with liberal spot checking and re-training of the model, could work, but the total number of comments is so small in this case that it would amount to coding each comment by hand.</p> <p>Coding comments manually in an unbiased fashion can also be done with a level of blinding, but doing that would probably require getting more people involved (since I see and hear comments while I'm playing) and relying on unpaid or poorly paid labor.</p> <h3 id="being-told-how-to-play">Being told how to play</h3> <p>The most striking, easy to quantify, difference was the rate at which I played games in which people told me how I should play. Since it's unclear how much confidence we should have in the difference if we just look at the raw rates, we'll use a simple statistical model to get the <a href="https://statmodeling.stat.columbia.edu/2010/12/21/lets_say_uncert/">uncertainty interval</a> around the estimates. Since I'm not sure what my belief about this should be, this uses <a href="https://en.wikipedia.org/wiki/Prior_probability#Uninformative_priors">an uninformative prior</a>, so the estimate is close to the actual rate. Anyway, here are the uncertainty intervals a simple model puts on the percent of games where at least one person told me I was playing wrong, that I should change how I'm playing, or that I switch characters:</p> <p><style>table {border-collapse: collapse;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <div style="overflow-x:auto;"></p> <table> <thead> <tr> <th>Cond</th> <th>Est</th> <th>P25</th> <th>P75</th> </tr> </thead> <tbody> <tr> <td>F comp</td> <td>19</td> <td>13</td> <td>25</td> </tr> <tr> <td>M comp</td> <td>6</td> <td>2</td> <td>10</td> </tr> <tr> <td>F QP</td> <td>4</td> <td>3</td> <td>6</td> </tr> <tr> <td>M QP</td> <td>1</td> <td>0</td> <td>2</td> </tr> </tbody> </table> <p>The experimental conditions in this table are masculine vs. feminine name (M/F) and competitive mode vs quick play (comp/QP). The numbers are percents. <code>Est</code> is the estimate, <code>P25</code> is the 25%-ile estimate, and <code>P75</code> is the 75%-ile estimate. Competitive mode and using a feminine name are both correlated with being told how to play. See <a href="http://andrewgelman.com/2016/11/05/why-i-prefer-50-to-95-intervals/">this post by Andrew Gelman for why you might want to look at the 50% interval instead of the 95% interval</a>.</p> <p>For people not familiar with overwatch, in competitive mode, you're explicitly told what your ELO-like rating is and you get a badge that reflects your rating. In quick play, you have a rating that's tracked, but it's never directly surfaced to the user and you don't get a badge.</p> <p>It's generally believed that people are more on edge during competitive play and are more likely to lash out (and, for example, tell you how you should play). The data is consistent with this common belief.</p> <p>Per above, I didn't want to code tone of messages to avoid bias, so this table only indicates the rate at which people told me I was playing incorrectly or asked that I switch to a different character. The qualitative difference in experience is understated by this table. For example, the one time someone asked me to switch characters in the masculine condition, the request was a one sentence, polite, request (&quot;hey, we're dying too quickly, could we switch [from the standard one primary healer / one off healer setup] to double primary healer or switch our tank to [a tank that can block more damage]?&quot;). When using the feminine name, a typical case would involve 1-4 people calling me human garbage for most of the game and consoling themselves with the idea that the entire reason our team is losing is that I won't change characters.</p> <p>The simple model we're using indicates that there's probably a difference between both competitive and QP and playing with a masculine vs. a feminine name. However, most published results are pretty bogus, so let's look at reasons this result might be bogus and then you can decide for yourself.</p> <h3 id="threats-to-validity">Threats to validity</h3> <p>The biggest issue is that this wasn't a <a href="https://en.wikipedia.org/wiki/Trial_registration">pre-registered trial</a>. I'm obviously not going to go and officially register a trial like this, but I also didn't informally &quot;register&quot; this by having this comparison in mind when I started the experiment. A problem with non-pre-registered trials is that there are a lot of degrees of freedom, both in terms of what we could look at, and in terms of the methodology we used to look at things, so it's unclear if the result is &quot;real&quot; or an artifact of fishing for something that looks interesting. A standard example of this is that, if you look for 100 possible effects, you're likely to find 1 that appears to be statistically significant with p = 0.01.</p> <p>There are standard techniques to correct for this problem (e.g., <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a>), but I don't find these convincing because they usually don't capture all of the degrees of freedom that go into a statistical model. An example is that it's common to take a variable and discretize it into a few buckets. There are many ways to do this and you generally won't see papers talk about the impact of this or correct for this in any way, although changing how these buckets are arranged can drastically change the results of a study. Another common knob people can use to manipulate results is curve fitting to an inappropriate curve (often a 2nd a 3rd degree polynomial when a scatterplot shows that's clearly incorrect). Another way to handle this would be to use <a href="http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf">a more complex model</a>, but I wanted to keep this as simple as possible.</p> <p>If I wanted to really be convinced on this, I'd want to, at a minimum, re-run this experiment with this exact comparison in mind. As a result, this experiment would need to be replicated to provide more than a preliminary result that is, at best, weak evidence.</p> <p>One other large class of problem with randomized controlled trials (RCTs) is that, despite randomization, the two arms of the experiment might be different in some way that wasn't randomized. Since Overwatch doesn't allow you to keep changing your name, this experiment was done with two different accounts and these accounts had different ratings in competitive mode. On average, the masculine account had a higher rating due to starting with a higher rating, which meant that I was playing against stronger players and having worse games on the masculine account. In the long run, this will even out, but since most games in this experiment were in QP, this didn't have time to even out in comp. As a result, I had a higher win rate as well as just generally much better games with the feminine account in comp.</p> <p>With no other information, we might expect that people who are playing worse get told how to play more frequently and people who are playing better should get told how to play less frequently, which would mean that the table above understates the actual difference.</p> <p>However <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0131613">Kasumovic et al., in a gender-based randomized trial of Halo 3</a>, found that players who were playing poorly were more negative towards women, especially women who were playing well (there's enough statistical manipulation of the data that a statement this concise can only be roughly correct, see study for details). If that result holds, it's possible that I would've gotten fewer people telling me that I'm human garbage and need to switch characters if I was average instead of dominating most of my games in the feminine condition.</p> <p>If that result generalizes to OW, that would explain something which I thought was odd, which was that a lot of demands to switch and general vitriol came during my best performances with the feminine account. A typical example of this would be a game where we have a 2-2-2 team composition (2 players playing each of the three roles in the game) where my counterpart in the same role ran into the enemy team and died at the beginning of the fight in almost every engagement. I happened to be having a good day and dominated the other team (37-2 in a ten minute comp game, while focusing on protecting our team's healers) while only dying twice, once on purpose as a sacrifice and second time after a stupid blunder. Immediately after I died, someone asked me to switch roles so they could take over for me, but at no point did someone ask the other player in my role to switch despite their total uselessness all game (for OW players this was a Rein who immediately charged into the middle of the enemy team at every opportunity, from a range where our team could not possibly support them; this was Hanamura 2CP, where it's very easy for Rein to set up situations where their team cannot help them). This kind of performance was typical of games where my team jumped on me for playing incorrectly. This isn't to say I didn't have bad games; I had plenty of bad games, but a disproportionate number of the most toxic experiences came when I was having a great game.</p> <p>I tracked how well I did in games, but this sample doesn't have enough ranty games to do a meaningful statistical analysis of my performance vs. probability of getting thrown under the bus.</p> <p>Games at different ratings are probably also generally different environments and get different comments, but it's not clear if there are more negative comments at 2000 than 2500 or vice versa. There are a lot of online debates about this; for any rating level other than the very lowest or the very highest ratings, you can find a lot of people who say that the rating band they're in has the highest volume of toxic comments.</p> <h3 id="other-differences">Other differences</h3> <p>Here are some things that happened while playing with the feminine name that didn't happen with the masculine name during this experiment or in any game outside of this experiment:</p> <ul> <li>unsolicited &quot;friend&quot; requests from people I had no textual or verbal interaction with (happened 7 times total, didn't track which cases were in the experiment and which weren't)</li> <li>someone on the other team deciding that my team wasn't doing a good enough job of protecting me while I was playing healer, berating my team, and then throwing the game so that we won (happened once during the experiment)</li> <li>someone on my team flirting with me and then flipping out when I don't respond, who then spends the rest of the game calling me autistic or toxic (this happened once during the experiment, and once while playing in a game not included in the experiment)</li> </ul> <p>The rate of all these was low enough that I'd have to play many more games to observe something without a huge uncertainty interval.</p> <p>I didn't accept any friend requests from people I had no interaction with. Anecdotally, some people report people will send sexual comments or berate them after an unsolicited friend request. It's possible that the effect show in the table would be larger if I accepted these friend requests and it couldn't be smaller.</p> <p>I didn't attempt to classify comments as flirty or not because, unlike the kinds of commments I did classify, this is often somewhat subtle and you could make a good case that any particular comment is or isn't flirting. Without responding (which I didn't do), many of these kinds of comments are ambiguous</p> <p>Another difference was in the tone of the compliments. The rate of games where I was complimented wasn't too different, but compliments under the masculine condition tended to be short and factual (e.g., someone from the other team saying &quot;no answer for [name of character I was playing]&quot; after a dominant game) and compliments under the feminine condition tended to be more effusive and multiple people would sometimes chime in about how great I was.</p> <h3 id="non-differences">Non differences</h3> <p>The rate of complements and the rate of insults in games that didn't include explanations of how I'm playing wrong or how I need to switch characters were similar in both conditions.</p> <h3 id="other-factors">Other factors</h3> <p>Some other factors that would be interesting to look at would be time of day, server, playing solo or in a group, specific character choice, being more or less communicative, etc., but it would take a lot more data to be able to get good estimates when adding it more variables. Blizzard should have the data necessary to do analyses like this in aggregate, but they're notoriously private with their data, so someone at Blizzard would have to do the work and then publish it publicly, and they're not really in the habit of doing that kind of thing. If you work at Blizzard and are interested in letting a third party do some analysis on an anonymized data set, let me know and I'd be happy to dig in.</p> <h3 id="experimental-minutiae">Experimental minutiae</h3> <p>Under both conditions, I avoided ever using voice chat and would call things out in text chat when time permitted. Also under both conditions, I mostly filled in with whatever character class the team needed most, although I'd sometimes pick DPS (in general, DPS are heavily oversubscribed, so you'll rarely play DPS if you don't pick one even when unnecessary).</p> <p>For quickplay, backfill games weren't counted (backfill games are games where you join after the game started to fill in for a player who left; comp doesn't allow backfills). 6% of QP games were backfills.</p> <p>These games are from before the &quot;endorsements&quot; patch; most games were played around May 2018. All games were played in &quot;solo q&quot; (with 5 random teammates). In order to avoid correlations between games depending on how long playing sessions were, I quit between games and waited for enough time (since you're otherwise likely to end up in a game with some or many of the same players as before).</p> <p>The model used probability of a comment happening in a game to avoid the problem that Kasumovic et al. ran into, where a person who's ranting can skew the total number of comments. Kasumovic et al. addressed this by removing outliers, but I really don't like manually reaching in and removing data to adjust results. This could also be addressed by using a more sophisticated model, but a more sophisticated model means more knobs which means more ways for bias to sneak in. Using the number of players who made comments instead would be one way to mitigate this problem, but I think this still isn't ideal because these aren't independent -- when one player starts being negative, this greatly increases the odds that another player in that game will be negative, but just using the number of players makes four games with one negative person the same as one game with four negative people. This can also be accounted for with a slightly more sophisticated model, but that also involves adding more knobs to the model.</p> <h3 id="update-98-ile">UPDATE: 98%-ile</h3> <p>One of the more common comments I got when I wrote this post is that it's only valid at &quot;low&quot; ratings, like Plat, which is 50%-ile. If someone is going to concede that a game's community is toxic at 50%-ile and you have to be significantly better than that to avoid toxic players, that seems to be conceding that the game's community is toxic.</p> <p>However, to see if that's accurate, I played a bit more and play in games as high as 98%-ile to see if things improved. While there was a minor improvement, it's not fundamentally different at 98%-ile, so people who are saying that things are much better at higher ranks either have very different experiences than I did or are referring to 99%-ile or above. If it's the latter, then I'd say that the previous comment about conceding that the game has a toxic community holds. If it's the former, perhaps I just got unlucky, but based on other people's comments about their experiences with the game, I don't think I got particularly unlucky.</p> <h3 id="appendix-comments-advice-to-overwatch-players">Appendix: comments / advice to overwatch players</h3> <p>A common complaint, perhaps the most common complaint by people below 2000 SR (roughly <a href="(https://us.forums.blizzard.com/en/overwatch/t/competitive-mode-tier-distribution/972)">30%-ile</a>) or perhaps 1500 SR (roughly 10%-ile) is that they're in &quot;ELO hell&quot; and are kept down because their teammates are too bad. Based on my experience, I find this to be extremely unlikely.</p> <p>People often split skill up into &quot;mechanics&quot; and &quot;gamesense&quot;. My mechanics are pretty much as bad as it's possible to get. The last game I played seriously was a 90s video game that's basically <a href="https://en.wikipedia.org/wiki/SubSpace_(video_game)">online asteroids</a> and the last game before that I put any time into was the original SNES <a href="https://en.wikipedia.org/wiki/Super_Mario_Kart">super mario kart</a>. As you'd expect from someone who hasn't put significant time into a post-90s video game or any kind of FPS game, my aim and dodging are both atrocious. On top of that, I'm an old dude with slow reflexes and I was able to get to 2500 SR (<a href="https://us.forums.blizzard.com/en/overwatch/t/competitive-mode-tier-distribution/972">roughly 60%-ile</a> among players who play &quot;competitive&quot;, likely higher among all players) by avoiding a few basic fallacies and blunders despite have approximately zero mechanical skill. If you're also an old dude with basically no FPS experience, you can do the same thing; if you have good reflexes or enough FPS experience to actually aim or dodge, you basically can't be worse mechnically than I am and you can do much better by avoiding a few basic mistakes.</p> <p>The most common fallacy I see repeated is that you have to play DPS to move out of bronze or gold. The evidence people give for this is that, when a GM streamer plays flex, tank, or healer, they sometimes lose in bronze. I guess the idea is that, because the only way to ensure a 99.9% win rate in bronze is to be a GM level DPS player and play DPS, the best way to maintain a 55% or a 60% win rate is to play DPS, but this doesn't follow.</p> <p>Healers and tanks are both very powerful in low ranks. Because low ranks feature both poor coordination and relatively poor aim (players with good coordination or aim tend to move up quickly), time-to-kill is very slow compared to higher ranks. As a result, an off healer can tilt the result of a 1v1 (and sometimes even a 2v1) matchup and a primary healer can often determine the result of a 2v1 matchup. Because coordination is poor, most matchups end up being 2v1 or 1v1. The flip side of the lack of coordination is that you'll almost never get help from teammates. It's common to see an enemy player walk into the middle of my team, attack someone, and then walk out while literally no one else notices. If the person being attacked is you, the other healer typically won't notice and will continue healing someone at full health and none of the classic &quot;peel&quot; characters will help or even notice what's happening. That means it's on you to pay attention to your surroundings and watching flank routes to avoid getting murdered.</p> <p>If you can avoid getting murdered constantly and actually try to heal (as opposed to many healers at low ranks, who will try to kill people or stick to a single character and continue healing them all the time even if they're at full health), you outheal a primary healer half the time when playing an off healer and, as a primary healer, you'll usually be able to get 10k-12k healing per 10 min compared to 6k to 8k for most people in Silver (sometimes less if they're playing DPS Moira). That's like having an extra half a healer on your team, which basically makes the game 6.5 v 6 instead of 6v6. You can still lose a 6.5v6 game, and you'll lose plenty of games, but if you're consistently healing 50% more than an normal healer at your rank, you'll tend to move up even if you get a lot of major things wrong (heal order, healing when that only feeds the other team, etc.).</p> <p>A corollary to having to watch out for yourself 95% when playing a healer is that, as a character who can peel, you can actually watch out for your teammates and put your team at a significant advantage in 95% of games. As Zarya or Hog, if you just boringly play towards the front of your team, you can basically always save at least one teammate from death in a team fight, and you can often do this 2 or 3 times. Meanwhile, your counterpart on the other team is walking around looking for 1v1 matchups. If they find a good one, they'll probably kill someone, and if they don't (if they run into someone with a mobility skill or a counter like brig or reaper), they won't. Even in the case where they kill someone and you don't do a lot, you still provide as much value as them and, on average, you'll provide more value. A similar thing is true of many DPS characters, although it depends on the character (e.g., McCree is effective as a peeler, at least at the low ranks that I've played in). If you play a non-sniper DPS that isn't suited for peeling, you can find a DPS on your team who's looking for 1v1 fights and turn those fights into 2v1 fights (at low ranks, there's no shortage of these folks on both teams, so there are plenty of 1v1 fights you can control by making them 2v1).</p> <p>All of these things I've mentioned amount to actually trying to help your team instead of going for flashy PotG setups or trying to dominate the entire team by yourself. If you say this in the abstract, it seems obvious, but most people think they're better than their rating. It doesn't help that OW is designed to make people think they're doing well when they're not and the best way to get &quot;medals&quot; or &quot;play of the game&quot; is to play in a way that severely reduces your odds of actually winning each game.</p> <p>Outside of obvious gameplay mistakes, the other big thing that loses games is when someone tilts and either starts playing terribly or flips out and says something to enrage someone else on the team, who then starts playing terribly. I don't think you can actually do much about this directly, but you can never do this, so 5/6th of your team will do this at some base rate, whereas 6/6 of the other team will do this. Like all of the above, this won't cause you to win all of your games, but everything you do that increases your win rate makes a difference.</p> <p>Poker players have the right attitude when they talk about leaks. The goal isn't to win every hand, it's to increase your EV by avoiding bad blunders (at high levels, it's about more than avoiding bad blunders, but we're talking about getting out of below median ranks, not becoming GM here). You're going to have terrible games where you get 5 people instalocking DPS. Your odds of winning a game are low, say 10%. If you get mad and pick DPS and reduce your odds even further (say this is to 2%), all that does is create a leak in your win rate during games when your teammates are being silly.</p> <p>If you gain/lose 25 rating per game for a win or a loss, your average rating change from a game is <code>25 (W_rate - L_rate) = 25 (2W_rate - 1)</code>. Let's say 1/40 games are these silly games where your team decides to go all DPS. The per-game SR difference of trying to win these vs. soft throwing is maybe something like <code>1/40 * 25 (2 * 0.08) = 0.1</code>. That doesn't sound like much and these numbers are just guesses, but everyone outside of very high-level games is full of leaks like these, and they add up. And if you look at a 60% win rate, which is pretty good considering that your influence is limited because you're only one person on a 6 person team, that only translates to an average of 5SR per game, so it doesn't actually take that many small leaks to really move your average SR gain or loss.</p> <h3 id="appendix-general-comments-on-online-gaming-20-years-ago-vs-today">Appendix: general comments on online gaming, 20 years ago vs. today</h3> <p>Since I'm unlikely to write another blog post on gaming any time soon, here are some other random thoughts that won't fit with any other post. My last serious experience with online games was with a game from the 90s. Even though I'd heard that things were a lot worse, I was still surprised by it. IRL, the only time I encounter the same level and rate of pointless nastiness in a recreational activity is down at the bridge club (casual bridge games tend to be very nice). When I say pointless nastiness, I mean things like getting angry and then making nasty comments to a teammate mid-game. Even if your &quot;criticism&quot; is correct (and, if you review OW games or bridge hands, you'll see that these kinds of angry comments are almost never correct), this has virtually no chance of getting your partner to change their behavior and it has a pretty good chance of tilting them and making them play worse. If you're trying to win, there's no reason to do this and good reason to avoid this.</p> <p>If you look at the online commentary for this, it's common to see <a href="https://www.reddit.com/r/OverwatchUniversity/comments/arx09u/people_throwing_because_of_my_age/egq89fm/">people blaming kids</a>, but this doesn't match my experience at all. For one thing, when I was playing video games in the 90s, a huge fraction of the online gaming population was made up of kids, and online game communities were nicer than they are today. Saying that &quot;kids nowadays&quot; are worse than kids used to be is a pastime that goes back thousands of years, but it's generally not true and there doesn't seem to be any reason to think that it's true here.</p> <p>Additionally, this simply doesn't match what I saw. If I just look at comments over audio chat, there were a couple of times when some kids were nasty, but almost all of the comments are from people who sound like adults. Moreover, if I look at when I played games that were bad, a disproportionately large number of those games were late (after 2am eastern time, on the central/east server), where the relative population of adults is larger.</p> <p>And if we look at bridge, the median age of an <a href="https://www.acbl.org/">ACBL</a> member is in the 70s, with an increase in age of a whopping <a href="https://www.lajollabridge.com/Articles/EveningClubBridgeDecline.htm">0.4 years per year</a>.</p> <p>Sure, maybe people tend to get more mature as they age, but in any particular activity, that effect seems to be dominated by other factors. I don't have enough data at hand to make a good guess as to what happened, but I'm entertained by the idea that <a href="http://jadagul.tumblr.com/post/175863849718/xhxhxhx">this</a> might have something to do with it:</p> <blockquote> <p>I’ve said this before, but one of the single biggest culture shocks I’ve ever received was when I was talking to someone about five years younger than I was, and she said “Wait, you play video games? I’m surprised. You seem like way too much of a nerd to play video games. Isn’t that like a fratboy jock thing?”</p> </blockquote> <h3 id="appendix-faq">Appendix: FAQ</h3> <p>Here are some responses to the most common online comments.</p> <blockquote> <p>Plat? You suck at Overwatch</p> </blockquote> <p>Yep. But I sucked roughly equally on both accounts (actually somewhat more on the masculine account because it was rated higher and I was playing a bit out of my depth). Also, that's not a question.</p> <blockquote> <p>This is just a blog post, it's not an academic study, the results are crap.</p> </blockquote> <p>There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.</p> <p>When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my abillity) even if it makes the result look less interesting or less likely to win an award.</p> <blockquote> <p>Gamers have always been toxic, that's just nostalgia talking.</p> </blockquote> <p>If I pull game logs for subspace, this seems to be false. YMMV depending on what games you played, I suppose. FWIW, airmash seems to be the modern version of subspace, and (until the game died), it was much more toxic than subspace even if you just compare on a per-game basis despite having much smaller games (25 people for a good sized game in airmash, vs. 95 for subsace).</p> <blockquote> <p>This is totally invalid because you didn't talk on voice chat.</p> </blockquote> <p>At the ranks I played, not talking on voice was the norm. It would be nice to have talking or not talking on voice chat be an indepedent variable, but that would require playing even more games to get data for another set of conditions, and if I wasn't going to do that, choosing the condition that's most common doesn't make the entire experiment invalid, IMO.</p> <p>Some people report that, post &quot;endorsements&quot; patch, talking on voice chat is much more common. I tested this out by playing 20 (non-comp) games just after the &quot;Paris&quot; patch. Three had comments on voice chat. One was someone playing random music clips, one had someone screaming at someone else for playing incorrectly, and one had useful callouts on voice chat. It's possible I'd see something different with more games or in comp, but I don't think it's obvious that voice chat is common for most people after the &quot;endorsements&quot; patch.</p> <h3 id="appendix-code-and-data">Appendix: code and data</h3> <p>If you want to play with this data and model yourself, experiment with different priors, run a posterior predictive check, etc., here's a snippet of R code that embeds the data:</p> <pre><code>library(brms) library(modelr) library(tidybayes) library(tidyverse) d &lt;- tribble( ~game_type, ~gender, ~xplain, ~games, &quot;comp&quot;, &quot;female&quot;, 7, 35, &quot;comp&quot;, &quot;male&quot;, 1, 23, &quot;qp&quot;, &quot;female&quot;, 6, 149, &quot;qp&quot;, &quot;male&quot;, 2, 132 ) d &lt;- d %&gt;% mutate(female = ifelse(gender == &quot;female&quot;, 1, 0), comp = ifelse(game_type == &quot;comp&quot;, 1, 0)) result &lt;- brm(data = d, family = binomial, xplain | trials(games) ~ female + comp, prior = c(set_prior(&quot;normal(0,10)&quot;, class = &quot;b&quot;)), iter = 25000, warmup = 500, cores = 4, chains = 4) </code></pre> <p>The model here is simple enough that I wouldn't expect the version of software used to significantly affect results, but in case you're curious, this was done with <code>brms 2.7.0</code>, <code>rstan 2.18.2</code>, on <code>R 3.5.1</code>.</p> <p><i>Thanks to Leah Hanson, Sean Talts and Sean's math/stats reading group, Annie Cherkaev, Robert Schuessler, Wesley Aptekar-Cassels, Julia Evans, Paul Gowder, Jonathan Dahan, Bradley Boccuzzi, Akiva Leffert, and one or more anonymous commenters for comments/corrections/discussion.</i></p> <p><link rel="prefetch" href="//danluu.com"> <link rel="prefetch" href="//danluu.com/about/"> <link rel="prefetch" href="//danluu.com/input-lag/"></p> Fsyncgate: errors on fsync are unrecovarable fsyncgate/ Wed, 28 Mar 2018 00:00:00 +0000 fsyncgate/ <p><small><em>This is an archive of the original &quot;fsyncgate&quot; email thread. This is posted here because I wanted to have a link that would fit on a slide for <a href="//danluu.com/deconstruct-files/">a talk on file safety</a> with <a href="//danluu.com/web-bloat/">a mobile-friendly non-bloated format</a>.</em></small></p> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Subject:Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date:2018-03-28 02:23:46 </code></pre> <p>Hi all</p> <p>Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error.</p> <p>TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means &quot;all writes since the last fsync have hit disk&quot; but we assume it means &quot;all writes since the last SUCCESSFUL fsync have hit disk&quot;.</p> <p>Pg wrote some blocks, which went to OS dirty buffers for writeback. Writeback failed due to an underlying storage error. The block I/O layer and XFS marked the writeback page as failed (AS_EIO), but had no way to tell the app about the failure. When Pg called fsync() on the FD during the next checkpoint, fsync() returned EIO because of the flagged page, to tell Pg that a previous async write failed. Pg treated the checkpoint as failed and didn't advance the redo start position in the control file.</p> <p>All good so far.</p> <p>But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() <em>cleared the AS_EIO bad page flag</em>.</p> <p>The write never made it to disk, but we completed the checkpoint, and merrily carried on our way. Whoops, data loss.</p> <p>The clear-error-and-continue behaviour of fsync is not documented as far as I can tell. Nor is fsync() returning EIO unless you have a very new linux man-pages with the patch I wrote to add it. But from what I can see in the POSIX standard we are not given any guarantees about what happens on fsync() failure at all, so we're probably wrong to assume that retrying fsync( ) is safe.</p> <p>If the server had been using ext3 or ext4 with errors=remount-ro, the problem wouldn't have occurred because the first I/O error would've remounted the FS and stopped Pg from continuing. But XFS doesn't have that option. There may be other situations where this can occur too, involving LVM and/or multipath, but I haven't comprehensively dug out the details yet.</p> <p>It proved possible to recover the system by faking up a backup label from before the first incorrectly-successful checkpoint, forcing redo to repeat and write the lost blocks. But ... what a mess.</p> <p>I posted about the underlying fsync issue here some time ago:</p> <p><a href="https://stackoverflow.com/q/42434872/398670">https://stackoverflow.com/q/42434872/398670</a></p> <p>but haven't had a chance to follow up about the Pg specifics.</p> <p>I've been looking at the problem on and off and haven't come up with a good answer. I think we should just PANIC and let redo sort it out by repeating the failed write when it repeats work since the last checkpoint.</p> <p>The API offered by async buffered writes and fsync offers us no way to find out which page failed, so we can't just selectively redo that write. I think we do know the relfilenode associated with the fd that failed to fsync, but not much more. So the alternative seems to be some sort of potentially complex online-redo scheme where we replay WAL only the relation on which we had the fsync() error, while otherwise servicing queries normally. That's likely to be extremely error-prone and hard to test, and it's trying to solve a case where on other filesystems the whole DB would grind to a halt anyway.</p> <p>I looked into whether we can solve it with use of the AIO API instead, but the mess is even worse there - from what I can tell you can't even reliably guarantee fsync at all on all Linux kernel versions.</p> <p>We already PANIC on fsync() failure for WAL segments. We just need to do the same for data forks at least for EIO. This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us.</p> <p>There are rather a lot of pg_fsync() callers. While we could handle this case-by-case for each one, I'm tempted to just make pg_fsync() itself intercept EIO and PANIC. Thoughts?</p> <hr> <pre><code>From:Tom Lane &lt;tgl(at)sss(dot)pgh(dot)pa(dot)us&gt; Date:2018-03-28 03:53:08 </code></pre> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> <blockquote> <p>Retrying fsync() is not OK at least on Linux. When fsync() returns success it means &quot;all writes since the last fsync have hit disk&quot; but we assume it means &quot;all writes since the last SUCCESSFUL fsync have hit disk&quot;.</p> </blockquote> <p>If that's actually the case, we need to push back on this kernel brain damage, because as you're describing it fsync would be completely useless.</p> <p>Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.</p> <hr> <pre><code>From:Michael Paquier &lt;michael(at)paquier(dot)xyz&gt; Date:2018-03-29 02:30:59 </code></pre> <p>On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-03-29 02:48:27 </code></pre> <p>On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.</p> </blockquote> <p>Craig, is the phenomenon you described the same as the second issue &quot;Reporting writeback errors&quot; discussed in this article?</p> <p><a href="https://lwn.net/Articles/724307/">https://lwn.net/Articles/724307/</a></p> <p>&quot;Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen.&quot;</p> <p>That's... I'm speechless.</p> <hr> <pre><code>From:Justin Pryzby &lt;pryzby(at)telsasoft(dot)com&gt; Date:2018-03-29 05:00:31 </code></pre> <p>On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:</p> <blockquote> <p>On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.</p> </blockquote> <p>The retries are the source of the problem ; the first fsync() can return EIO, and also <em>clears the error</em> causing a 2nd fsync (of the same data) to return success.</p> <p>(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).</p> <p>On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:</p> <blockquote> <p>Craig, is the phenomenon you described the same as the second issue &quot;Reporting writeback errors&quot; discussed in this article? <a href="https://lwn.net/Articles/724307/">https://lwn.net/Articles/724307/</a></p> </blockquote> <p>Worse, the article acknowledges the behavior without apparently suggesting to change it:</p> <p>&quot;Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION.&quot;</p> <p>I believe I reproduced the problem behavior using dmsetup &quot;error&quot; target, see attached.</p> <p>strace looks like this:</p> <p>kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux</p> <pre><code>1open(&quot;/dev/mapper/eio&quot;, O_RDWR|O_CREAT, 0600) = 3 2write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 8192 3write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 8192 4write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 8192 5write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 8192 6write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 8192 7write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 8192 8write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = 2560 9write(3, &quot;\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0&quot;..., 8192) = -1 ENOSPC (No space left on device) 10dup(2) = 4 11fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE) 12brk(NULL) = 0x1299000 13brk(0x12ba000) = 0x12ba000 14fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 15write(4, &quot;write(1): No space left on devic&quot;..., 34write(1): No space left on device 16) = 34 17close(4) = 0 18fsync(3) = -1 EIO (Input/output error) 19dup(2) = 4 20fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE) 21fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 22write(4, &quot;fsync(1): Input/output error\n&quot;, 29fsync(1): Input/output error 23) = 29 24close(4) = 0 25close(3) = 0 26open(&quot;/dev/mapper/eio&quot;, O_RDWR|O_CREAT, 0600) = 3 27fsync(3) = 0 28write(3, &quot;\0&quot;, 1) = 1 29fsync(3) = 0 30exit_group(0) = ? </code></pre> <p>2: EIO isn't seen initially due to writeback page cache;</p> <p>9: ENOSPC due to small device</p> <p>18: original IO error reported by fsync, good</p> <p>25: the original FD is closed</p> <p>26: ..and file reopened</p> <p>27: fsync on file with still-dirty data+EIO returns success BAD</p> <p>10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror should write to a separate FD (?)</p> <p>Also note, close() ALSO returned success..which you might think exonerates the 2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd byte certainly never got written to DM error, and the failure status was lost following fsync().</p> <p>I get the exact same behavior if I break after one write() loop, such as to avoid ENOSPC.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-03-29 05:06:22 </code></pre> <p>On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:</p> <blockquote> <p>The retries are the source of the problem ; the first fsync() can return EIO, and also <em>clears the error</em> causing a 2nd fsync (of the same data) to return success.</p> </blockquote> <p>What I'm failing to grok here is how that error flag even matters, whether it's a single bit or a counter as described in that patch. If write back failed, <em>the page is still dirty</em>. So all future calls to fsync() need to try to try to flush it again, and (presumably) fail again (unless it happens to succeed this time around).</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-03-29 05:25:51 </code></pre> <p>On 29 March 2018 at 13:06, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:</p> <blockquote> <p>The retries are the source of the problem ; the first fsync() can return EIO, and also <em>clears the error</em> causing a 2nd fsync (of the same data) to return success.</p> </blockquote> <p>What I'm failing to grok here is how that error flag even matters, whether it's a single bit or a counter as described in that patch. If write back failed, <em>the page is still dirty</em>. So all future calls to fsync() need to try to try to flush it again, and (presumably) fail again (unless it happens to succeed this time around). <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></p> </blockquote> <p>You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.</p> <p>I wrote a test case here.</p> <p><a href="https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c">https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c</a></p> <p>I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.</p> <p>Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-03-29 05:32:43 </code></pre> <p>On 29 March 2018 at 10:48, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.</p> </blockquote> <p>Craig, is the phenomenon you described the same as the second issue &quot;Reporting writeback errors&quot; discussed in this article?</p> <p><a href="https://lwn.net/Articles/724307/">https://lwn.net/Articles/724307/</a></p> </blockquote> <p>A variant of it, by the looks.</p> <p>The problem in our case is that the kernel only tells us about the error once. It then forgets about it. So yes, that seems like a variant of the statement:</p> <blockquote> <p>&quot;Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen.&quot;</p> <p>That's... I'm speechless.</p> </blockquote> <p>Yeah.</p> <p>It's a bit nuts.</p> <p>I was astonished when I saw the behaviour, and that it appears undocumented.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-03-29 05:35:47 </code></pre> <p>On 29 March 2018 at 10:30, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.</p> </blockquote> <p>I covered this in my original post.</p> <p>Yes, we check the return value. But what do we do about it? For fsyncs of heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint later, which will retry the fsync(). <strong>Which will now appear to succeed</strong> because the kernel forgot that it lost our writes after telling us the first time. So we do check the error code, which returns success, and we complete the checkpoint and move on.</p> <p>But we only retried the fsync, not the writes before the fsync.</p> <p>So we lost data. Or rather, failed to detect that the kernel did so, so our checkpoint was bad and could not be completed.</p> <p>The problem is that we keep retrying checkpoints <em>without</em> repeating the writes leading up to the checkpoint, and retrying fsync.</p> <p>I don't pretend the kernel behaviour is sane, but we'd better deal with it anyway.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-03-29 05:58:45 </code></pre> <p>On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.</p> <p>It's not necessary on ext3/ext4 with errors=remount-ro, but that's only because the FS stops us dead in our tracks.</p> <p>I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going to lose a write, it should at minimum mark the FD as broken so no further fsync() or anything else can succeed on the FD, and an app that cares about durability must repeat the whole set of work since the prior succesful fsync(). Just reporting it once and forgetting it is madness.</p> <p>But even if we convince the kernel folks of that, how do other platforms behave? And how long before these kernels are out of use? We'd better deal with it, crazy or no.</p> <p>Please see my StackOverflow post for the kernel-level explanation. Note also the test case link there. <a href="https://stackoverflow.com/a/42436054/398670">https://stackoverflow.com/a/42436054/398670</a></p> <blockquote> <blockquote> <p>Retrying fsync() is not OK at least on Linux. When fsync() returns success it means &quot;all writes since the last fsync have hit disk&quot; but we assume it means &quot;all writes since the last SUCCESSFUL fsync have hit disk&quot;.</p> </blockquote> <p>If that's actually the case, we need to push back on this kernel brain damage, because as you're describing it fsync would be completely useless.</p> </blockquote> <p>It's not useless, it's just telling us something other than what we think it means. The promise it seems to give us is that if it reports an error once, everything <em>after</em> that is useless, so we should throw our toys, close and reopen everything, and redo from the last known-good state.</p> <p>Though as Tomas posted below, it provides rather weaker guarantees than I thought in some other areas too. See that lwn.net article he linked.</p> <blockquote> <p>Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.</p> </blockquote> <p>I can't find anything that says so to me. Please quote relevant spec.</p> <p>I'm working from <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html">http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html</a> which states that</p> <p>&quot;The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.&quot;</p> <p>My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:</p> <p>&quot;If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.&quot;</p> <p>but that doesn't clarify matters much either, because it can be read to mean that once there's been an error reported for some IO operations there's no guarantee those operations are ever completed even after a subsequent fsync returns success.</p> <p>I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-03-29 12:07:56 </code></pre> <p>On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.</p> </blockquote> <p>I found your discussion with kernel hacker Jeff Layton at <a href="https://lwn.net/Articles/718734/">https://lwn.net/Articles/718734/</a> in which he said: &quot;The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior.&quot;</p> <p>The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.</p> <p>If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.</p> <p>Yeah, I see why you want to PANIC.</p> <blockquote> <blockquote> <p>Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.</p> </blockquote> <p>I can't find anything that says so to me. Please quote relevant spec.</p> <p>I'm working from <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html">http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html</a> which states that</p> <p>&quot;The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.&quot;</p> <p>My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:</p> </blockquote> <p>FWIW my reading is the same as Tom's. It says &quot;all data for the open file descriptor&quot; without qualification or special treatment after errors. Not &quot;some&quot;.</p> <blockquote> <p>I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.</p> </blockquote> <p>I see no reason to think that any other operating system would behave that way without strong evidence... This is openly acknowledged to be &quot;a mess&quot; and &quot;a surprise&quot; in the Filesystem Summit article. I am not really qualified to comment, but from a cursory glance at FreeBSD's vfs_bio.c I think it's doing what you'd hope for... see the code near the comment &quot;Failed write, redirty.&quot;</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-03-29 13:15:10 </code></pre> <p>On 29 March 2018 at 20:07, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return.</p> </blockquote> <p>Surely you jest.</p> </blockquote> <p>No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.</p> </blockquote> <p>I found your discussion with kernel hacker Jeff Layton at <a href="https://lwn.net/Articles/718734/">https://lwn.net/Articles/718734/</a> in which he said: &quot;The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior.&quot;</p> <p>The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.</p> <p>If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.</p> <p>Yeah, I see why you want to PANIC.</p> </blockquote> <p>In more ways than one ;)</p> <blockquote> <p>I'm not seeking to defend what the kernel seems to be doing. Rather, saying</p> <blockquote> <p>that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.</p> </blockquote> <p>I see no reason to think that any other operating system would behave that way without strong evidence... This is openly acknowledged to be &quot;a mess&quot; and &quot;a surprise&quot; in the Filesystem Summit article. I am not really qualified to comment, but from a cursory glance at FreeBSD's vfs_bio.c I think it's doing what you'd hope for... see the code near the comment &quot;Failed write, redirty.&quot;</p> </blockquote> <p>Ok, that's reassuring, but doesn't help us on the platform the great majority of users deploy on :(</p> <p>&quot;If on Linux, PANIC&quot;</p> <p>Hrm.</p> <hr> <pre><code>From:Catalin Iacob &lt;iacobcatalin(at)gmail(dot)com&gt; Date:2018-03-29 16:20:00 </code></pre> <p>On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>I found your discussion with kernel hacker Jeff Layton at <a href="https://lwn.net/Articles/718734/">https://lwn.net/Articles/718734/</a> in which he said: &quot;The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior.&quot;</p> </blockquote> <p>And a bit below in the same comments, to this question about PG: &quot;So, what are the options at this point? The assumption was that we can repeat the fsync (which as you point out is not the case), or shut down the database and perform recovery from WAL&quot;, the same Jeff Layton seems to agree PANIC is the appropriate response: &quot;Replaying the WAL synchronously sounds like the simplest approach when you get an error on fsync. These are uncommon occurrences for the most part, so having to fall back to slow, synchronous error recovery modes when this occurs is probably what you want to do.&quot;. And right after, he confirms the errseq_t patches are about always detecting this, not more: &quot;The main thing I working on is to better guarantee is that you actually get an error when this occurs rather than silently corrupting your data. The circumstances where that can occur require some corner-cases, but I think we need to make sure that it doesn't occur.&quot;</p> <p>Jeff's comments in the pull request that merged errseq_t are worth reading as well: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750</a></p> <blockquote> <p>The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.</p> <p>If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.</p> </blockquote> <p>Indeed, that's exactly how I read it as well (opinion formed independently before reading your sentence above). The errseq_t patches landed in v4.13 by the way, so very recently.</p> <blockquote> <p>Yeah, I see why you want to PANIC.</p> </blockquote> <p>Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-03-29 21:18:14 </code></pre> <p>On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob <iacobcatalin(at)gmail(dot)com> wrote:</p> <blockquote> <p>Jeff's comments in the pull request that merged errseq_t are worth reading as well: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750</a></p> </blockquote> <p>Wow. It looks like there may be a separate question of when each filesystem adopted this new infrastructure?</p> <blockquote> <blockquote> <p>Yeah, I see why you want to PANIC.</p> </blockquote> <p>Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.</p> </blockquote> <p>The pre-errseq_t problems are beyond our control. There's nothing we can do about that in userspace (except perhaps abandon OS-buffered IO, a big project). We just need to be aware that this problem exists in certain kernel versions and be grateful to Layton for fixing it.</p> <p>The dropped dirty flag problem is something we can and in my view should do something about, whatever we might think about that design choice. As Andrew Gierth pointed out to me in an off-list chat about this, by the time you've reached this state, both PostgreSQL's buffer and the kernel's buffer are clean and might be reused for another block at any time, so your data might be gone from the known universe -- we don't even have the option to rewrite our buffers in general. Recovery is the only option.</p> <p>Thank you to Craig for chasing this down and +1 for his proposal, on Linux only.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-03-31 13:24:28 </code></pre> <p>On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:</p> <blockquote> <blockquote> <blockquote> <p>Yeah, I see why you want to PANIC.</p> </blockquote> <p>Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.</p> </blockquote> </blockquote> <p>There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-03-31 16:13:09 </code></pre> <p>On 31 March 2018 at 21:24, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:</p> <blockquote> <blockquote> <blockquote> <p>Yeah, I see why you want to PANIC.</p> </blockquote> <p>Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.</p> </blockquote> </blockquote> <p>There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested.</p> </blockquote> <p>That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.</p> <p>I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.</p> <p>Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.</p> <p>I'll send a patch this week.</p> <hr> <pre><code>From:Tom Lane &lt;tgl(at)sss(dot)pgh(dot)pa(dot)us&gt; Date:2018-03-31 16:38:12 </code></pre> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>So we should just use the big hammer here.</p> </blockquote> <p>And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.</p> <hr> <pre><code>From:Michael Paquier &lt;michael(at)paquier(dot)xyz&gt; Date:2018-04-01 00:20:38 </code></pre> <p>On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>So we should just use the big hammer here.</p> </blockquote> <p>And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.</p> </blockquote> <p>That won't fix anything released already, so as per the information gathered something has to be done anyway. The discussion of this thread is spreading quite a lot actually.</p> <p>Handling things at a low-level looks like a better plan for the backend. Tools like pg_basebackup and pg_dump also issue fsync's on the data created, we should do an equivalent for them, with some exit() calls in file_utils.c. As of now failures are logged to stderr but not considered fatal.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-01 00:58:22 </code></pre> <p>On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 31 March 2018 at 21:24, Anthony Iliopoulos &lt;[1]ailiop(at)altatus(dot)com&gt; wrote:</p> <pre><code> On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: &gt; &gt;&gt; Yeah, I see why you want to PANIC. &gt; &gt; &gt; &gt; Indeed. Even doing that leaves question marks about all the kernel &gt; &gt; versions before v4.13, which at this point is pretty much everything &gt; &gt; out there, not even detecting this reliably. This is messy. </code></pre> <blockquote> <pre><code> There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested. </code></pre> </blockquote> <p>That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.</p> </blockquote> <p>Those interfaces have been around since the kernel 2.6 times and are rather stable, but I was merely responding to your original post comment regarding having a way of finding out which page(s) failed. I assume that indeed there would be no benefit, especially since those errors are usually not transient (typically they come from hard medium faults), and although a filesystem could theoretically mask the error by allocating a different logical block, I am not aware of any implementation that currently does that.</p> <blockquote> <p>I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.</p> </blockquote> <p>Fully agree, those cases should be sufficiently rare that a complex and possibly non-maintainable solution is not really warranted.</p> <blockquote> <p>Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.</p> </blockquote> <p>I am not entirely sure what you mean here, does Pg really treat write() errors as fatal? Also, the kind of errors that ext4 detects with this option is at the superblock level and govern metadata rather than actual data writes (recall that those are buffered anyway, no actual device IO has to take place at the time of write()).</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-01 01:14:46 </code></pre> <p>On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>So we should just use the big hammer here.</p> </blockquote> <p>And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.</p> </blockquote> <p>It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive). Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-01 18:24:51 </code></pre> <p>On Fri, Mar 30, 2018 at 10:18 AM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>... on Linux only.</p> </blockquote> <p>Apparently I was too optimistic. I had looked only at FreeBSD, which keeps the page around and dirties it so we can retry, but the other BSDs apparently don't (FreeBSD changed that in 1999). From what I can tell from the sources below, we have:</p> <pre><code>Linux, OpenBSD, NetBSD: retrying fsync() after EIO lies FreeBSD, Illumos: retrying fsync() after EIO tells the truth </code></pre> <p>Maybe my drive-by assessment of those kernel routines is wrong and someone will correct me, but I'm starting to think you might be better to assume the worst on all systems. Perhaps a GUC that defaults to panicking, so that users on those rare OSes could turn that off? Even then I'm not sure if the failure mode will be that great anyway or if it's worth having two behaviours. Thoughts?</p> <p><a href="http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html">http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html</a> <a href="https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059">https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059</a> <a href="https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867">https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867</a> <a href="https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2631">https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2631</a> <a href="https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266">https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266</a> <a href="https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441">https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441</a></p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-02 15:03:42 </code></pre> <p>On 2 April 2018 at 02:24, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>Maybe my drive-by assessment of those kernel routines is wrong and someone will correct me, but I'm starting to think you might be better to assume the worst on all systems. Perhaps a GUC that defaults to panicking, so that users on those rare OSes could turn that off? Even then I'm not sure if the failure mode will be that great anyway or if it's worth having two behaviours. Thoughts?</p> </blockquote> <p>I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.</p> <p>I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.</p> <p>BTW, the support team at 2ndQ is presently working on two separate issues where ENOSPC resulted in DB corruption, though neither of them involve logs of lost page writes. I'm planning on taking some time tomorrow to write a torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in the test case I linked to in my original StackOverflow post.</p> <p>If this is just an EIO issue then I see no point doing anything other than PANICing unconditionally.</p> <p>If it's a concern for ENOSPC too, we should try harder to fail more nicely whenever we possibly can.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-02 18:13:46 </code></pre> <p>Hi,</p> <p>On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>So we should just use the big hammer here.</p> </blockquote> <p>And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.</p> </blockquote> <p>It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive).</p> </blockquote> <p>Throwing away the dirty pages <em>and</em> persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.</p> <blockquote> <p>Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited.</p> </blockquote> <p>Why do dirty pages need to be kept around in the case of persistent errors? I don't think the lack of automatic recovery in that case is what anybody is complaining about. It's that the error goes away and there's no reasonable way to separate out such an error from some potential transient errors.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-02 18:53:20 </code></pre> <p>On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:</p> <blockquote> <p>Craig Ringer <craig(at)2ndquadrant(dot)com> writes:</p> <blockquote> <p>So we should just use the big hammer here.</p> </blockquote> <p>And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.</p> </blockquote> <p>It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive).</p> </blockquote> <p>Throwing away the dirty pages <em>and</em> persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.</p> </blockquote> <p>Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).</p> <p>The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-02 19:32:45 </code></pre> <p>On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:</p> <blockquote> <p>Throwing away the dirty pages <em>and</em> persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.</p> </blockquote> <p>Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).</p> </blockquote> <p>Meh^2.</p> <p>&quot;no reason&quot; - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge &quot;we've thrown away your data for unknown reason&quot; seems entirely reasonable.</p> <blockquote> <p>The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that.</p> </blockquote> <p>Which isn't what I've suggested.</p> <blockquote> <p>Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.</p> </blockquote> <p>Meh.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-02 20:38:06 </code></pre> <p>On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:</p> <blockquote> <p>On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:</p> <blockquote> <p>Throwing away the dirty pages <em>and</em> persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.</p> </blockquote> <p>Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).</p> </blockquote> <p>Meh^2.</p> <p>&quot;no reason&quot; - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge &quot;we've thrown away your data for unknown reason&quot; seems entirely reasonable.</p> </blockquote> <p>As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures.</p> <p>The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.</p> <p>Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).</p> <hr> <pre><code>From:Stephen Frost &lt;sfrost(at)snowman(dot)net&gt; Date:2018-04-02 20:58:08 </code></pre> <p>Greetings,</p> <p>Anthony Iliopoulos (ailiop(at)altatus(dot)com) wrote:</p> <blockquote> <p>On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:</p> <blockquote> <p>On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:</p> <blockquote> <p>Throwing away the dirty pages <em>and</em> persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.</p> </blockquote> <p>Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).</p> </blockquote> <p>Meh^2.</p> <p>&quot;no reason&quot; - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge &quot;we've thrown away your data for unknown reason&quot; seems entirely reasonable.</p> </blockquote> <p>As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures.</p> </blockquote> <p>fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to &quot;please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful.&quot;</p> <p>Give us a way to ask &quot;are these specific pages written out to persistant storage?&quot; and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.</p> <blockquote> <p>The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.</p> </blockquote> <p>We do deal with that error- by realizing that it failed and later <em>retrying</em> the fsync(), which is when we get back an &quot;all good! everything with this file descriptor you've opened is sync'd!&quot; and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.</p> <p>Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?</p> <blockquote> <p>Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).</p> </blockquote> <p>Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean &quot;run another fsync() to see if the error has resolved itself.&quot; Requiring that to mean &quot;you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything&quot; is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-02 23:05:44 </code></pre> <p>Hi Stephen,</p> <p>On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:</p> <blockquote> <p>fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to &quot;please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful.&quot;</p> <p>Give us a way to ask &quot;are these specific pages written out to persistant storage?&quot; and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.</p> </blockquote> <p>Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go.</p> <p>Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though:</p> <p>(a) it still messes with the top-level error reporting so mixing it with callers that use fsync() and do care about errors will produce the same issue (clearing the error status).</p> <p>(b) the error-reporting granularity is coarse (failure reporting applies to the entire requested range so you still don't know which particular pages/file sub-ranges failed writeback)</p> <p>(c) the same &quot;report and forget&quot; semantics apply to repeated invocations of the sync_file_range() call, so again action will need to be taken upon first error encountered for the particular ranges.</p> <blockquote> <blockquote> <p>The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.</p> </blockquote> <p>We do deal with that error- by realizing that it failed and later <em>retrying</em> the fsync(), which is when we get back an &quot;all good! everything with this file descriptor you've opened is sync'd!&quot; and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.</p> </blockquote> <p>It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail. Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.</p> <blockquote> <p>Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?</p> </blockquote> <p>I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write(). The pattern should be roughly like: write() -&gt; fsync() -&gt; free(), rather than write() -&gt; free() -&gt; fsync(). For example, if a partition gets full upon fsync(), then the application has a chance to persist the data in a different location, while the kernel cannot possibly make this decision and recover.</p> <blockquote> <blockquote> <p>Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).</p> </blockquote> <p>Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean &quot;run another fsync() to see if the error has resolved itself.&quot; Requiring that to mean &quot;you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything&quot; is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do.</p> </blockquote> <p>I think we are conflating a few issues here: having the OS kernel being responsible for error recovery (so that subsequent fsync() would fix the problems) is one. This clearly is a design which most kernels have not really adopted for reasons outlined above (although having the FS layer recovering from hard errors transparently is open for discussion from what it seems [1]). Now, there is the issue of granularity of error reporting: userspace could benefit from a fine-grained indication of failed pages (or file ranges). Another issue is that of reporting semantics (report and clear), which is also a design choice made to avoid having higher-resolution error tracking and the corresponding memory overheads [1].</p> <p>[1] <a href="https://lwn.net/Articles/718734/">https://lwn.net/Articles/718734/</a></p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-02 23:23:24 </code></pre> <p>On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though:</p> </blockquote> <p>To quote sync_file_range(2):</p> <pre><code> Warning This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file's metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an over‐ write. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches. </code></pre> <p>Given the lack of metadata safety that seems entirely a no go. We use sfr(2), but only to force the kernel's hand around writing back earlier without throwing away cache contents.</p> <blockquote> <blockquote> <blockquote> <p>The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.</p> </blockquote> <p>We do deal with that error- by realizing that it failed and later <em>retrying</em> the fsync(), which is when we get back an &quot;all good! everything with this file descriptor you've opened is sync'd!&quot; and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.</p> </blockquote> <p>It really turns out that this is not how the fsync() semantics work though</p> </blockquote> <p>Except on freebsd and solaris, and perhaps others.</p> <blockquote> <p>, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail.</p> </blockquote> <p>That's not guaranteed at all, think NFS.</p> <blockquote> <p>Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.</p> </blockquote> <p>It's broken behaviour justified post facto with the only rational that was available, which explains why it's so unconvincing. You could just say &quot;this ship has sailed, and it's to onerous to change because xxx&quot; and this'd be a done deal. But claiming this is reasonable behaviour is ridiculous.</p> <p>Again, you could just continue to error for this fd and still throw away the data.</p> <blockquote> <blockquote> <p>Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?</p> </blockquote> <p>I think what you have in mind are the semantics of sync() rather than fsync()</p> </blockquote> <p>If you open the same file with two fds, and write with one, and fsync with another that's definitely supposed to work. And sync() isn't a realistic replacement in any sort of way because it's obviously systemwide, and thus entirely and completely unsuitable. Nor does it have any sort of better error reporting behaviour, does it?</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-02 23:27:35 </code></pre> <p>On 3 April 2018 at 07:05, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>Hi Stephen,</p> <p>On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:</p> <blockquote> <p>fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to &quot;please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful.&quot;</p> <p>Give us a way to ask &quot;are these specific pages written out to persistant storage?&quot; and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.</p> </blockquote> <p>Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go.</p> </blockquote> <p>They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.</p> <p>So I don't buy this argument.</p> <blockquote> <p>It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail.</p> </blockquote> <p><em>might</em> simply fail.</p> <p>It depends on why the error ocurred.</p> <p>I originally identified this behaviour on a multipath system. Multipath defaults to &quot;throw the writes away, nobody really cares anyway&quot; on error. It seems to figure a higher level will retry, or the application will receive the error and retry.</p> <p>(See no_path_retry in multipath config. AFAICS the default is insanely dangerous and only suitable for specialist apps that understand the quirks; you should use no_path_retry=queue).</p> <blockquote> <p>Instead the kernel opts for marking those pages clean (since there is no other recovery strategy),</p> <p>and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.</p> </blockquote> <p>It could mark the FD.</p> <p>It's not just undocumented, it's a slightly creative interpretation of the POSIX spec for fsync.</p> <blockquote> <blockquote> <p>Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?</p> </blockquote> <p>I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write().</p> </blockquote> <p>This is almost exactly what we tell application authors using PostgreSQL: the data isn't written until you receive a successful commit confirmation, so you'd better not forget it.</p> <p>We provide applications with <em>clear boundaries</em> so they can know <em>exactly</em> what was, and was not, written. I guess the argument from the kernel is the same is true: whatever was written since the last <em>successful</em> fsync is potentially lost and must be redone.</p> <p>But the fsync behaviour is utterly undocumented and dubiously standard.</p> <blockquote> <p>I think we are conflating a few issues here: having the OS kernel being responsible for error recovery (so that subsequent fsync() would fix the problems) is one. This clearly is a design which most kernels have not really adopted for reasons outlined above</p> </blockquote> <p>[citation needed]</p> <p>What do other major platforms do here? The post above suggests it's a bit of a mix of behaviours.</p> <blockquote> <p>Now, there is the issue of granularity of error reporting: userspace could benefit from a fine-grained indication of failed pages (or file ranges).</p> </blockquote> <p>Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map a sync failure back to an individual AIO write.</p> <p>But it seems AIO just adds more problems and fixes none. Flush behaviour with AIO from what I can tell is inconsistent version to version and generally unhelpful. The kernel should really report such sync failures back to the app on its AIO write mapping, but it seems nothing of the sort happens.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-03 00:03:39 </code></pre> <blockquote> <p>On Apr 2, 2018, at 16:27, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.</p> </blockquote> <p>Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare &quot;What fools we've been!&quot; and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-03 00:05:09 </code></pre> <p>On April 2, 2018 5:03:39 PM PDT, Christophe Pettus <xof(at)thebuild(dot)com> wrote:</p> <blockquote> <blockquote> <p>On Apr 2, 2018, at 16:27, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.</p> </blockquote> <p>Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare &quot;What fools we've been!&quot; and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.</p> </blockquote> <p>Don't we pretty much already have agreement in that? And Craig is the main proponent of it?</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-03 00:07:41 </code></pre> <blockquote> <p>On Apr 2, 2018, at 17:05, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <p>Don't we pretty much already have agreement in that? And Craig is the main proponent of it?</p> </blockquote> <p>For sure on the second sentence; the first was not clear to me.</p> <hr> <pre><code>From:Peter Geoghegan &lt;pg(at)bowt(dot)ie&gt; Date:2018-04-03 00:48:00 </code></pre> <p>On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <blockquote> <p>Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare &quot;What fools we've been!&quot; and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.</p> </blockquote> <p>Don't we pretty much already have agreement in that? And Craig is the main proponent of it?</p> </blockquote> <p>I wonder how bad it will be in practice if we PANIC. Craig said &quot;This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us&quot;. It would be nice to get more information on that.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-03 01:29:28 </code></pre> <p>On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.</p> <p>I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.</p> </blockquote> <p>Yeah, it'd be nice to give an administrator the chance to free up some disk space after ENOSPC is reported, and stay up. Running out of space really shouldn't take down the database without warning! The question is whether the data remains in cache and marked dirty, so that retrying is a safe option (since it's potentially gone from our own buffers, so if the OS doesn't have it the only place your committed data can definitely still be found is the WAL... recovery time). Who can tell us? Do we need a per-filesystem answer? Delayed allocation is a somewhat filesystem-specific thing, so maybe. Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation:</p> <pre><code>POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no Illumos/Solaris, Linux, macOS: yes </code></pre> <p>I don't know if macOS really means it or not; it just tells you to see the errors for read(2) and write(2). By the way, speaking of macOS, I was curious to see if the common BSD heritage would show here. Yeah, somewhat. It doesn't appear to keep buffers on writeback error, if this is the right code<a href="though it could be handling it somewhere else for all I know">1</a>.</p> <p>[1] <a href="https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695">https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695</a></p> <hr> <pre><code>From:Robert Haas &lt;robertmhaas(at)gmail(dot)com&gt; Date:2018-04-03 02:54:26 </code></pre> <p>On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).</p> </blockquote> <p>Like other people here, I think this is 100% unreasonable, starting with &quot;the dirty pages which cannot been written out are practically thrown away&quot;. Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to throw away the user's data. If the writes are going to fail, then let them keep on failing every time. <em>That</em> wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.</p> <p>Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.</p> <p>Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying &quot;oh, just do that&quot; is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time. We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is <em>necessary</em> for good performance.</p> <blockquote> <p>The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.</p> </blockquote> <p>I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.</p> <hr> <pre><code>From:Peter Geoghegan &lt;pg(at)bowt(dot)ie&gt; Date:2018-04-03 03:45:30 </code></pre> <p>On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:</p> <blockquote> <p>Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.</p> </blockquote> <p>I fear that the conventional wisdom from the Kernel people is now &quot;you should be using O_DIRECT for granular control&quot;. The LWN article Thomas linked (<a href="https://lwn.net/Articles/718734">https://lwn.net/Articles/718734</a>) cites Ted Ts'o:</p> <p>&quot;Monakhov asked why a counter was needed; Layton said it was to handle multiple overlapping writebacks. Effectively, the counter would record whether a writeback had failed since the file was opened or since the last fsync(). Ts'o said that should be fine; applications that want more information should use O_DIRECT. For most applications, knowledge that an error occurred somewhere in the file is all that is necessary; applications that require better granularity already use O_DIRECT.&quot;</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-03 10:35:39 </code></pre> <p>Hi Robert,</p> <p>On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:</p> <blockquote> <p>On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).</p> </blockquote> <p>Like other people here, I think this is 100% unreasonable, starting with &quot;the dirty pages which cannot been written out are practically thrown away&quot;. Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to</p> </blockquote> <p>If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard.</p> <blockquote> <p>throw away the user's data. If the writes are going to fail, then let them keep on failing every time. <em>That</em> wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.</p> </blockquote> <p>I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through. Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now.</p> <blockquote> <p>Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.</p> </blockquote> <p>Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that <em>happen to be open at the time of error</em>. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:</p> <p>process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.</p> <p>process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.</p> <p>This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior.</p> <blockquote> <p>dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then</p> </blockquote> <p>That's the only way to think about fsync() guarantees unless you are on a kernel that keeps retrying to persist dirty pages. Assuming such a model, after repeated and unrecoverable hard failures the process would have to explicitly inform the kernel to drop the dirty pages. All the process could do at that point is read back to userspace the dirty/failed pages and attempt to rewrite them at a different place (which is current possible too). Most applications would not bother though to inform the kernel and drop the permanently failed pages; and thus someone eventually would hit the case that a large amount of failed writeback pages are running his server out of memory, at which point people will complain that those semantics are completely unreasonable.</p> <blockquote> <p>we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is <em>necessary</em> for good performance.</p> </blockquote> <p>Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO.</p> <blockquote> <blockquote> <p>The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.</p> </blockquote> <p>I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.</p> </blockquote> <p>Again, conflating two separate issues, that of buffering and retrying failed pages and that of error reporting. Yes it would be convenient for applications not to have to care at all about recovery of failed write-backs, but at some point they would have to face this issue one way or another (I am assuming we are always talking about hard failures, other kinds of failures are probably already being dealt with transparently at the kernel level).</p> <p>As for the reporting, it is also unreasonable to effectively signal and persist an error on a file-wide granularity while it pertains to subsets of that file and other writes can go through, but I am repeating myself.</p> <p>I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-03 11:26:05 </code></pre> <p>On 3 April 2018 at 11:35, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>Hi Robert,</p> <p>Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that <em>happen to be open at the time of error</em>. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:</p> <p>process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.</p> <p>process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.</p> </blockquote> <p>Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy.</p> <p>This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that.</p> <p>The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program using fsync for its original intended purpose of guaranteeing that the all writes are synced to disk it no longer has any guarantee at all.</p> <blockquote> <p>This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior.</p> </blockquote> <p>It was always documented as level-triggered. This edge-triggered concept is a completely surprise to application writers.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-03 13:36:47 </code></pre> <p>On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 3 April 2018 at 11:35, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>Hi Robert,</p> <p>Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that <em>happen to be open at the time of error</em>. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:</p> <p>process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.</p> <p>process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.</p> </blockquote> <p>Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy.</p> </blockquote> <p>You can't possibly make this assumption. Process B may be reading and writing to completely disjoint regions from those of process A, and as such not really caring about earlier failures, only wanting to ensure its own writes go all the way through. But even if it did care, the file interfaces make no transactional guarantees. Even without fsync() there is nothing preventing process B from reading dirty pages from process A, and based on their content proceed to to its own business and write/persist new data, while process A further modifies the not-yet-flushed pages in-memory before flushing. In this case you'd need explicit synchronization/locking between the processes anyway, so why would fsync() be an exception?</p> <blockquote> <p>This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that.</p> </blockquote> <p>The errseq_t fixes deal with that; errors will be reported to any process that has an open fd, irrespective to who is the actual caller of the fsync() that may have induced errors. This is anyway required as the kernel may evict dirty pages on its own by doing writeback and as such there needs to be a way to report errors on all open fds.</p> <blockquote> <p>The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program</p> </blockquote> <p>If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error and proceed or some other process will need to leverage that without coordination, or which writes actually failed for that matter. We would be back to the case of requiring explicit synchronization between processes that care about this, in which case the processes may as well synchronize over calling fsync() in the first place.</p> <p>Having an opt-in persisting EIO per-fd would practically be a form of &quot;contract&quot; between &quot;cooperating&quot; processes anyway.</p> <p>But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-03 14:29:10 </code></pre> <p>On 3 April 2018 at 10:54, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:</p> <blockquote> <p>I think it's always unreasonable to throw away the user's data.</p> </blockquote> <p>Well, we do that. If a txn aborts, all writes in the txn are discarded.</p> <p>I think that's perfectly reasonable. Though we also promise an all or nothing effect, we make exceptions even there.</p> <p>The FS doesn't offer transactional semantics, but the fsync behaviour can be interpreted kind of similarly.</p> <p>I don't <em>agree</em> with it, but I don't think it's as wholly unreasonable as all that. I think leaving it undocumented is absolutely gobsmacking, and it's dubious at best, but it's not totally insane.</p> <blockquote> <p>If the writes are going to fail, then let them keep on failing every time.</p> </blockquote> <p>Like we do, where we require an explicit rollback.</p> <p>But POSIX may pose issues there, it doesn't really define any interface for that AFAIK. Unless you expect the app to close() and re-open() the file. Replacing one nonstandard issue with another may not be a win.</p> <blockquote> <p><em>That</em> wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.</p> </blockquote> <p>Yep. That's what I expected to happen on unrecoverable I/O errors. Because, y'know, unrecoverable.</p> <p>I was stunned to learn it's not so. And I'm even more amazed to learn that ext4's errors=remount-ro apparently doesn't concern its self with mere user data, and may exhibit the same behaviour - I need to rerun my test case on it tomorrow.</p> <blockquote> <p>Also, this really does make it impossible to write reliable programs.</p> </blockquote> <p>In the presence of multiple apps interacting on the same file, yes. I think that's a little bit of a stretch though.</p> <p>For a single app, you can recover by remembering and redoing all the writes you did.</p> <p>Sucks if your app wants to have multiple processes working together on a file without some kind of journal or WAL, relying on fsync() alone, mind you. But at least we have WAL.</p> <p>Hrm. I wonder how this interacts with wal_level=minimal.</p> <blockquote> <p>Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying &quot;oh, just do that&quot; is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time.</p> </blockquote> <p>It's also spikier. Users have more issues with latency with short, frequent checkpoints.</p> <blockquote> <p>We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is <em>necessary</em> for good performance.</p> </blockquote> <p>Our double-caching is already plenty bad enough anyway, as well.</p> <p>(Ideally I want to be able to swap buffers between shared_buffers and the OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write out a block, we <em>transfer</em> ownership to the OS. Yeah, I'm dreaming. But we'd sure need to be able to trust the OS not to just forget the block then!)</p> <blockquote> <blockquote> <p>The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.</p> </blockquote> <p>I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.</p> </blockquote> <p>Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They report an ERROR on a statement in a txn then go on with life, merrily forgetting that anything was ever wrong.</p> <p>I agree with PostgreSQL's stance that this is wrong. We require an explicit rollback (or ROLLBACK TO SAVEPOINT) to restore the session to a usable state. This is good.</p> <p>But we're the odd one out there. Almost everyone else does much like what fsync() does on Linux, report the error and forget it.</p> <p>In any case, we're not going to get anyone to backpatch a fix for this into all kernels, so we're stuck working around it.</p> <p>I'll do some testing with ENOSPC tomorrow, propose a patch, report back.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-03 14:37:30 </code></pre> <p>On 3 April 2018 at 14:36, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error</p> </blockquote> <p>I still don't understand what &quot;clear the error&quot; means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By &quot;clear the error&quot; you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that?</p> <blockquote> <p>But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.</p> </blockquote> <p>Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error reporting system anyways. It just needs to know when all writes have been synced to disk.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-03 16:52:07 </code></pre> <p>On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 3 April 2018 at 14:36, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error</p> </blockquote> <p>I still don't understand what &quot;clear the error&quot; means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By &quot;clear the error&quot; you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that?</p> </blockquote> <p>It means that the responsibility of recovering the data is passed back to the application. The writes may never be able to be written out. How would a kernel deal with that? Either discard the data (and have the writer acknowledge) or buffer the data until reboot and simply risk going OOM. It's not what someone would want, but rather <em>need</em> to deal with, one way or the other. At least on the application-level there's a fighting chance for restoring to a consistent state. The kernel does not have that opportunity.</p> <blockquote> <blockquote> <p>But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.</p> </blockquote> <p>Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error</p> </blockquote> <p>I don't really follow this argument, Pg is admittedly using non-portable interfaces (e.g the sync_file_range()). While it's nice to avoid platform specific hacks, expecting that the POSIX semantics will be consistent across systems is simply a 90's pipe dream. While it would be lovely to have really consistent interfaces for application writers, this is simply not going to happen any time soon.</p> <p>And since those problematic semantics of fsync() appear to be prevalent in other systems as well that are not likely to be changed, you cannot rely on preconception that once buffers are handed over to kernel you have a guarantee that they will be eventually persisted no matter what. (Why even bother having fsync() in that case? The kernel would eventually evict and writeback dirty pages anyway. The point of reporting the error back to the application is to give it a chance to recover - the kernel could repeat &quot;fsync()&quot; itself internally if this would solve anything).</p> <blockquote> <p>reporting system anyways. It just needs to know when all writes have been synced to disk.</p> </blockquote> <p>Well, it does know when <em>some</em> writes have <em>not</em> been synced to disk, exactly because the responsibility is passed back to the application. I do realize this puts more burden back to the application, but what would a viable alternative be? Would you rather have a kernel that risks periodically going OOM due to this design decision?</p> <hr> <pre><code>From:Robert Haas &lt;robertmhaas(at)gmail(dot)com&gt; Date:2018-04-03 21:47:01 </code></pre> <p>On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <blockquote> <p>Like other people here, I think this is 100% unreasonable, starting with &quot;the dirty pages which cannot been written out are practically thrown away&quot;. Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to</p> </blockquote> <p>If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard.</p> </blockquote> <p>Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD, POSIX.1-2001, which on the first system I tested, it did. Also, the summary should be changed from the current &quot;fsync, fdatasync - synchronize a file's in-core state with storage device&quot; by adding &quot;, possibly by randomly undoing some of the changes you think you made to the file&quot;.</p> <blockquote> <p>I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through.</p> </blockquote> <p>No, that's not questionable at all. fsync() doesn't take any argument saying which part of the file you care about, so the kernel is entirely not entitled to assume it knows to which writes a given fsync() call was intended to apply.</p> <blockquote> <p>Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now.</p> </blockquote> <p>I don't deny that it's possible that somebody could have an application which is utterly indifferent to the fact that earlier modifications to a file failed due to I/O errors, but is A-OK with that as long as later modifications can be flushed to disk, but I don't think that's a normal thing to want.</p> <blockquote> <blockquote> <p>Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.</p> </blockquote> <p>Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that <em>happen to be open at the time of error</em>.</p> </blockquote> <p>Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure. What you have (or someone has) basically done here is made an undocumented assumption about which file descriptors might care about a particular error, but it just so happens that PostgreSQL has never conformed to that assumption. You can keep on saying the problem is with our assumptions, but it doesn't seem like a very good guess to me to suppose that we're the only program that has ever made them. The documentation for fsync() gives zero indication that it's edge-triggered, and so complaining that people wouldn't like it if it became level-triggered seems like an ex post facto justification for a poorly-chosen behavior: they probably think (as we did prior to a week ago) that it already is.</p> <blockquote> <p>Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO.</p> </blockquote> <p>Well, the way PostgreSQL works today, we typically run with say 8GB of shared_buffers even if the system memory is, say, 200GB. As pages are evicted from our relatively small cache to the operating system, we track which files need to be fsync()'d at checkpoint time, but we don't hold onto the blocks. Until checkpoint time, the operating system is left to decide whether it's better to keep caching the dirty blocks (thus leaving less memory for other things, but possibly allowing write-combining if the blocks are written again) or whether it should clean them to make room for other things. This means that only a small portion of the operating system memory is directly managed by PostgreSQL, while allowing the effective size of our cache to balloon to some very large number if the system isn't under heavy memory pressure.</p> <p>Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that &quot;real men use O_DIRECT&quot; and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same.</p> <p>Unfortunately, that is going to add a huge amount of complexity, because if we ran with shared_buffers set to a large percentage of system memory, we couldn't allocate large chunks of memory for sorts and hash tables from the operating system any more. We'd have to allocate it from our own shared_buffers because that's basically all the memory there is and using substantially more might run the system out entirely. So it's a huge, huge architectural change. And even once it's done it is in some ways inferior to what we are doing today -- true, it gives us superior control over writeback timing, but it also makes PostgreSQL play less nicely with other things running on the same machine, because now PostgreSQL has a dedicated chunk of whatever size it has, rather than using some portion of the OS buffer cache that can grow and shrink according to memory needs both of other parts of PostgreSQL and other applications on the system.</p> <blockquote> <p>I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case.</p> </blockquote> <p>That would certainly be better than nothing.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-03 23:59:27 </code></pre> <p>On Tue, Apr 3, 2018 at 1:29 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation:</p> <p>POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no Illumos/Solaris, Linux, macOS: yes</p> </blockquote> <p>Oops, reading comprehension fail. POSIX yes (since issue 5), via the note that read() and write()'s error conditions can also be returned.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 00:56:37 </code></pre> <p>On Tue, Apr 3, 2018 at 05:47:01PM -0400, Robert Haas wrote:</p> <blockquote> <p>Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure.</p> </blockquote> <p>There has been a lot of focus in this thread on the workflow:</p> <pre><code>write() -&gt; blocks remain in kernel memory -&gt; fsync() -&gt; panic? </code></pre> <p>But what happens in this workflow:</p> <pre><code>write() -&gt; kernel syncs blocks to storage -&gt; fsync() </code></pre> <p>Is fsync() going to see a &quot;kernel syncs blocks to storage&quot; failure?</p> <p>There was already discussion that if the fsync() causes the &quot;syncs blocks to storage&quot;, fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?</p> <p>You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-04 01:54:50 </code></pre> <p>On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>There has been a lot of focus in this thread on the workflow:</p> <pre><code> write() -&gt; blocks remain in kernel memory -&gt; fsync() -&gt; panic? </code></pre> <p>But what happens in this workflow:</p> <pre><code> write() -&gt; kernel syncs blocks to storage -&gt; fsync() </code></pre> <p>Is fsync() going to see a &quot;kernel syncs blocks to storage&quot; failure?</p> <p>There was already discussion that if the fsync() causes the &quot;syncs blocks to storage&quot;, fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?</p> <p>You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.</p> </blockquote> <p>I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that &quot;bug #1&quot;.</p> <p>The second issues is that the pages are marked clean after the error is reported, so further attempts to fsync() the data (in our case for a new attempt to checkpoint) will be futile but appear successful. Call that &quot;bug #2&quot;, with the proviso that some people apparently think it's reasonable behaviour and not a bug. At least there is a plausible workaround for that: namely the nuclear option proposed by Craig.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 02:05:19 </code></pre> <p>On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>There has been a lot of focus in this thread on the workflow:</p> <pre><code> write() -&gt; blocks remain in kernel memory -&gt; fsync() -&gt; panic? </code></pre> <p>But what happens in this workflow:</p> <pre><code> write() -&gt; kernel syncs blocks to storage -&gt; fsync() </code></pre> <p>Is fsync() going to see a &quot;kernel syncs blocks to storage&quot; failure?</p> <p>There was already discussion that if the fsync() causes the &quot;syncs blocks to storage&quot;, fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?</p> <p>You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.</p> </blockquote> <p>I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that &quot;bug #1&quot;.</p> </blockquote> <p>So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.</p> <blockquote> <p>The second issues is that the pages are marked clean after the error is reported, so further attempts to fsync() the data (in our case for a new attempt to checkpoint) will be futile but appear successful. Call that &quot;bug #2&quot;, with the proviso that some people apparently think it's reasonable behaviour and not a bug. At least there is a plausible workaround for that: namely the nuclear option proposed by Craig.</p> </blockquote> <p>Yes, that one I understood.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 02:14:28 </code></pre> <p>On Tue, Apr 3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:</p> <blockquote> <p>I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that &quot;bug #1&quot;.</p> </blockquote> <p>So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.</p> </blockquote> <p>Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file <em>after</em> the failure and try to fsync a write that happened <em>before</em> the failure.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 02:40:16 </code></pre> <p>On 4 April 2018 at 05:47, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:</p> <blockquote> <p>Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that &quot;real men use O_DIRECT&quot; and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same.</p> </blockquote> <p>I looked into buffered AIO a while ago, by the way, and just ... hell no. Run, run as fast as you can.</p> <p>The trouble with direct I/O is that it pushes a <em>lot</em> of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.</p> <p>We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-04 02:44:22 </code></pre> <p>On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Tue, Apr 3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:</p> <blockquote> <p>I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that &quot;bug #1&quot;.</p> </blockquote> <p>So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.</p> </blockquote> <p>Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file <em>after</em> the failure and try to fsync a write that happened <em>before</em> the failure.</p> </blockquote> <p>I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the <em>file</em> (presumably via any fd), which sounds like the desired behaviour:</p> <p><a href="https://github.com/torvalds/linux/blob/master/mm/filemap.c#L682">https://github.com/torvalds/linux/blob/master/mm/filemap.c#L682</a></p> <blockquote> <p>When userland calls fsync (or something like nfsd does the equivalent), we want to report any writeback errors that occurred since the last fsync (or since the file was opened if there haven't been any).</p> </blockquote> <p>But I'm not sure what the lifetime of the passed-in &quot;file&quot; and more importantly &quot;file-&gt;f_wb_err&quot; is. Specifically, what happens to it if no one has the file open at all, between operations? It is reference counted, see fs/file_table.c. I don't know enough about it to comment.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-04 05:29:28 </code></pre> <p>On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file <em>after</em> the failure and try to fsync a write that happened <em>before</em> the failure.</p> </blockquote> <p>I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the <em>file</em> (presumably via any fd), which sounds like the desired behaviour:</p> <p>[..]</p> </blockquote> <p>Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:</p> <pre><code>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping); </code></pre> <p><a href="https://github.com/torvalds/linux/blob/master/fs/open.c#L752">https://github.com/torvalds/linux/blob/master/fs/open.c#L752</a></p> <p>Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?</p> <p>If so I'm not sure how that can possibly be considered to be an implementation of _POSIX_SYNCHRONIZED_IO: &quot;the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state.&quot; Note &quot;the file&quot;, not &quot;this file descriptor + copies&quot;, and without reference to when you opened it.</p> <blockquote> <p>But I'm not sure what the lifetime of the passed-in &quot;file&quot; and more importantly &quot;file-&gt;f_wb_err&quot; is.</p> </blockquote> <p>It's really inode-&gt;i_mapping-&gt;wb_err's lifetime that I should have been asking about there, not file-&gt;f_wb_err, but I see now that that question is irrelevant due to the above.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 06:00:21 </code></pre> <p>On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file <em>after</em> the failure and try to fsync a write that happened <em>before</em> the failure.</p> </blockquote> <p>I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the <em>file</em> (presumably via any fd), which sounds like the desired behaviour:</p> <p>[..]</p> </blockquote> <p>Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:</p> <p>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping);</p> <p><a href="https://github.com/torvalds/linux/blob/master/fs/open.c#L752">https://github.com/torvalds/linux/blob/master/fs/open.c#L752</a></p> <p>Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?</p> </blockquote> <p>Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?</p> <p>I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.</p> <p>Does that mean that the ONLY ways to do reliable I/O are:</p> <ul> <li>single-process, single-file-descriptor write() then fsync(); on failure, retry all work since last successful fsync()</li> <li>direct I/O</li> </ul> <p>?</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-04 07:32:04 </code></pre> <p>On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping);</p> <p>[...]</p> </blockquote> <p>Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?</p> </blockquote> <p>Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?</p> <blockquote> <p>Does that mean that the ONLY ways to do reliable I/O are:</p> <ul> <li>single-process, single-file-descriptor write() then fsync(); on failure, retry all work since last successful fsync()</li> </ul> </blockquote> <p>I suppose you could some up with some crazy complicated IPC scheme to make sure that the checkpointer always has an fd older than any writes to be flushed, with some fallback strategy for when it can't take any more fds.</p> <p>I haven't got any good ideas right now.</p> <blockquote> <ul> <li>direct I/O</li> </ul> </blockquote> <p>As a bit of an aside, I gather that when you resize files (think truncating/extending relation files) you still need to call fsync() even if you read/write all data with O_DIRECT, to make it flush the filesystem meta-data. I have no idea if that could also be affected by eaten writeback errors.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 07:51:53 </code></pre> <p>On 4 April 2018 at 14:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file <em>after</em> the failure and try to fsync a write that happened <em>before</em> the failure.</p> </blockquote> <p>I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the <em>file</em> (presumably via any fd), which sounds like the desired behaviour:</p> <p>[..]</p> </blockquote> <p>Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:</p> <p>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping);</p> <p><a href="https://github.com/torvalds/linux/blob/master/fs/open.c#L752">https://github.com/torvalds/linux/blob/master/fs/open.c#L752</a></p> <p>Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?</p> </blockquote> <p>Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?</p> <p>I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.</p> </blockquote> <p>Done, you can find it in <a href="https://github.com/ringerc/scrapcode/tree/master/testcases/fsync-error-clear">https://github.com/ringerc/scrapcode/tree/master/testcases/fsync-error-clear</a> now.</p> <p>Warning, this runs a Docker container in privileged mode on your system, and it uses devicemapper. Read it before you run it, and while I've tried to keep it safe, beware that it might eat your system.</p> <p>For now it tests only xfs and EIO. Other FSs should be easy enough.</p> <p>I haven't added coverage for multi-processing yet, but given what you found above, I should. I'll probably just system() a copy of the same proc with instructions to only fsync(). I'll do that next.</p> <p>I haven't worked out a reliable way to trigger ENOSPC on fsync() yet, when mapping without the error hole. It happens sometimes but I don't know why, it almost always happens on write() instead. I know it can happen on nfs, but I'm hoping for a saner example than that to test with. ext4 and xfs do delayed allocation but eager reservation so it shouldn't happen to them.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 13:49:38 </code></pre> <p>On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping);</p> <p>[...]</p> </blockquote> <p>Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?</p> </blockquote> <p>Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?</p> </blockquote> <p>Uh, just to clarify, what is new here is that it is ignoring any <em>errors</em> that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open().</p> <p>FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes:</p> <pre><code>Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 5360.341 ops/sec 187 usecs/op write, close, fsync 4785.240 ops/sec 209 usecs/op </code></pre> <p>Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though.</p> <p>I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence.</p> <p>I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required.</p> <p>Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-(</p> <p>The last time I remember being this surprised about storage was in the early Postgres years when we learned that just because the BSD file system uses 8k pages doesn't mean those are atomically written to storage. We knew the operating system wrote the data in 8k chunks to storage but:</p> <ul> <li>the 8k pages are written as separate 512-byte sectors</li> <li>the 8k might be contiguous logically on the drive but not physically</li> <li>even 512-byte sectors are not written atomically</li> </ul> <p>This is why we added pre-page images are written to WAL, which is what full_page_writes controls.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 13:53:01 </code></pre> <p>On Wed, Apr 4, 2018 at 10:40:16AM +0800, Craig Ringer wrote:</p> <blockquote> <p>The trouble with direct I/O is that it pushes a <em>lot</em> of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.</p> <p>We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?</p> </blockquote> <p>I believe this is how Oracle views the kernel, so there is precedent for this approach, though I am not advocating it.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 14:00:15 </code></pre> <p>On 4 April 2018 at 15:51, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 14:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file <em>after</em> the failure and try to fsync a write that happened <em>before</em> the failure.</p> </blockquote> <p>I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the <em>file</em> (presumably via any fd), which sounds like the desired behaviour:</p> <p>[..]</p> </blockquote> <p>Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:</p> <p>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping);</p> <p><a href="https://github.com/torvalds/linux/blob/master/fs/open.c#L752">https://github.com/torvalds/linux/blob/master/fs/open.c#L752</a></p> <p>Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?</p> </blockquote> <p>Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?</p> <p>I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.</p> </blockquote> <p>Done, you can find it in <a href="https://github.com/ringerc/scrapcode/tree/master/">https://github.com/ringerc/scrapcode/tree/master/</a> testcases/fsync-error-clear now.</p> </blockquote> <p>Update. Now supports multiple FSes.</p> <p>I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.</p> <p>Still working on getting ENOSPC on fsync() rather than write(). Kernel code reading suggests this is possible, but all the above FSes reserve space eagerly on write( ) even if they do delayed allocation of the actual storage, so it doesn't seem to happen at least in my simple single-process test.</p> <p>I'm not overly inclined to complain about a fsync() succeeding after a write() error. That seems reasonable enough, the kernel told the app at the time of the failure. What else is it going to do? I don't personally even object hugely to the current fsync() behaviour if it were, say, DOCUMENTED and conformant to the relevant standards, though not giving us any sane way to find out the affected file ranges makes it drastically harder to recover sensibly.</p> <p>But what's come out since on this thread, that we cannot even rely on fsync() giving us an EIO <em>once</em> when it loses our data, because:</p> <ul> <li>all currently widely deployed kernels can fail to deliver info due to recently fixed limitation; and</li> <li>the kernel deliberately hides errors from us if they relate to writes that occurred before we opened the FD (?)</li> </ul> <p>... that's really troubling. I thought we could at least fix this by PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we can't even do that and expect reliability. So how the @#$ are we meant to do?</p> <p>It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 14:09:09 </code></pre> <p>On 4 April 2018 at 22:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.</p> </blockquote> <p>Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.</p> <p>Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).</p> <p>What bewilders me is that running with data=journal doesn't seem to be safe either. WTF?</p> <pre><code>[26438.846111] EXT4-fs (dm-0): mounted filesystem with journalled data mode. Opts: errors=remount-ro,data_err=abort,data=journal [26454.125319] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 12 (offset 0 size 0 starting block 59393) [26454.125326] Buffer I/O error on device dm-0, logical block 59393 [26454.125337] Buffer I/O error on device dm-0, logical block 59394 [26454.125343] Buffer I/O error on device dm-0, logical block 59395 [26454.125350] Buffer I/O error on device dm-0, logical block 59396 </code></pre> <p>and splat, there goes your data anyway.</p> <p>It's possible that this is in some way related to using the device-mapper &quot;error&quot; target and a loopback device in testing. But I don't really see how.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 14:25:47 </code></pre> <p>On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 4 April 2018 at 22:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <pre><code>It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly. </code></pre> <p>Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.</p> <p>Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).</p> </blockquote> <p>Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 14:42:18 </code></pre> <p>On 4 April 2018 at 22:25, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 4 April 2018 at 22:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <pre><code>It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly. </code></pre> <p>Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.</p> <p>Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).</p> </blockquote> <p>Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.</p> </blockquote> <p>Yep, I gathered. I was referring to data_err.</p> <hr> <pre><code>From:Antonis Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-04 15:23:31 </code></pre> <p>On Wed, Apr 4, 2018 at 4:42 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 22:25, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 4 April 2018 at 22:00, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <p>It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.</p> <p>Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.</p> <p>Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).</p> </blockquote> <p>Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.</p> </blockquote> <p>Yep, I gathered. I was referring to data_err.</p> </blockquote> <p>As far as I recall data_err=abort pertains to the jbd2 handling of potential writeback errors. Jbd2 will inetrnally attempt to drain the data upon txn commit (and it's even kind enough to restore the EIO at the address space level, that otherwise would get eaten).</p> <p>When data_err=abort is set, then jbd2 forcibly shuts down the entire journal, with the error being propagated upwards to ext4. I am not sure at which point this would be manifested to userspace and how, but in principle any subsequent fs operations would get some filesystem error due to the journal being down (I would assume similar to remounting the fs read-only).</p> <p>Since you are using data=journal, I would indeed expect to see something more than what you saw in dmesg.</p> <p>I can have a look later, I plan to also respond to some of the other interesting issues that you guys raised in the thread.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-04 15:23:51 </code></pre> <p>On 4 April 2018 at 21:49, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:</p> <blockquote> <p>On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>/* Ensure that we skip any errors that predate opening of the file */ f-&gt;f_wb_err = filemap_sample_wb_err(f-&gt;f_mapping);</p> <p>[...]</p> </blockquote> <p>Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?</p> </blockquote> <p>Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?</p> </blockquote> <p>Uh, just to clarify, what is new here is that it is ignoring any <em>errors</em> that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open().</p> <p>FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes:</p> <pre><code> Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a </code></pre> <p>different descriptor.) write, fsync, close 5360.341 ops/sec 187 usecs/op write, close, fsync 4785.240 ops/sec 209 usecs/op</p> <p>Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though.</p> <p>I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence.</p> <p>I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure?</p> </blockquote> <p>Ideally until the app tells it not to.</p> <p>But there's no standard API for that.</p> <p>The obvious answer seems to be &quot;until the FD is closed&quot;. But we just discussed how Pg relies on being able to open and close files freely. That may not be as reasonable a thing to do as we thought it was when you consider error reporting. What's the kernel meant to do? How long should it remember &quot;I had an error while doing writeback on this file&quot;? Should it flag the file metadata and remember across reboots? Obviously not, but where does it stop? Tell the next program that does an fsync() and forget? How could it associate a dirty buffer on a file with no open FDs with any particular program at all? And what if the app did a write then closed the file and went away, never to bother to check the file again, like most apps do?</p> <p>Some I/O errors are transient (network issue, etc). Some are recoverable with some sort of action, like disk space issues, but may take a long time before an admin steps in. Some are entirely unrecoverable (disk 1 in striped array is on fire) and there's no possible recovery. Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.</p> <p>That does leave us in a pickle when it comes to the checkpointer and opening/closing FDs. I don't know what the &quot;right&quot; thing for the kernel to do from our perspective even is here, but the best I can come up with is actually pretty close to what it does now. Report the fsync() error to the first process that does an fsync() since the writeback error if one has occurred, then forget about it. Ideally I'd have liked it to mark all FDs pointing to the file with a flag to report EIO on next fsync too, but it turns out that won't even help us due to our opening and closing behaviour, so we're going to have to take responsibility for handling and communicating that ourselves, preventing checkpoint completion if any backend gets an fsync error. Probably by PANICing. Some extra work may be needed to ensure reliable ordering and stop checkpoints completing if their fsync() succeeds due to a recent failed fsync() on a normal backend that hasn't PANICed or where the postmaster hasn't noticed yet.</p> <p>Our only option might be to tell administrators to closely watch for &gt; kernel write failure messages, and then restore or failover. :-( &gt;</p> <p>Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show &quot;Buffer I/O error on device dm-0, logical block 59393&quot; or the like.</p> <hr> <pre><code>From:Gasper Zejn &lt;zejn(at)owca(dot)info&gt; Date:2018-04-04 17:23:58 </code></pre> <p>On 04. 04. 2018 15:49, Bruce Momjian wrote:</p> <blockquote> <p>I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required.</p> </blockquote> <p>Ideally the kernel would keep its data for as little time as possible. With fsync, it doesn't really know which process is interested in knowing about a write error, it just assumes the caller will know how to deal with it. Most unfortunate issue is there's no way to get information about a write error.</p> <p>Thinking aloud - couldn't/shouldn't a write error also be a file system event reported by inotify? Admittedly that's only a thing on Linux, but still.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-04 17:51:03 </code></pre> <p>On Wed, Apr 4, 2018 at 11:23:51PM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 4 April 2018 at 21:49, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure?</p> </blockquote> <p>Ideally until the app tells it not to.</p> <p>But there's no standard API for that.</p> </blockquote> <p>You would almost need an API that registers <em>before</em> the failure that you care about sync failures, and that you plan to call fsync() to gather such information. I am not sure how you would allow more than the first fsync() to see the failure unless you added <em>another</em> API to clear the fsync failure, but I don't see the point since the first fsync() might call that clear function. How many applications are going to know there is <em>another</em> application that cares about the failure? Not many.</p> <blockquote> <p>Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.</p> </blockquote> <p>Agreed.</p> <blockquote> <blockquote> <p>Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-(</p> </blockquote> <p>Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show &quot;Buffer I/O error on device dm-0, logical block 59393&quot; or the like.</p> </blockquote> <p>I assume that is the kernel logs. I am thinking the kernel logs have to be monitored, but how many administrators do that? The other issue I think you are pointing out is how is the administrator going to know this is a Postgres file? I guess any sync error to a device that contains Postgres has to assume Postgres is corrupted. :-(</p> <hr> <p>see explicit treatment of retrying, though I'm not entirely sure if the retry flag is set just for async write-back), and apparently unlike every other kernel I've tried to grok so far (things descended from ancestral BSD but not descended from FreeBSD, with macOS/Darwin apparently in the first category for this purpose).</p> <p>Here's a new ticket in the NetBSD bug database for this stuff:</p> <p><a href="http://gnats.netbsd.org/53152">http://gnats.netbsd.org/53152</a></p> <p>As mentioned in that ticket and by Andres earlier in this thread, keeping the page dirty isn't the only strategy that would work and may be problematic in different ways (it tells the truth but floods your cache with unflushable stuff until eventually you force unmount it and your buffers are eventually invalidated after ENXIO errors? I don't know.). I have no qualified opinion on that. I just know that we need a way for fsync() to tell the truth about all preceding writes or our checkpoints are busted.</p> <p>*We mmap() + msync() in pg_flush_data() if you don't have sync_file_range(), and I see now that that is probably not a great idea on ZFS because you'll finish up double-buffering (or is that triple-buffering?), flooding your page cache with transient data. Oops. That is off-topic and not relevant for the checkpoint correctness topic of this thread through, since pg_flush_data() is advisory only.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-04 22:14:24 </code></pre> <p>On Thu, Apr 5, 2018 at 9:28 AM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.</p> </blockquote> <p>While contemplating what exactly it would do (not sure),</p> </blockquote> <p>See manual for failmode=wait | continue | panic. Even &quot;continue&quot; returns EIO to all new write requests, so they apparently didn't bother to supply an 'eat-my-data-but-tell-me-everything-is-fine' mode. Figures.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-05 07:09:57 </code></pre> <p>Summary to date:</p> <p>It's worse than I thought originally, because:</p> <ul> <li>Most widely deployed kernels have cases where they don't tell you about losing your writes at all; and</li> <li>Information about loss of writes can be masked by closing and re-opening a file</li> </ul> <p>So the checkpointer cannot trust that a successful fsync() means ... a successful fsync().</p> <p>Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.</p> <p>There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.</p> <p>I previously though that errors=remount-ro was a sufficient safeguard. It isn't. There doesn't seem to be anything that is, for ext3, ext4, btrfs or xfs.</p> <p>It's not clear to me yet why data_err=abort isn't sufficient in data=ordered or data=writeback mode on ext3 or ext4, needs more digging. (In my test tools that's: make FSTYPE=ext4 MKFSOPTS=&quot;&quot; MOUNTOPTS=&quot;errors=remount-ro, data_err=abort,data=journal&quot; as of the current version d7fe802ec). AFAICS that's because data_error=abort only affects data=ordered, not data=journal. If you use data=ordered, you at least get retries of the same write failing. This post <a href="https://lkml.org/lkml/2008/10/10/80">https://lkml.org/lkml/2008/10/10/80</a> added the option and has some explanation, but doesn't explain why it doesn't affect data=journal.</p> <p>zfs is probably not affected by the issues, per Thomas Munro. I haven't run my test scripts on it yet because my kernel doesn't have zfs support and I'm prioritising the multi-process / open-and-close issues.</p> <p>So far none of the FSes and options I've tried exhibit the behavour I actually want, which is to make the fs readonly or inaccessible on I/O error.</p> <p>ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().</p> <p>I think what we really need is a block-layer fix, where an I/O error flips the block device into read-only mode, as if blockdev --setro had been used. Though I'd settle for a kernel panic, frankly. I don't think anybody really wants this, but I'd rather either of those to silent data loss.</p> <p>I'm currently tweaking my test to do some close and reopen the file between each write() and fsync(), and to support running with nfs.</p> <p>I've also just found the device-mapper &quot;flakey&quot; driver, which looks fantastic for simulating unreliable I/O with intermittent faults. I've been using the &quot;error&quot; target in a mapping, which lets me remap some of the device to always error, but &quot;flakey&quot; looks very handy for actual PostgreSQL testing.</p> <p>For the sake of Google, these are errors known to be associated with the problem:</p> <p>ext4, and ext3 mounted with ext4 driver:</p> <pre><code>[42084.327345] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 12 (offset 0 size 0 starting block 59393) [42084.327352] Buffer I/O error on device dm-0, logical block 59393 </code></pre> <p>xfs:</p> <pre><code>[42193.771367] XFS (dm-0): writeback error on sector 118784 [42193.784477] XFS (dm-0): writeback error on sector 118784 </code></pre> <p>jfs: (nil, silence in the kernel logs)</p> <p>You should also beware of &quot;lost page write&quot; or &quot;lost write&quot; errors.</p> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-05 08:46:08 </code></pre> <p>On 5 April 2018 at 15:09, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.</p> </blockquote> <p>I just confirmed this with a tweak to the test that</p> <pre><code>records the file position close()s the fd sync()s open(s) the file lseek()s back to the recorded position </code></pre> <p>This causes the test to completely ignore the I/O error, which is not reported to it at any time.</p> <p>Fair enough, really, when you look at it from the kernel's point of view. What else can it do? Nobody has the file open. It'd have to mark the file its self as bad somehow. But that's pretty bad for our robustness AFAICS.</p> <blockquote> <p>There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.</p> </blockquote> <p>If dirty writeback fails between our close() and re-open() I see the same behaviour as with sync(). To test that I set dirty_writeback_centisecs and dirty_expire_centisecs to 1 and added a usleep(3*100*1000) between close() and open(). (It's still plenty slow). So sync() is a convenient way to simulate something other than our own fsync() writing out the dirty buffer.</p> <p>If I omit the sync() then we get the error reported by fsync() once when we re open() the file and fsync() it, because the buffers weren't written out yet, so the error wasn't generated until we re-open()ed the file. But I doubt that'll happen much in practice because dirty writeback will get to it first so the error will be seen and discarded before we reopen the file in the checkpointer.</p> <p>In other words, it looks like <em>even with a new kernel with the error reporting bug fixes</em>, if I understand how the backends and checkpointer interact when it comes to file descriptors, we're unlikely to notice I/O errors and fail a checkpoint. We may notice I/O errors if a backend does its own eager writeback for large I/O operations, or if the checkpointer fsync()s a file before the kernel's dirty writeback gets around to trying to flush the pages that will fail.</p> <p>I haven't tested anything with multiple processes / multiple FDs yet, where we keep one fd open while writing on another.</p> <p>But at this point I don't see any way to make Pg reliably detect I/O errors and fail a checkpoint then redo and retry. To even fix this by PANICing like I proposed originally, we need to know we have to PANIC.</p> <p>AFAICS it's completely unsafe to write(), close(), open() and fsync() and expect that the fsync() makes any promises about the write(). Which if I read Pg's low level storage code right, makes it completely unable to reliably detect I/O errors.</p> <p>When put it that way, it sounds fair enough too. How long is the kernel meant to remember that there was a write error on the file triggered by a write initiated by some seemingly unrelated process, some unbounded time ago, on a since-closed file?</p> <p>But it seems to put Pg on the fast track to O_DIRECT.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-05 19:33:14 </code></pre> <p>On Thu, Apr 5, 2018 at 03:09:57PM +0800, Craig Ringer wrote:</p> <blockquote> <p>ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().</p> </blockquote> <p>This does explain why NFS has a reputation for unreliability for Postgres.</p> <hr> <pre><code>From:Andrew Gierth &lt;andrew(at)tao11(dot)riddles(dot)org(dot)uk&gt; Date:2018-04-05 23:37:42 </code></pre> <p>Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS <em>does</em> do the right thing (discovered by testing on FreeBSD).</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-06 01:27:05 </code></pre> <p>On 6 April 2018 at 07:37, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:</p> <blockquote> <p>Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS <em>does</em> do the right thing (discovered by testing on FreeBSD).</p> </blockquote> <p>Yikes. For other readers, the related thread for this is</p> <p>Meanwhile, I've extended my test to run postgres on a deliberately faulty volume and confirmed my results there.</p> <pre><code>2018-04-06 01:11:40.555 UTC [58] LOG: checkpoint starting: immediate force wait 2018-04-06 01:11:40.567 UTC [58] ERROR: could not fsync file &quot;base/12992/16386&quot;: Input/output error 2018-04-06 01:11:40.655 UTC [66] ERROR: checkpoint request failed 2018-04-06 01:11:40.655 UTC [66] HINT: Consult recent messages in the server log for details. 2018-04-06 01:11:40.655 UTC [66] STATEMENT: CHECKPOINT Checkpoint failed with checkpoint request failed HINT: Consult recent messages in the server log for details. Retrying 2018-04-06 01:11:41.568 UTC [58] LOG: checkpoint starting: immediate force wait 2018-04-06 01:11:41.614 UTC [58] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.046 s; sync files=3, longest=0.000 s, average=0.000 s; distance=2727 kB, estimate=2779 kB </code></pre> <p>Given your report, now I have to wonder if we even reissued the fsync() at all this time. 'perf' time. OK, with</p> <pre><code>sudo perf record -e syscalls:sys_enter_fsync,syscalls:sys_exit_fsync -a sudo perf script </code></pre> <p>I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.</p> <pre><code> postgres 9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0 </code></pre> <p>... and Pg continues merrily on its way without realising it lost data:</p> <pre><code>[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688 </code></pre> <p>In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-06 02:53:56 </code></pre> <p>On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 6 April 2018 at 07:37, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:</p> <blockquote> <p>Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS <em>does</em> do the right thing (discovered by testing on FreeBSD).</p> </blockquote> <p>Yikes. For other readers, the related thread for this is</p> </blockquote> <p>Yeah. That's really embarrassing, especially after beating up on various operating systems all week. It's also an independent issue -- let's keep that on the other thread and get it fixed.</p> <blockquote> <p>I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.</p> <pre><code> postgres 9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd: </code></pre> <p>0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0</p> <p>... and Pg continues merrily on its way without realising it lost data:</p> <p>[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688</p> <p>In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.</p> </blockquote> <p>I suppose you only see errors because the file descriptors linger open in the virtual file descriptor cache, which is a matter of luck depending on how many relation segment files you touched. One thing you could try to confirm our understand of the Linux 4.13+ policy would be to hack PostgreSQL so that it reopens the file descriptor every time in mdsync(). See attached.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-06 03:20:22 </code></pre> <p>On 6 April 2018 at 10:53, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 6 April 2018 at 07:37, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:</p> <blockquote> <p>Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS <em>does</em> do the right thing (discovered by testing on FreeBSD).</p> </blockquote> <p>Yikes. For other readers, the related thread for this is news-spur.riddles.org.uk</p> </blockquote> <p>Yeah. That's really embarrassing, especially after beating up on various operating systems all week. It's also an independent issue -- let's keep that on the other thread and get it fixed.</p> <blockquote> <p>I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.</p> <pre><code> postgres 9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd: </code></pre> <p>0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0</p> <p>... and Pg continues merrily on its way without realising it lost data:</p> <p>[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688</p> <p>In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.</p> </blockquote> <p>I suppose you only see errors because the file descriptors linger open in the virtual file descriptor cache, which is a matter of luck depending on how many relation segment files you touched.</p> </blockquote> <p>In this case I think it's because the kernel didn't get around to doing the writeback before the eagerly forced checkpoint fsync()'d it. Or we didn't even queue it for writeback from our own shared_buffers until just before we fsync()'d it. After all, it's a contrived test case that tries to reproduce the issue rapidly with big writes and frequent checkpoints.</p> <p>So the checkpointer had the relation open to fsync() it, and it was the checkpointer's fsync() that did writeback on the dirty page and noticed the error.</p> <p>If we the kernel had done the writeback before the checkpointer opened the relation to fsync() it, we might not have seen the error at all - though as you note this depends on the file descriptor cache. You can see the silent-error behaviour in my standalone test case where I confirmed the post-4.13 behaviour. (I'm on 4.14 here).</p> <p>I can try to reproduce it with postgres too, but it not only requires closing and reopening the FDs, it also requires forcing writeback before opening the fd. To make it occur in a practical timeframe I have to make my kernel writeback settings insanely aggressive and/or call sync() before re-open()ing. I don't really think it's worth it, since I've confirmed the behaviour already with the simpler test in standalone/ in the rest repo. To try it yourself, clone</p> <p><a href="https://github.com/ringerc/scrapcode">https://github.com/ringerc/scrapcode</a></p> <p>and in the master branch</p> <pre><code>cd testcases/fsync-error-clear less README make REOPEN=reopen standalone-run </code></pre> <p>See <a href="https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear/standalone/fsync-error-clear.c#L118">https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear/standalone/fsync-error-clear.c#L118</a> .</p> <p>I've pushed the postgres test to that repo too; &quot;make postgres-run&quot;.</p> <p>You'll need docker, and be warned, it's using privileged docker containers and messing with dmsetup.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-08 02:16:07 </code></pre> <p>So, what can we actually do about this new Linux behaviour?</p> <p>Idea 1:</p> <ul> <li>whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err)</li> <li>if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to</li> </ul> <p>Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?</p> <p>Idea 2:</p> <p>Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).</p> <p>Idea 3:</p> <p>Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.</p> <p>Any other ideas?</p> <p>For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work?</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-08 02:33:37 </code></pre> <p>On Sun, Apr 8, 2018 at 02:16:07PM +1200, Thomas Munro wrote:</p> <blockquote> <p>So, what can we actually do about this new Linux behaviour?</p> <p>Idea 1:</p> <ul> <li><p>whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err)</p></li> <li><p>if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to</p></li> </ul> <p>Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?</p> <p>Idea 2:</p> <p>Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).</p> <p>Idea 3:</p> <p>Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.</p> </blockquote> <p>Idea 4 would be for people to assume their database is corrupt if their server logs report any I/O error on the file systems Postgres uses.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-08 02:37:47 </code></pre> <blockquote> <p>On Apr 7, 2018, at 19:33, Bruce Momjian <bruce(at)momjian(dot)us> wrote: Idea 4 would be for people to assume their database is corrupt if their server logs report any I/O error on the file systems Postgres uses.</p> </blockquote> <p>Pragmatically, that's where we are right now. The best answer in this bad situation is (a) fix the error, then (b) replay from a checkpoint before the error occurred, but it appears we can't even guarantee that a PostgreSQL process will be the one to see the error.</p> <p>-- -- Christophe Pettus xof(at)thebuild(dot)com</p> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-08 03:27:45 </code></pre> <p>On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>So, what can we actually do about this new Linux behaviour?</p> </blockquote> <p>Yeah, I've been cooking over that myself.</p> <p>More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.</p> <p>We have a storage abstraction that makes this way, way less painful than it should be.</p> <p>We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.</p> <p>Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.</p> <p>That way we can use something like your #1 (which is what I was also thinking about then rejecting previously), but reduce the pain by reducing the FD count drastically so exhausting FDs stops being a problem.</p> <p>Previously I was leaning toward what you've described here:</p> <blockquote> <ul> <li><p>whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err)</p></li> <li><p>if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to</p></li> </ul> <p>Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?</p> </blockquote> <p>... and got stuck on &quot;yuck, that's awful&quot;.</p> <p>I was assuming we'd force early checkpoints if the checkpointer hit its fd limit, but that's even worse.</p> <p>We'd need to urgently do away with segmented relations, and partitions would start to become a hinderance.</p> <p>Even then it's going to be an unworkable nightmare with heavily partitioned systems, systems that use schema-sharding, etc. And it'll mean we need to play with process limits and, often, system wide limits on FDs. I imagine the performance implications won't be pretty.</p> <p>Idea 2:</p> <blockquote> <p>Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).</p> </blockquote> <p>This appears to be what SQLite does AFAICS.</p> <p><a href="https://www.sqlite.org/atomiccommit.html">https://www.sqlite.org/atomiccommit.html</a></p> <p>though it has the huge luxury of a single writer, so it's probably only subject to the original issue not the multiprocess / checkpointer issues we face.</p> <blockquote> <p>Idea 3:</p> <p>Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.</p> </blockquote> <p>That seems to be what the kernel folks will expect. But that's going to KILL performance. We'll need writer threads to have any hope of it not <em>totally</em> sucking, because otherwise simple things like updating a heap tuple and two related indexes will incur enormous disk latencies.</p> <p>But I suspect it's the path forward.</p> <p>Goody.</p> <blockquote> <p>Any other ideas?</p> <p>For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work?</p> </blockquote> <p>I think our underlying data syncing concept is fundamentally broken, and it's not really the kernel's fault.</p> <p>We assume that we can safely:</p> <pre><code>procA: open() procA: write() procA: close() </code></pre> <p>... some long time later, unbounded as far as the kernel is concerned ...</p> <pre><code>procB: open() procB: fsync() procB: close() </code></pre> <p>If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?</p> <p>Should it just remember &quot;this file had an error&quot; forever, and tell every caller? In that case how could we recover? We'd need some new API to say &quot;yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now&quot;. Otherwise it'd keep reporting an error after we did redo to recover, too.</p> <p>I never really clicked to the fact that we closed relations with pending buffered writes, left them closed, then reopened them to fsync. That's .... well, the kernel isn't the only thing doing crazy things here.</p> <p>Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.</p> <p>Fun times.</p> <p>This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST make sure you don't run out of disk. Because the usual safeguard of space reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested this with nfsv3 in sync,hard,nointr mode yet, <em>maybe</em> that's safe, but I doubt it). The same applies to thin-provisioned storage. Just. Don't.</p> <p>This helps explain various reports of corruption in Docker and various other tools that use various sorts of thin provisioning. If you hit ENOSPC in fsync(), bye bye data.</p> <hr> <pre><code>From:Peter Geoghegan &lt;pg(at)bowt(dot)ie&gt; Date:2018-04-08 03:37:06 </code></pre> <p>On Sat, Apr 7, 2018 at 8:27 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.</p> <p>We have a storage abstraction that makes this way, way less painful than it should be.</p> <p>We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.</p> <p>Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.</p> </blockquote> <p>I'm not sure that we can do that now, since it would break the new &quot;Optimize btree insertions for common case of increasing values&quot; optimization. (I did mention this before it went in.)</p> <p>I've asked Pavan to at least add a note to the nbtree README that explains the high level theory behind the optimization, as part of post-commit clean-up. I'll ask him to say something about how it might affect extent-based storage, too.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-08 03:46:17 </code></pre> <blockquote> <p>On Apr 7, 2018, at 20:27, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.</p> </blockquote> <p>Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error. The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly. But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.</p> <hr> <pre><code>From:Andreas Karlsson &lt;andreas(at)proxel(dot)se&gt; Date:2018-04-08 09:41:06 </code></pre> <p>On 04/08/2018 05:27 AM, Craig Ringer wrote:&gt;</p> <blockquote> <p>More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.</p> </blockquote> <p>FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.</p> <p><a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html">https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html</a></p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-08 10:30:31 </code></pre> <p>On 8 April 2018 at 11:46, Christophe Pettus <xof(at)thebuild(dot)com> wrote:</p> <blockquote> <p>On Apr 7, 2018, at 20:27, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.</p> <p>Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error. The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly. But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.</p> </blockquote> <p>I don't entirely agree - because it affects ENOSPC, I/O errors on thin provisioned storage, I/O errors on multipath storage, etc. (I identified the original issue on a thin provisioned system that ran out of backing space, mangling PostgreSQL in a way that made no sense at the time).</p> <p>These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-08 10:31:24 </code></pre> <p>On 8 April 2018 at 17:41, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:</p> <blockquote> <p>On 04/08/2018 05:27 AM, Craig Ringer wrote:&gt; More below, but here's an idea #5: decide InnoDB has the right idea, and</p> <blockquote> <p>go to using a single massive blob file, or a few giant blobs.</p> </blockquote> <p>FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.</p> <p><a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html">https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html</a></p> </blockquote> <p>Huh, thanks for the update.</p> <p>We should see how they handle reliable flushing and see if they've looked into it. If they haven't, we should give them a heads-up and if they have, lets learn from them.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-08 16:38:03 </code></pre> <blockquote> <p>On Apr 8, 2018, at 03:30, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.</p> </blockquote> <p>This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-08 21:23:21 </code></pre> <p>On 8 April 2018 at 04:27, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <p>If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?</p> <p>Should it just remember &quot;this file had an error&quot; forever, and tell every caller? In that case how could we recover? We'd need some new API to say &quot;yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now&quot;. Otherwise it'd keep reporting an error after we did redo to recover, too.</p> </blockquote> <p>There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past.</p> <p>The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written &quot;a memory leak&quot;.</p> <p>As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface.</p> <p>Going to DIRECTIO is basically recognizing this. That the kernel filesystem buffer provides no reliable interface so we need to reimplement it ourselves in user space.</p> <p>It's rather disheartening. Aside from having to do all that work we have the added barrier that we don't have as much information about the hardware as the kernel has. We don't know where raid stripes begin and end, how big the memory controller buffers are or how to tell when they're full or empty or how to flush them. etc etc. We also don't know what else is going on on the machine.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-08 21:28:43 </code></pre> <blockquote> <p>On Apr 8, 2018, at 14:23, Greg Stark <stark(at)mit(dot)edu> wrote:</p> <p>They consider dirty filesystem buffers when there's hardware failure preventing them from being written &quot;a memory leak&quot;.</p> </blockquote> <p>That's not an irrational position. File system buffers are <em>not</em> dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-08 21:47:04 </code></pre> <p>On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 8 April 2018 at 04:27, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <p>If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?</p> <p>Should it just remember &quot;this file had an error&quot; forever, and tell every caller? In that case how could we recover? We'd need some new API to say &quot;yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now&quot;. Otherwise it'd keep reporting an error after we did redo to recover, too.</p> </blockquote> <p>There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync</p> </blockquote> <p>Instead, fsync() reports when some of the buffers have not been written out, due to reasons outlined before. As such it may make some sense to maintain some tracking regarding errors even after marking failed dirty pages as clean (in fact it has been proposed, but this introduces memory overhead).</p> <blockquote> <p>again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past.</p> </blockquote> <p>The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.</p> <blockquote> <p>The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written &quot;a memory leak&quot;.</p> <p>As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface.</p> </blockquote> <p>This does not necessarily follow. Whether the kernel discards writes or not would not really help (see above). It is more a matter of proper &quot;reporting contract&quot; between userspace and kernel, and tracking would be a way for facilitating this vs. having a more complex userspace scheme (as described by others in this thread) where synchronization for fsync() is required in a multi-process application.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-08 22:29:16 </code></pre> <p>On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:</p> <blockquote> <blockquote> <p>On Apr 8, 2018, at 03:30, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.</p> </blockquote> <p>This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.</p> </blockquote> <p>I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby.</p> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-08 23:10:24 </code></pre> <blockquote> <p>On Apr 8, 2018, at 15:29, Bruce Momjian <bruce(at)momjian(dot)us> wrote: I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost.</p> </blockquote> <p>Yeah, it's bad. In the short term, the best advice to installations is to monitor their kernel logs for errors (which very few do right now), and make sure they have a backup strategy which can encompass restoring from an error like this. Even Craig's smart fix of patching the backup label to recover from a previous checkpoint doesn't do much good if we don't have WAL records back that far (or one of the required WAL records also took a hit).</p> <p>In the longer term... O_DIRECT seems like the most plausible way out of this, but that might be popular with people running on file systems or OSes that don't have this issue. (Setting aside the daunting prospect of implementing that.)</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-08 23:16:25 </code></pre> <p>On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote:</p> <blockquote> <p>On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:</p> <blockquote> <blockquote> <p>On Apr 8, 2018, at 03:30, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:</p> <p>These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.</p> </blockquote> <p>This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.</p> </blockquote> <p>I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby.</p> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> </blockquote> <p>I think the danger presented here is far smaller than some of the statements in this thread might make one think. In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed. Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).</p> <p>There's a lot of not so great things here, but I don't think there's any need to panic.</p> <p>We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit.</p> <p>I think there's pretty good reasons to go to direct IO where supported, but error handling doesn't strike me as a particularly good reason for the move.</p> <hr> <pre><code>From:Christophe Pettus &lt;xof(at)thebuild(dot)com&gt; Date:2018-04-08 23:27:57 </code></pre> <blockquote> <p>On Apr 8, 2018, at 16:16, Andres Freund <andres(at)anarazel(dot)de> wrote: We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).</p> </blockquote> <p>There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time.</p> <p>That being said...</p> <blockquote> <p>There's a lot of not so great things here, but I don't think there's any need to panic.</p> </blockquote> <p>No reason to panic, yes. We can assume that if this was a very big persistent problem, it would be much more widely reported. It would, however, be good to find a way to get the error surfaced back up to the client in a way that is not just monitoring the kernel logs.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-09 01:31:56 </code></pre> <p>On 9 April 2018 at 05:28, Christophe Pettus <xof(at)thebuild(dot)com> wrote:</p> <blockquote> <blockquote> <p>On Apr 8, 2018, at 14:23, Greg Stark <stark(at)mit(dot)edu> wrote:</p> <p>They consider dirty filesystem buffers when there's hardware failure preventing them from being written &quot;a memory leak&quot;.</p> </blockquote> <p>That's not an irrational position. File system buffers are <em>not</em> dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.</p> </blockquote> <p>Yep.</p> <p>Another example is a write to an NFS or iSCSI volume that goes away forever. What if the app keeps write()ing in the hopes it'll come back, and by the time the kernel starts reporting EIO for write(), it's already saddled with a huge volume of dirty writeback buffers it can't get rid of because someone, one day, might want to know about them?</p> <p>You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok? What if it's remounted again? That'd be really bad too, for someone expecting write reliability.</p> <p>You can coarsen from dirty buffer tracking to marking the FD(s) bad, but what if there's no FD to mark because the file isn't open at the moment?</p> <p>You can mark the inode cache entry and pin it, I guess. But what if your app triggered I/O errors over vast numbers of small files? Again, the kernel's left holding the ball.</p> <p>It doesn't know if/when an app will return to check. It doesn't know how long to remember the failure for. It doesn't know when all interested clients have been informed and it can treat the fault as cleared/repaired, either, so it'd have to <em>keep on reporting EIO for PostgreSQL's own writes and fsyncs() indefinitely</em>, even once we do recovery.</p> <p>The only way it could avoid that would be to keep the dirty writeback pages around and flagged bad, then clear the flag when a new write() replaces the same file range. I can't imagine that being practical.</p> <p>Blaming the kernel for this sure is the easy way out.</p> <p>But IMO we cannot rationally expect the kernel to remember error state forever for us, then forget it when we expect, all without actually telling it anything about our activities or even that we still exist and are still interested in the files/writes. We've closed the files and gone away.</p> <p>Whatever we do, it's likely going to have to involve not doing that anymore.</p> <p>Even if we can somehow convince the kernel folks to add a new interface for us that reports I/O errors to some listener, like an inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting errors in buffered async writes, we won't be able to rely on having it for 5-10 years, and only on Linux.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-09 01:35:06 </code></pre> <p>On 9 April 2018 at 06:29, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost.</p> </blockquote> <p>Right.</p> <p>Specifically, we need a way to ask the kernel at checkpoint time &quot;was everything written to [this set of files] flushed successfully since the last time I asked, no matter who did the writing and no matter how the writes were flushed?&quot;</p> <p>If the result is &quot;no&quot; we PANIC and redo. If the hardware/volume is screwed, the user can fail over to a standby, do PITR, etc.</p> <p>But we don't have any way to ask that reliably at present.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 01:55:10 </code></pre> <p>Hi,</p> <p>On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote:</p> <blockquote> <blockquote> <p>On Apr 8, 2018, at 16:16, Andres Freund <andres(at)anarazel(dot)de> wrote: We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).</p> </blockquote> <p>There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time.</p> </blockquote> <p>Only if you're &quot;lucky&quot; enough that your clients actually read that data, and then you're somehow able to figure out across the whole stack that these 0.001% of transactions that fail are due to IO errors. Or you also need to do log analysis.</p> <p>If you want to solve things like that you need regular reads of all your data, including verifications etc.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-09 02:00:41 </code></pre> <p>On 9 April 2018 at 07:16, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>I think the danger presented here is far smaller than some of the statements in this thread might make one think.</p> </blockquote> <p>Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.</p> <p>I suspect we've written off a fair few issues in the past as &quot;it'd bad hardware&quot; when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.</p> <p>I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say &quot;well, don't do that then&quot; and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.</p> <p>We have to look at what checkpoints are and are not supposed to promise, and whether this is a problem we just define away as &quot;not our problem, the lower level failed, we're not obliged to detect this and fail gracefully.&quot;</p> <p>We can choose to say that checkpoints are required to guarantee crash/power loss safety ONLY and do not attempt to protect against I/O errors of any sort. In fact, I think we should likely amend the documentation for release versions to say just that.</p> <blockquote> <p>In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed.</p> </blockquote> <p>Your database is going to be down or have interrupted service. It's possible you may have some unreadable data. This could result in localised damage to one or more relations. That could affect FK relationships, indexes, all sorts. If you're really unlucky you might lose something critical like pg_clog/ contents.</p> <p>But in general your DB should be repairable/recoverable even in those cases.</p> <p>And in many failure modes there's no reason to expect any data loss at all, like:</p> <ul> <li>Local disk fills up (seems to be safe already due to space reservation at write() time)</li> <li>Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up</li> <li>NFS volume fills up</li> <li>Multipath I/O error</li> <li>Interruption of connectivity to network block device</li> <li>Disk develops localized bad sector where we haven't previously written data</li> </ul> <p>Except for the ENOSPC on NFS, all the rest of the cases can be handled by expecting the kernel to retry forever and not return until the block is written or we reach the heat death of the universe. And NFS, well...</p> <p>Part of the trouble is that the kernel <em>won't</em> retry forever in all these cases, and doesn't seem to have a way to ask it to in all cases.</p> <p>And if the user hasn't configured it for the right behaviour in terms of I/O error resilience, we don't find out about it.</p> <p>So it's not the end of the world, but it'd sure be nice to fix.</p> <blockquote> <p>Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).</p> </blockquote> <p>That's because reads don't make promises about what's committed and synced. I think that's quite different.</p> <blockquote> <p>We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit.</p> </blockquote> <p>Certainly for current versions.</p> <p>I think we need to think about a more robust path in future. But it's certainly not &quot;stop the world&quot; territory.</p> <p>The docs need an update to indicate that we explicitly disclaim responsibility for I/O errors on async writes, and that the kernel and I/O stack must be configured never to give up on buffered writes. If it does, that's not our problem anymore.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 02:06:12 </code></pre> <p>On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:</p> <blockquote> <p>I suspect we've written off a fair few issues in the past as &quot;it'd bad hardware&quot; when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.</p> </blockquote> <p>Agreed on that, but I think that's FAR more likely to be things like multixacts, index structure corruption due to logic bugs etc.</p> <blockquote> <p>I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say &quot;well, don't do that then&quot; and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.</p> </blockquote> <p>And it'd be a realy bad idea to behave differently.</p> <blockquote> <p>And in many failure modes there's no reason to expect any data loss at all, like:</p> <ul> <li>Local disk fills up (seems to be safe already due to space reservation at write() time)</li> </ul> </blockquote> <p>That definitely should be treated separately.</p> <blockquote> <ul> <li>Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up</li> <li>NFS volume fills up</li> </ul> </blockquote> <p>Those should be the same as the above.</p> <blockquote> <p>I think we need to think about a more robust path in future. But it's certainly not &quot;stop the world&quot; territory.</p> </blockquote> <p>I think you're underestimating the complexity of doing that by at least two orders of magnitude.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-09 03:15:01 </code></pre> <p>On 9 April 2018 at 10:06, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <blockquote> <p>And in many failure modes there's no reason to expect any data loss at all, like:</p> <ul> <li>Local disk fills up (seems to be safe already due to space reservation at write() time)</li> </ul> </blockquote> <p>That definitely should be treated separately.</p> </blockquote> <p>It is, because all the FSes I looked at reserve space before returning from write(), even if they do delayed allocation. So they won't fail with ENOSPC at fsync() time or silently due to lost errors on background writeback. Otherwise we'd be hearing a LOT more noise about this.</p> <blockquote> <blockquote> <ul> <li>Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up</li> <li>NFS volume fills up</li> </ul> </blockquote> <p>Those should be the same as the above.</p> </blockquote> <p>Unfortunately, they aren't.</p> <p>AFAICS NFS doesn't reserve space with the other end before returning from write(), even if mounted with the sync option. So we can get ENOSPC lazily when the buffer writeback fails due to a full backing file system. This then travels the same paths as EIO: we fsync(), ERROR, retry, appear to succeed, and carry on with life losing the data. Or we never hear about the error in the first place.</p> <p>(There's a proposed extension that'd allow this, see <a href="https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5">https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5</a>, but I see no mention of it in fs/nfs. All the reserve_space / xdr_reserve_space stuff seems to be related to space in protocol messages at a quick read.)</p> <p>Thin provisioned storage could vary a fair bit depending on the implementation. But the specific failure case I saw, prompting this thread, was on a volume using the stack:</p> <pre><code>xfs -&gt; lvm2 -&gt; multipath -&gt; ??? -&gt; SAN </code></pre> <p>(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was iSCSI. I'm checking.)</p> <p>The SAN ran out of space. Due to use of thin provisioning, Linux <em>thought</em> there was plenty of space on the volume; LVM thought it had plenty of physical extents free and unallocated, XFS thought there was tons of free space, etc. The space exhaustion manifested as I/O errors on flushes of writeback buffers.</p> <p>The logs were like this:</p> <pre><code>kernel: sd 2:0:0:1: [sdd] Unhandled sense code kernel: sd 2:0:0:1: [sdd] kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE kernel: sd 2:0:0:1: [sdd] kernel: Sense Key : Data Protect [current] kernel: sd 2:0:0:1: [sdd] kernel: Add. Sense: Space allocation failed write protect kernel: sd 2:0:0:1: [sdd] CDB: kernel: Write(16): **HEX-DATA-CUT-OUT** kernel: Buffer I/O error on device dm-0, logical block 3098338786 kernel: lost page write due to I/O error on dm-0 kernel: Buffer I/O error on device dm-0, logical block 3098338787 </code></pre> <p>The immediate cause was that Linux's multipath driver didn't seem to recognise the sense code as retryable, so it gave up and reported it to the next layer up (LVM). LVM and XFS both seem to think that the lower layer is responsible for retries, so they toss the write away, and tell any interested writers if they feel like it, per discussion upthread.</p> <p>In this case Pg did get the news and reported fsync() errors on checkpoints, but it only reported an error once per relfilenode. Once it ran out of failed relfilenodes to cause the checkpoint to ERROR, it &quot;completed&quot; a &quot;successful&quot; checkpoint and kept on running until the resulting corruption started to manifest its self and it segfaulted some time later. As we've now learned, there's no guarantee we'd even get the news about the I/O errors at all.</p> <p>WAL was on a separate volume that didn't run out of room immediately, so we didn't PANIC on WAL write failure and prevent the issue.</p> <p>In this case if Pg had PANIC'd (and been able to guarantee to get the news of write failures reliably), there'd have been no corruption and no data loss despite the underlying storage issue.</p> <p>If, prior to seeing this, you'd asked me &quot;will my PostgreSQL database be corrupted if my thin-provisioned volume runs out of space&quot; I'd have said &quot;Surely not. PostgreSQL won't be corrupted by running out of disk space, it orders writes carefully and forces flushes so that it will recover gracefully from write failures.&quot;</p> <p>Except not. I was very surprised.</p> <p>BTW, it also turns out that the <em>default</em> for multipath is to give up on errors anyway; see the queue_if_no_path option and no_path_retries options. (Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.</p> <p>I regularly see rather a lot of multipath systems, iSCSI systems, SAN backed systems, etc. I think we need to be pretty clear that we expect them to retry indefinitely, and if they report an I/O error we cannot reliably handle it. We need to patch Pg to PANIC on any fsync() failure and document that Pg won't notice some storage failure modes that might otherwise be considered nonfatal or transient, so very specific storage configuration and testing is required. (Not that anyone will do it). Also warn against running on NFS even with &quot;hard,sync,nointr&quot;.</p> <p>It'd be interesting to have a tool that tested error handling, allowing people to do iSCSI plug-pull tests, that sort of thing. But as far as I can tell nobody ever tests their storage stack anyway, so I don't plan on writing something that'll never get used.</p> <blockquote> <blockquote> <p>I think we need to think about a more robust path in future. But it's certainly not &quot;stop the world&quot; territory.</p> </blockquote> <p>I think you're underestimating the complexity of doing that by at least two orders of magnitude.</p> </blockquote> <p>Oh, it's just a minor total rewrite of half Pg, no big deal ;)</p> <p>I'm sure that no matter how big I think it is, I'm still underestimating it.</p> <p>The most workable option IMO would be some sort of fnotify/dnotify/whatever that reports all I/O errors on a volume. Some kind of error reporting handle we can keep open on a volume level that we can check for each volume/tablespace after we fsync() everything to see if it all really worked. If we PANIC if that gives us a bad answer, and PANIC on fsync errors, we guard against the great majority of these sorts of should-be-transient-if-the-kernel-didn't-give-up-and-throw-away-our-data errors.</p> <p>Even then, good luck getting those events from an NFS volume in which the backing volume experiences an issue.</p> <p>And it's kind of moot because AFAICS no such interface exists.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-09 08:45:40 </code></pre> <p>On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 8 April 2018 at 04:27, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com></p> </blockquote> </blockquote> <p>The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.</p> </blockquote> <p>Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future. But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device?</p> <p>But just to underline the point. &quot;pointless to keep them dirty&quot; is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the <em>really</em> last thing user space expects is for this to happen and return no error.)</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 10:50:41 </code></pre> <p>On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:</p> <blockquote> <p>On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 8 April 2018 at 04:27, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com></p> </blockquote> </blockquote> <p>The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.</p> </blockquote> <p>Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future.</p> </blockquote> <p>To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack. Mind you, I am not claiming that this contract is either documented or necessarily respected (in fact there have been studies on the error propagation and handling of the block layer, see [1]). Let us assume that this is the design contract though (which appears to be the case across a number of open-source kernels), and if not - it's a bug. In this case, indeed the specific write()s will never be able to succeed in the future, at least not as long as the BIOs are allocated to the specific failing LBAs.</p> <blockquote> <p>But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device?</p> </blockquote> <p>I think there are degrees of failure. There are certainly cases where one may encounter localized unrecoverable medium errors (specific to certain LBAs) that are non-maskable from the block layer and below. That does not mean that the device is dropped at all, so it does make sense to continue all other operations to all other regions of the device that are functional. In cases of total device failure, then the filesystem will prevent you from proceeding anyway.</p> <blockquote> <p>But just to underline the point. &quot;pointless to keep them dirty&quot; is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the <em>really</em> last thing user space expects is for this to happen and return no error.)</p> </blockquote> <p>Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back. This in turn means that there needs to be an association between the fsync() caller and the specific dirtied pages that the caller intents to drain (for which we'd need an fsync_range(), among other things). BTW, currently the failed writebacks are not dropped from memory, but rather marked clean. They could be lost though due to memory pressure or due to explicit request (e.g. proc drop_caches), unless mlocked.</p> <p>There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.</p> <p>What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.</p> <p>[1] <a href="https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf">https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf</a></p> <hr> <pre><code>From:Geoff Winkless &lt;pgsqladmin(at)geoff(dot)dj&gt; Date:2018-04-09 12:03:28 </code></pre> <p>On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.</p> </blockquote> <p>That seems like a perfectly reasonable position to take, frankly.</p> <p>The whole <em>point</em> of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a &quot;difficult&quot; problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_its<em>job</em>.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-09 12:16:38 </code></pre> <p>On 9 April 2018 at 18:50, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.</p> <p>What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.</p> </blockquote> <p>That's what Pg appears to assume now, yes.</p> <p>Whether that's reasonable is a whole different topic.</p> <p>I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. &quot;Tell us about anything that happens under pgdata/&quot; or an inotify-style per-directory-registration option. I'd even say that's ideal.</p> <p>In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.</p> <p>Some keen person who wants to later could optimise it by adding a fsync worker thread pool in backends, so we don't block the main thread. Frankly that might be a nice thing to have in the checkpointer anyway. But it's out of scope for fixing this in durability terms.</p> <p>I'm partway through a patch that makes fsync panic on errors now. Once that's done, the next step will be to force fsync on close() in md and see how we go with that.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 12:31:27 </code></pre> <p>On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:</p> <blockquote> <p>On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.</p> </blockquote> <p>That seems like a perfectly reasonable position to take, frankly.</p> </blockquote> <p>Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them. Good luck convincing any OS kernel upstream to go with this design.</p> <blockquote> <p>The whole <em>point</em> of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a &quot;difficult&quot; problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_its<em>job</em>.</p> </blockquote> <p>No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly.</p> <p>It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a &quot;portable&quot; fashion.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 12:54:16 </code></pre> <p>On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:</p> <blockquote> <p>I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. &quot;Tell us about anything that happens under pgdata/&quot; or an inotify-style per-directory-registration option. I'd even say that's ideal.</p> </blockquote> <p>I see what you are saying. So basically you'd always maintain the notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to). The kernel wouldn't even have to maintain per-page bits to trace the errors, since they will be consumed by the process that reads the events (or discarded, when the notification fd is closed).</p> <p>Assuming this would be possible, wouldn't Pg still need to deal with synchronizing writers and related issues (since this would be merely a notification mechanism - not prevent any process from continuing), which I understand would be rather intrusive for the current Pg multi-process design.</p> <p>But other than that, similarly this interface could in principle be similarly implemented in the BSDs via kqueue(), I suppose, to provide what you need.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 13:33:18 </code></pre> <p>On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:</p> <blockquote> <p>On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.</p> </blockquote> <p>That seems like a perfectly reasonable position to take, frankly.</p> </blockquote> <p>Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them.</p> </blockquote> <p>Sure, but the question is whether the system can reasonably operate after some of the writes failed and the data got lost. Because if it can't, then recovering the memory is rather useless. It might be better to stop the system in that case, forcing the system administrator to resolve the issue somehow (fail-over to a replica, perform recovery from the last checkpoint, ...).</p> <p>We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.</p> <blockquote> <p>Good luck convincing any OS kernel upstream to go with this design.</p> </blockquote> <p>Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.</p> <p>The question is whether the current design makes it any easier for user-space developers to build reliable systems. We have tried using it, and unfortunately the answers seems to be &quot;no&quot; and &quot;Use direct I/O and manage everything on your own!&quot;</p> <blockquote> <blockquote> <p>The whole <em>point</em> of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a &quot;difficult&quot; problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_its<em>job</em>.</p> </blockquote> <p>No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly.</p> </blockquote> <p>We can (and do) take care of the atomicity and isolation. Implementation of those parts is obviously very application-specific, and we have WAL and locks for that purpose. I/O on the other hand seems to be a generic service provided by the OS - at least that's how we saw it until now.</p> <blockquote> <p>It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a &quot;portable&quot; fashion.</p> </blockquote> <p>Users ask us about this quite often, actually. The question is usually about &quot;RAW devices&quot; and performance, but ultimately it boils down to buffered vs. direct I/O. So far our answer was we rely on kernel to do this reliably, because they know how to do that correctly and we simply don't have the manpower to implement it (portable, reliable, handling different types of storage, ...).</p> <p>One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 13:42:35 </code></pre> <p>On 04/09/2018 12:29 AM, Bruce Momjian wrote:</p> <blockquote> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> </blockquote> <p>That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.</p> <hr> <pre><code>From:Abhijit Menon-Sen &lt;ams(at)2ndQuadrant(dot)com&gt; Date:2018-04-09 13:47:03 </code></pre> <p>At 2018-04-09 15:42:35 +0200, tomas(dot)vondra(at)2ndquadrant(dot)com wrote:</p> <blockquote> <p>On 04/09/2018 12:29 AM, Bruce Momjian wrote:</p> <blockquote> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> </blockquote> <p>That doesn't seem like a very practical way.</p> </blockquote> <p>Not least because Craig's tests showed that you can't rely on <em>always</em> getting an error message in the logs.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 13:54:19 </code></pre> <p>On 04/09/2018 04:00 AM, Craig Ringer wrote:</p> <blockquote> <p>On 9 April 2018 at 07:16, Andres Freund &lt;andres(at)anarazel(dot)de</p> <blockquote> <p>I think the danger presented here is far smaller than some of the statements in this thread might make one think.</p> </blockquote> <p>Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.</p> </blockquote> <p>Yeah, it clearly isn't the case that everything we do suddenly got pointless. It's fairly annoying, though.</p> <blockquote> <p>I suspect we've written off a fair few issues in the past as &quot;it'd bad hardware&quot; when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.</p> </blockquote> <p>Right. Write errors are fairly rare, and we've probably ignored a fair number of cases demonstrating this issue. It kinda reminds me the wisdom that not seeing planes with bullet holes in the engine does not mean engines don't need armor [1].</p> <p>[1] <a href="https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d">https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d</a></p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 14:22:06 </code></pre> <p>On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.</p> </blockquote> <p>Sure, there could be knobs for limiting how much memory such &quot;zombie&quot; pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack).</p> <blockquote> <p>Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.</p> </blockquote> <p>It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.</p> <p>Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction.</p> <blockquote> <p>One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.</p> </blockquote> <p>I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner.</p> <p>[1] <a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf">https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf</a></p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-09 15:29:36 </code></pre> <p>On 9 April 2018 at 15:22, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:</p> <p>Sure, there could be knobs for limiting how much memory such &quot;zombie&quot; pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner.</p> </blockquote> <p>Surely this is exactly what the kernel is there to manage. It has to control how much memory is allowed to be full of dirty buffers in the first place to ensure that the system won't get memory starved if it can't clean them fast enough. That isn't even about persistent hardware errors. Even when the hardware is working perfectly it can only flush buffers so fast. The whole point of the kernel is to abstract away shared resources. It's not like user space has any better view of the situation here. If Postgres implemented all this in DIRECT_IO it would have exactly the same problem only with less visibility into what the rest of the system is doing. If every application implemented its own buffer cache we would be back in the same boat only with a fragmented memory allocation.</p> <blockquote> <p>This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error.</p> </blockquote> <p>I still think we're speaking two different languages. There's no application anywhere that's going to &quot;clear the error&quot;. The application has done the writes and if it's calling fsync it wants to wait until the filesystem can arrange for the write to be persisted. If the application could manage without the persistence then it wouldn't have called fsync.</p> <p>The only way to &quot;clear out&quot; the error would be by having the writes succeed. There's no reason to think that wouldn't be possible sometime. The filesystem could remap blocks or an administrator could replace degraded raid device components. The only thing Postgres could do to recover would be create a new file and move the data (reading from the dirty buffer in memory!) to a new file anyways so we would &quot;clear the error&quot; by just no longer calling fsync on the old file.</p> <p>We always read fsync as a simple write barrier. That's what the documentation promised and it's what Postgres always expected. It sounds like the kernel implementors looked at it as some kind of communication channel to communicate status report for specific writes back to user-space. That's a much more complex problem and would have entirely different interface. I think this is why we're having so much difficulty communicating.</p> <blockquote> <p>It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.</p> </blockquote> <p>Well if they're writing to the same file that had a previous error I doubt there are many applications that would be happy to consider their writes &quot;persisted&quot; when the file was corrupt. Ironically the earlier discussion quoted talked about how applications that wanted more granular communication would be using O_DIRECT -- but what we have is fsync trying to be <em>too</em> granular such that it's impossible to get any strong guarantees about anything with it.</p> <blockquote> <blockquote> <p>One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.</p> </blockquote> <p>I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO).</p> </blockquote> <p>Honestly I don't think there's <em>any</em> way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?).</p> <hr> <pre><code>From:Robert Haas &lt;robertmhaas(at)gmail(dot)com&gt; Date:2018-04-09 16:45:00 </code></pre> <p>On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.</p> </blockquote> <p>Ouch. If a process exits -- say, because the user typed \q into psql -- then you're talking about potentially calling fsync() on a really large number of file descriptor flushing many gigabytes of data to disk. And it may well be that you never actually wrote any data to any of those file descriptors -- those writes could have come from other backends. Or you may have written a little bit of data through those FDs, but there could be lots of other data that you end up flushing incidentally. Perfectly innocuous things like starting up a backend, running a few short queries, and then having that backend exit suddenly turn into something that could have a massive system-wide performance impact.</p> <p>Also, if a backend ever manages to exit without running through this code, or writes any dirty blocks afterward, then this still fails to fix the problem completely. I guess that's probably avoidable -- we can put this late in the shutdown sequence and PANIC if it fails.</p> <p>I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.</p> <hr> <pre><code>From:&quot;Joshua D(dot) Drake&quot; &lt;jd(at)commandprompt(dot)com&gt; Date:2018-04-09 17:26:24 </code></pre> <p>On 04/09/2018 09:45 AM, Robert Haas wrote:</p> <blockquote> <p>On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.</p> </blockquote> <p>I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.</p> </blockquote> <p>I don't have a better option but whatever we do, it should be an optional (GUC) change. We have plenty of YEARS of people not noticing this issue and Robert's correct, if we go back to an era of things like stalls it is going to look bad on us no matter how we describe the problem.</p> <hr> <pre><code>From:Gasper Zejn &lt;zejn(at)owca(dot)info&gt; Date:2018-04-09 18:02:21 </code></pre> <p>On 09. 04. 2018 15:42, Tomas Vondra wrote:</p> <blockquote> <p>On 04/09/2018 12:29 AM, Bruce Momjian wrote:</p> <blockquote> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> </blockquote> <p>That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.</p> <p>regards</p> </blockquote> <p>For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel module with kprobe/kretprobe capturing the file passed to fsync or even byte range within file and corresponding return value shouldn't be that hard. Kprobe has been a part of Linux kernel for a really long time, and from first glance it seems like it could be backported to 2.6 too.</p> <p>Then you could have stable log messages or implement some kind of &quot;fsync error log notification&quot; via whatever is the most sane way to get this out of kernel.</p> <p>If the kernel is new enough and has eBPF support (seems like &gt;=4.4), using bcc-tools[1] should enable you to write a quick script to get exactly that info via perf events[2].</p> <p>Obviously, that's a stopgap solution ...</p> <p>[1] <a href="https://github.com/iovisor/bcc">https://github.com/iovisor/bcc</a> [2] <a href="https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/">https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/</a></p> <hr> <pre><code>From:Mark Dilger &lt;hornschnorter(at)gmail(dot)com&gt; Date:2018-04-09 18:29:42 </code></pre> <blockquote> <p>On Apr 9, 2018, at 10:26 AM, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:</p> <p>We have plenty of YEARS of people not noticing this issue</p> </blockquote> <p>I disagree. I have noticed this problem, but blamed it on other things. For over five years now, I have had to tell customers not to use thin provisioning, and I have had to add code to postgres to refuse to perform inserts or updates if the disk volume is more than 80% full. I have lost count of the number of customers who are running an older version of the product (because they refuse to upgrade) and come back with complaints that they ran out of disk and now their database is corrupt. All this time, I have been blaming this on virtualization and thin provisioning.</p> <hr> <pre><code>From:Robert Haas &lt;robertmhaas(at)gmail(dot)com&gt; Date:2018-04-09 19:02:11 </code></pre> <p>On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:</p> <blockquote> <p>Ouch. If a process exits -- say, because the user typed \q into psql -- then you're talking about potentially calling fsync() on a really large number of file descriptor flushing many gigabytes of data to disk. And it may well be that you never actually wrote any data to any of those file descriptors -- those writes could have come from other backends. Or you may have written a little bit of data through those FDs, but there could be lots of other data that you end up flushing incidentally. Perfectly innocuous things like starting up a backend, running a few short queries, and then having that backend exit suddenly turn into something that could have a massive system-wide performance impact.</p> <p>Also, if a backend ever manages to exit without running through this code, or writes any dirty blocks afterward, then this still fails to fix the problem completely. I guess that's probably avoidable -- we can put this late in the shutdown sequence and PANIC if it fails.</p> <p>I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.</p> </blockquote> <p>What about the bug we fixed in <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e">https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e</a> ? Say somebody does something along the lines of:</p> <pre><code>ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9 </code></pre> <p>...and then restarts postgres. Craig's proposal wouldn't cover this case, because there was no opportunity to run fsync() after the first crash, and there's now no way to go back and fsync() any stuff we didn't fsync() before, because the kernel may have already thrown away the error state, or may lie to us and tell us everything is fine (because our new fd wasn't opened early enough). I can't find the original discussion that led to that commit right now, so I'm not exactly sure what scenarios we were thinking about. But I think it would at least be a problem if full_page_writes=off or if you had previously started the server with fsync=off and now wish to switch to fsync=on after completing a bulk load or similar. Recovery can read a page, see that it looks OK, and continue, and then a later fsync() failure can revert that page to an earlier state and now your database is corrupted -- and there's absolute no way to detect this because write() gives you the new page contents later, fsync() doesn't feel obliged to tell you about the error because your fd wasn't opened early enough, and eventually the write can be discarded and you'll revert back to the old page version with no errors ever being reported anywhere.</p> <p>Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.</p> <p>What's being presented to us as the API contract that we should expect from buffered I/O is that if you open a file and read() from it, call fsync(), and get no error, the kernel may nevertheless decide that some previous write that it never managed to flush can't be flushed, and then revert the page to the contents it had at some point in the past. That's mostly or less equivalent to letting a malicious adversary randomly overwrite database pages plausible-looking but incorrect contents without notice and hoping you can still build a reliable system. You can avoid the problem if you can always open an fd for every file you want to modify before it's written and hold on to it until after it's fsync'd, but that's pretty hard to guarantee in the face of kill -9.</p> <p>I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 19:13:14 </code></pre> <p>Hi,</p> <p>On 2018-04-09 15:02:11 -0400, Robert Haas wrote:</p> <blockquote> <p>I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.</p> </blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 19:22:58 </code></pre> <p>On 04/09/2018 08:29 PM, Mark Dilger wrote:</p> <blockquote> <blockquote> <p>On Apr 9, 2018, at 10:26 AM, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote: We have plenty of YEARS of people not noticing this issue</p> </blockquote> <p>I disagree. I have noticed this problem, but blamed it on other things. For over five years now, I have had to tell customers not to use thin provisioning, and I have had to add code to postgres to refuse to perform inserts or updates if the disk volume is more than 80% full. I have lost count of the number of customers who are running an older version of the product (because they refuse to upgrade) and come back with complaints that they ran out of disk and now their database is corrupt. All this time, I have been blaming this on virtualization and thin provisioning.</p> </blockquote> <p>Yeah. There's a big difference between not noticing an issue because it does not happen very often vs. attributing it to something else. If we had the ability to revisit past data corruption cases, we would probably discover a fair number of cases caused by this.</p> <p>The other thing we probably need to acknowledge is that the environment changes significantly - things like thin provisioning are likely to get even more common, increasing the incidence of these issues.</p> <hr> <pre><code>From:Peter Geoghegan &lt;pg(at)bowt(dot)ie&gt; Date:2018-04-09 19:25:33 </code></pre> <p>On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> </blockquote> <p>+1</p> <blockquote> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> </blockquote> <p>Right. We seem to be implicitly assuming that there is a big difference between a problem in the storage layer that we could in principle detect, but don't, and any other problem in the storage layer. I've read articles claiming that technologies like SMART are not really reliable in a practical sense [1], so it seems to me that there is reason to doubt that this gap is all that big.</p> <p>That said, I suspect that the problems with running out of disk space are serious practical problems. I have personally scoffed at stories involving Postgres databases corruption that gets attributed to running out of disk space. Looks like I was dead wrong.</p> <p>[1] <a href="file-consistency/">file-consistency/</a> -- &quot;Filesystem correctness&quot;</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 19:26:21 </code></pre> <p>On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:</p> <blockquote> <p>Honestly I don't think there's <em>any</em> way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?).</p> </blockquote> <p>That is indeed true (sync would induce fsync on open inodes and clear the error), and that's a nasty bug that apparently went unnoticed for a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at least this issue, but similar fixes need to be adopted by many other kernels (all those that mark failed pages as clean).</p> <p>I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().</p> <p>What about having buffered IO with implied fsync() atomicity via O_SYNC? This would probably necessitate some helper threads that mask the latency and present an async interface to the rest of PG, but sounds less intrusive than going for DIO.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 19:29:16 </code></pre> <p>On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>What about having buffered IO with implied fsync() atomicity via O_SYNC?</p> </blockquote> <p>You're kidding, right? We could also just add sleep(30)'s all over the tree, and hope that that'll solve the problem. There's a reason we don't permanently fsync everything. Namely that it'll be way too slow.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 19:37:03 </code></pre> <p>On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().</p> </blockquote> <p>Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.</p> <hr> <pre><code>From:Justin Pryzby &lt;pryzby(at)telsasoft(dot)com&gt; Date:2018-04-09 19:41:19 </code></pre> <p>On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:</p> <blockquote> <p>You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?</p> </blockquote> <p>I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.</p> <p>I was going to say that's fine for postgres, since it chdir()s into its basedir, but actually not fine for nondefault tablespaces..</p> <p>On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).</p> </blockquote> <p>For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are &quot;signalled&quot; to the checkpointer process...</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 19:44:31 </code></pre> <p>On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:</p> <blockquote> <p>On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>What about having buffered IO with implied fsync() atomicity via O_SYNC?</p> </blockquote> <p>You're kidding, right? We could also just add sleep(30)'s all over the tree, and hope that that'll solve the problem. There's a reason we don't permanently fsync everything. Namely that it'll be way too slow.</p> </blockquote> <p>I am assuming you can apply the same principle of selectively using O_SYNC at times and places that you'd currently actually call fsync().</p> <p>Also assuming that you'd want to have a backwards-compatible solution for all those kernels that don't keep the pages around, irrespective of future fixes. Short of loading a kernel module and dealing with the problem directly, the only other available options seem to be either O_SYNC, O_DIRECT or ignoring the issue.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 19:47:44 </code></pre> <p>On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.</p> </blockquote> <p>Sure, there could be knobs for limiting how much memory such &quot;zombie&quot; pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack).</p> </blockquote> <p>I don't quite see how this is any different from other possible issues when running multiple applications on the same system. One application can generate a lot of dirty data, reaching dirty_bytes and forcing the other applications on the same host to do synchronous writes.</p> <p>Of course, you might argue that is a temporary condition - it will resolve itself once the dirty pages get written to storage. In case of an I/O issue, it is a permanent impact - it will not resolve itself unless the I/O problem gets fixed.</p> <p>Not sure what interfaces would need to be written? Possibly something that says &quot;drop dirty pages for these files&quot; after the application gets killed or something. That makes sense, of course.</p> <blockquote> <blockquote> <p>Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.</p> </blockquote> <p>It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.</p> </blockquote> <p>In my experience when you have a persistent I/O error on a device, it likely affects all applications using that device. So unmounting the fs to clear the dirty pages seems like an acceptable solution to me.</p> <p>I don't see what else the application should do? In a way I'm suggesting applications don't really want to be responsible for recovering (cleanup or dirty pages etc.). We're more than happy to hand that over to kernel, e.g. because each kernel will do that differently. What we however do want is reliable information about fsync outcome, which we need to properly manage WAL, checkpoints etc.</p> <blockquote> <p>Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction.</p> <blockquote> <p>One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.</p> </blockquote> <p>I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner.</p> </blockquote> <p>Right. What I was getting to is that perhaps the current fsync() behavior is not very practical for building actual applications.</p> <blockquote> <p>Best regards, Anthony</p> <p>[1] <a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf">https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf</a></p> </blockquote> <p>Thanks. The paper looks interesting.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-09 19:51:12 </code></pre> <p>On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:</p> <blockquote> <p>On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().</p> </blockquote> <p>Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.</p> </blockquote> <p>As discussed before, I think this could be acceptable, especially if you pair it with an opt-in mechanism (only applications that care to deal with this will have to), and would give it a shot.</p> <p>Still need a way to deal with all other systems and prior kernel releases that are eating fsync() writeback errors even over sync().</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 19:54:05 </code></pre> <p>On 04/09/2018 09:37 PM, Andres Freund wrote:</p> <blockquote> <p>On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().</p> </blockquote> <p>Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.</p> </blockquote> <p>Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?</p> <p>Of course, it's also possible to do what you suggested, and simply mark the inode as failed. In which case the next fsync can't possibly retry the writes (e.g. after freeing some space on thin-provisioned system), but we'd get reliable failure mode.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 19:59:34 </code></pre> <p>On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:</p> <blockquote> <p>You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?</p> </blockquote> <p>I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.</p> <p>On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).</p> </blockquote> <p>For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are &quot;signalled&quot; to the checkpointer process...</p> </blockquote> <p>I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.</p> <p>Even better, we could do so via a dedicated worker. That'd quite possibly end up as a performance benefit.</p> <blockquote> <p>I was going to say that's fine for postgres, since it chdir()s into its basedir, but actually not fine for nondefault tablespaces..</p> </blockquote> <p>I think it'd be fair to open PG_VERSION of all created tablespaces. Would require some hangups to signal checkpointer (or whichever process) to do so when creating one, but it shouldn't be too hard. Some people would complain because they can't do some nasty hacks anymore, but it'd also save peoples butts by preventing them from accidentally unmounting.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 20:04:20 </code></pre> <p>Hi,</p> <p>On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:</p> <blockquote> <p>Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?</p> </blockquote> <p>Some people expect that, I personally don't think it's a useful expectation.</p> <p>We should just deal with this by crash-recovery. The big problem I see is that you always need to keep an file descriptor open for pretty much any file written to inside and outside of postgres, to be guaranteed to see errors. And that'd solve that. Even if retrying would work, I'd advocate for that (I've done so in the past, and I've written code in pg that panics on fsync failure...).</p> <p>What we'd need to do however is to clear that bit during crash recovery... Which is interesting from a policy perspective. Could be that other apps wouldn't want that.</p> <p>I also wonder if we couldn't just somewhere read each relevant mounted filesystem's errseq value. Whenever checkpointer notices before finishing a checkpoint that it has changed, do a crash restart.</p> <hr> <pre><code>From:Mark Dilger &lt;hornschnorter(at)gmail(dot)com&gt; Date:2018-04-09 20:25:54 </code></pre> <blockquote> <p>On Apr 9, 2018, at 12:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <p>Hi,</p> <p>On 2018-04-09 15:02:11 -0400, Robert Haas wrote:</p> <blockquote> <p>I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.</p> </blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> </blockquote> <p>I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.</p> <p>Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?</p> <p>Can anybody clarify this for non-core-hacker folks following along at home?</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 20:30:00 </code></pre> <p>On 04/09/2018 10:04 PM, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:</p> <blockquote> <p>Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?</p> </blockquote> <p>Some people expect that, I personally don't think it's a useful expectation.</p> </blockquote> <p>Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.</p> <p>And most importantly, it's rather delusional to think the kernel developers are going to be enthusiastic about that approach ...</p> <blockquote> <p>We should just deal with this by crash-recovery. The big problem I see is that you always need to keep an file descriptor open for pretty much any file written to inside and outside of postgres, to be guaranteed to see errors. And that'd solve that. Even if retrying would work, I'd advocate for that (I've done so in the past, and I've written code in pg that panics on fsync failure...).</p> </blockquote> <p>Sure. And it's likely way less invasive from kernel perspective.</p> <blockquote> <p>What we'd need to do however is to clear that bit during crash recovery... Which is interesting from a policy perspective. Could be that other apps wouldn't want that.</p> </blockquote> <p>IMHO it'd be enough if a remount clears it.</p> <blockquote> <p>I also wonder if we couldn't just somewhere read each relevant mounted filesystem's errseq value. Whenever checkpointer notices before finishing a checkpoint that it has changed, do a crash restart.</p> </blockquote> <p>Hmmmm, that's an interesting idea, and it's about the only thing that would help us on older kernels. There's a wb_err in adress_space, but that's at inode level. Not sure if there's something at fs level.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 20:34:15 </code></pre> <p>Hi,</p> <p>On 2018-04-09 13:25:54 -0700, Mark Dilger wrote:</p> <blockquote> <p>I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted.</p> </blockquote> <p>I don't see that as a real problem here. For one the problematic scenarios shouldn't readily apply, for another WAL is checksummed.</p> <p>There's the problem that a new basebackup would potentially become corrupted however. And similarly pg_rewind.</p> <p>Note that I'm not saying that we and/or linux shouldn't change anything. Just that the apocalypse isn't here.</p> <blockquote> <p>Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?</p> </blockquote> <p>I think that's basically right. There's cases where corruption could get propagated, but they're not straightforward.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 20:37:31 </code></pre> <p>Hi,</p> <p>On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:</p> <blockquote> <p>Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.</p> </blockquote> <p>Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 20:43:03 </code></pre> <p>On 04/09/2018 10:25 PM, Mark Dilger wrote:</p> <blockquote> <blockquote> <p>On Apr 9, 2018, at 12:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <p>Hi,</p> <p>On 2018-04-09 15:02:11 -0400, Robert Haas wrote:</p> <blockquote> <p>I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.</p> </blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> </blockquote> <p>I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.</p> <p>Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?</p> <p>Can anybody clarify this for non-core-hacker folks following along at home?</p> </blockquote> <p>That's a good question. I don't see any guarantee it'd be isolated to the master node. Consider this example:</p> <p>(0) checkpoint happens on the primary</p> <p>(1) a page gets modified, a full-page gets written to WAL</p> <p>(2) the page is written out to page cache</p> <p>(3) writeback of that page fails (and gets discarded)</p> <p>(4) we attempt to modify the page again, but we read the stale version</p> <p>(5) we modify the stale version, writing the change to WAL</p> <p>The standby will get the full-page, and then a WAL from the stale page version. That doesn't seem like a story with a happy end, I guess. But I might be easily missing some protection built into the WAL ...</p> <hr> <pre><code>From:Mark Dilger &lt;hornschnorter(at)gmail(dot)com&gt; Date:2018-04-09 20:55:29 </code></pre> <blockquote> <p>On Apr 9, 2018, at 1:43 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:</p> <p>On 04/09/2018 10:25 PM, Mark Dilger wrote:</p> <blockquote> <blockquote> <p>On Apr 9, 2018, at 12:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <p>Hi,</p> <p>On 2018-04-09 15:02:11 -0400, Robert Haas wrote:</p> <blockquote> <p>I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.</p> </blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> </blockquote> <p>I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.</p> <p>Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?</p> <p>Can anybody clarify this for non-core-hacker folks following along at home?</p> </blockquote> <p>That's a good question. I don't see any guarantee it'd be isolated to the master node. Consider this example:</p> <p>(0) checkpoint happens on the primary</p> <p>(1) a page gets modified, a full-page gets written to WAL</p> <p>(2) the page is written out to page cache</p> <p>(3) writeback of that page fails (and gets discarded)</p> <p>(4) we attempt to modify the page again, but we read the stale version</p> <p>(5) we modify the stale version, writing the change to WAL</p> <p>The standby will get the full-page, and then a WAL from the stale page version. That doesn't seem like a story with a happy end, I guess. But I might be easily missing some protection built into the WAL ...</p> </blockquote> <p>I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption. When choosing to have one standby, or two standbys, or ten standbys, one needs to be able to assume a certain amount of statistical independence between failures on one server and failures on another. If they are tightly correlated dependent variables, then the conclusion that the probability of all nodes failing simultaneously is vanishingly small becomes invalid.</p> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-09 21:08:29 </code></pre> <p>Hi,</p> <p>On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:</p> <blockquote> <p>I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.</p> </blockquote> <p>I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.</p> <hr> <pre><code>From:Tomas Vondra &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt; Date:2018-04-09 21:25:52 </code></pre> <p>On 04/09/2018 11:08 PM, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:</p> <blockquote> <p>I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.</p> </blockquote> <p>I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.</p> </blockquote> <p>In any case, that certainly does not count as data corruption spreading from the master to standby.</p> <hr> <pre><code>From:Mark Dilger &lt;hornschnorter(at)gmail(dot)com&gt; Date:2018-04-09 21:33:29 </code></pre> <blockquote> <p>On Apr 9, 2018, at 2:25 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:</p> <p>On 04/09/2018 11:08 PM, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:</p> <blockquote> <p>I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.</p> </blockquote> <p>I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.</p> </blockquote> </blockquote> <p>I'm happy to take your word for that.</p> <blockquote> <p>In any case, that certainly does not count as data corruption spreading from the master to standby.</p> </blockquote> <p>Maybe not from the point of view of somebody looking at the code. But a user might see it differently. If the data being loaded into the master and getting replicated to the standby &quot;causes&quot; both to get corrupt, then it seems like corruption spreading. I put &quot;causes&quot; in quotes because there is some argument to be made about &quot;correlation does not prove cause&quot; and so forth, but it still feels like causation from an arms length perspective. If there is a pattern of standby servers tending to fail more often right around the time that the master fails, you'll have a hard time comforting users, &quot;hey, it's not technically causation.&quot; If loading data into the master causes the master to hit ENOSPC, and replicating that data to the standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has not been fixed, then this looks like corruption spreading.</p> <p>I'm certainly planning on taking a hard look at the disk allocation on my standby servers right soon now.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-09 22:33:16 </code></pre> <p>On Tue, Apr 10, 2018 at 2:22 AM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.</p> </blockquote> <p>It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.</p> </blockquote> <p>Right. For anyone interested, here is the change you mentioned, and an interesting one that came a bit earlier last year:</p> <ul> <li><a href="https://reviews.freebsd.org/rS316941">https://reviews.freebsd.org/rS316941</a> -- drop buffers after device goes away</li> <li><a href="https://reviews.freebsd.org/rS326029">https://reviews.freebsd.org/rS326029</a> -- update comment about EIO contract</li> </ul> <p>Retrying may well be futile, but at least future fsync() calls won't report success bogusly. There may of course be more space-efficient ways to represent that state as the comment implies, while never lying to the user -- perhaps involving filesystem level or (pinned) inode level errors that stop all writes until unmounted. Something tells me they won't resort to flakey fsync() error reporting.</p> <p>I wonder if anyone can tell us what Windows, AIX and HPUX do here.</p> <blockquote> <p>[1] <a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf">https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf</a></p> </blockquote> <p>Very interesting, thanks.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-10 00:32:20 </code></pre> <p>On Tue, Apr 10, 2018 at 10:33 AM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:</p> <blockquote> <p>I wonder if anyone can tell us what Windows, AIX and HPUX do here.</p> </blockquote> <p>I created a wiki page to track what we know (or think we know) about fsync() on various operating systems:</p> <p><a href="https://wiki.postgresql.org/wiki/Fsync_Errors">https://wiki.postgresql.org/wiki/Fsync_Errors</a></p> <p>If anyone has more information or sees mistakes, please go ahead and edit it.</p> <hr> <pre><code>From:Andreas Karlsson &lt;andreas(at)proxel(dot)se&gt; Date:2018-04-10 00:41:10 </code></pre> <p>On 04/09/2018 02:16 PM, Craig Ringer wrote:</p> <blockquote> <p>I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. &quot;Tell us about anything that happens under pgdata/&quot; or an inotify-style per-directory-registration option. I'd even say that's ideal.</p> </blockquote> <p>Could there be a risk of a race condition here where fsync incorrectly returns success before we get the notification of that something went wrong?</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-10 01:44:59 </code></pre> <p>On 10 April 2018 at 03:59, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:</p> <blockquote> <p>You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?</p> </blockquote> <p>I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.</p> <p>On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:</p> <blockquote> <p>notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).</p> </blockquote> <p>For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are &quot;signalled&quot; to the checkpointer process...</p> </blockquote> <p>I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.</p> </blockquote> <p>Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet.</p> <p>Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better.</p> <p>We'd need a way to dup() the fd and pass it back to a backend when it needed to reopen it sometimes, or just make sure to keep the oldest copy of the fd when a backend reopens multiple times, but that's no biggie.</p> <p>We'd still have to fsync() out early in the checkpointer if we ran out of space in our FD list, and initscripts would need to change our ulimit or we'd have to do it ourselves in the checkpointer. But neither seems insurmountable.</p> <p>FWIW, I agree that this is a corner case, but it's getting to be a pretty big corner with the spread of overcommitted, dedupliating SANs, cloud storage, etc. Not all I/O errors indicate permanent hardware faults, disk failures, etc, as I outlined earlier. I'm very curious to know what AWS EBS's error semantics are, and other cloud network block stores. (I posted on Amazon forums <a href="https://forums.aws.amazon.com/thread.jspa?threadID=279274&amp;tstart=0">https://forums.aws.amazon.com/thread.jspa?threadID=279274&amp;tstart=0</a> but nothing so far).</p> <p>I'm also not particularly inclined to trust that all file systems will always reliably reserve space without having some cases where they'll fail writeback on space exhaustion.</p> <p>So we don't need to panic and freak out, but it's worth looking at the direction the storage world is moving in, and whether this will become a bigger issue over time.</p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-10 01:52:21 </code></pre> <p>On Tue, Apr 10, 2018 at 1:44 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 10 April 2018 at 03:59, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.</p> </blockquote> <p>Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet.</p> <p>Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better.</p> </blockquote> <p>How would that interlock with concurrent checkpoints?</p> <p>I can see how to make that work if the share-fd-or-fsync-now logic happens in smgrwrite() when called by FlushBuffer() while you hold io_in_progress, but not if you defer it to some random time later.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-10 01:54:30 </code></pre> <p>On 10 April 2018 at 04:25, Mark Dilger <hornschnorter(at)gmail(dot)com> wrote:</p> <blockquote> <p>I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted.</p> </blockquote> <p>Yes, it can, but not directly through the first error.</p> <p>What can happen is that we think a block got written when it didn't.</p> <p>If our in memory state diverges from our on disk state, we can make subsequent WAL writes based on that wrong information. But that's actually OK, since the standby will have replayed the original WAL correctly.</p> <p>I think the only time we'd run into trouble is if we evict the good (but not written out) data from s_b and the fs buffer cache, then later read in the old version of a block we failed to overwrite. Data checksums (if enabled) might catch it unless the write left the whole block stale. In that case we might generate a full page write with the stale block and propagate that over WAL to the standby.</p> <p>So I'd say standbys are relatively safe - very safe if the issue is caught promptly, and less so over time. But AFAICS WAL-based replication (physical or logical) is not a perfect defense for this.</p> <p>However, remember, if your storage system is free of any sort of overprovisioning, is on a non-network file system, and doesn't use multipath (or sets it up right) this issue <em>is exceptionally unlikely to affect you</em>.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-10 01:59:03 </code></pre> <p>On 10 April 2018 at 04:37, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:</p> <blockquote> <p>Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.</p> </blockquote> <p>Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.</p> </blockquote> <p>Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.</p> <p>EXT4 and XFS don't allocate until later, it by performing actual writes to FS metadata, initializing disk blocks, etc. So we won't notice errors that are only detectable at actual time of allocation, like thin provisioning problems, until after write() returns and we face the same writeback issues.</p> <p>So I reckon you're safe from space-related issues if you're not on NFS (and whyyy would you do that?) and not thinly provisioned. I'm sure there are other corner cases, but I don't see any reason to expect space-exhaustion-related corruption problems on a sensible FS backed by a sensible block device. I haven't tested things like quotas, verified how reliable space reservation is under concurrency, etc as yet.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-10 02:00:59 </code></pre> <p>On April 9, 2018 6:59:03 PM PDT, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 10 April 2018 at 04:37, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:</p> <blockquote> <p>Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.</p> </blockquote> <p>Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.</p> </blockquote> <p>Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.</p> <p>EXT4 and XFS don't allocate until later, it by performing actual writes to FS metadata, initializing disk blocks, etc. So we won't notice errors that are only detectable at actual time of allocation, like thin provisioning problems, until after write() returns and we face the same writeback issues.</p> <p>So I reckon you're safe from space-related issues if you're not on NFS (and whyyy would you do that?) and not thinly provisioned. I'm sure there are other corner cases, but I don't see any reason to expect space-exhaustion-related corruption problems on a sensible FS backed by a sensible block device. I haven't tested things like quotas, verified how reliable space reservation is under concurrency, etc as yet.</p> </blockquote> <p>How's that not solved by pre zeroing and/or fallocate as I suggested above?</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-10 02:02:48 </code></pre> <p>On 10 April 2018 at 08:41, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:</p> <blockquote> <p>On 04/09/2018 02:16 PM, Craig Ringer wrote:</p> <blockquote> <p>I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. &quot;Tell us about anything that happens under pgdata/&quot; or an inotify-style per-directory-registration option. I'd even say that's ideal.</p> </blockquote> <p>Could there be a risk of a race condition here where fsync incorrectly returns success before we get the notification of that something went wrong?</p> </blockquote> <p>We'd examine the notification queue only once all our checkpoint fsync()s had succeeded, and before we updated the control file to advance the redo position.</p> <p>I'm intrigued by the suggestion upthread of using a kprobe or similar to achieve this. It's a horrifying unportable hack that'd make kernel people cry, and I don't know if we have any way to flush buffered probe data to be sure we really get the news in time, but it's a cool idea too.</p> <hr> <pre><code>From:Michael Paquier &lt;michael(at)paquier(dot)xyz&gt; Date:2018-04-10 05:04:13 </code></pre> <p>On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:</p> <blockquote> <p>Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.</p> </blockquote> <p>And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-10 05:37:19 </code></pre> <p>On 10 April 2018 at 13:04, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:</p> <blockquote> <p>Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.</p> </blockquote> <p>And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.</p> </blockquote> <p>... but <em>only if they hit an I/O error</em> or they're on a FS that doesn't reserve space and hit ENOSPC.</p> <p>It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.</p> <hr> <pre><code>From:Michael Paquier &lt;michael(at)paquier(dot)xyz&gt; Date:2018-04-10 06:10:21 </code></pre> <p>On Tue, Apr 10, 2018 at 01:37:19PM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 10 April 2018 at 13:04, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.</p> </blockquote> <p>... but <em>only if they hit an I/O error</em> or they're on a FS that doesn't reserve space and hit ENOSPC.</p> </blockquote> <p>Sure.</p> <blockquote> <p>It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.</p> </blockquote> <p>Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-10 12:15:15 </code></pre> <p>On 10 April 2018 at 14:10, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.</p> </blockquote> <p>Yup.</p> <p>In the mean time, speaking of PANIC, here's the first cut patch to make Pg panic on fsync() failures. I need to do some closer review and testing, but it's presented here for anyone interested.</p> <p>I intentionally left some failures as ERROR not PANIC, where the entire operation is done as a unit, and an ERROR will cause us to retry the whole thing.</p> <p>For example, when we fsync() a temp file before we move it into place, there's no point panicing on failure, because we'll discard the temp file on ERROR and retry the whole thing.</p> <p>I've verified that it works as expected with some modifications to the test tool I've been using (pushed).</p> <p>The main downside is that if we panic in redo, we don't try again. We throw our toys and shut down. But arguably if we get the same I/O error again in redo, that's the right thing to do anyway, and quite likely safer than continuing to ERROR on checkpoints indefinitely.</p> <p>Patch attached.</p> <p>To be clear, this patch only deals with the issue of us retrying fsyncs when it turns out to be unsafe. This does NOT address any of the issues where we won't find out about writeback errors at all.</p> <p>AttachmentContent-TypeSize v1-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchtext/x-patch10.3 KB</p> <hr> <pre><code>From:Robert Haas &lt;robertmhaas(at)gmail(dot)com&gt; Date:2018-04-10 15:15:46 </code></pre> <p>On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> </blockquote> <p>Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux).</p> <blockquote> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> </blockquote> <p>I think that reliable error reporting is more than &quot;nice&quot; -- I think it's essential. The only argument for the current Linux behavior that has been so far advanced on this thread, at least as far as I can see, is that if it kept retrying the buffers forever, it would be pointless and might run the machine out of memory, so we might as well discard them. But previous comments have already illustrated that the kernel is not really up against a wall there -- it could put individual inodes into a permanent failure state when it discards their dirty data, as you suggested, or it could do what others have suggested, and what I think is better, which is to put the whole filesystem into a permanent failure state that can be cleared by remounting the FS. That could be done on an as-needed basis -- if the number of dirty buffers you're holding onto for some filesystem becomes too large, put the filesystem into infinite-fail mode and discard them all. That behavior would be pretty easy for administrators to understand and would resolve the entire problem here provided that no PostgreSQL processes survived the eventual remount.</p> <p>I also don't really know what we mean by an &quot;unresolvable&quot; error. If the drive is beyond all hope, then it doesn't really make sense to talk about whether the database stored on it is corrupt. In general we can't be sure that we'll even get an error - e.g. the system could be idle and the drive could be on fire. Maybe this is the case you meant by &quot;it'd be nice if we could report it reliably&quot;. But at least in my experience, that's typically not what's going on. You get some I/O errors and so you remount the filesystem, or reboot, or rebuild the array, or ... something. And then the errors go away and, at that point, you want to run recovery and continue using your database. In this scenario, it matters <em>quite a bit</em> what the error reporting was like during the period when failures were occurring. In particular, if the database was allowed to think that it had successfully checkpointed when it didn't, you're going to start recovery from the wrong place.</p> <p>I'm going to shut up now because I'm telling you things that you obviously already know, but this doesn't sound like a &quot;near irresolvable corner case&quot;. When the storage goes bonkers, either PostgreSQL and the kernel can interact in such a way that a checkpoint can succeed without all of the relevant data getting persisted, or they don't. It sounds like right now they do, and I'm not really clear that we have a reasonable idea how to fix that. It does not sound like a PANIC is sufficient.</p> <hr> <pre><code>From:Robert Haas &lt;robertmhaas(at)gmail(dot)com&gt; Date:2018-04-10 15:28:07 </code></pre> <p>On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>... but <em>only if they hit an I/O error</em> or they're on a FS that doesn't reserve space and hit ENOSPC.</p> <p>It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.</p> </blockquote> <p>I think we'd need every child postgres process started by initdb to do that individually, which I suspect would slow down initdb quite a lot. Now admittedly for anybody other than a PostgreSQL developer that's only a minor issue, and our regression tests mostly run with fsync=off anyway. But I have a strong suspicion that our assumptions about how fsync() reports errors are baked into an awful lot of parts of the system, and by the time we get unbaking them I think it's going to be really surprising if we haven't done real harm to overall system performance.</p> <p>BTW, I took a look at the MariaDB source code to see whether they've got this problem too and it sure looks like they do. os_file_fsync_posix() retries the fsync in a loop with an 0.2 second sleep after each retry. It warns after 100 failures and fails an assertion after 1000 failures. It is hard to understand why they would have written the code this way unless they expect errors reported by fsync() to continue being reported until the underlying condition is corrected. But, it looks like they wouldn't have the problem that we do with trying to reopen files to fsync() them later -- I spot checked a few places where this code is invoked and in all of those it looks like the file is already expected to be open.</p> <hr> <pre><code>From:Anthony Iliopoulos &lt;ailiop(at)altatus(dot)com&gt; Date:2018-04-10 15:40:05 </code></pre> <p>Hi Robert,</p> <p>On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:</p> <blockquote> <p>On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> </blockquote> <p>Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux).</p> </blockquote> <p>I think it would be interesting to get in touch with some of the respective linux kernel maintainers and open up this topic for more detailed discussions. LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-10 16:38:27 </code></pre> <p>On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:</p> <blockquote> <p>On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> </blockquote> <p>To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack.</p> </blockquote> <p>Well Postgres is using the filesystem. The interface between the block layer and the filesystem may indeed need to be more complex, I wouldn't know.</p> <p>But I don't think &quot;all possibilities&quot; is a very useful concept. Neither layer here is going to be perfect. They can only promise that all possibilities that have actually been implemented have been exhausted. And even among those only to the degree they can be done automatically within the engineering tradeoffs and constraints. There will always be cases like thin provisioned devices that an operator can expand, or degraded raid arrays that can be repaired after a long operation and so on. A network device can't be sure whether a remote server may eventually come back or not and have to be reconfigured by a human or system automation tool to point to the new server or new network configuration.</p> <blockquote> <p>Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back.</p> </blockquote> <p>No, the interface we have is fsync which gives us that information with the granularity of a single file. The database could in theory recognize that fsync is not completing on a file and read that file back and write it to a new file. More likely we would implement a feature Oracle has of writing key files to multiple devices. But currently in practice that's not what would happen, what would happen would be a human would recognize that the database has stopped being able to commit and there are hardware errors in the log and would stop the database, take a backup, and restore onto a new working device. The current interface is that there's one error and then Postgres would pretty much have to say, &quot;sorry, your database is corrupt and the data is gone, restore from your backups&quot;. Which is pretty dismal.</p> <blockquote> <p>There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.</p> </blockquote> <p>Postgres cannot just store the entire database in RAM. It writes things to the filesystem all the time. It calls fsync only when it needs a write barrier to ensure consistency. That's only frequent on the transaction log to ensure it's flushed before data modifications and then periodically to checkpoint the data files. The amount of data written between checkpoints can be arbitrarily large and Postgres has no idea how much memory is available as filesystem buffers or how much i/o bandwidth is available or other memory pressure there is. What you're suggesting is that the application should have to babysit the filesystem buffer cache and reimplement all of it in user-space because the filesystem is free to throw away any data any time it chooses?</p> <p>The current interface to throw away filesystem buffer cache is unmount. It sounds like the kernel would like a more granular way to discard just part of a device which makes a lot of sense in the age of large network block devices. But I don't think just saying that the filesystem buffer cache is now something every application needs to re-implement in user-space really helps with that, they're going to have the same problems to solve.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-10 16:54:40 </code></pre> <p>On 10 April 2018 at 02:59, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.</p> </blockquote> <p>I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).</p> <hr> <pre><code>From:&quot;Joshua D(dot) Drake&quot; &lt;jd(at)commandprompt(dot)com&gt; Date:2018-04-10 18:58:37 </code></pre> <p>-hackers,</p> <p>I reached out to the Linux ext4 devs, here is tytso(at)mit(dot)edu response:</p> <p>&quot;&quot;&quot; Hi Joshua,</p> <p>This isn't actually an ext4 issue, but a long-standing VFS/MM issue.</p> <p>There are going to be multiple opinions about what the right thing to do. I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.</p> <p>First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.</p> <p>Which is why after a while, one can get quite paranoid and assume that the only way you can guarantee data robustness is to store multiple copies and/or use erasure encoding, with some of the copies or shards written to geographically diverse data centers.</p> <p>Secondly, I think it's fair to say that the vast majority of the companies who require data robustness, and are either willing to pay $$$ to an enterprise distro company like Red Hat, or command a large enough paying customer base that they can afford to dictate terms to an enterprise distro, or hire a consultant such as Christoph, or have their own staffed Linux kernel teams, have tended to use O_DIRECT. So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.</p> <p>Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.</p> <p>So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by &quot;don't clear the dirty bit&quot;. For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.</p> <p>I can think of things that could be done --- for example, it could be switchable on a per-block device basis (or maybe a per-mount basis) whether or not the dirty bit gets cleared after the error is reported to userspace. And perhaps there could be a new unmount flag that causes all dirty pages to be wiped out, which could be used to recover after a permanent loss of the block device. But the question is who is going to invest the time to make these changes? If there is a company who is willing to pay to comission this work, it's almost certainly soluble. Or if a company which has a kernel on staff is willing to direct an engineer to work on it, it certainly could be solved. But again, of the companies who have client code where we care about robustness and proper handling of failed disk drives, and which have a kernel team on staff, pretty much all of the ones I can think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try to make buffered writes and error reporting via fsync(2) work well.</p> <p>In general these companies want low-level control over buffer cache eviction algorithms, which drives them towards the design decision of effectively implementing the page cache in userspace, and using O_DIRECT reads/writes.</p> <p>If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work. Let me know off-line if that's the case...</p> <pre><code>- Ted </code></pre> <p>&quot;&quot;&quot;</p> <hr> <pre><code>From:&quot;Joshua D(dot) Drake&quot; &lt;jd(at)commandprompt(dot)com&gt; Date:2018-04-10 19:51:01 </code></pre> <p>-hackers,</p> <p>The thread is picking up over on the ext4 list. They don't update their archives as often as we do, so I can't link to the discussion. What would be the preferred method of sharing the info?</p> <p>Thanks,</p> <hr> <pre><code>From:&quot;Joshua D(dot) Drake&quot; &lt;jd(at)commandprompt(dot)com&gt; Date:2018-04-10 20:57:34 </code></pre> <p>On 04/10/2018 12:51 PM, Joshua D. Drake wrote:</p> <blockquote> <p>-hackers,</p> <p>The thread is picking up over on the ext4 list. They don't update their archives as often as we do, so I can't link to the discussion. What would be the preferred method of sharing the info?</p> </blockquote> <p>Thanks to Anthony for this link:</p> <p><a href="http://lists.openwall.net/linux-ext4/2018/04/10/33">http://lists.openwall.net/linux-ext4/2018/04/10/33</a></p> <p>It isn't quite real time but it keeps things close enough.</p> <hr> <pre><code>From:Jonathan Corbet &lt;corbet(at)lwn(dot)net&gt; Date:2018-04-11 12:05:27 </code></pre> <p>On Tue, 10 Apr 2018 17:40:05 +0200 Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:</p> <blockquote> <p>LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions.</p> </blockquote> <p>FWIW, it is my current intention to be sure that the development community is at least aware of the issue by the time LSFMM starts.</p> <p>The event is April 23-25 in Park City, Utah. I bet that room could be found for somebody from the postgresql community, should there be somebody who would like to represent the group on this issue. Let me know if an introduction or advocacy from my direction would be helpful.</p> <hr> <pre><code>From:Greg Stark &lt;stark(at)mit(dot)edu&gt; Date:2018-04-11 12:23:49 </code></pre> <p>On 10 April 2018 at 19:58, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:</p> <blockquote> <p>You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off.</p> </blockquote> <p>I always wondered why Linux didn't implement umount -f. It's been in BSD since forever and it's a major annoyance that it's missing in Linux. Even without leaking memory it still leaks other resources, causes confusion and awkward workarounds in UI and automation software.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-11 14:29:09 </code></pre> <p>Hi,</p> <p>On 2018-04-11 06:05:27 -0600, Jonathan Corbet wrote:</p> <blockquote> <p>The event is April 23-25 in Park City, Utah. I bet that room could be found for somebody from the postgresql community, should there be somebody who would like to represent the group on this issue. Let me know if an introduction or advocacy from my direction would be helpful.</p> </blockquote> <p>If that room can be found, I might be able to make it. Being in SF, I'm probably the physically closest PG dev involved in the discussion.</p> <p>Thanks for chiming in,</p> <hr> <pre><code>From:Jonathan Corbet &lt;corbet(at)lwn(dot)net&gt; Date:2018-04-11 14:40:31 </code></pre> <p>On Wed, 11 Apr 2018 07:29:09 -0700 Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>If that room can be found, I might be able to make it. Being in SF, I'm probably the physically closest PG dev involved in the discussion.</p> </blockquote> <p>OK, I've dropped the PC a note; hopefully you'll be hearing from them.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-17 21:19:53 </code></pre> <p>On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 10 April 2018 at 02:59, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.</p> </blockquote> <p>I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).</p> </blockquote> <p>I don't think the write is <em>sent</em> to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-17 21:29:17 </code></pre> <p>On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>On 04/09/2018 12:29 AM, Bruce Momjian wrote:</p> <blockquote> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> </blockquote> <p>That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.</p> </blockquote> <p>My more-considered crazy idea is to have a postgresql.conf setting like archive_command that allows the administrator to specify a command that will be run <em>after</em> fsync but before the checkpoint is marked as complete. While we can have write flush errors before fsync and never see the errors during fsync, we will not have write flush errors <em>after</em> fsync that are associated with previous writes.</p> <p>The script should check for I/O or space-exhaustion errors and return false in that case, in which case we can stop and maybe stop and crash recover. We could have an exit of 1 do the former, and an exit of 2 do the later.</p> <p>Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:</p> <pre><code> #wal_sync_method = fsync # the default is the first option # supported by the operating system: # open_datasync --&gt; # fdatasync (default on Linux) --&gt; # fsync --&gt; # fsync_writethrough # open_sync </code></pre> <p>I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-17 21:32:45 </code></pre> <p>On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>On 04/09/2018 12:29 AM, Bruce Momjian wrote:</p> <blockquote> <p>An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.</p> </blockquote> <p>That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.</p> </blockquote> <p>Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it. Does O_DIRECT work in such container cases?</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-17 21:34:53 </code></pre> <p>On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:</p> <blockquote> <p>Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:</p> <pre><code> &gt; #wal_sync_method = fsync # the default is the first option &gt; # supported by the operating system: &gt; # open_datasync &gt; --&gt; # fdatasync (default on Linux) &gt; --&gt; # fsync &gt; --&gt; # fsync_writethrough &gt; # open_sync </code></pre> <p>I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.</p> </blockquote> <p>Hm? That's not really the issue though? One issue is that retries are not necessarily safe in buffered IO, the other that fsync might not report an error if the fd was closed and opened.</p> <p>O_DIRECT is only used if wal archiving or streaming isn't used, which makes it pretty useless anyway.</p> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-17 21:41:42 </code></pre> <p>On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:</p> <blockquote> <p>On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.</p> </blockquote> </blockquote> <p>You can certainly have access to the kernel log in containers. I'd assume such a script wouldn't check various system logs but instead tail /dev/kmsg or such. Otherwise the variance between installations would be too big.</p> <p>There's not <em>that</em> many different type of error messages and they don't change that often. If we'd just detect error for the most common FSs we'd probably be good. Detecting a few general storage layer message wouldn't be that hard either, most things have been unified over the last ~8-10 years.</p> <blockquote> <p>Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it.</p> </blockquote> <p>Not sure what you mean?</p> <p>Space exhaustiion can be checked when allocating space, FWIW. We'd just need to use posix_fallocate et al.</p> <blockquote> <p>Does O_DIRECT work in such container cases?</p> </blockquote> <p>Yes.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-17 21:49:42 </code></pre> <p>On Mon, Apr 9, 2018 at 12:25:33PM -0700, Peter Geoghegan wrote:</p> <blockquote> <p>On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.</p> </blockquote> <p>+1</p> <blockquote> <p>We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.</p> </blockquote> <p>Right. We seem to be implicitly assuming that there is a big difference between a problem in the storage layer that we could in principle detect, but don't, and any other problem in the storage layer. I've read articles claiming that technologies like SMART are not really reliable in a practical sense [1], so it seems to me that there is reason to doubt that this gap is all that big.</p> <p>That said, I suspect that the problems with running out of disk space are serious practical problems. I have personally scoffed at stories involving Postgres databases corruption that gets attributed to running out of disk space. Looks like I was dead wrong.</p> </blockquote> <p>Yes, I think we need to look at user expectations here.</p> <p>If the device has a hardware write error, it is true that it is good to detect it, and it might be permanent or temporary, e.g. NAS/NFS. The longer the error persists, the more likely the user will expect corruption. However, right now, any length outage could cause corruption, and it will not be reported in all cases.</p> <p>Running out of disk space is also something you don't expect to corrupt your database --- you expect it to only prevent future writes. It seems NAS/NFS and any thin provisioned storage will have this problem, and again, not always reported.</p> <p>So, our initial action might just be to educate users that write errors can cause silent corruption, and out-of-space errors on NAS/NFS and any thin provisioned storage can cause corruption.</p> <p>Kernel logs (not just Postgres logs) should be monitored for these issues and fail-over/recovering might be necessary.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-18 09:52:22 </code></pre> <p>On Tue, Apr 17, 2018 at 02:34:53PM -0700, Andres Freund wrote:</p> <blockquote> <p>On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:</p> <blockquote> <p>Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:</p> <pre><code>&gt; &gt; #wal_sync_method = fsync # the default is the first option &gt; &gt; # supported by the operating system: &gt; &gt; # open_datasync &gt; &gt; --&gt; # fdatasync (default on Linux) &gt; &gt; --&gt; # fsync &gt; &gt; --&gt; # fsync_writethrough &gt; &gt; # open_sync </code></pre> <p>I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.</p> </blockquote> <p>Hm? That's not really the issue though? One issue is that retries are not necessarily safe in buffered IO, the other that fsync might not report an error if the fd was closed and opened.</p> </blockquote> <p>Well, we have have been focusing on the delay between backend or checkpoint writes and checkpoint fsyncs. My point is that we have the same problem in doing a write, <em>then</em> fsync for the WAL. Yes, the delay is much shorter, but the issue still exists. I realize that newer Linux kernels will not have the problem since the file descriptor remains open, but the problem exists with older/common linux kernels.</p> <blockquote> <p>O_DIRECT is only used if wal archiving or streaming isn't used, which makes it pretty useless anyway.</p> </blockquote> <p>Uh, as doesn't 'open_datasync' and 'open_sync' fsync as part of the write, meaning we can't lose the error report like we can with the others?</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-18 10:04:30 </code></pre> <p>On 18 April 2018 at 05:19, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 10 April 2018 at 02:59, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.</p> </blockquote> <p>I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).</p> </blockquote> <p>I don't think the write is <em>sent</em> to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.</p> </blockquote> <p>It should be sent if you're using sync mode.</p> <p>From my reading of the docs, if you're using async mode you're already open to so many potential corruptions you might as well not bother.</p> <p>I need to look into this more re NFS and expand the tests I have to cover that properly.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-18 10:19:28 </code></pre> <p>On 10 April 2018 at 20:15, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>On 10 April 2018 at 14:10, Michael Paquier <michael(at)paquier(dot)xyz> wrote:</p> <blockquote> <p>Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.</p> </blockquote> <p>Yup.</p> <p>In the mean time, speaking of PANIC, here's the first cut patch to make Pg panic on fsync() failures. I need to do some closer review and testing, but it's presented here for anyone interested.</p> <p>I intentionally left some failures as ERROR not PANIC, where the entire operation is done as a unit, and an ERROR will cause us to retry the whole thing.</p> <p>For example, when we fsync() a temp file before we move it into place, there's no point panicing on failure, because we'll discard the temp file on ERROR and retry the whole thing.</p> <p>I've verified that it works as expected with some modifications to the test tool I've been using (pushed).</p> <p>The main downside is that if we panic in redo, we don't try again. We throw our toys and shut down. But arguably if we get the same I/O error again in redo, that's the right thing to do anyway, and quite likely safer than continuing to ERROR on checkpoints indefinitely.</p> <p>Patch attached.</p> <p>To be clear, this patch only deals with the issue of us retrying fsyncs when it turns out to be unsafe. This does NOT address any of the issues where we won't find out about writeback errors at all.</p> </blockquote> <p>Thinking about this some more, it'll definitely need a GUC to force it to continue despite a potential hazard. Otherwise we go backwards from the status quo if we're in a position where uptime is vital and correctness problems can be tolerated or repaired later. Kind of like zero_damaged_pages, we'll need some sort of continue_after_fsync_errors .</p> <p>Without that, we'll panic once, enter redo, and if the problem persists we'll panic in redo and exit the startup process. That's not going to help users.</p> <p>I'll amend the patch accordingly as time permits.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-18 11:46:15 </code></pre> <p>On Wed, Apr 18, 2018 at 06:04:30PM +0800, Craig Ringer wrote:</p> <blockquote> <p>On 18 April 2018 at 05:19, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:</p> <blockquote> <p>On 10 April 2018 at 02:59, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:</p> <blockquote> <p>Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.</p> </blockquote> <p>I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).</p> </blockquote> <p>I don't think the write is <em>sent</em> to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.</p> </blockquote> <p>It should be sent if you're using sync mode.</p> <blockquote> <p>From my reading of the docs, if you're using async mode you're already open to so many potential corruptions you might as well not bother.</p> </blockquote> <p>I need to look into this more re NFS and expand the tests I have to cover that properly.</p> </blockquote> <p>So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.</p> <p>So, what about thin provisioning? I can understand sharing <em>free</em> space among file systems, but once a write arrives I assume it reserves the space. Is the problem that many thin provisioning systems don't have a sync mode, so you can't force the write to appear on the device before an fsync?</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-18 11:56:57 </code></pre> <p>On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:</p> <blockquote> <p>On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:</p> <blockquote> <p>On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:</p> <blockquote> <p>That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.</p> </blockquote> </blockquote> <p>You can certainly have access to the kernel log in containers. I'd assume such a script wouldn't check various system logs but instead tail /dev/kmsg or such. Otherwise the variance between installations would be too big.</p> </blockquote> <p>I was thinking 'dmesg', but the result is similar.</p> <blockquote> <p>There's not <em>that</em> many different type of error messages and they don't change that often. If we'd just detect error for the most common FSs we'd probably be good. Detecting a few general storage layer message wouldn't be that hard either, most things have been unified over the last ~8-10 years.</p> </blockquote> <p>It is hard to know exactly what the message format should be for each operating system because it is hard to generate them on demand, and we would need to filter based on Postgres devices.</p> <p>The other issue is that once you see a message during a checkpoint and exit, you don't want to see that message again after the problem has been fixed and the server restarted. The simplest solution is to save the output of the last check and look for only new entries. I am attaching a script I run every 15 minutes from cron that emails me any unexpected kernel messages.</p> <p>I am thinking we would need a contrib module with sample scripts for various operating systems.</p> <blockquote> <blockquote> <p>Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it.</p> </blockquote> <p>Not sure what you mean?</p> <p>Space exhaustiion can be checked when allocating space, FWIW. We'd just need to use posix_fallocate et al.</p> </blockquote> <p>I was asking about cases where permissions prevent viewing of kernel messages. I think you can view them in containers, but in virtual machines you might not have access to the host operating system's kernel messages, and that might be where they are.</p> <pre><code> AttachmentContent-TypeSize dmesg_checktext/plain574 bytes </code></pre> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-18 12:45:53 </code></pre> <p>wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.</p> </blockquote> <p>Yeah. I need to verify in a concrete test case.</p> <p>The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to <em>allocate</em> the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space.</p> <p>The Linux man pages (<a href="http://man7.org/linux/man-pages/man2/write.2.html">http://man7.org/linux/man-pages/man2/write.2.html</a>) say:</p> <blockquote> <p>A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data.</p> </blockquote> <p>... and I'm inclined to believe it when it refuses to make guarantees. Especially lately.</p> <blockquote> <p>So, what about thin provisioning? I can understand sharing <em>free</em> space among file systems</p> </blockquote> <p>Most thin provisioning is done at the block level, not file system level. So the FS is usually unaware it's on a thin-provisioned volume. Usually the whole kernel is unaware, because the thin provisioning is done on the SAN end or by a hypervisor. But the same sort of thing may be done via LVM - see lvmthin. For example, you may make 100 different 1TB ext4 FSes, each on 1TB iSCSI volumes backed by SAN with a total of 50TB of concrete physical capacity. The SAN is doing block mapping and only allocating storage chunks to a given volume when the FS has written blocks to every previous free block in the previous storage chunk. It may also do things like block de-duplication, compression of storage chunks that aren't written to for a while, etc.</p> <p>The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage <em>even if you don't know it</em>.</p> <p>Think of it as a bit like VM overcommit, for storage. You can malloc() as much memory as you like and everything's fine until you try to actually use it. Then you go to dirty a page, no free pages are available, and <em>boom</em>.</p> <p>The thing is, the SAN (or LVM) doesn't have any idea about the FS's internal in-memory free space counter and its space reservations. Nor does it understand any FS metadata. All it cares about is &quot;has this LBA ever been written to by the FS?&quot;. If so, it must make sure backing storage for it exists. If not, it won't bother.</p> <p>Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data.</p> <p>I don't know if posix_fallocate is a sufficient safeguard either. You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say</p> <blockquote> <p>After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space.</p> </blockquote> <p>... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know.</p> <p>It's reasonable enough to throw up our hands in this case and say &quot;your setup is crazy, you're breaking the rules, don't do that&quot;. The truth is they AREN'T breaking the rules, but we can disclaim support for such configurations anyway.</p> <p>After all, we tell people not to use Linux's VM overcommit too. How's that working for you? I see it enabled on the great majority of systems I work with, and some people are very reluctant to turn it off because they don't want to have to add swap.</p> <p>If someone has a 50TB SAN and wants to allow for unpredictable space use expansion between various volumes, and we say &quot;you can't do that, go buy a 100TB SAN instead&quot; ... that's not going to go down too well either. Often we can actually say &quot;make sure the 5TB volume PostgreSQL is using is eagerly provisioned, and expand it at need using online resize if required. We don't care about the rest of the SAN.&quot;.</p> <p>I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.</p> <p>There <em>are</em> file systems optimised for thin provisioning, etc, too. But that's more commonly done by having them do things like zero deallocated space so the thin provisioning system knows it can return it to the free pool, and now things like DISCARD provide much of that signalling in a standard way.</p> <hr> <pre><code>From:Mark Kirkwood &lt;mark(dot)kirkwood(at)catalyst(dot)net(dot)nz&gt; Date:2018-04-18 23:31:50 </code></pre> <p>On 19/04/18 00:45, Craig Ringer wrote:</p> <blockquote> <p>I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.</p> </blockquote> <p>Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof).</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-19 00:44:33 </code></pre> <p>On 19 April 2018 at 07:31, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:</p> <blockquote> <p>On 19/04/18 00:45, Craig Ringer wrote:</p> <blockquote> <p>I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.</p> </blockquote> <p>Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof).</p> </blockquote> <p>This still makes a lot of sense on AWS EBS, particularly when using a volume created from a non-empty snapshot. Performance of S3-snapshot based EBS volumes is spectacularly awful, since they're copy-on-read. Reading the whole volume helps a lot.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-20 20:49:08 </code></pre> <p>On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:</p> <blockquote> <p>wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.</p> </blockquote> <p>Yeah. I need to verify in a concrete test case.</p> </blockquote> <p>Thanks.</p> <blockquote> <p>The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to <em>allocate</em> the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space.</p> <p>The Linux man pages (<a href="http://man7.org/linux/man-pages/man2/write.2.html">http://man7.org/linux/man-pages/man2/write.2.html</a>) say:</p> <p>&quot; A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. &quot;</p> <p>... and I'm inclined to believe it when it refuses to make guarantees. Especially lately.</p> </blockquote> <p>Uh, even calling fsync after write isn't 100% safe since the kernel could have flushed the dirty pages to storage, and failed, and the fsync would later succeed. I realize newer kernels have that fixed for files open during that operation, but that is the minority of installs.</p> <blockquote> <p>The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage <em>even if you don't know it</em>.</p> <p>Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data.</p> </blockquote> <p>I see what you are saying --- that the kernel is reserving the write space from its free space, but the free space doesn't all exist. I am not sure how we can tell people to make sure the file system free space is real.</p> <blockquote> <p>You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say</p> <p>&quot; After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. &quot;</p> <p>... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know.</p> </blockquote> <p>Frankly, in what cases will a write fail <em>for</em> lack of free space? It could be a new WAL file (not recycled), or a pages added to the end of the table.</p> <p>Is that it? It doesn't sound too terrible. If we can eliminate the corruption due to free space exxhaustion, it would be a big step forward.</p> <p>The next most common failure would be temporary storage failure or storage communication failure.</p> <p>Permanent storage failure is &quot;game over&quot; so we don't need to worry about that.</p> <hr> <pre><code>From:Gasper Zejn &lt;zejn(at)owca(dot)info&gt; Date:2018-04-21 19:21:39 </code></pre> <p>Just for the record, I tried the test case with ZFS on Ubuntu 17.10 host with ZFS on Linux 0.6.5.11.</p> <p>ZFS does not swallow the fsync error, but the system does not handle the error nicely: the test case program hangs on fsync, the load jumps up and there's a bunch of z_wr_iss and z_null_int kernel threads belonging to zfs, eating up the CPU.</p> <p>Even then I managed to reboot the system, so it's not a complete and utter mess.</p> <p>The test case adjustments are here: <a href="https://github.com/zejn/scrapcode/commit/e7612536c346d59a4b69bedfbcafbe8c1079063c">https://github.com/zejn/scrapcode/commit/e7612536c346d59a4b69bedfbcafbe8c1079063c</a></p> <p>Kind regards,</p> <hr> <p>On 29. 03. 2018 07:25, Craig Ringer wrote:</p> <blockquote> <p>On 29 March 2018 at 13:06, Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com</p> <pre><code>On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby &gt; The retries are the source of the problem ; the first fsync() can return EIO, &gt; and also *clears the error* causing a 2nd fsync (of the same data) to return &gt; success. &gt; What I'm failing to grok here is how that error flag even matters, &gt; whether it's a single bit or a counter as described in that patch. If &gt; write back failed, *the page is still dirty*. So all future calls to &gt; fsync() need to try to try to flush it again, and (presumably) fail &gt; again (unless it happens to succeed this time around). </code></pre> <p>You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.</p> <p>I wrote a test case here.</p> <p><a href="https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c">https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c</a></p> <p>I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.</p> <p>Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.</p> </blockquote> <hr> <pre><code>From:Andres Freund &lt;andres(at)anarazel(dot)de&gt; Date:2018-04-23 20:14:48 </code></pre> <p>Hi,</p> <p>On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means &quot;all writes since the last fsync have hit disk&quot; but we assume it means &quot;all writes since the last SUCCESSFUL fsync have hit disk&quot;.</p> <p>But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() <em>cleared the AS_EIO bad page flag</em>.</p> </blockquote> <p>Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().</p> <p>I'm LSF/MM to discuss future behaviour of linux here, but that's how it is right now.</p> <hr> <pre><code>From:Bruce Momjian &lt;bruce(at)momjian(dot)us&gt; Date:2018-04-24 00:09:23 </code></pre> <p>On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means &quot;all writes since the last fsync have hit disk&quot; but we assume it means &quot;all writes since the last SUCCESSFUL fsync have hit disk&quot;.</p> <p>But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() <em>cleared the AS_EIO bad page flag</em>.</p> </blockquote> <p>Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().</p> </blockquote> <p>Well, that's interesting. You might remember that NFS does not reserve space for writes like local file systems like ext4/xfs do. For that reason, we might be able to capture the out-of-space error on close and exit sooner for NFS.</p> <hr> <pre><code>From:Craig Ringer &lt;craig(at)2ndquadrant(dot)com&gt; Date:2018-04-26 02:16:52 </code></pre> <p>On 24 April 2018 at 04:14, Andres Freund <andres(at)anarazel(dot)de> wrote:</p> <blockquote> <p>I'm LSF/MM to discuss future behaviour of linux here, but that's how it is right now.</p> </blockquote> <p>Interim LWN.net coverage of that can be found here: <a href="https://lwn.net/Articles/752613/">https://lwn.net/Articles/752613/</a></p> <hr> <pre><code>From:Thomas Munro &lt;thomas(dot)munro(at)enterprisedb(dot)com&gt; Date:2018-04-27 01:18:55 </code></pre> <p>On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:</p> <blockquote> <p>On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:</p> <blockquote> <p>TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means &quot;all writes since the last fsync have hit disk&quot; but we assume it means &quot;all writes since the last SUCCESSFUL fsync have hit disk&quot;.</p> <p>But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() <em>cleared the AS_EIO bad page flag</em>.</p> </blockquote> <p>Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().</p> </blockquote> <p>Well, that's interesting. You might remember that NFS does not reserve space for writes like local file systems like ext4/xfs do. For that reason, we might be able to capture the out-of-space error on close and exit sooner for NFS.</p> </blockquote> <p>It seems like some implementations flush on close and therefore discover ENOSPC problem at that point, unless they have NVSv4 (RFC 3050) &quot;write delegation&quot; with a promise from the server that a certain amount of space is available. It seems like you can't count on that in any way though, because it's the server that decides when to delegate and how much space to promise is preallocated, not the client. So in userspace you always need to be able to handle errors including ENOSPC returned by close(), and if you ignore that and you're using an operating system that immediately incinerates all evidence after telling you that (so that later fsync() doesn't fail), you're in trouble.</p> <p>Some relevant code:</p> <ul> <li><a href="https://github.com/torvalds/linux/commit/5445b1fbd123420bffed5e629a420aa2a16bf849">https://github.com/torvalds/linux/commit/5445b1fbd123420bffed5e629a420aa2a16bf849</a></li> <li><a href="https://github.com/freebsd/freebsd/blob/master/sys/fs/nfsclient/nfs_clvnops.c#L618">https://github.com/freebsd/freebsd/blob/master/sys/fs/nfsclient/nfs_clvnops.c#L618</a></li> </ul> <p>It looks like the bleeding edge of the NFS spec includes a new ALLOCATE operation that should be able to support posix_fallocate() (if we were to start using that for extending files):</p> <p><a href="https://tools.ietf.org/html/rfc7862#page-64">https://tools.ietf.org/html/rfc7862#page-64</a></p> <p>I'm not sure how reliable [posix_]fallocate is on NFS in general though, and it seems that there are fall-back implementations of posix_fallocate() that write zeros (or even just feign success?) which probably won't do anything useful here if not also flushed (that fallback strategy might only work on eager reservation filesystems that don't have direct fallocate support?) so there are several layers (libc, kernel, nfs client, nfs server) that'd need to be aligned for that to work, and it's not clear how a humble userspace program is supposed to know if they are.</p> <p>I guess if you could find a way to amortise the cost of extending (like Oracle et al do by extending big container datafiles 10MB at a time or whatever), then simply writing zeros and flushing when doing that might work out OK, so you wouldn't need such a thing? (Unless of course it's a COW filesystem, but that's a different can of worms.)</p> <hr> <p><em>This thread continues on the <code>ext4</code> mailing list:</em> <hr></p> <pre><code>From: &quot;Joshua D. Drake&quot; &lt;jd@...mandprompt.com&gt; Subject: fsync() errors is unsafe and risks data loss Date: Tue, 10 Apr 2018 09:28:15 -0700 </code></pre> <p>-ext4,</p> <p>If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:</p> <p><a href="https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz">https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz</a></p> <p>The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.</p> <hr> <pre><code>From: &quot;Darrick J. Wong&quot; &lt;darrick.wong@...cle.com&gt; Date: Tue, 10 Apr 2018 09:54:43 -0700 </code></pre> <p>On Tue, Apr 10, 2018 at 09:28:15AM -0700, Joshua D. Drake wrote:</p> <blockquote> <p>-ext4,</p> <p>If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:</p> <p><a href="https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz">https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz</a></p> <p>The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.</p> </blockquote> <p>You might try the XFS list (linux-xfs@...r.kernel.org) seeing as the initial complaint is against xfs behaviors...</p> <hr> <pre><code>From: &quot;Joshua D. Drake&quot; &lt;jd@...mandprompt.com&gt; Date: Tue, 10 Apr 2018 09:58:21 -0700 </code></pre> <p>On 04/10/2018 09:54 AM, Darrick J. Wong wrote:</p> <blockquote> <p>On Tue, Apr 10, 2018 at 09:28:15AM -0700, Joshua D. Drake wrote:</p> <blockquote> <p>-ext4,</p> <p>If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:</p> <p><a href="https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz">https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz</a></p> <p>The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.</p> </blockquote> <p>You might try the XFS list (linux-xfs@...r.kernel.org) seeing as the initial complaint is against xfs behaviors...</p> </blockquote> <p>Later in the thread it becomes apparent that it applies to ext4 (NFS too) as well. I picked ext4 because I assumed it is the most populated of the lists since its the default filesystem for most distributions.</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Tue, 10 Apr 2018 14:43:56 -0400 </code></pre> <p>Hi Joshua,</p> <p>This isn't actually an ext4 issue, but a long-standing VFS/MM issue.</p> <p>There are going to be multiple opinions about what the right thing to do. I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.</p> <p>First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.</p> <p>Which is why after a while, one can get quite paranoid and assume that the only way you can guarantee data robustness is to store multiple copies and/or use erasure encoding, with some of the copies or shards written to geographically diverse data centers.</p> <p>Secondly, I think it's fair to say that the vast majority of the companies who require data robustness, and are either willing to pay $$$ to an enterprise distro company like Red Hat, or command a large enough paying customer base that they can afford to dictate terms to an enterprise distro, or hire a consultant such as Christoph, or have their own staffed Linux kernel teams, have tended to use O_DIRECT. So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.</p> <p>Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.</p> <p>So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by &quot;don't clear the dirty bit&quot;. For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.</p> <p>I can think of things that could be done --- for example, it could be switchable on a per-block device basis (or maybe a per-mount basis) whether or not the dirty bit gets cleared after the error is reported to userspace. And perhaps there could be a new unmount flag that causes all dirty pages to be wiped out, which could be used to recover after a permanent loss of the block device. But the question is who is going to invest the time to make these changes? If there is a company who is willing to pay to comission this work, it's almost certainly soluble. Or if a company which has a kernel on staff is willing to direct an engineer to work on it, it certainly could be solved. But again, of the companies who have client code where we care about robustness and proper handling of failed disk drives, and which have a kernel team on staff, pretty much all of the ones I can think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try to make buffered writes and error reporting via fsync(2) work well.</p> <p>In general these companies want low-level control over buffer cache eviction algorithms, which drives them towards the design decision of effectively implementing the page cache in userspace, and using O_DIRECT reads/writes.</p> <p>If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work. Let me know off-line if that's the case...</p> <hr> <pre><code>From: Andreas Dilger &lt;adilger@...ger.ca&gt; Date: Tue, 10 Apr 2018 13:44:48 -0600 </code></pre> <p>On Apr 10, 2018, at 10:50 AM, Joshua D. Drake <a href="mailto:jd@...mandprompt.com">jd@...mandprompt.com</a> wrote:</p> <blockquote> <p>-ext4,</p> <p>If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:</p> <p><a href="https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz">https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz</a></p> <p>The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.</p> </blockquote> <p>Yes, this is a very long thread. The summary is Postgres is unhappy that fsync() on Linux (and also other OSes) returns an error once if there was a prior write() failure, instead of keeping dirty pages in memory forever and trying to rewrite them.</p> <p>This behaviour has existed on Linux forever, and (for better or worse) is the only reasonable behaviour that the kernel can take. I've argued for the opposite behaviour at times, and some subsystems already do limited retries before finally giving up on a failed write, though there are also times when retrying at lower levels is pointless if a higher level of code can handle the failure (e.g. mirrored block devices, filesystem data mirroring, userspace data mirroring, or cross-node replication).</p> <p>The confusion is whether fsync() is a &quot;level&quot; state (return error forever if there were pages that could not be written), or an &quot;edge&quot; state (return error only for any write failures since the previous fsync() call).</p> <p>I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on &quot;I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!&quot; without looking at the larger picture of what is practical to change and where the issue should best be fixed.</p> <p>Regardless of why this is the case, the net is that PG needs to deal with all of the systems that currently exist that have this behaviour, even if some day in the future it may change (though that is unlikely). It seems ironic that &quot;keep dirty pages in userspace until fsync() returns success&quot; is totally unacceptable, but &quot;keep dirty pages in the kernel&quot; is fine. My (limited) understanding of databases was that they preferred to cache everything in userspace and use O_DIRECT to write to disk (which returns an error immediately if the write fails and does not double buffer data).</p> <hr> <p>From: Martin Steigerwald <a href="mailto:martin@...htvoll.de">martin@...htvoll.de</a> Date: Tue, 10 Apr 2018 21:47:21 +0200</p> <p>Hi Theodore, Darrick, Joshua.</p> <p>CC´d fsdevel as it does not appear to be Ext4 specific to me (and to you as well, Theodore).</p> <p>Theodore Y. Ts'o - 10.04.18, 20:43:</p> <blockquote> <p>This isn't actually an ext4 issue, but a long-standing VFS/MM issue. […] First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.</p> </blockquote> <p>Guh. I was not aware of this. I knew consumer-grade SSDs often do not have power loss protection, but still thought they´d handle garble collection in an atomic way. Sometimes I am tempted to sing an &quot;all hardware is crap&quot; song (starting with Meltdown/Spectre, then probably heading over to storage devices and so on… including firmware crap like Intel ME).</p> <blockquote> <p>Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.</p> <p>From the original PostgreSQL mailing list thread I did not get on how exactly FreeBSD differs in behavior, compared to Linux. I am aware of one operating system that from a user point of view handles this in almost the right way IMHO: AmigaOS.</p> </blockquote> <p>When you removed a floppy disk from the drive while the OS was writing to it it showed a &quot;You MUST insert volume somename into drive somedrive:&quot; and if you did, it just continued writing. (The part that did not work well was that with the original filesystem if you did not insert it back, the whole disk was corrupted, usually to the point beyond repair, so the &quot;MUST&quot; was no joke.)</p> <p>In my opinion from a user´s point of view this is the only sane way to handle the premature removal of removable media. I have read of a GSoC project to implement something like this for NetBSD but I did not check on the outcome of it. But in MS-DOS I think there has been something similar, however MS-DOS is not an multitasking operating system as AmigaOS is.</p> <p>Implementing something like this for Linux would be quite a feat, I think, cause in addition to the implementation in the kernel, the desktop environment or whatever other userspace you use would need to handle it as well, so you´d have to adapt udev / udisks / probably Systemd. And probably this behavior needs to be restricted to anything that is really removable and even then in order to prevent memory exhaustion in case processes continue to write to an removed and not yet re-inserted USB harddisk the kernel would need to halt I/O processes which dirty I/O to this device. (I believe this is what AmigaOS did. It just blocked all subsequent I/O to the device still it was re-inserted. But then the I/O handling in that OS at that time is quite different from what Linux does.)</p> <blockquote> <p>So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by &quot;don't clear the dirty bit&quot;. For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.</p> </blockquote> <p>I was not aware that flash based media may be as crappy as you hint at.</p> <blockquote> <p>From my tests with AmigaOS 4.something or AmigaOS 3.9 + 3rd Party Poseidon USB stack the above mechanism worked even with USB sticks. I however did not test this often and I did not check for data corruption after a test.</p> </blockquote> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Tue, 10 Apr 2018 15:07:26 -0700 </code></pre> <p>(Sorry if I screwed up the thread structure - I'd to reconstruct the reply-to and CC list from web archive as I've not found a way to properly download an mbox or such of old content. Was subscribed to fsdevel but not ext4 lists)</p> <p>Hi,</p> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.</p> </blockquote> <p>Same ;)</p> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.</p> </blockquote> <p>That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.</p> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by &quot;don't clear the dirty bit&quot;. For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.</p> </blockquote> <p>I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that.</p> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> <p>Both in postgres, <em>and</em> a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file writtten. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.</p> <p>You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says &quot;I died&quot; and set that instead of keeping per inode/whatever information.</p> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.</p> </blockquote> <p>I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.</p> <p>2018-04-10 19:44:48 Andreas wrote:</p> <blockquote> <p>The confusion is whether fsync() is a &quot;level&quot; state (return error forever if there were pages that could not be written), or an &quot;edge&quot; state (return error only for any write failures since the previous fsync() call).</p> </blockquote> <p>I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, <em>THAT'S</em> the big issue.</p> <p>Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and <em>then</em> assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.</p> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on &quot;I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!&quot; without looking at the larger picture of what is practical to change and where the issue should best be fixed.</p> </blockquote> <p>Everone can participate in discussions...</p> <hr> <pre><code>From: Andreas Dilger &lt;adilger@...ger.ca&gt; Date: Wed, 11 Apr 2018 15:52:44 -0600 </code></pre> <p>On Apr 10, 2018, at 4:07 PM, Andres Freund <a href="mailto:andres@...razel.de">andres@...razel.de</a> wrote:</p> <blockquote> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.</p> </blockquote> <p>That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.</p> </blockquote> <p>Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error. If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written.</p> <p>There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this.</p> <blockquote> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by &quot;don't clear the dirty bit&quot;. For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.</p> </blockquote> <p>I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that.</p> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> </blockquote> <p>I think there are two issues here - &quot;fsync() on an fd that was just opened&quot; and &quot;persistent error state (without keeping dirty pages in memory)&quot;.</p> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> <p>Consider if there was a per-inode &quot;there was once an error writing this inode&quot; flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned.</p> <p>IMHO, the only alternative would be to keep the dirty pages in memory until they are written to disk. If that was not possible, what then? It would need a reboot to clear the dirty pages, or truncate the file (discarding all data)?</p> <blockquote> <p>Both in postgres, <em>and</em> a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file written. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.</p> </blockquote> <p>... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it).</p> <p>It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is &quot;re-opened&quot;, and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool.</p> <p>That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems.</p> <blockquote> <p>You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says &quot;I died&quot; and set that instead of keeping per inode/whatever information.</p> </blockquote> <p>The filesystem will definitely return an error in this case, I don't think this needs any kind of changes:</p> <p>int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode-&gt;i_sb)))) return -EIO;</p> <blockquote> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.</p> </blockquote> <p>I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.</p> </blockquote> <p>Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others. Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable &quot;keep dirty pages in RAM on IO failure&quot;, I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.</p> <p>That said, <em>even</em> if a fix was available for Linux tomorrow, it would be <em>years</em> before a majority of users would have it available on their system, that includes even the errseq mechanism that was landed a few months ago. That implies to me that you'd want something that fixes PG <em>now</em> so that it works around whatever (perceived) breakage exists in the Linux fsync() implementation. Since the thread indicates that non-Linux kernels have the same fsync() behaviour, it makes sense to do that even if the Linux fix was available.</p> <blockquote> <p>2018-04-10 19:44:48 Andreas wrote:</p> <blockquote> <p>The confusion is whether fsync() is a &quot;level&quot; state (return error forever if there were pages that could not be written), or an &quot;edge&quot; state (return error only for any write failures since the previous fsync() call).</p> </blockquote> <p>I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, <em>THAT'S</em> the big issue.</p> <p>Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and <em>then</em> assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.</p> </blockquote> <p>I can't say how common or uncommon such a workload is, though PG is the only application that I've heard of doing it, and I've been working on filesystems for 20 years. I'm a bit surprised that anyone expects fsync() on a newly-opened fd to have any state from write() calls that predate the open. I can understand fsync() returning an error for any IO that happens within the context of that fsync(), but how far should it go back for reporting errors on that file? Forever? The only way to clear the error would be to reboot the system, since I'm not aware of any existing POSIX code to clear such an error</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Thu, 12 Apr 2018 10:09:16 +1000 </code></pre> <p>On Wed, Apr 11, 2018 at 03:52:44PM -0600, Andreas Dilger wrote: &gt; On Apr 10, 2018, at 4:07 PM, Andres Freund <a href="mailto:andres@...razel.de">andres@...razel.de</a> wrote: &gt; &gt; 2018-04-10 18:43:56 Ted wrote: &gt; &gt;&gt; So for better or for worse, there has not been as much investment in &gt; &gt;&gt; buffered I/O and data robustness in the face of exception handling of &gt; &gt;&gt; storage devices. &gt; &gt; &gt; &gt; That's a bit of a cop out. It's not just databases that care. Even more &gt; &gt; basic tools like SCM, package managers and editors care whether they can &gt; &gt; proper responses back from fsync that imply things actually were synced. &gt; &gt; Sure, but it is mostly PG that is doing (IMHO) crazy things like writing &gt; to thousands(?) of files, closing the file descriptors, then expecting &gt; fsync() on a newly-opened fd to return a historical error.</p> <p>Yeah, this seems like a recipe for disaster, especially on cross-platform code where every OS platform behaves differently and almost never to expectation.</p> <p>And speaking of &quot;behaving differently to expectations&quot;, nobody has mentioned that close() can also return write errors. Hence if you do write - close - open - fsync the the write error might get reported on close, not fsync. IOWs, the assumption that &quot;async writeback errors will persist across close to open&quot; is fundamentally broken to begin with. It's even documented as a slient data loss vector in the close(2) man page:</p> <pre><code>$ man 2 close ..... Dealing with error returns from close() A careful programmer will check the return value of close(), since it is quite possible that errors on a previous write(2) operation are reported only on the final close() that releases the open file description. Failing to check the return value when closing a file may lead to silent loss of data. This can especially be observed with NFS and with disk quota. </code></pre> <p>Yeah, ensuring data integrity in the face of IO errors is a really hard problem. :/</p> <p>To pound the broken record: there are many good reasons why Linux filesystem developers have said &quot;you should use direct IO&quot; to the PG devs each time we have this &quot;the kernel doesn't do [complex things PG needs]&quot; discussion.</p> <p>In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit.</p> <p>Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking.</p> <p>This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times....</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Wed, 11 Apr 2018 19:17:52 -0700 </code></pre> <p>On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:</p> <blockquote> <p>On Apr 10, 2018, at 4:07 PM, Andres Freund <a href="mailto:andres@...razel.de">andres@...razel.de</a> wrote:</p> <blockquote> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.</p> </blockquote> <p>That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.</p> </blockquote> <p>Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error.</p> </blockquote> <p>It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too:</p> <pre><code> /* We want to guarantee the extracted files are on the disk, so that the * subsequent renames to the info database do not end up with old or zero * length files in case of a system crash. As neither dpkg-deb nor tar do * explicit fsync()s, we have to do them here. * XXX: This could be avoided by switching to an internal tar extractor. */ dir_sync_contents(cidir); </code></pre> <p>(a bunch of other places too)</p> <p>Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that.</p> <p>I think there's some legitimate arguments that a database should use direct IO (more on that as a reply to David), but claiming that all sorts of random utilities need to use DIO with buffering etc is just insane.</p> <blockquote> <p>If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written.</p> </blockquote> <p>Except that they won't notice that they got a failure, at least in the dpkg case. And happily continue installing corrupted data</p> <blockquote> <p>There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this.</p> </blockquote> <p>Yea, I agree that'd not be sane. As far as I understand the dpkg code (all of 10min reading it), that'd also be unnecessary. It can abort the installation, but only if it detects the error. Which isn't happening.</p> <blockquote> <blockquote> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> </blockquote> <p>I think there are two issues here - &quot;fsync() on an fd that was just opened&quot; and &quot;persistent error state (without keeping dirty pages in memory)&quot;.</p> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> </blockquote> <p>And that's <em>horrible</em>. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.</p> <p>Or even more extreme, you untar/zip/git clone a directory. Then do a sync. And you don't know whether anything actually succeeded.</p> <blockquote> <p>Consider if there was a per-inode &quot;there was once an error writing this inode&quot; flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned.</p> </blockquote> <p>The data in the file also is corrupt. Having to unmount or delete the file to reset the fact that it can't safely be assumed to be on disk isn't insane.</p> <blockquote> <blockquote> <p>Both in postgres, <em>and</em> a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file written. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.</p> </blockquote> <p>... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it).</p> <p>It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is &quot;re-opened&quot;, and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool.</p> </blockquote> <p>Except that postgres uses multiple processes. And works on a lot of architectures. If we started to fsync all opened files on process exit our users would <em>lynch</em> us. We'd need a complicated scheme that sends processes across sockets between processes, then deduplicate them on the receiving side, somehow figuring out which is the oldest filedescriptors (handling clockdrift safely).</p> <p>Note that it'd be perfectly fine that we've &quot;thrown away&quot; the buffer contents if we'd get notified that the fsync failed. We could just do WAL replay, and restore the contents (just was we do after crashes and/or for replication).</p> <blockquote> <p>That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems.</p> </blockquote> <p>There's already a per-process cache of open files.</p> <blockquote> <blockquote> <p>You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says &quot;I died&quot; and set that instead of keeping per inode/whatever information.</p> </blockquote> <p>The filesystem will definitely return an error in this case, I don't think this needs any kind of changes:</p> <p>int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode-&gt;i_sb)))) return -EIO;</p> </blockquote> <p>Well, I'm making that argument because several people argued that throwing away buffer contents in this case is the only way to not cause OOMs, and that that's incompatible with reporting errors. It's clearly not...</p> <blockquote> <blockquote> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.</p> </blockquote> <p>I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.</p> </blockquote> <p>Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others.</p> </blockquote> <p>Sure.</p> <blockquote> <p>Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable &quot;keep dirty pages in RAM on IO failure&quot;, I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.</p> </blockquote> <p>I don't think this is that PG specific, as explained above.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Wed, 11 Apr 2018 19:32:21 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:</p> <blockquote> <p>To pound the broken record: there are many good reasons why Linux filesystem developers have said &quot;you should use direct IO&quot; to the PG devs each time we have this &quot;the kernel doesn't do [complex things PG needs]&quot; discussion.</p> </blockquote> <p>I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.</p> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).</p> <blockquote> <p>In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit.</p> <p>Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking.</p> <p>This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times....</p> </blockquote> <p>Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Wed, 11 Apr 2018 19:51:13 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-11 19:32:21 -0700, Andres Freund wrote:</p> <blockquote> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).</p> </blockquote> <p>And before somebody argues that that's a too small window to trigger the problem realistically: Restoring large databases happens pretty commonly (for new replicas, testcases, or actual fatal issues), takes time, and it's where a lot of storage is actually written to for the first time in a while, so it's far from unlikely to trigger bad block errors or such.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Wed, 11 Apr 2018 20:02:48 -0700 </code></pre> <p>On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:</p> <blockquote> <blockquote> <blockquote> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> </blockquote> <p>I think there are two issues here - &quot;fsync() on an fd that was just opened&quot; and &quot;persistent error state (without keeping dirty pages in memory)&quot;.</p> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> </blockquote> <p>And that's <em>horrible</em>. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.</p> </blockquote> <p>At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open(&quot;file&quot;)); would report the most recent error.</p> <p>That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's &quot;good enough&quot; to only support it for recent files.</p> <p>Jeff, what do you think?</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 12 Apr 2018 01:09:24 -0400 </code></pre> <p>On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:</p> <blockquote> <p>Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code.</p> </blockquote> <p>I know it's not what PG has chosen, but realistically all of the other major databases and userspace based storage systems have used DIO precisely <em>because</em> it's the way to avoid OS-specific behavior or require OS-specific code. DIO is simple, and pretty much the same everywhere.</p> <p>In contrast, the exact details of how buffered I/O workrs can be quite different on different OS's. This is especially true if you take performance related details (e.g., the cleaning algorithm, how pages get chosen for eviction, etc.)</p> <p>As I read the PG-hackers thread, I thought I saw acknowledgement that some of the behaviors you don't like with Linux also show up on other Unix or Unix-like systems?</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 12 Apr 2018 01:34:45 -0400 </code></pre> <p>On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:</p> <blockquote> <blockquote> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> </blockquote> <p>And that's <em>horrible</em>. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.</p> </blockquote> <p>If there is no open file descriptor, and in many cases, no process (because it has already exited), it may be horrible, but what the h*ll else do you expect the OS to do?</p> <p>The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon. If it detects errors on a particular hard drive, it tells the cluster file system to stop using that disk, and to reconstruct from erasure code all of the data chunks on that disk onto other disks in the cluster. We then run a series of disk diagnostics to make sure we find all of the bad sectors (every often, where there is one bad sector, there are several more waiting to be found), and then afterwards, put the disk back into service.</p> <p>By making it be a separate health monitoring process, we can have HDD experts write much more sophisticated code that can ask the disk firmware for more information (e.g., SMART, the grown defect list), do much more careful scrubbing of the disk media, etc., before returning the disk back to service.</p> <blockquote> <blockquote> <p>Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable &quot;keep dirty pages in RAM on IO failure&quot;, I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.</p> </blockquote> <p>I don't think this is that PG specific, as explained above.</p> </blockquote> <p>The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something.</p> <p>The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program.</p> <p>If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system.</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Thu, 12 Apr 2018 15:45:36 +1000 </code></pre> <p>On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:</p> <blockquote> <p>To pound the broken record: there are many good reasons why Linux filesystem developers have said &quot;you should use direct IO&quot; to the PG devs each time we have this &quot;the kernel doesn't do <complex things PG needs>&quot; discussion.</p> </blockquote> <p>I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.</p> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.</p> </blockquote> <p>Yes it is.</p> <p>This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.</p> <blockquote> <p>Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).</p> </blockquote> <p>No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this.</p> <hr> <pre><code>From: Lukas Czerner &lt;lczerner@...hat.com&gt; Date: Thu, 12 Apr 2018 12:19:26 +0200 </code></pre> <p>On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:</p> <blockquote> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).</p> </blockquote> <p>Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot.</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Thu, 12 Apr 2018 07:09:14 -0400 </code></pre> <p>On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:</p> <blockquote> <blockquote> <blockquote> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> </blockquote> <p>I think there are two issues here - &quot;fsync() on an fd that was just opened&quot; and &quot;persistent error state (without keeping dirty pages in memory)&quot;.</p> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> </blockquote> <p>And that's <em>horrible</em>. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.</p> </blockquote> </blockquote> <p>What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?</p> <blockquote> <p>At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open(&quot;file&quot;)); would report the most recent error.</p> <p>That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's &quot;good enough&quot; to only support it for recent files.</p> <p>Jeff, what do you think?</p> </blockquote> <p>I hate it :). We could do that, but....yecchhhh.</p> <p>Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.</p> <p>I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer:</p> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> <p>Maybe that's ok in the face of a writeback error though? IDK.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Thu, 12 Apr 2018 04:19:48 -0700 </code></pre> <p>On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open(&quot;file&quot;)); would report the most recent error.</p> <p>That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's &quot;good enough&quot; to only support it for recent files.</p> <p>Jeff, what do you think?</p> </blockquote> <p>I hate it :). We could do that, but....yecchhhh.</p> <p>Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.</p> </blockquote> <p>Yeah, it's definitely half-arsed. We could make further changes to improve the situation, but they'd have wider impact. For example, we can tell if the error has been sampled by any existing fd, so we could bias our inode reaping to have inodes with unreported errors stick around in the cache for longer.</p> <blockquote> <p>I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer:</p> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> </blockquote> <p>I suspect it isn't. If there's a transient error then we should reattempt the write. OTOH if the error is permanent then reattempting the write isn't going to do any good and it's just going to cause the drive to go through the whole error handling dance again. And what do we do if we're low on memory and need these pages back to avoid going OOM? There's a lot of options here, all of them bad in one situation or another.</p> <blockquote> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> <p>Maybe that's ok in the face of a writeback error though? IDK.</p> </blockquote> <p>I don't know either. It'd force the application to face up to the fact that the data is gone immediately rather than only finding it out after a reboot. Again though that might cause more problems than it solves. It's hard to know what the right thing to do is.</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Thu, 12 Apr 2018 07:24:12 -0400 </code></pre> <p>On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:</p> <blockquote> <p>On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:</p> <blockquote> <p>To pound the broken record: there are many good reasons why Linux filesystem developers have said &quot;you should use direct IO&quot; to the PG devs each time we have this &quot;the kernel doesn't do <complex things PG needs>&quot; discussion.</p> </blockquote> <p>I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.</p> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.</p> </blockquote> <p>Yes it is.</p> <p>This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.</p> </blockquote> <p>Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error.</p> <blockquote> <blockquote> <p>Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).</p> </blockquote> <p>No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this.</p> </blockquote> <p>You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect.</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Thu, 12 Apr 2018 22:01:22 +1000 </code></pre> <p>On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:</p> <blockquote> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> </blockquote> <p>There isn't a right thing. Whatever we do will be wrong for someone.</p> <blockquote> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> </blockquote> <p>Not to mention a POSIX IO ordering violation. Seeing stale data after a &quot;successful&quot; write is simply not allowed.</p> <blockquote> <p>Maybe that's ok in the face of a writeback error though? IDK.</p> </blockquote> <p>No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO....</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 12 Apr 2018 11:16:46 -0400 </code></pre> <p>On Thu, Apr 12, 2018 at 10:01:22PM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:</p> <blockquote> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> </blockquote> <p>There isn't a right thing. Whatever we do will be wrong for someone.</p> </blockquote> <p>That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they <em>think</em> they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us <em>again</em>, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life.</p> <p>Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources...</p> <p>We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors). Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site.</p> <p>It's certainly that's what <em>I</em> would do if I didn't decide to use a hosted cloud solution, such as Cloud SQL for Postgres, and let someone else solve the really hard problems of dealing with real-world HDD failures. :-)</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Thu, 12 Apr 2018 11:08:50 -0400 </code></pre> <p>On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:</p> <blockquote> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> </blockquote> <p>There isn't a right thing. Whatever we do will be wrong for someone.</p> <blockquote> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> </blockquote> <p>Not to mention a POSIX IO ordering violation. Seeing stale data after a &quot;successful&quot; write is simply not allowed.</p> </blockquote> <p>I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?</p> <p>Given that the pages are clean after these failures, we aren't doing this even today:</p> <p>Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.</p> <p>That can even happen before you get the chance to call fsync, so even a write()+read()+fsync() is not guaranteed to be safe in this regard today, given sufficient memory pressure.</p> <p>I think the current situation is fine from a &quot;let's not OOM at all costs&quot; standpoint, but not so good for application predictability. We should really consider ways to do better here.</p> <blockquote> <blockquote> <p>Maybe that's ok in the face of a writeback error though? IDK.</p> </blockquote> <p>No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO....</p> </blockquote> <p>I think we we have an opportunity here to come up with better defined and hopefully more useful behavior for buffered I/O in the face of writeback errors. The first step would be to hash out what we'd want it to look like.</p> <p>Maybe we need a plenary session at LSF/MM?</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 12:46:27 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-12 12:19:26 +0200, Lukas Czerner wrote:</p> <blockquote> <p>On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:</p> <blockquote> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).</p> </blockquote> <p>Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot.</p> </blockquote> <p>That doesn't really help, unless you want to sync() and then re-read all the data to make sure it's the same. Rereading multi-TB backups just to know whether there was an error that the OS knew about isn't particularly fun. Without verifying after sync it's not going to improve the situation measurably, you're still only going to discover that $data isn't available when it's needed.</p> <p>What you're saying here is that there's no way to use standard linux tools to manipulate files and know whether it failed, without filtering kernel logs for IO errors. Or am I missing something?</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 12:55:36 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-12 01:34:45 -0400, Theodore Y. Ts'o wrote:</p> <blockquote> <p>The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon.</p> </blockquote> <p>Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing.</p> <blockquote> <p>The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something.</p> </blockquote> <p>The problem really isn't about <em>recovering</em> from disk errors. <em>Knowing</em> about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that &quot;we&quot; get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up.</p> <blockquote> <p>The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program.</p> <p>If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system.</p> </blockquote> <p>Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 13:13:22 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-12 11:16:46 -0400, Theodore Y. Ts'o wrote:</p> <blockquote> <p>That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they <em>think</em> they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us <em>again</em>, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life.</p> </blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient. I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem. If the drive is entirely gone there's obviously no point in keeping per-file information around, so per-blockdev/fs information suffices entirely to return an error on fsync (which at least on ext4 appears to happen if the underlying blockdev is gone).</p> <p>Have fun making up things we want, but I'm not sure it's particularly productive.</p> <blockquote> <p>Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources...</p> </blockquote> <p>Well, that's why I'm discussing here so we can figure out what's acceptable before considering wasting money and revew cycles doing or paying somebody to do some crazy useless shit.</p> <blockquote> <p>We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors).</p> </blockquote> <p>Ah, darn. After you'd mentioned that in an earlier mail I'd hoped that'd be upstream. And yes, that'd be perfect.</p> <blockquote> <p>Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site.</p> </blockquote> <p>As I said earlier, I think there's good reason to move to DIO for postgres. But to keep that performant is going to need some serious work.</p> <p>But afaict such a solution wouldn't really depend on applications using DIO or not. Before finishing a checkpoint (logging it persistently and allowing to throw older data away), we could check if any errors have been reported and give up if there have been any. And after starting postgres on a directory restored from backup using $tool, we can fsync the directory recursively, check for such errors, and give up if there've been any.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 13:24:57 -0700 </code></pre> <p>On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:</p> <blockquote> <p>On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:</p> <blockquote> <blockquote> <blockquote> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> </blockquote> <p>I think there are two issues here - &quot;fsync() on an fd that was just opened&quot; and &quot;persistent error state (without keeping dirty pages in memory)&quot;.</p> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> </blockquote> <p>And that's <em>horrible</em>. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.</p> </blockquote> </blockquote> <p>What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?</p> </blockquote> <p>Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care <em>that</em> much.</p> <p>At the very least <em>some</em> way to <em>know</em> that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.</p> <p>If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.</p> <p>Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.</p> <pre><code>1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test &quot;$oldcount&quot; -eq &quot;$newcount&quot; || die-with-horrible-message </code></pre> <p>Isn't beautiful to script, but it's also not absolutely terrible.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Thu, 12 Apr 2018 13:28:30 -0700 </code></pre> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> <blockquote> <p>I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.</p> </blockquote> <p>Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 14:11:45 -0700 </code></pre> <p>On 2018-04-12 07:24:12 -0400, Jeff Layton wrote:</p> <blockquote> <p>On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:</p> <blockquote> <p>On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: &gt; To pound the broken record: there are many good reasons why Linux &gt; filesystem developers have said &quot;you should use direct IO&quot; to the PG &gt; devs each time we have this &quot;the kernel doesn't do <complex things > PG needs&gt;&quot; discussion.</p> <p>I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.</p> <p>And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.</p> </blockquote> <p>Yes it is.</p> <p>This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.</p> </blockquote> </blockquote> <p>syncfs isn't standardized, it operates on an entire filesystem (thus writing out unnecessary stuff), it has no meaningful documentation of it's return codes. Yes, using syncfs() might better performancewise, but it doesn't seem like it actually solves anything, performance aside:</p> <blockquote> <p>Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error.</p> <p>You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect.</p> </blockquote> <p>Which again doesn't allow one to use any non-bespoke tooling (like tar or whatnot). And it means you'll have to call syncfs() every few hundred files, because you'll obviously run into filehandle limitations.</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Thu, 12 Apr 2018 17:14:54 -0400 </code></pre> <p>On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> </blockquote> <p>Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.</p> <p>We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file-&gt;f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?</p> <blockquote> <blockquote> <p>I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.</p> </blockquote> <p>Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.</p> </blockquote> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 12 Apr 2018 17:21:44 -0400 </code></pre> <p>On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> </blockquote> <p>When or how would the per-superblock wb_err flag get cleared?</p> <p>Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?</p> <blockquote> <blockquote> <p>I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.</p> </blockquote> <p>Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.</p> </blockquote> <p>Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty.</p> <p>If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Thu, 12 Apr 2018 14:24:32 -0700 </code></pre> <p>On Thu, Apr 12, 2018 at 05:21:44PM -0400, Theodore Y. Ts'o wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> </blockquote> <p>When or how would the per-superblock wb_err flag get cleared?</p> </blockquote> <p>That's not how errseq works, Ted ;-)</p> <blockquote> <p>Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?</p> </blockquote> <p>Only ones which occur after the last sampling get reported through this particular file descriptor.</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Thu, 12 Apr 2018 17:27:54 -0400 </code></pre> <p>On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:</p> <blockquote> <p>On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:</p> <blockquote> <p>On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:</p> <blockquote> <blockquote> <blockquote> <p>While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.</p> </blockquote> <p>I think there are two issues here - &quot;fsync() on an fd that was just opened&quot; and &quot;persistent error state (without keeping dirty pages in memory)&quot;.</p> <p>If there is background data writeback <em>without an open file descriptor</em>, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.</p> </blockquote> <p>And that's <em>horrible</em>. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.</p> </blockquote> </blockquote> <p>What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?</p> </blockquote> <p>Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care <em>that</em> much.</p> </blockquote> <p>I'll have to respectfully disagree. Why should I interpret an error on a read() syscall to mean that writeback failed? Note that the data is still potentially intact.</p> <p>What <em>might</em> make sense, IMO, is to just invalidate the pages that failed to be written back. Then you could potentially do a read to fault them in again (i.e. sync the pagecache and the backing store) and possibly redirty them for another try.</p> <p>Note that you can detect this situation by checking the return code from fsync. It should report the latest error once per file description.</p> <blockquote> <p>At the very least <em>some</em> way to <em>know</em> that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.</p> <p>If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.</p> </blockquote> <p>syncfs could use some work.</p> <p>I'm warming to willy's idea to add a per-sb errseq_t. I think that might be a simple way to get better semantics here. Not sure how we want to handle the reporting end yet though...</p> <p>We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either.</p> <blockquote> <p>Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.</p> <p>1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test &quot;$oldcount&quot; -eq &quot;$newcount&quot; || die-with-horrible-message</p> <p>Isn't beautiful to script, but it's also not absolutely terrible.</p> </blockquote> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Thu, 12 Apr 2018 14:31:10 -0700 </code></pre> <p>On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> </blockquote> <p>Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.</p> <p>We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file-&gt;f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?</p> </blockquote> <p>Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case.</p> <p>Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 14:37:56 -0700 </code></pre> <p>On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> </blockquote> <p>When or how would the per-superblock wb_err flag get cleared?</p> </blockquote> <p>I don't think unmount + resettable via /sys would be an insane approach. Requiring explicit action to acknowledge data loss isn't a crazy concept. But I think that's something reasonable minds could disagree with.</p> <blockquote> <p>Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?</p> </blockquote> <p>If it were tied to syncfs, I wonder if there's a way to have some errseq type logic. Store a per superblock (or whatever equivalent thing) errseq value of errors. For each fd calling syncfs() report the error once, but then store the current value in a separate per-fd field. And if that's considered too weird, only report the errors to fds that have been opened from before the error occurred.</p> <p>I can see writing a tool 'pg_run_and_sync /directo /ries -- command' which opens an fd for each of the filesystems the directories reside on, and calls syncfs() after. That'd allow to use backup/restore tools at least semi safely.</p> <blockquote> <blockquote> <blockquote> <p>I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.</p> </blockquote> <p>Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.</p> </blockquote> <p>Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty.</p> </blockquote> <p>Well, it's an open list, everyone can argue. And initially people at first didn't know the OOM explanation, and then it takes some time to revise ones priors :). I think it's a design question that reasonable people can disagree upon (if &quot;hot&quot; removed devices are handled by throwing data away regardless, at least). But as it's clearly not something viable, we can move on to something that can solve the problem.</p> <blockquote> <p>If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though.</p> </blockquote> <p>Yea, I agree that that's not obvious. One way would be to say that it's only automatically cleared when you unlink the file. A bit heavyhanded, but not too crazy.</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 12 Apr 2018 17:52:52 -0400 </code></pre> <p>On Thu, Apr 12, 2018 at 12:55:36PM -0700, Andres Freund wrote:</p> <blockquote> <p>Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing.</p> </blockquote> <p>Yeah, sorry, it never got upstreamed. It's not really all that complicated, it was just that there were some other folks who wanted to do something similar, and there was a round of bike-sheddingh several years ago, and nothing ever went upstream. Part of the problem was that our orignial scheme sent up information about file system-level corruption reports --- e.g, those stemming from calls to ext4_error() --- and lots of people had different ideas about how tot get all of the possible information up in some structured format. (Think something like uerf from Digtial's OSF/1.)</p> <p>We did something <em>really</em> simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg).</p> <p>That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases.</p> <blockquote> <p>The problem really isn't about <em>recovering</em> from disk errors. <em>Knowing</em> about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that &quot;we&quot; get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up.</p> </blockquote> <p>Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't <em>that</em> much variability.</p> <blockquote> <p>Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.</p> </blockquote> <p>Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's <em>possible</em> that might help, but it's not likely to be helpful in my experience.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 14:53:19 -0700 </code></pre> <p>On 2018-04-12 17:27:54 -0400, Jeff Layton wrote:</p> <blockquote> <p>On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:</p> <blockquote> <p>At the very least <em>some</em> way to <em>know</em> that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.</p> <p>If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.</p> </blockquote> <p>syncfs could use some work.</p> </blockquote> <p>It's really too bad that it doesn't have a flags argument.</p> <blockquote> <p>We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either.</p> <blockquote> <p>Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.</p> <p>1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test &quot;$oldcount&quot; -eq &quot;$newcount&quot; || die-with-horrible-message</p> <p>Isn't beautiful to script, but it's also not absolutely terrible.</p> </blockquote> </blockquote> <p>ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places. By my reading XFS doesn't seem to have something similar.</p> <p>Wouldn't be bad to standardize...</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 12 Apr 2018 17:57:56 -0400 </code></pre> <p>On Thu, Apr 12, 2018 at 02:53:19PM -0700, Andres Freund wrote:</p> <blockquote> <blockquote> <blockquote> <p>Isn't beautiful to script, but it's also not absolutely terrible.</p> </blockquote> </blockquote> <p>ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places.</p> </blockquote> <p>This is only for file system inconsistencies noticed by the kernel. We don't bump that count for data block I/O errors.</p> <p>The same idea could be used on a block device level. It would be pretty simple to maintain a counter for I/O errors, and when the last error was detected on a particular device. You could evne break out and track read errors and write errors eparately if that would be useful.</p> <p>If you don't care what block was bad, but just that <em>some</em> I/O error had happened, a counter is definitely the simplest approach, and less hair to implemnet and use than something like a netlink channel or scraping dmesg....</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Thu, 12 Apr 2018 15:03:59 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-12 17:52:52 -0400, Theodore Y. Ts'o wrote:</p> <blockquote> <p>We did something <em>really</em> simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg).</p> <p>That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases.</p> </blockquote> <p>The worst part of that is, as you mention below, needing to handle a lot of different error message formats. I guess it's reasonable enough if you control your hardware, but no such luck.</p> <p>Aren't there quite realistic scenarios where one could miss kmsg style messages due to it being a ringbuffer?</p> <blockquote> <p>Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't <em>that</em> much variability.</p> </blockquote> <p>There's also SAN / NAS type stuff - not all of that presents as a SCSI/SATA device, right?</p> <blockquote> <blockquote> <p>Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.</p> </blockquote> <p>Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's <em>possible</em> that might help, but it's not likely to be helpful in my experience.</p> </blockquote> <p>Depends on the type of error and storage. ENOSPC, especially over NFS, has some reasonable chances of being cleared up. And for networked block storage it's also not impossible to think of scenarios where that'd work for EIO.</p> <p>But I think besides hope of clearing up itself, it has the advantage that it trivially can give <em>some</em> feedback to the user. The user'll get back strerror(ENOSPC) with some decent SQL error code, which'll hopefully cause them to investigate (well, once monitoring detects high error rates). It's much nicer for the user to type COMMIT; get an appropriate error back etc, than if the database just commits suicide.</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Fri, 13 Apr 2018 08:44:04 +1000 </code></pre> <p>On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:</p> <blockquote> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> </blockquote> <p>There isn't a right thing. Whatever we do will be wrong for someone.</p> <blockquote> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> </blockquote> <p>Not to mention a POSIX IO ordering violation. Seeing stale data after a &quot;successful&quot; write is simply not allowed.</p> </blockquote> <p>I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?</p> </blockquote> <p>Posix says this about write():</p> <pre><code> After a write() to a regular file has successfully returned: Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified. </code></pre> <p>IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred.</p> <blockquote> <p>Given that the pages are clean after these failures, we aren't doing this even today:</p> <p>Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.</p> </blockquote> <p>Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors?</p> <p>To save you looking, XFS will trash the page contents completely on a filesystem level -&gt;writepage error. It doesn't mark them &quot;clean&quot;, doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written &quot;sucessfully&quot; to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.</p> <p>This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Fri, 13 Apr 2018 08:56:38 -0400 </code></pre> <p>On Thu, 2018-04-12 at 14:31 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:</p> <blockquote> <p>I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.</p> </blockquote> <p>Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.</p> </blockquote> <p>Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.</p> <p>We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file-&gt;f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?</p> </blockquote> <p>Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case.</p> <p>Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor.</p> </blockquote> <p>That might work.</p> <p>It'd be a syscall behavioral change so we'd need to document that well. It's probably innocuous though -- I doubt we have a lot of callers in the field opening files with O_PATH and calling syncfs on them.</p> <hr> <pre><code>From: Jeff Layton &lt;jlayton@...hat.com&gt; Date: Fri, 13 Apr 2018 09:18:56 -0400 </code></pre> <p>On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:</p> <blockquote> <p>On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:</p> <blockquote> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> </blockquote> <p>There isn't a right thing. Whatever we do will be wrong for someone.</p> <blockquote> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> </blockquote> <p>Not to mention a POSIX IO ordering violation. Seeing stale data after a &quot;successful&quot; write is simply not allowed.</p> </blockquote> <p>I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?</p> </blockquote> <p>Posix says this about write():</p> <p>After a write() to a regular file has successfully returned:</p> <pre><code> Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified. </code></pre> <p>IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred.</p> <blockquote> <p>Given that the pages are clean after these failures, we aren't doing this even today:</p> <p>Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.</p> </blockquote> <p>Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors?</p> <p>To save you looking, XFS will trash the page contents completely on a filesystem level -&gt;writepage error. It doesn't mark them &quot;clean&quot;, doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written &quot;sucessfully&quot; to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.</p> <p>This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....</p> </blockquote> <p>Got it, thanks.</p> <p>Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.</p> <p>So to summarize, at this point in the discussion, I think we want to consider doing the following:</p> <ul> <li>better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.</li> <li>invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.</li> </ul> <p>Did I miss anything? Would that be enough to help the Pg usecase?</p> <p>I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Fri, 13 Apr 2018 06:25:35 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-13 09:18:56 -0400, Jeff Layton wrote:</p> <blockquote> <p>Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.</p> <p>So to summarize, at this point in the discussion, I think we want to consider doing the following:</p> <ul> <li>better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.</li> <li>invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.</li> </ul> <p>Did I miss anything? Would that be enough to help the Pg usecase?</p> <p>I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.</p> </blockquote> <p>It's not perfect, but I think the amount of hacky OS specific code should be acceptable. And it does allow for a wrapper tool that can be used around backup restores etc to syncfs all the necessary filesystems. Let me mull with others for a bit.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Fri, 13 Apr 2018 07:02:32 -0700 </code></pre> <p>On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:</p> <blockquote> <p>To save you looking, XFS will trash the page contents completely on a filesystem level -&gt;writepage error. It doesn't mark them &quot;clean&quot;, doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written &quot;sucessfully&quot; to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.</p> <p>This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....</p> </blockquote> <p>Got it, thanks.</p> <p>Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.</p> <p>So to summarize, at this point in the discussion, I think we want to consider doing the following:</p> <ul> <li>better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.</li> <li>invalidate or clear uptodate flag on pages that experience writebackerrors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.</li> </ul> <p>Did I miss anything? Would that be enough to help the Pg usecase?</p> <p>I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.</p> </blockquote> <p>I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!)</p> <ol> <li>If we get an error while wbc-&gt;for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.</li> <li>Background writebacks should skip pages which are PageError.</li> <li>for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.</li> </ol> <p>I think kupdate writes are the same as for_background writes. for_reclaim is tougher. I don't want to see us getting into OOM because we're hanging onto stale data, but we don't necessarily have an open fd to report the error on. I think I'm leaning towards behaving the same for for_reclaim as for_sync, but this is probably a subject on which reasonable people can disagree.</p> <p>And this logic all needs to be on one place, although invoked from each filesystem.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Fri, 13 Apr 2018 07:48:07 -0700 </code></pre> <p>On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:</p> <blockquote> <p>I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, <em>THAT'S</em> the big issue.</p> <p>Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and <em>then</em> assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.</p> </blockquote> <p>While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work.</p> <p>Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write.</p> <p>If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error.</p> <p>B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether <em>anyone</em> has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor:</p> <p>Signed-off-by: Matthew Wilcox <a href="mailto:mawilcox@...rosoft.com">mawilcox@...rosoft.com</a></p> <pre><code>diff --git a/lib/errseq.c b/lib/errseq.c index df782418b333..093f1fba4ee0 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); errseq_t errseq_sample(errseq_t *eseq) { errseq_t old = READ_ONCE(*eseq); - errseq_t new = old; - /* - * For the common case of no errors ever having been set, we can skip - * marking the SEEN bit. Once an error has been set, the value will - * never go back to zero. - */ - if (old != 0) { - new |= ERRSEQ_SEEN; - if (old != new) - cmpxchg(eseq, old, new); - } - return new; + /* If nobody has seen this error yet, then we can be the first. */ + if (!(old &amp; ERRSEQ_SEEN)) + old = 0; + return old; } EXPORT_SYMBOL(errseq_sample); </code></pre> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Sat, 14 Apr 2018 11:47:52 +1000 </code></pre> <p>On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:</p> <blockquote> <p>On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:</p> <blockquote> <p>To save you looking, XFS will trash the page contents completely on a filesystem level -&gt;writepage error. It doesn't mark them &quot;clean&quot;, doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written &quot;sucessfully&quot; to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.</p> <p>This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....</p> </blockquote> <p>Got it, thanks.</p> <p>Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.</p> <p>So to summarize, at this point in the discussion, I think we want to consider doing the following:</p> <ul> <li><p>better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.</p></li> <li><p>invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.</p></li> </ul> <p>Did I miss anything? Would that be enough to help the Pg usecase?</p> <p>I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.</p> </blockquote> <p>I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!)</p> <ol> <li>If we get an error while wbc-&gt;for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.</li> </ol> </blockquote> <p>So you're saying we should treat it as a transient error rather than a permanent error.</p> <blockquote> <ol> <li>Background writebacks should skip pages which are PageError.</li> </ol> </blockquote> <p>That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?</p> <p>e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.</p> <p>This is a recipe for &quot;I lost data that I wrote /days/ before the system crashed&quot; bug reports.</p> <blockquote> <ol> <li>for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.</li> </ol> </blockquote> <p>Which may well be unmount. Are we really going to wait until unmount to report fatal errors?</p> <p>We used to do this with XFS metadata. We'd just keep trying to write metadata and keep the filesystem running (because it's consistent in memory and it might be a transient error) rather than shutting down the filesystem after a couple of retries. the result was that users wouldn't notice there were problems until unmount, and the most common sympton of that was &quot;why is system shutdown hanging?&quot;.</p> <p>We now don't hang at unmount by default:</p> <pre><code>$ cat /sys/fs/xfs/dm-0/error/fail_at_unmount 1 $ </code></pre> <p>And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...</p> <p>We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of &quot;one size doesn't fit all&quot; and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)</p> <blockquote> <p>And this logic all needs to be on one place, although invoked from each filesystem.</p> </blockquote> <p>Perhaps so, but as there's no &quot;one-size-fits-all&quot; behaviour, I really want to extend the XFS error config infrastructure to control what the filesystem does on error here.</p> <hr> <pre><code>From: Andres Freund &lt;andres@...razel.de&gt; Date: Fri, 13 Apr 2018 19:04:33 -0700 </code></pre> <p>Hi,</p> <p>On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:</p> <blockquote> <p>And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...</p> <p>We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of &quot;one size doesn't fit all&quot; and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)</p> </blockquote> <p>Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Fri, 13 Apr 2018 19:38:14 -0700 </code></pre> <p>On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:</p> <blockquote> <ol> <li>If we get an error while wbc-&gt;for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.</li> </ol> </blockquote> <p>So you're saying we should treat it as a transient error rather than a permanent error.</p> </blockquote> <p>Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.</p> <blockquote> <blockquote> <ol> <li>Background writebacks should skip pages which are PageError.</li> </ol> </blockquote> <p>That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?</p> </blockquote> <p>That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).</p> <blockquote> <p>e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.</p> <p>This is a recipe for &quot;I lost data that I wrote /days/ before the system crashed&quot; bug reports.</p> </blockquote> <p>So ... exponential backoff on retries?</p> <blockquote> <blockquote> <ol> <li>for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.</li> </ol> </blockquote> <p>Which may well be unmount. Are we really going to wait until unmount to report fatal errors?</p> </blockquote> <p>Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.</p> <hr> <hr> <pre><code>From: bfields@...ldses.org (J. Bruce Fields) Date: Wed, 18 Apr 2018 12:52:19 -0400 </code></pre> <blockquote> <p>Theodore Y. Ts'o - 10.04.18, 20:43:</p> <blockquote> <p>First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.</p> </blockquote> </blockquote> <p>Pointers to documentation or papers or anything? The only google results I can find for &quot;power fail certified&quot; are your posts.</p> <p>I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.</p> <hr> <pre><code>From: bfields@...ldses.org (J. Bruce Fields) Date: Wed, 18 Apr 2018 14:09:03 -0400 </code></pre> <p>On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:</p> <blockquote> <p>On Apr 10, 2018, at 4:07 PM, Andres Freund <a href="mailto:andres@...razel.de">andres@...razel.de</a> wrote:</p> <blockquote> <p>2018-04-10 18:43:56 Ted wrote:</p> <blockquote> <p>So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.</p> </blockquote> <p>That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.</p> </blockquote> <p>Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error.</p> </blockquote> <p>It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too: /* We want to guarantee the extracted files are on the disk, so that the * subsequent renames to the info database do not end up with old or zero * length files in case of a system crash. As neither dpkg-deb nor tar do * explicit fsync()s, we have to do them here. * XXX: This could be avoided by switching to an internal tar extractor. */ dir_sync_contents(cidir);</p> <p>(a bunch of other places too)</p> <p>Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that.</p> </blockquote> <p>Is that still true if you're able to use some kind of parallelism? (async io, or fsync from multiple processes?)</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Thu, 19 Apr 2018 09:59:50 +1000 </code></pre> <p>On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:</p> <blockquote> <p>And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...</p> <p>We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of &quot;one size doesn't fit all&quot; and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)</p> </blockquote> <p>Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?</p> </blockquote> <p>That's for metadata writeback error behaviour, not data writeback IO errors.</p> <p>We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways).</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Thu, 19 Apr 2018 10:13:43 +1000 </code></pre> <p>On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:</p> <blockquote> <ol> <li>If we get an error while wbc-&gt;for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.</li> </ol> </blockquote> <p>So you're saying we should treat it as a transient error rather than a permanent error.</p> </blockquote> <p>Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.</p> </blockquote> <p>And if it's getting IO errors because of USB stick pull? What then?</p> <blockquote> <blockquote> <blockquote> <ol> <li>Background writebacks should skip pages which are PageError.</li> </ol> </blockquote> <p>That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?</p> </blockquote> <p>That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).</p> </blockquote> <p>So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?</p> <blockquote> <blockquote> <p>e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.</p> <p>This is a recipe for &quot;I lost data that I wrote /days/ before the system crashed&quot; bug reports.</p> </blockquote> <p>So ... exponential backoff on retries?</p> </blockquote> <p>Maybe, but I don't think that actually helps anything and adds yet more &quot;when should we write this&quot; complication to inode writeback....</p> <blockquote> <blockquote> <blockquote> <ol> <li>for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.</li> </ol> </blockquote> <p>Which may well be unmount. Are we really going to wait until unmount to report fatal errors?</p> </blockquote> <p>Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.</p> </blockquote> <p>But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....</p> <hr> <pre><code>From: Eric Sandeen &lt;esandeen@...hat.com&gt; Date: Wed, 18 Apr 2018 19:23:46 -0500 </code></pre> <p>On 4/18/18 6:59 PM, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:</p> <blockquote> <p>Hi,</p> <p>On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:</p> <blockquote> <p>And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...</p> <p>We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of &quot;one size doesn't fit all&quot; and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)</p> </blockquote> <p>Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?</p> </blockquote> <p>That's for metadata writeback error behaviour, not data writeback IO errors.</p> </blockquote> <p>/me points casually at data_err=abort &amp; data_err=ignore in ext4...</p> <pre><code> data_err=ignore Just print an error message if an error occurs in a file data buffer in ordered mode. data_err=abort Abort the journal if an error occurs in a file data buffer in ordered mode. </code></pre> <p>Just sayin'</p> <blockquote> <p>We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways).</p> </blockquote> <p>I don't disagree, but there are already mount-option knobs in ext4, FWIW.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Wed, 18 Apr 2018 17:40:37 -0700 </code></pre> <p>On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:</p> <blockquote> <ol> <li>If we get an error while wbc-&gt;for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.</li> </ol> </blockquote> <p>So you're saying we should treat it as a transient error rather than a permanent error.</p> </blockquote> <p>Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.</p> </blockquote> <p>And if it's getting IO errors because of USB stick pull? What then?</p> </blockquote> <p>I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.</p> <blockquote> <blockquote> <blockquote> <blockquote> <ol> <li>Background writebacks should skip pages which are PageError.</li> </ol> </blockquote> <p>That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?</p> </blockquote> <p>That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).</p> </blockquote> <p>So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?</p> </blockquote> <p>I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is).</p> <blockquote> <blockquote> <blockquote> <blockquote> <ol> <li>for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.</li> </ol> </blockquote> <p>Which may well be unmount. Are we really going to wait until unmount to report fatal errors?</p> </blockquote> <p>Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.</p> </blockquote> <p>But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....</p> </blockquote> <p>Right. But then that's on the application.</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Wed, 18 Apr 2018 21:08:19 -0400 </code></pre> <p>On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.o</p> </blockquote> <p>Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway.</p> <hr> <hr> <pre><code>From: Christoph Hellwig &lt;hch@...radead.org&gt; Date: Thu, 19 Apr 2018 01:39:04 -0700 </code></pre> <p>On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:</p> <blockquote> <blockquote> <p>Theodore Y. Ts'o - 10.04.18, 20:43:</p> <blockquote> <p>First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.</p> </blockquote> </blockquote> <p>Pointers to documentation or papers or anything? The only google results I can find for &quot;power fail certified&quot; are your posts.</p> <p>I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.</p> </blockquote> <p>Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen.</p> <hr> <pre><code>From: &quot;J. Bruce Fields&quot; &lt;bfields@...ldses.org&gt; Date: Thu, 19 Apr 2018 10:10:16 -0400 </code></pre> <p>On Thu, Apr 19, 2018 at 01:39:04AM -0700, Christoph Hellwig wrote:</p> <blockquote> <p>On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:</p> <blockquote> <blockquote> <p>Theodore Y. Ts'o - 10.04.18, 20:43:</p> <blockquote> <p>First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.</p> </blockquote> </blockquote> <p>Pointers to documentation or papers or anything? The only google results I can find for &quot;power fail certified&quot; are your posts.</p> <p>I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.</p> </blockquote> <p>Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen.</p> </blockquote> <p>Thanks! That was my understanding too. But then the name is terrible. As is all the vendor documentation I can find:</p> <blockquote> <p><a href="https://insights.samsung.com/2016/03/22/power-loss-protection-how-ssds-are-protecting-data-integrity-white-paper/">https://insights.samsung.com/2016/03/22/power-loss-protection-how-ssds-are-protecting-data-integrity-white-paper/</a></p> <p>&quot;Power loss protection is a critical aspect of ensuring data integrity, especially in servers or data centers.&quot;</p> <p><a href="https://www.intel.com/content/.../ssd-320-series-power-loss-data-protection-brief.pdf">https://www.intel.com/content/.../ssd-320-series-power-loss-data-protection-brief.pdf</a></p> <p>&quot;Data safety features prepare for unexpected power-loss and protect system and user data.&quot;</p> </blockquote> <p>Why do they all neglect to mention that their consumer drives are also perfectly capable of well-defined behavior after power loss, just at the expense of flush performance? It's ridiculously confusing.</p> <hr> <pre><code>From: Matthew Wilcox &lt;willy@...radead.org&gt; Date: Thu, 19 Apr 2018 10:40:10 -0700 </code></pre> <p>On Wed, Apr 18, 2018 at 09:08:19PM -0400, Theodore Y. Ts'o wrote:</p> <blockquote> <p>On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.o</p> </blockquote> <p>Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway.</p> </blockquote> <p>The device mapper target (dm-removable?) was my first idea too, but I kept thinking through use cases and I think we end up wanting this functionality in the block layer. Let's try a story.</p> <p>Stephen the PFY goes into the data centre looking to hotswap a failed drive. Due to the eight pints of lager he had for lunch, he pulls out the root drive instead of the failed drive. The air raid siren warbles and he realises his mistake, shoving the drive back in.</p> <p>CYOA:</p> <p>Currently: All writes are lost, calamities ensue. The PFY is fired.</p> <p>With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault.</p> <p>Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job.</p> <hr> <pre><code>From: &quot;Theodore Y. Ts'o&quot; &lt;tytso@....edu&gt; Date: Thu, 19 Apr 2018 19:27:15 -0400 </code></pre> <p>On Thu, Apr 19, 2018 at 10:40:10AM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault.</p> <p>Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job.</p> </blockquote> <p>What you're talking about is a deployment issue, though. Ultimately the distribution will set up dm-removable automatically if the user requests it, much like it sets up dm-crypt automatically for laptop users upon request.</p> <p>My concern is that not all removable devices have a globally unique id number available in hardware so the kernel can tell whether or not it's the same device that has been plugged in. There are hueristics you could use -- for example, you could look at the file system uuid plus the last fsck time. But they tend to be very file system specific, and not things we would want ot have in the kernel.</p> <hr> <pre><code>From: Dave Chinner &lt;david@...morbit.com&gt; Date: Fri, 20 Apr 2018 09:28:59 +1000 </code></pre> <p>On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:</p> <blockquote> <p>On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:</p> <blockquote> <p>On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:</p> <blockquote> <ol> <li>If we get an error while wbc-&gt;for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.</li> </ol> </blockquote> <p>So you're saying we should treat it as a transient error rather than a permanent error.</p> </blockquote> <p>Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.</p> </blockquote> <p>And if it's getting IO errors because of USB stick pull? What then?</p> </blockquote> <p>I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.</p> </blockquote> <p><em>nod</em></p> <p>But in the meantime, device unplug (should give ENODEV, not EIO) is a fatal error and we need to toss away the data.</p> <blockquote> <blockquote> <blockquote> <blockquote> <blockquote> <ol> <li>Background writebacks should skip pages which are PageError.</li> </ol> </blockquote> <p>That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?</p> </blockquote> <p>That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).</p> </blockquote> <p>So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?</p> </blockquote> <p>I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is).</p> </blockquote> <p>But if it's still failing, then we should be still trying to report the error. i.e. if fsync fails and the page remains dirty, then the next attmept to write it is a new error and fsync should report that. IOWs, I think we should be returning errors at every occasion errors need to be reported if we have a persistent writeback failure...</p> <blockquote> <blockquote> <blockquote> <blockquote> <blockquote> <ol> <li>for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.</li> </ol> </blockquote> <p>Which may well be unmount. Are we really going to wait until unmount to report fatal errors?</p> </blockquote> <p>Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.</p> </blockquote> <p>But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....</p> </blockquote> <p>Right. But then that's on the application.</p> </blockquote> <p>Which we know don't do the right thing. Seems like a lot of hoops to jump through given it still won't work if the appliction isn't changed to support linux specific error handling requirements...</p> <hr> <pre><code>From: Jan Kara &lt;jack@...e.cz&gt; Date: Sat, 21 Apr 2018 18:59:54 +0200 </code></pre> <p>On Fri 13-04-18 07:48:07, Matthew Wilcox wrote:</p> <blockquote> <p>On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:</p> <blockquote> <p>I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, <em>THAT'S</em> the big issue.</p> <p>Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and <em>then</em> assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.</p> </blockquote> <p>While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work.</p> <p>Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write.</p> <p>If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error.</p> <p>B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether <em>anyone</em> has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor:</p> <p>Signed-off-by: Matthew Wilcox <a href="mailto:mawilcox@...rosoft.com">mawilcox@...rosoft.com</a></p> </blockquote> <p>So I agree with going to the old semantics of reporting errors from before a file was open at least once to someone. As the PG case shows apps are indeed relying on the old behavior. As much as it is unreliable, it ends up doing the right thing for these apps in 99% of cases and we shouldn't break them (BTW IMO the changelog should contain a note that this fixes a regression of PostgreSQL, a reference to this thread and CC to stable). Anyway feel free to add:</p> <p>Reviewed-by: Jan Kara <a href="mailto:jack@...e.cz">jack@...e.cz</a></p> <p>Oh, and to make myself clear I do think we need to find a better way of reporting IO errors. I consider this just an immediate band-aid to avoid userspace regressions.</p> <blockquote> <p>diff --git a/lib/errseq.c b/lib/errseq.c index df782418b333..093f1fba4ee0 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); errseq_t errseq_sample(errseq_t *eseq) { errseq_t old = READ_ONCE(*eseq); - errseq_t new = old;</p> <ul> <li>/*</li> <li>* For the common case of no errors ever having been set, we can skip</li> <li>* marking the SEEN bit. Once an error has been set, the value will</li> <li>* never go back to zero.</li> <li>*/</li> <li>if (old != 0) {</li> <li>new |= ERRSEQ_SEEN;</li> <li>if (old != new)</li> <li>cmpxchg(eseq, old, new);</li> <li>}</li> <li>return new;</li> <li>/* If nobody has seen this error yet, then we can be the first. */</li> <li>if (!(old &amp; ERRSEQ_SEEN))</li> <li>old = 0;</li> <li>return old;</li> </ul> </blockquote> <hr> <pre><code>From: Jan Kara &lt;jack@...e.cz&gt; Date: Sat, 21 Apr 2018 20:14:29 +0200 </code></pre> <p>On Thu 12-04-18 07:09:14, Jeff Layton wrote:</p> <blockquote> <p>On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:</p> <blockquote> <p>At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open(&quot;file&quot;)); would report the most recent error.</p> <p>That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's &quot;good enough&quot; to only support it for recent files.</p> <p>Jeff, what do you think?</p> </blockquote> <p>I hate it :). We could do that, but....yecchhhh.</p> <p>Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.</p> </blockquote> <p>So this is never going to be perfect but I think we could do good enough by: 1) Mark inodes that hit IO error. 2) If the inode gets evicted from memory we store the fact that we hit an error for this IO in a more space efficient data structure (sparse bitmap, radix tree, extent tree, whatever). 3) If the underlying device gets destroyed, we can just switch the whole SB to an error state and forget per inode info. 4) If there's too much of per-inode error info (probably per-fs configurable limit in terms of number of inodes), we would yell in the kernel log, switch the whole fs to the error state and forget per inode info.</p> <p>This way there won't be silent loss of IO errors. Memory usage would be reasonably limited. It could happen the whole fs would switch to error state &quot;prematurely&quot; but if that's a problem for the machine, admin could tune the limit for number of inodes to keep IO errors for...</p> <blockquote> <p>I think the crux of the matter here is not really about error reporting, per-se.</p> </blockquote> <p>I think this is related but a different question.</p> <blockquote> <p>I asked this at LSF last year, and got no real answer:</p> <p>When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?</p> <p>One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.</p> <p>Maybe that's ok in the face of a writeback error though? IDK.</p> </blockquote> <p>I can see the admin wanting to rather kill the machine with OOM than having to deal with data loss due to IO errors (e.g. if he has HA server fail over set up). Or retry for some time before dropping the dirty data. Or do what we do now (possibly with invalidating pages as you say). As Dave said elsewhere there's not one strategy that's going to please everybody. So it might be beneficial to have this configurable like XFS has it for metadata.</p> <p>OTOH if I look at the problem from application developer POV, most apps will just declare game over at the face of IO errors (if they take care to check for them at all). And the sophisticated apps that will try some kind of error recovery have to be prepared that the data is just gone (as depending on what exactly the kernel does is rather fragile) so I'm not sure how much practical value the configurable behavior on writeback errors would bring.</p> <hr> <p><link rel="prefetch" href="//danluu.com/deconstruct-files/"></p> Computer latency: 1977-2017 input-lag/ Sun, 24 Dec 2017 00:00:00 +0000 input-lag/ <p>I've had this nagging feeling that the computers I use today feel slower than the computers I used as a kid. As a rule, I don’t trust this kind of feeling because human perception has been shown to be unreliable in empirical studies, so I carried around a high-speed camera and measured the response latency of devices I’ve run into in the past few months. Here are the results:</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}td.l {text-align:left;}</style> <table> <tr> <th>computer</th><th>latency<br>(ms)</th><th>year</th><th>clock</th><th># T</th></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Apple_IIe">apple 2e</a></td><td bgcolor=#ffffcc><font color=black>30</td><td bgcolor=#54278f><font color=white>1983</font></td><td bgcolor=#08306b><font color=white>1 MHz</font></td><td bgcolor=#08306b><font color=white>3.5k</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B004GYBZBE/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B004GYBZBE&linkId=123a7b9e283914f633fbdc9b001f255b">ti 99/4a</a></td><td bgcolor=#ffffcc><font color=black>40</td><td bgcolor=#3f007d><font color=white>1981</font></td><td bgcolor=#08519c><font color=white>3 MHz</font></td><td bgcolor=#08306b><font color=white>8k</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01FJLAITC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01FJLAITC&linkId=8ffc49fdd6c38da2429e5f3ab5527314">custom haswell-e <i>165Hz</i></a></td><td bgcolor=#ffeda0><font color=black>50</td><td bgcolor=#efedf5><font color=black>2014</font></td><td bgcolor=#deebf7><font color=black>3.5 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Commodore_PET">commodore pet 4016</a></td><td bgcolor=#ffeda0><font color=black>60</td><td bgcolor=#3f007d><font color=white>1977</font></td><td bgcolor=#08306b><font color=white>1 MHz</font></td><td bgcolor=#08306b><font color=white>3.5k</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/SGI_Indy">sgi indy</a></td><td bgcolor=#ffeda0><font color=black>60</td><td bgcolor=#807dba><font color=black>1993</font></td><td bgcolor=#6baed6><font color=black>.1 GHz</font></td><td bgcolor=#4292c6><font color=black>1.2M</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01FJLAITC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01FJLAITC&linkId=8ffc49fdd6c38da2429e5f3ab5527314">custom haswell-e <i>120Hz</i></a></td><td bgcolor=#ffeda0><font color=black>60</td><td bgcolor=#efedf5><font color=black>2014</font></td><td bgcolor=#deebf7><font color=black>3.5 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01LWNVFIM/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01LWNVFIM&linkId=d4f3e06001021e2bfc95eb2c8c3a9805">thinkpad 13 <b>chromeos</b></a></td><td bgcolor=#fed976><font color=black>70</td><td bgcolor=#fcfbfd><font color=black>2017</font></td><td bgcolor=#deebf7><font color=black>2.3 GHz</font></td><td bgcolor=#deebf7><font color=black>1G</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B007O35FI8/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B007O35FI8&linkId=dc8b3670da16e98e718b157661037e4b">imac g4 <b>os 9</b></a></td><td bgcolor=#fed976><font color=black>70</td><td bgcolor=#bcbddc><font color=black>2002</font></td><td bgcolor=#c6dbef><font color=black>.8 GHz</font></td><td bgcolor=#6baed6><font color=black>11M</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01FJLAITC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01FJLAITC&linkId=8ffc49fdd6c38da2429e5f3ab5527314">custom haswell-e <i>60Hz</i></a></td><td bgcolor=#fed976><font color=black>80</td><td bgcolor=#efedf5><font color=black>2014</font></td><td bgcolor=#deebf7><font color=black>3.5 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Macintosh_Color_Classic">mac color classic</a></td><td bgcolor=#feb24c><font color=black>90</td><td bgcolor=#807dba><font color=black>1993</font></td><td bgcolor=#2171b5><font color=white>16 MHz</font></td><td bgcolor=#2171b5><font color=white>273k</font></td></tr> <tr> <td class="l"><a href="http://www.powerspec.com/systems/system_specs.phtml?selection=G405">powerspec g405 <b>linux</b> <i>60Hz</i></a></td><td bgcolor=#feb24c><font color=black>90</td><td bgcolor=#fcfbfd><font color=black>2017</font></td><td bgcolor=#f7fbff><font color=black>4.2 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B0096VDM8G/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B0096VDM8G&linkId=d134e4e812fd9e68ddb6191e03f78a72">macbook pro 2014</a></td><td bgcolor=#feb24c><font color=black>100</td><td bgcolor=#efedf5><font color=black>2014</font></td><td bgcolor=#deebf7><font color=black>2.6 GHz</font></td><td bgcolor=#deebf7><font color=black>700M</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01LWNVFIM/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01LWNVFIM&linkId=d4f3e06001021e2bfc95eb2c8c3a9805">thinkpad 13 <b>linux chroot</b></a></td><td bgcolor=#feb24c><font color=black>100</td><td bgcolor=#fcfbfd><font color=black>2017</font></td><td bgcolor=#deebf7><font color=black>2.3 GHz</font></td><td bgcolor=#deebf7><font color=black>1G</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B072BDGLBC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B072BDGLBC&linkId=b316192e2edacb439be22ef01a6bb779">lenovo x1 carbon 4g <b>linux</b></a></td><td bgcolor=#fd8d3c><font color=black>110</td><td bgcolor=#efedf5><font color=black>2016</font></td><td bgcolor=#deebf7><font color=black>2.6 GHz</font></td><td bgcolor=#deebf7><font color=black>1G</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B007O35FI8/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B007O35FI8&linkId=dc8b3670da16e98e718b157661037e4b">imac g4 <b>os x</b></a></td><td bgcolor=#fd8d3c><font color=black>120</td><td bgcolor=#bcbddc><font color=black>2002</font></td><td bgcolor=#c6dbef><font color=black>.8 GHz</font></td><td bgcolor=#6baed6><font color=black>11M</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01FJLAITC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01FJLAITC&linkId=8ffc49fdd6c38da2429e5f3ab5527314">custom haswell-e <i>24Hz</i></a></td><td bgcolor=#fc4e2a><font color=black>140</td><td bgcolor=#efedf5><font color=black>2014</font></td><td bgcolor=#deebf7><font color=black>3.5 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B072BDGLBC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B072BDGLBC&linkId=b316192e2edacb439be22ef01a6bb779">lenovo x1 carbon 4g <b>win</b></a></td><td bgcolor=#e31a1c><font color=black>150</td><td bgcolor=#efedf5><font color=black>2016</font></td><td bgcolor=#deebf7><font color=black>2.6 GHz</font></td><td bgcolor=#deebf7><font color=black>1G</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/NeXTcube">next cube</a></td><td bgcolor=#e31a1c><font color=black>150</td><td bgcolor=#6a51a3><font color=white>1988</font></td><td bgcolor=#4292c6><font color=black>25 MHz</font></td><td bgcolor=#4292c6><font color=black>1.2M</font></td></tr> <tr> <td class="l"><a href="http://www.powerspec.com/systems/system_specs.phtml?selection=G405">powerspec g405 <b>linux</b></a></td><td bgcolor=#bd0026><font color=black>170</td><td bgcolor=#fcfbfd><font color=black>2017</font></td><td bgcolor=#f7fbff><font color=black>4.2 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l" bgcolor="silver">packet around the world</td><td bgcolor=silver><font color=black>190</td><td bgcolor=silver><font color=black></font></td><td bgcolor=silver><font color=black></font></td><td bgcolor=silver><font color=black></font></td></tr> <tr> <td class="l"><a href="http://www.powerspec.com/systems/system_specs.phtml?selection=G405">powerspec g405 <b>win</b></a></td><td bgcolor=black><font color=white>200</font></td><td bgcolor=#fcfbfd><font color=black>2017</font></td><td bgcolor=#f7fbff><font color=black>4.2 GHz</font></td><td bgcolor=#f7fbff><font color=black>2G</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Symbolics">symbolics 3620</a></td><td bgcolor=black><font color=white>300</font></td><td bgcolor=#54278f><font color=white>1986</font></td><td bgcolor=#08519c><font color=white>5 MHz</font></td><td bgcolor=#2171b5><font color=white>390k</font></td></tr> </table></p> <p>These are tests of the latency between a keypress and the display of a character in a terminal (see appendix for more details). The results are sorted from quickest to slowest. In the latency column, the background goes from green to yellow to red to black as devices get slower and the background gets darker as devices get slower. No devices are green. When multiple OSes were tested on the same machine, the os is <b>in bold</b>. When multiple refresh rates were tested on the same machine, the refresh rate is <i>in italics</i>.</p> <p>In the year column, the background gets darker and purple-er as devices get older. If older devices were slower, we’d see the year column get darker as we read down the chart.</p> <p>The next two columns show the clock speed and number of transistors in the processor. Smaller numbers are darker and blue-er. As above, if slower clocked and smaller chips correlated with longer latency, the columns would get darker as we go down the table, but it, if anything, seems to be the other way around.</p> <p>For reference, the latency of a packet going around the world through fiber from NYC back to NYC via <a href="https://www.extremetech.com/extreme/122989-1-5-billion-the-cost-of-cutting-london-toyko-latency-by-60ms">Tokyo and London</a> is inserted in the table.</p> <p>If we look at overall results, the fastest machines are ancient. Newer machines are all over the place. Fancy gaming rigs with unusually high refresh-rate displays are almost competitive with machines from the late 70s and early 80s, but “normal” modern computers can’t compete with thirty to forty year old machines.</p> <p>We can also look at mobile devices. In this case, we’ll look at scroll latency in the browser:</p> <table> <tr> <th>device</th><th>latency<br>(ms)</th><th>year</th></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B072V4HK9F/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B072V4HK9F&linkId=5679f47f9a699d00e721eff9592bed11">ipad pro 10.5" pencil</a></td><td bgcolor=#ffffcc><font color=black>30</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B072V4HK9F/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B072V4HK9F&linkId=5679f47f9a699d00e721eff9592bed11">ipad pro 10.5"</a></td><td bgcolor=#fed976><font color=black>70</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B006FMDVDK/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B006FMDVDK&linkId=9ad1ee1c78ed24a00d7aeaa7acba466b">iphone 4s</a></td><td bgcolor=#fed976><font color=black>70</td><td bgcolor=#dadaeb><font color=black>2011</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00YD547Q6/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00YD547Q6&linkId=8760bcadabac2a1a34059f5695523d18">iphone 6s</a></td><td bgcolor=#fed976><font color=black>70</td><td bgcolor=#efedf5><font color=black>2015</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B008VUNRZQ/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B008VUNRZQ&linkId=f9f973947891d7eb01a251146c8a17ed">iphone 3gs</a></td><td bgcolor=#fed976><font color=black>70</td><td bgcolor=#dadaeb><font color=black>2009</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B075QN8NDJ/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B075QN8NDJ&linkId=17f3a63a6d672e453c1bb28c26f1e680">iphone x</a></td><td bgcolor=#fed976><font color=black>80</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B075QJSQLT/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B075QJSQLT&linkId=0e9df41aaed6dc80988793dcbc08a175">iphone 8</a></td><td bgcolor=#fed976><font color=black>80</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B0743HK992/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B0743HK992&linkId=ad1f03111643f0e52f245f120f8efb2b">iphone 7</a></td><td bgcolor=#fed976><font color=black>80</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00YD545CC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00YD545CC&linkId=cc899d2e45b87f032042af8eb666cef1">iphone 6</a></td><td bgcolor=#fed976><font color=black>80</td><td bgcolor=#efedf5><font color=black>2014</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B077629NJ2/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B077629NJ2&linkId=cadd23b2ef84a2b9c6abe268754e6cb4">gameboy color</a></td><td bgcolor=#fed976><font color=black>80</td><td bgcolor=#9e9ac8><font color=black>1998</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00WZR5URO/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00WZR5URO&linkId=f781b3a6200f81f6a0137a69ab1b4c0f">iphone 5</a></td><td bgcolor=#feb24c><font color=black>90</td><td bgcolor=#efedf5><font color=black>2012</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00CES5B96/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00CES5B96&linkId=fab4357d277c9716f691d15431aace90">blackberry q10</a></td><td bgcolor=#feb24c><font color=black>100</td><td bgcolor=#efedf5><font color=black>2013</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01IVV7W8M/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01IVV7W8M&linkId=be83bfddcd1ff8029f548810b405a281">huawei honor 8</a></td><td bgcolor=#fd8d3c><font color=black>110</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B0766TPHSH/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B0766TPHSH&linkId=774b61b9fe3692df21821844dfedc18b">google pixel 2 xl</a></td><td bgcolor=#fd8d3c><font color=black>110</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01F48QLFA/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01F48QLFA&linkId=00e8a7505d84e546c50e3d8761944d05">galaxy s7</a></td><td bgcolor=#fd8d3c><font color=black>120</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00TJ4FFHQ/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00TJ4FFHQ&linkId=f74214ac520214bcb199ce2710ed470f">galaxy note 3</a></td><td bgcolor=#fd8d3c><font color=black>120</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00EP2CLII/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00EP2CLII&linkId=f3ad24ebb82fb4b8282ae94deede1319">moto x</a></td><td bgcolor=#fd8d3c><font color=black>120</td><td bgcolor=#efedf5><font color=black>2013</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B076KTGJPG/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B076KTGJPG&linkId=9164d7895c51ee31fd8db00c3d1597e7">nexus 5x</a></td><td bgcolor=#fd8d3c><font color=black>120</td><td bgcolor=#efedf5><font color=black>2015</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01MXR13TZ/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01MXR13TZ&linkId=924c9f9d8f2413c64f0f295d505f09da">oneplus 3t</a></td><td bgcolor=#fc4e2a><font color=black>130</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B075QBLBN8/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B075QBLBN8&linkId=439fabf5b71beb2b4820618432366e72">blackberry key one</a></td><td bgcolor=#fc4e2a><font color=black>130</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B016QP7SKC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B016QP7SKC&linkId=207d8847bf6e6740694ea85cc97a8363">moto e (2g)</a></td><td bgcolor=#fc4e2a><font color=black>140</td><td bgcolor=#efedf5><font color=black>2015</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B071YC3G5V/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B071YC3G5V&linkId=d5c16509db8a503bfd0e9fa4480bd92f">moto g4 play</a></td><td bgcolor=#fc4e2a><font color=black>140</td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01DZJFYLC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01DZJFYLC&linkId=d173832b8e78e184d0ec5f801c582403">moto g4 plus</a></td><td bgcolor=#fc4e2a><font color=black>140</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B0731KJVGG/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B0731KJVGG&linkId=a52d46d36b6136641937b82c5f99fa1d">google pixel</a></td><td bgcolor=#fc4e2a><font color=black>140</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00N3FSWG8/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00N3FSWG8&linkId=a4df2be4f14002eed6fdbd267b8dca91">samsung galaxy avant</a></td><td bgcolor=#e31a1c><font color=black>150</td><td bgcolor=#efedf5><font color=black>2014</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01LZ8516T/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01LZ8516T&linkId=16070e26fe5b1eee184b3fcfa6a7f373">asus zenfone3 max</a></td><td bgcolor=#e31a1c><font color=black>150</td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B01LACBMV2/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01LACBMV2&linkId=6a38fa7701b76f17fc1ba183e2bac551">sony xperia z5 compact</a></td><td bgcolor=#e31a1c><font color=black>150</td><td bgcolor=#efedf5><font color=black>2015</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00F618DXE/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00F618DXE&linkId=481f35c95f57b730ab417f680ff1cdf3">htc one m4</a></td><td bgcolor=#e31a1c><font color=black>160</td><td bgcolor=#efedf5><font color=black>2013</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B00UZ7QJ6W/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B00UZ7QJ6W&linkId=a37479092aca4085fc7bee582232fbfe">galaxy s4 mini</a></td><td bgcolor=#bd0026><font color=black>170</td><td bgcolor=#efedf5><font color=black>2013</font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B072K1BY62/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B072K1BY62&linkId=7a9dc6d3a3400a24cba9141b09d85f32">lg k4</a></td><td bgcolor=#800026><font color=white>180</font></td><td bgcolor=#efedf5><font color=black>2016</font></td></tr> <tr> <td class="l" bgcolor="silver">packet</td><td bgcolor=silver><font color=black>190</td><td bgcolor=silver><font color=black></font></td></tr> <tr> <td class="l"><a href="https://www.amazon.com/gp/product/B0073YBGCC/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=abroaview-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B0073YBGCC&linkId=75785e57c1c5a2318dcd95a6178af60c">htc rezound</a></td><td bgcolor=black><font color=white>240</font></td><td bgcolor=#dadaeb><font color=black>2011</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Pilot_1000">palm pilot 1000</a></td><td bgcolor=black><font color=white>490</font></td><td bgcolor=#807dba><font color=black>1996</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Amazon_Kindle">kindle oasis 2</a></td><td bgcolor=black><font color=white>570</font></td><td bgcolor=#fcfbfd><font color=black>2017</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Amazon_Kindle">kindle paperwhite 3</a></td><td bgcolor=black><font color=white>630</font></td><td bgcolor=#efedf5><font color=black>2015</font></td></tr> <tr> <td class="l"><a href="https://en.wikipedia.org/wiki/Amazon_Kindle">kindle 4</a></td><td bgcolor=black><font color=white>860</font></td><td bgcolor=#dadaeb><font color=black>2011</font></td></tr> </table> <p>As above, the results are sorted by latency and color-coded from green to yellow to red to black as devices get slower. Also as above, the year gets purple-er (and darker) as the device gets older.</p> <p>If we exclude the <code>game boy color</code>, which is a different class of device than the rest, all of the quickest devices are Apple phones or tablets. The next quickest device is the <code>blackberry q10</code>. Although we don’t have enough data to really tell why the <code>blackberry q10</code> is unusually quick for a non-Apple device, one plausible guess is that it’s helped by having actual buttons, which are easier to implement with low latency than a touchscreen. The other two devices with actual buttons are the <code>gameboy color</code> and the <code>kindle 4</code>.</p> <p>After that <code>iphones</code> and non-kindle button devices, we have a variety of Android devices of various ages. At the bottom, we have the ancient <code>palm pilot 1000</code> followed by the kindles. The <code>palm</code> is hamstrung by a touchscreen and display created in an era with much slower touchscreen technology and the <code>kindles</code> use <a href="https://en.wikipedia.org/wiki/E_Ink">e-ink</a> displays, which are much slower than the displays used on modern phones, so it’s not surprising to see those devices at the bottom.</p> <h3 id="why-is-the-apple-2e-so-fast">Why is the <code>apple 2e</code> so fast?</h3> <p>Compared to a modern computer that’s not the latest <code>ipad pro</code>, the <code>apple 2</code> has significant advantages on both the input and the output, and it also has an advantage between the input and the output for all but the most carefully written code since the <code>apple 2</code> doesn’t have to deal with context switches, buffers involved in handoffs between different processes, etc.</p> <p>On the input, if we look at modern keyboards, it’s common to see them scan their inputs at <code>100 Hz</code> to <code>200 Hz</code> (e.g., <a href="https://github.com/benblazak/ergodox-firmware">the ergodox claims to scan at <code>167 Hz</code></a>). By comparison, the <code>apple 2e</code> effectively scans at <code>556 Hz</code>. See appendix for details.</p> <p>If we look at the other end of the pipeline, the display, we can also find latency bloat there. I have a display that advertises <code>1 ms</code> switching on the box, but if we look at how long it takes for the display to actually show a character from when you can first see the trace of it on the screen until the character is solid, it can easily be <code>10 ms</code>. You can even see this effect with some high-refresh-rate displays that are sold on their allegedly good latency.</p> <p>At <code>144 Hz</code>, each frame takes <code>7 ms</code>. A change to the screen will have <code>0 ms</code> to <code>7 ms</code> of extra latency as it waits for the next frame boundary before getting rendered (on average,we expect half of the maximum latency, or <code>3.5 ms</code>). On top of that, even though my display at home advertises a <code>1 ms</code> switching time, it actually appears to take <code>10 ms</code> to fully change color once the display has started changing color. When we add up the latency from waiting for the next frame to the latency of an actual color change, we get an expected latency of <code>7/2 + 10 = 13.5ms</code></p> <p>With the old CRT in the <code>apple 2e</code>, we’d expect half of a <code>60 Hz</code> refresh (<code>16.7 ms / 2</code>) plus a negligible delay, or <code>8.3 ms</code>. That’s hard to beat today: a state of the art “gaming monitor” can get the total display latency down into the same range, but in terms of marketshare, very few people have such displays, and even displays that are advertised as being fast aren’t always actually fast.</p> <h3 id="ios-rendering-pipeline">iOS rendering pipeline</h3> <p>If we look at what’s happening between the input and the output, the differences between a modern system and an <code>apple 2e</code> are too many to describe without writing an entire book. To get a sense of the situation in modern machines, here’s former iOS/UIKit engineer <a href="https://andymatuschak.org/">Andy Matuschak</a>’s high-level sketch of what happens on iOS, which he says should be presented with the disclaimer that “this is my out of date memory of out of date information”:</p> <ul> <li>hardware has its own scanrate (e.g. <code>120 Hz</code> for recent touch panels), so that can introduce up to <code>8 ms</code> latency</li> <li>events are delivered to the kernel through firmware; this is relatively quick but system scheduling concerns may introduce a couple <code>ms</code> here</li> <li>the kernel delivers those events to privileged subscribers (here, <code>backboardd</code>) over a mach port; more scheduling loss possible</li> <li><code>backboardd</code> must determine which process should receive the event; this requires taking a lock against the window server, which shares that information (a trip back into the kernel, more scheduling delay)</li> <li><code>backboardd</code> sends that event to the process in question; more scheduling delay possible before it is processed</li> <li>those events are only dequeued on the main thread; something else may be happening on the main thread (e.g. as result of a timer or network activity), so some more latency may result, depending on that work</li> <li>UIKit introduced <code>1-2 ms</code> event processing overhead, CPU-bound</li> <li>application decides what to do with the event; apps are poorly written, so usually this takes many <code>ms</code>. the consequences are batched up in a data-driven update which is sent to the render server over IPC <ul> <li>If the app needs a new shared-memory video buffer as a consequence of the event, which will happen anytime something non-trivial is happening, that will require round-trip IPC to the render server; more scheduling delays</li> <li>(trivial changes are things which the render server can incorporate itself, like affine transformation changes or color changes to layers; non-trivial changes include anything that has to do with text, most raster and vector operations)</li> <li>These kinds of updates often end up being triple-buffered: the GPU might be using one buffer to render right now; the render server might have another buffer queued up for its next frame; and you want to draw into another. More (cross-process) locking here; more trips into kernel-land.</li> </ul></li> <li>the render server applies those updates to its render tree (a few <code>ms</code>)</li> <li>every <code>N Hz</code>, the render tree is flushed to the GPU, which is asked to fill a video buffer <ul> <li>Actually, though, there’s often triple-buffering for the screen buffer, for the same reason I described above: the GPU’s drawing into one now; another might be being read from in preparation for another frame</li> </ul></li> <li>every <code>N Hz</code>, that video buffer is swapped with another video buffer, and the display is driven directly from that memory <ul> <li>(this <code>N Hz</code> isn’t necessarily ideally aligned with the preceding step’s <code>N Hz</code>)</li> </ul></li> </ul> <p>Andy says “the actual amount of <em>work</em> happening here is typically quite small. A few <code>ms</code> of CPU time. Key overhead comes from:”</p> <ul> <li>periodic scanrates (input device, render server, display) imperfectly aligned</li> <li>many handoffs across process boundaries, each an opportunity for something else to get scheduled instead of the consequences of the input event</li> <li>lots of locking, especially across process boundaries, necessitating trips into kernel-land</li> </ul> <p>By comparison, on the Apple 2e, there basically aren’t handoffs, locks, or process boundaries. Some very simple code runs and writes the result to the display memory, which causes the display to get updated on the next scan.</p> <h3 id="refresh-rate-vs-latency">Refresh rate vs. latency</h3> <p>One thing that’s curious about the computer results is the impact of refresh rate. We get a <code>90 ms</code> improvement from going from <code>24 Hz</code> to <code>165 Hz</code>. At <code>24 Hz</code> each frame takes <code>41.67 ms</code> and at <code>165 Hz</code> each frame takes <code>6.061 ms</code>. As we saw above, if there weren’t any buffering, we’d expect the average latency added by frame refreshes to be <code>20.8ms</code> in the former case and <code>3.03 ms</code> in the latter case (because we’d expect to arrive at a uniform random point in the frame and have to wait between <code>0ms</code> and the full frame time), which is a difference of about <code>18ms</code>. But the difference is actually <code>90 ms</code>, implying we have latency equivalent to <code>(90 - 18) / (41.67 - 6.061) = 2</code> buffered frames.</p> <p>If we plot the results from the other refresh rates on the same machine (not shown), we can see that they’re roughly in line with a “best fit” curve that we get if we assume that, for that machine running powershell, we get 2.5 frames worth of latency regardless of refresh rate. This lets us estimate what the latency would be if we equipped this low latency gaming machine with an <code>infinity Hz</code> display -- we’d expect latency to be <code>140 - 2.5 * 41.67 = 36 ms</code>, almost as fast as quick but standard machines from the 70s and 80s.</p> <h3 id="complexity">Complexity</h3> <p>Almost every computer and mobile device that people buy today is slower than common models of computers from the 70s and 80s. Low-latency gaming desktops and the <code>ipad pro</code> can get into the same range as quick machines from thirty to forty years ago, but most off-the-shelf devices aren’t even close.</p> <p>If we had to pick one root cause of latency bloat, we might say that it’s because of “complexity”. Of course, we all know that complexity is bad. If you’ve been to a non-academic non-enterprise tech conference in the past decade, there’s a good chance that there was at least one talk on how complexity is the root of all evil and we should aspire to reduce complexity.</p> <p>Unfortunately, it's a lot harder to remove complexity than to give a talk saying that we should remove complexity. A lot of the complexity buys us something, either directly or indirectly. When we looked at the input of a fancy modern keyboard vs. the <code>apple 2</code> keyboard, we saw that using a relatively powerful and expensive general purpose processor to handle keyboard inputs can be slower than dedicated logic for the keyboard, which would both be simpler and cheaper. However, using the processor gives people the ability to easily customize the keyboard, and also pushes the problem of “programming” the keyboard from hardware into software, which reduces the cost of making the keyboard. The more expensive chip increases the manufacturing cost, but considering how much of the cost of these small-batch artisanal keyboards is the design cost, it seems like a net win to trade manufacturing cost for ease of programming.</p> <p>We see this kind of tradeoff in every part of the pipeline. One of the biggest examples of this is the OS you might run on a modern desktop vs. the loop that’s running on the <code>apple 2</code>. Modern OSes let programmers write generic code that can deal with having other programs simultaneously running on the same machine, and do so with pretty reasonable general performance, but we pay a huge complexity cost for this and the handoffs involved in making this easy result in a significant latency penalty.</p> <p>A lot of the complexity might be called <a href="http://wiki.c2.com/?AccidentalComplexity">accidental complexity</a>, but most of that accidental complexity is there because it’s so convenient. At every level from the hardware architecture to the syscall interface to the I/O framework we use, we take on complexity, much of which could be eliminated if we could sit down and re-write all of the systems and their interfaces today, but it’s too inconvenient to re-invent the universe to reduce complexity and we get benefits from economies of scale, so we live with what we have.</p> <p>For those reasons and more, in practice, the solution to poor performance caused by “excess” complexity is often to add more complexity. In particular, the gains we’ve seen that get us back to the quickness of the quickest machines from thirty to forty years ago have come not from listening to exhortations to reduce complexity, but from piling on more complexity.</p> <p>The <code>ipad pro</code> is a feat of modern engineering; the engineering that went into increasing the refresh rate on both the input and the output as well as making sure the software pipeline doesn’t have unnecessary buffering is complex! The design and manufacture of high-refresh-rate displays that can push system latency down is also non-trivially complex in ways that aren’t necessary for bog standard <code>60 Hz</code> displays.</p> <p>This is actually a common theme when working on latency reduction. A common trick to reduce latency is to add a cache, but adding a cache to a system makes it more complex. For systems that generate new data and can’t tolerate a cache, the solutions are often even more complex. An example of this might be <a href="https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet">large scale RoCE deployments</a>. These can push remote data access latency from from the millisecond range down to the microsecond range, <a href="//danluu.com/infinite-disk/">which enables new classes of applications</a>. However, this has come at a large cost in complexity. Early large-scale RoCE deployments easily took tens of person years of effort to get right and also came with a tremendous operational burden.</p> <h3 id="conclusion">Conclusion</h3> <p>It’s a bit absurd that a modern gaming machine running at <code>4,000x</code> the speed of an <code>apple 2</code>, with a CPU that has <code>500,000x</code> as many transistors (with a GPU that has <code>2,000,000x</code> as many transistors) can maybe manage the same latency as an <code>apple 2</code> in very carefully coded applications if we have a monitor with nearly <code>3x</code> the refresh rate. It’s perhaps even more absurd that the default configuration of the <code>powerspec g405</code>, which had the fastest single-threaded performance you could get until October 2017, had more latency from keyboard-to-screen (approximately <code>3 feet</code>, maybe <code>10 feet</code> of actual cabling) than sending a packet around the world (<code>16187 mi</code> from NYC to Tokyo to London back to NYC, more due to the cost of running the shortest possible length of fiber).</p> <p>On the bright side, we’re arguably emerging from the latency dark ages and it’s now possible to assemble a computer or buy a tablet with latency that’s in the same range as you could get off-the-shelf in the 70s and 80s. This reminds me a bit of the screen resolution &amp; density dark ages, where CRTs from the 90s offered better resolution and higher pixel density than affordable non-laptop LCDs until relatively recently. 4k displays have now become normal and affordable 8k displays are on the horizon, blowing past anything we saw on consumer CRTs. I don’t know that we’ll see the same kind improvement with respect to latency, but one can hope. There are individual developers improving the experience for people who use certain, very carefully coded, applications, but it's not clear what force could cause a significant improvement in the default experience most users see.</p> <h3 id="other-posts-on-latency-measurement">Other posts on latency measurement</h3> <ul> <li><a href="//danluu.com/term-latency/">Terminal latency</a></li> <li><a href="//danluu.com/keyboard-latency/">Keyboard latency</a></li> <li><a href="//danluu.com/keyboard-v-mouse/">Mouse vs. keyboard latency</a> (human factors, not device latency)</li> <li><a href="https://pavelfatin.com/typing-with-pleasure/">Editor latency</a> (by Pavel Fatin)</li> <li><a href="http://www.lofibucket.com/articles/dwm_latency.html">Windows 10 compositing latency</a> (by Pekka Vaananen)</li> <li><a href="http://blogs.valvesoftware.com/abrash/latency-the-sine-qua-non-of-ar-and-vr/">AR/VR latency</a> (by Michael Abrash)</li> <li><a href="//danluu.com/latency-mitigation/">Latency mitigation strategies</a> (by John Carmack)</li> </ul> <h3 id="appendix-why-measure-latency">Appendix: why measure latency?</h3> <p>Latency matters! For very simple tasks, <a href="https://pdfs.semanticscholar.org/386a/15fd85c162b8e4ebb6023acdce9df2bd43ee.pdf">people can perceive latencies down to <code>2 ms</code> or less</a>. Moreover, increasing latency is not only noticeable to users, <a href="http://www.tactuallabs.com/papers/howMuchFasterIsFastEnoughCHI15.pdf">it causes users to execute simple tasks less accurately</a>. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, <a href="https://www.youtube.com/watch?v=vOvQCPLkPt4">check out this MSR demo on touchscreen latency</a>.</p> <p>The most commonly cited document on response time is the nielsen group article on response times, which claims that latncies below <code>100ms</code> feel equivalent and perceived as instantaneous. One easy way to see that this is false is to go into your terminal and try <code>sleep 0; echo &quot;pong&quot;</code> vs. <code>sleep 0.1; echo &quot;test&quot;</code> (or for that matter, try playing an old game that doesn't have latency compensation, like quake 1, with <code>100 ms</code> ping, or even <code>30 ms</code> ping, or try typing in a terminal with <code>30 ms</code> ping). For more info on this and other latency fallacies, <a href="//danluu.com/keyboard-latency/#appendix-counter-arguments-to-common-arguments-that-latency-doesn-t-matter">see this document on common misconceptions about latency</a>.</p> <p><a href="https://stackoverflow.com/a/39187441/334816">Throughput</a> also matters, but this is widely understood and measured. If you go to pretty much any mainstream review or benchmarking site, you can find a wide variety of throughput measurements, so there’s less value in writing up additional throughput measurements.</p> <h3 id="appendix-apple-2-keyboard">Appendix: apple 2 keyboard</h3> <p>The <code>apple 2e</code>, instead of using a programmed microcontroller to read the keyboard, uses a much simpler custom chip designed for reading keyboard input, the AY 3600. If we look at <a href="//danluu.com/AY3600.pdf">the AY 3600 datasheet</a>,we can see that the scan time is <code>(90 * 1/f)</code> and the debounce time is listed as <code>strobe_delay</code>. These quantities are determined by some capacitors and a resistor, which appear to be <code>47pf</code>, <code>100k ohms</code>, and <code>0.022uf</code> for the Apple 2e. Plugging these numbers into <a href="//danluu.com/AY3600.pdf">the AY3600 datasheet</a>, we can see that <code>f = 50 kHz</code>, giving us a <code>1.8 ms</code> scan delay and a <code>6.8 ms</code> debounce delay (assuming the values are accurate -- <a href="http://www.kemet.com/Lists/TechnicalArticles/Attachments/191/Why%2047%20uF%20capacitor%20drops%20to%2037%20uF-%2030%20uF-%20or%20lower.pdf">capacitors can degrade over time</a>, so we should expect the real delays to be shorter on our old Apple 2e), giving us less than <code>8.6 ms</code> for the internal keyboard logic.</p> <p>Comparing to a keyboard with a <code>167 Hz</code> scan rate that <a href="https://www.arduino.cc/en/Tutorial/Debounce">scans two extra times to debounce</a>, the equivalent figure is <code>3 * 6 ms = 18 ms</code>. With a <code>100Hz</code> scan rate, that becomes <code>3 * 10 ms = 30 ms</code>. <code>18 ms</code> to <code>30 ms</code> of keyboard scan plus debounce latency is in line with <a href="//danluu.com/keyboard-latency/">what we saw when we did some preliminary keyboard latency measurements</a>.</p> <p>For reference, the ergodox uses a <code>16 MHz</code> microcontroller with ~80k transistors and the <code>apple 2e</code> CPU is a <code>1 MHz</code> chip with 3.5k transistors.</p> <h3 id="appendix-why-should-android-phones-have-higher-latency-than-old-apple-phones">Appendix: why should android phones have higher latency than old apple phones?</h3> <p>As we've seen, raw processing power doesn't help much with many of the causes of latency in the pipeline, like handoffs between different processes, so phones that an android phone with a 10x more powerful processor than an ancient iphone isn't guaranteed to be quicker to respond, even if it can render javascript heavy pages faster.</p> <p>If you talk to people who work on non-Apple mobile CPUs, you'll find that they run benchmarks like dhrystone (a synthetic benchmark that was irrelevant even when it was created, in 1984) and SPEC2006 (an updated version of a workstation benchmark that was relevant in the 90s and perhaps even as late as the early 2000s if you care about workstation workloads, which are completely different from mobile workloads). This problem where the vendor who makes the component has <a href="//danluu.com/percentile-latency/">an intermediate target</a> that's only weakly correlated to the actual user experience. I've heard that there are people working on the pixel phones who care about end-to-end latency, but it's difficult to get good latency when you have to use components that are optimized for things like dhrystone and SPEC2006.</p> <p>If you talk to people at Apple, you'll find that they're quite cagey, but that they've been targeting the end-to-end user experience for quite a long time and they they can do &quot;full stack&quot; optimizations that are difficult for android vendors to pull of. They're not literally impossible, but making a change to a chip that has to be threaded up through the OS is something you're very unlikely to see unless google is doing the optimization, and google hasn't really been serious about the end-to-end experience until recently.</p> <p>Having relatively poor performance in aspects that aren't measured is a common theme and one we saw when we looked at <a href="//danluu.com/term-latency/">terminal latency</a>. Prior to examining temrinal latency, public benchmarks were all throughput oriented and the terminals that priortized performance worked on increasing throughput, even though increasing terminal throughput isn't really useful. After those terminal latency benchmarks, some terminal authors looked into their latency and found places they could trim down buffering and remove latency. You get what you measure.</p> <h3 id="appendix-experimental-setup">Appendix: experimental setup</h3> <p>Most measurements were taken with the 240fps camera (<code>4.167 ms</code> resolution) in <a href="https://www.amazon.com/gp/product/B071W3DDM7/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B071W3DDM7&amp;linkId=fde44f1346bdf4bc60896580c2f3dffa">the iPhone SE</a>. Devices with response times below <code>40 ms</code> were re-measured with a 1000fps camera (<code>1 ms</code> resolution), the <a href="https://www.amazon.com/gp/product/B01M62I8Q7/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01M62I8Q7&amp;linkId=d5f90ebe0b220a8c6e550761f663fb19">Sony RX100 V</a> in PAL mode. Results in the tables are the results of multiple runs and are rounded to the nearest <code>10 ms</code> to avoid the impression of false precision. For desktop results, results are measured from when the key started moving until the screen finished updating. Note that this is different from most key-to-screen-update measurements you can find online, which typically use a setup that effectively removes much or all of the keyboard latency, which, as an end-to-end measurement, is only realistic if you have a psychic link to your computer (this isn't to say the measurements aren't useful -- if, as a programmer, you want a reproducible benchmark, it's nice to reduce measurement noise from sources that are beyond your control, but that's not relevant to end users). People often advocate measuring from one of: {the key bottoming out, the tactile feel of the switch}. Other than for measurement convenience, there appears to be no reason to do any of these, but people often claim that's when the user expects the keyboard to &quot;really&quot; work. But these are independent of when the switch actually fires. Both the distance between the key bottoming out and activiation as well as the distance between feeling feedback and activation are arbitrary and can be tuned. See <a href="//danluu.com/keyboard-latency/">this post on keyboard latency measurements for more info on keyboard fallacies</a>.</p> <p>Another significant difference is that measurements were done with settings as close to the default OS settings as possible since approximately 0% of users will futz around with display settings to reduce buffering, disable the compositor, etc. Waiting until the screen has finished updating is also different from most end-to-end measurements do -- most consider the update &quot;done&quot; when any movement has been detected on the screen. Waiting until the screen is finished changing is analogous to webpagetest's &quot;visually complete&quot; time.</p> <p>Computer results were taken using the “default” terminal for the system (e.g., powershell on windows, lxterminal on lubuntu), <a href="//danluu.com/term-latency/">which could easily cause <code>20 ms</code> to <code>30 ms</code> difference between a fast terminal and a slow terminal</a>. Between measuring time in a terminal and measuring the full end-to-end time, measurements in this article should be slower than measurements in other, similar, articles (which tend to measure time to first change in games).</p> <p>The <code>powerspec g405</code> baseline result is using integrated graphics (the machine doesn’t come with a graphics card) and the <code>60 Hz</code> result is with a cheap video card. The baseline was result was at <code>30 Hz</code> because the integrated graphics only supports <code>hdmi</code> output and the display it was attached to only runs at <code>30 Hz</code> over <code>hdmi</code>.</p> <p>Mobile results were done by using the default browser, browsing to <a href="https://danluu.com">https://danluu.com</a>, and measuring the latency from finger movement until the screen first updates to indicate that scrolling has occurred. In the cases where this didn’t make sense, (kindles, gameboy color, etc.), some action that makes sense for the platform was taken (changing pages on the kindle, pressing the joypad on the gameboy color in a game, etc.). Unlike with the desktop/laptop measurements, this end-time for the measurement was on the first visual change to avoid including many frames of scrolling. To make the measurement easy, the measurement was taken with a finger on the touchscreen and the timer was started when the finger started moving (to avoid having to determine when the finger first contacted the screen).</p> <p>In the case of “ties”, results are ordered by the unrounded latency as a tiebreaker, but this shouldn’t be considered significant. Differences of <code>10 ms</code> should probably also not be considered significant.</p> <p>The <code>custom haswell-e</code> was tested with <code>gsync</code> on and there was no observable difference. The year for that box is somewhat arbitrary, since the CPU is from <code>2014</code>, but the display is newer (I believe you couldn’t get a <code>165 Hz</code> display until <code>2015</code>.</p> <p>The number of transistors for some modern machines is a rough estimate because exact numbers aren’t public. Feel free to ping me if you have a better estimate!</p> <p>The color scales for latency and year are linear and the color scales for clock speed and number of transistors are log scale.</p> <p>All Linux results were done with a <a href="https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/">pre-KPTI</a> kernel. It's possible that KPTI will impact user perceivable latency.</p> <p>Measurements were done as cleanly as possible (without other things running on the machine/device when possible, with a device that was nearly full on battery for devices with batteries). Latencies when other software is running on the device or when devices are low on battery might be much higher.</p> <p>If you want a reference to compare the kindle against, a moderately quick page turn in a physical book appears to be about <code>200 ms</code>.</p> <p><i>This is a work in progress. I expect to get benchmarks from a lot more old computers the next time I visit Seattle. If you know of old computers I can test in the NYC area (that have their original displays or something like them), let me know! If you have a device you’d like to donate for testing, feel free to mail it to</i></p> <p>Dan Luu<br> Recurse Center<br> 455 Broadway, 2nd Floor<br> New York, NY 10013</p> <p><i>Thanks to <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>, David Albert, Bert Muthalaly, Christian Ternus, Kate Murphy, Ikhwan Lee, Peter Bhat Harkins, Leah Hanson, Alicia Thilani Singham Goodwin, Amy Huang, Dan Bentley, Jacquin Mininger, Rob, Susan Steinman, Raph Levien, Max McCrea, Peter Town, Jon Cinque, Anonymous, and Jonathan Dahan for donating devices to test and thanks to Leah Hanson, Andy Matuschak, Milosz Danczak, amos (@fasterthanlime), @emitter_coupled, Josh Jordan, mrob, and David Albert for comments/corrections/discussion.</i></p> How good are decisions? Evaluating decision quality in domains where evaluation is easy bad-decisions/ Tue, 21 Nov 2017 00:00:00 +0000 bad-decisions/ <p>A statement I commonly hear in tech-utopian circles is that some seeming inefficiency can’t actually be inefficient because the market is efficient and inefficiencies will quickly be eliminated. A contentious example of this is <a href="//danluu.com/tech-discrimination/">the claim that companies can’t be discriminating because the market is too competitive to tolerate discrimination</a>. A less contentious example is that when you see a big company doing something that seems bizarrely inefficient, maybe it’s not inefficient and you just lack the information necessary to understand why the decision was efficient. These kinds of statements are often accompanied by statements about how &quot;incentives matter&quot; or the CEO has &quot;skin in the game&quot; whereas the commentator does not.</p> <p>Unfortunately, arguments like this are difficult to settle because, even in retrospect, it’s usually not possible to get enough information to determine the precise “value” of a decision. Even in cases where the decision led to an unambiguous success or failure, there are so many factors that led to the result that it’s difficult to figure out precisely why something happened.</p> <p>In those post, we'll look at two classes of examples where we can see how good people's decisions are and how they respond to easy to obtain data showing that people are making bad decisions. Both classes of examples are from domains where the people making or discussing the decision seem to care a lot about the decision and the data clearly show that the decisions are very poor.</p> <p>The first class of example comes from sports and the second comes from board games. One nice thing about sports is that they often have detailed play-by-play data and well-defined win criteria which lets us tell, on average, what the expected value of a decision is. In this post, we’ll look at the cost of bad decision making in one sport and then briefly discuss why decision quality in sports might be the same or better as decision quality in other fields. Sports are fertile ground because decision making was non-data driven and generally terrible until fairly recently, so we have over a century of information for major U.S. sports and, for a decent fraction of that time period, fans would write analyses about how poor decision making was and now much it cost teams, which teams would ignore (this has since changed and basically every team has a staff of stats-PhDs or the equivalent looking at data).</p> <h4 id="baseball">Baseball</h4> <p>In <a href="talent/">another post, we looked at how &quot;hiring&quot; decisions in sports were total nonsense</a>. In this post, just because one of the top &quot;rationality community&quot; thought leaders gave the common excuse that that in-game baseball decision making by coaches isn't that costly (&quot;Do bad in-game decisions cost games? Absolutely. But not <em>that many</em> games. Maybe they lose you 4 a year out of 162.&quot;; the entire post implies this isn't a big deal and it's fine to throw away 4 games), we'll look at how costly bad decision making is and how much teams spend to buy an equivalent number of wins in other ways. However, you could do the same kind of analysis for football, hockey, basketball, etc., and my understanding is that you’d get a roughly similar result in all of those cases.</p> <p>We’re going to model baseball as a state machine, both because that makes it easy to understand the expected value of particular decisions and because this lets us talk about the value of decisions without having to go over most of the rules of baseball.</p> <p>We can treat each baseball game as an independent event. In each game, two teams play against each other and the team that scores more runs (points) wins. Each game is split into 9 “innings” and in each inning each team will get one set of chances on offense. In each inning, each team will play until it gets 3 “outs”. Any given play may or may not result in an out.</p> <p>One chunk of state in our state machine is the number of outs and the inning. The other chunks of state we’re going to track are who’s “on base” and which player is “at bat”. Each teams defines some order of batters for their active players and after each player bats once this repeats in a loop until the team collects 3 outs and the inning is over. The state of who is at bat is saved between innings. Just for example, you might see batters 1-5 bat in the first inning, 6-9 and then 1 again in the second inning, 2- … etc.</p> <p>When a player is at bat, the player may advance to a base and players who are on base may also advance, depending on what happens. When a player advances 4 bases (that is, through 1B, 2B, 3B, to what would be 4B except that it isn’t called that) a run is scored and the player is removed from the base. As mentioned above, various events may cause a player to be out, in which case they also stop being on base.</p> <p>An example state from our state machine is:</p> <pre><code>{1B, 3B; 2 outs} </code></pre> <p>This says that there’s a player on 1B, a player on 3B, there are two outs. Note that this is independent of the score, who’s actually playing, and the inning.</p> <p>Another state is:</p> <pre><code>{--; 0 outs} </code></pre> <p>With a model like this, if we want to determine the expected value of the above state, we just need to look up the total number of runs across all innings played in a season divided by the number of innings to find the expected number of runs from the state above (ignoring the 9th inning because a quirk of baseball rules distorts statistics from the 9th inning). If we do this, we find that, from the above state, a team will score .555 runs in expectation.</p> <p>We can then compute the expected number of runs for all of the other states:</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <table> <tr> <th></th><th>0</th><th>1</th><th>2</th></tr> <tr><th>bases</th><th colspan="3">outs</th></tr> <tr> <th>--</th><td bgcolor=#f7f7f7><font color=black>.555</font></td><td bgcolor=#d6604d><font color=black>.297</font></td><td bgcolor=#67001f><font color=white>.117</font></td></tr> <tr> <th>1B</th><td bgcolor=#d1e5f0><font color=black>.953</font></td><td bgcolor=#f7f7f7><font color=black>.573</font></td><td bgcolor=#b2182b><font color=black>.251</font></td></tr> <tr> <th>2B</th><td bgcolor=#d1e5f0><font color=black>1.189</font></td><td bgcolor=#f7f7f7><font color=black>.725</font></td><td bgcolor=#d6604d><font color=black>.344</font></td></tr> <tr> <th>3B</th><td bgcolor=#92c5de><font color=black>1.482</font></td><td bgcolor=#d1e5f0><font color=black>.983</font></td><td bgcolor=#f4a582><font color=black>.387</font></td></tr> <tr> <th>1B,2B</th><td bgcolor=#92c5de><font color=black>1.573</font></td><td bgcolor=#d1e5f0><font color=black>.971</font></td><td bgcolor=#f4a582><font color=black>.466</font></td></tr> <tr> <th>1B,3B</th><td bgcolor=#4393c3><font color=black>1.904</font></td><td bgcolor=#d1e5f0><font color=black>1.243</font></td><td bgcolor=#fddbc7><font color=black>.538</font></td></tr> <tr> <th>2B,3B</th><td bgcolor=#2166ac><font color=black>2.052</font></td><td bgcolor=#92c5de><font color=black>1.467</font></td><td bgcolor=#f7f7f7><font color=black>.634</font></td></tr> <tr> <th>1B,2B,3B</th><td bgcolor=#053061><font color=white>2.417</font></td><td bgcolor=#92c5de><font color=black>1.650</font></td><td bgcolor=#f7f7f7><font color=black>.815</font></td></tr> <tr> </tr> </table></p> <p>In this table, each entry is the expected number of runs from the remainder of the inning from some particular state. Each column shows the number of outs and each row shows the state of the bases. The color coding scheme is: the starting state (<code>.555</code> runs) has a white background. States with higher run expectation are more blue and states with lower run expectation are more red.</p> <p>This table and the other stats in this post come from <a href="https://www.amazon.com/gp/product/1494260174/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1494260174&amp;linkId=1d8e432457b958c87f46820d5aa51539">The Book by Tango et al.</a>, which mostly discussed baseball between 1999 and 2002. See the appendix if you're curious about how things change if we use a more detailed model.</p> <p>The state we’re tracking for an inning here is who’s on base and the number of outs. Innings start with nobody on base and no outs.</p> <p>As above, we see that we start the inning with <code>.555</code> runs in expectation. If a play puts someone on 1B without getting an out, we now have <code>.953</code> runs in expectation, i.e., putting someone on first without an out is worth <code>.953 - .555 = .398</code> runs.</p> <p>This immediately gives us the value of some decisions, e.g., trying to “steal” <code>2B</code> with no outs and someone on first. If we look at cases where the batter’s state doesn’t change, a successful steal moves us to the <code>{2B, 0 outs}</code> state, i.e., it gives us <code>1.189 - .953 = .236</code> runs. A failed steal moves us to the <code>{--, 1 out}</code> state, i.e., it gives us <code>.953 - .297 = -.656</code> runs. To break even, we need to succeed <code>.656 / .236 = 2.78x</code> more often than we fail, i.e., we need a <code>.735</code> success rate to break even. If we want to compute the average value of a stolen base, we can compute the weighted sum over all states, but for now, let’s just say that it’s possible to do so and that you need something like a <code>.735</code> success rate for stolen bases to make sense.</p> <p>We can then look at the stolen base success rate of teams to see that, in any given season, maybe 5-10 teams are doing better than breakeven, leaving 20-25 teams at breakeven or below (mostly below). If we look at a bad but not historically bad stolen-base team of that era, they might have a .6 success rate. It wouldn’t be unusual for a team from that era to make between 100 and 200 attempts. Just so we can compute an approximation, if we assume they were all attempts from the {1B, 0 outs} state, the average run value per attempt would be .4 * (-.656) + .6 * .236 = -0.12 runs per attempt. Another first-order approximation is that a delta of 10 runs is worth 1 win, so at 100 attempts we have -1.2 wins and at 200 attempts we have -2.4 wins.</p> <p>If we run the math across actual states instead of using the first order approximation, we see that the average stolen base is worth -.467 runs and the average successful steal is worth .175 runs. In that case, a steal attempt with a .6 success rate is worth .4 * (-.467) + .6 * .175 = -0.082 runs. With this new approximation, our estimate for the approximate cost in wins of stealing “as normal” vs. having a “no stealing” rule for a team that steals badly and often is .82 to 1.64 wins per season. Note that this underestimates the cost of stealing since getting into position to steal increases the odds of a successful “pickoff”, which we haven’t accounted for. From our state-machine standpoint, a pickoff is almost equivalent to a failed steal, but the analysis necessary to compute the difference in pickoff probability is beyond the scope of this post.</p> <p>We can also do this for other plays coaches can cause (or prevent). For the “intentional walk”, we see that an intentional walk appears to be worth .102 runs for the opposing team. In 2002, a team that issued “a lot” of intentional walks might have issued 50, resulting in 50 * .102 runs for the opposing team, giving a loss of roughly 5 runs or .5 wins.</p> <p>If we optimistically assume a “sac bunt” never fails, the cost of a sac bunt is .027 runs per attempt. If we look at the league where pitchers don’t bat, a team that was heavy on sac bunts might’ve done 49 sac bunts (we do this to avoid “pitcher” bunts, which add complexity to the approximation), costing a total of 49 * .027 = 1.32 runs or .132 wins.</p> <p>Another decision that’s made by a coach is setting the batting order. Players bat (take a turn) in order, 1-9, mod 9. That is, when the 10th “player” is up, we actually go back around and the 1st player bats. At some point the game ends, so not everyone on the team ends up with the same number of “at bats”.</p> <p>There’s a just-so story that justifies putting the fastest player first, someone with a high “batting average” second, someone pretty good third, your best batter fourth, etc. This story, or something like it, has been standard for over 100 years.</p> <p>I’m not going to walk through the math for computing a better batting order because I don’t think there’s a short, easy to describe, approximation. It turns out that if we compute the difference between an “optimal” order and a “typical” order justified by the story in the previous paragraph, using an optimal order appears to be worth between 1 and 2 wins per season.</p> <p>These approximations all leave out important information. In three out of the four cases, we assumed an average player at all times and didn’t look at who was at bat. The information above actually takes this into account to some extent, but not fully. How exactly this differs from a better approximation is a long story and probably too much detail for a post that’s using baseball to talk about decisions outside of baseball, so let’s just say that we have pretty decent but not amazing approximation that says that a coach who makes bad decisions following conventional wisdom that are in the normal range of bad decisions during a baseball season might be able cost their team something like 1 + 1.2 + .5 + .132 = 2.83 wins on these three decisions alone vs. a decision rule that says “never do these actions that, on average, have negative value”. If we compare to a better decision rule such as “do these actions when they have positive value and not when they have negative value” or a manager that generally makes good decisions, let’s conservatively estimate that’s maybe worth 3 wins.</p> <p>We’ve looked at four decisions (sac bunt, steal, intentional walk, and batting order). But there are a lot of other decisions! Let’s arbitrarily say that if we look at all decisions and not just these four decisions, having a better heuristic for all decisions might be worth 4 or 5 wins per season.</p> <p>What does 4 or 5 wins per season really mean? One way to look at it is that baseball teams play 162 games, so an “average” team wins 81 games. If we look at the seasons covered, the number of wins that teams that made the playoffs had was {103, 94, 103, 99, 101, 97, 98, 95, 95, 91, 116, 102, 88, 93, 93, 92, 95, 97, 95, 94, 87, 91, 91, 95, 103, 100, 97, 97, 98, 95, 97, 94}. Because of the structure of the system, we can’t name a single number for a season and say that N wins are necessary to make the playoffs and that teams with fewer than N wins won’t make the playoffs, but we can say that 95 wins gives a team decent odds of making the playoffs. 95 - 81 = 14. 5 wins is more than a third of the difference between an average team and a team that makes the playoffs. This a huge deal both in terms of prestige and also direct economic value.</p> <p>If we want to look at it at the margin instead of on average, the smallest delta in wins between teams that made the playoffs and teams that didn’t in each league was {1, 7, 8, 1, 6, 2, 6, 3}. For teams that are on the edge, a delta of 5 wins wouldn’t always be the difference between a successful season (making playoffs) and an unsuccessful season (not making playoffs), but there are teams within a 5 win delta of making the playoffs in most seasons. If we were actually running a baseball team, we’d want to use a much more fine-grained model, but as a first approximation we can say that in-game decisions are a significant factor in team performance and that, using some kind of computation, we can determine the expected cost of non-optimal decisions.</p> <p>Another way to look at what 5 wins is worth is to look at what it costs to get a player who’s not a pitcher that’s 5 wins above average (WAA) (we look at non-pitchers because non-pitchers tend to play in every game and pitchers tend to play in parts of some games, making a comparison between pitchers and non-pitchers more complicated). Of the 8 non-pitcher positions (we look at non-pitcher positions because it makes comparisons simpler), there are 30 teams, so we have 240 team-positions pairs. In 2002, of these 240 team-position pairs, <a href="https://www.baseball-reference.com/leagues/team_compare.cgi?request=1&amp;year=2002&amp;lg=MLB&amp;stat=WAA">there are two that were &gt;= 5 WAA</a>, Texas-SS (Alex Rodriguez, paid $22m) and SF-LF (Barry Bonds, paid $15m). If we look at the other seasons in the range of dates we’re looking at, there are either 2 or 3 team-position pairs where a team is able to get &gt;= 5 WAA in a season These aren’t stable across seasons because player performance is volatile, so it’s not as easy as finding someone great and paying them $15m. For example, in 2002, there were 7 non-pitchers paid $14m or more and only two of them we worth 5 WAA or more. For reference, the average total team payroll (teams have 26 players per) in 2002 was $67m, with a minimum of $34m and a max of $126m. At the time a $1m salary for a manager would’ve been considered generous, making a 5 WAA manager an incredible deal.</p> <p>5 WAA assumes typical decision making lining up with events in a bad, but not worst-case way. A more typical case might be that a manager costs a team 3 wins. In that case, in 2002, there were 25 team-position pairs out of 240 where a single player could make up for the loss caused from management by conventional wisdom. Players who provide that much value and who aren’t locked up in artificially cheap deals with particular teams due to the mechanics of player transfers are still much more expensive than managers.</p> <p>If we look at how teams have adopted data analysis in order to improve both in-game decision making and team-composition decisions, it’s been a slow, multi-decade, process. <a href="https://www.amazon.com/gp/product/0393324818/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0393324818&amp;linkId=391fa6566a95327d7f06f533fb3e2daf">Moneyball</a> describes part of the shift from using intuition and observation to select players to incorporating statistics into the process. Stats nerds were talking about how you could do this at least since <a href="https://en.wikipedia.org/wiki/Sabermetrics">1971</a> and no team really took it seriously until the 90s and the ideas didn’t really become mainstream until the mid 2000s, after a bestseller had been published.</p> <p>If we examine how much teams have improved at the in-game decisions we looked at here, the process has been even slower. It’s still true today that statistics-driven decisions aren’t mainstream. Things are getting better, and if we look at the aggregate cost of the non-optimal decisions mentioned here, the aggregate cost has been getting lower over the past couple decades as intuition-driven decisions slowly converge to more closely match what stats nerds have been saying for decades. For example, if we look at the total number of sac bunts recorded across all teams from 1999 until now, we see:</p> <div style="overflow-x:auto;"> <table> <tr> <th>1999</th><th>2000</th><th>2001</th><th>2002</th><th>2003</th><th>2004</th><th>2005</th><th>2006</th><th>2007</th><th>2008</th><th>2009</th><th>2010</th><th>2011</th><th>2012</th><th>2013</th><th>2014</th><th>2015</th><th>2016</th><th>2017</th></tr> <tr> <td bgcolor=#e31a1c><font color=black>1604</font></td><td bgcolor=#e31a1c><font color=black>1628</font></td><td bgcolor=#e31a1c><font color=black>1607</font></td><td bgcolor=#bd0026><font color=black>1633</font></td><td bgcolor=#e31a1c><font color=black>1626</font></td><td bgcolor=#800026><font color=white>1731</font></td><td bgcolor=#e31a1c><font color=black>1620</font></td><td bgcolor=#bd0026><font color=black>1651</font></td><td bgcolor=#e31a1c><font color=black>1540</font></td><td bgcolor=#fc4e2a><font color=black>1526</font></td><td bgcolor=#bd0026><font color=black>1635</font></td><td bgcolor=#e31a1c><font color=black>1544</font></td><td bgcolor=#bd0026><font color=black>1667</font></td><td bgcolor=#fc4e2a><font color=black>1479</font></td><td bgcolor=#fd8d3c><font color=black>1383</font></td><td bgcolor=#fd8d3c><font color=black>1343</font></td><td bgcolor=#fed976><font color=black>1200</font></td><td bgcolor=#ffffcc><font color=black>1025</font></td><td bgcolor=#ffffcc><font color=black>925</font></td></tr> </table> </div> <p>Despite decades of statistical evidence that sac bunts are overused, we didn’t really see a decline across all teams until 2012 or so. Why this is varies on a team-by-team and case-by-case basis, but the fundamental story that’s been repeated over and over again both for statistically-driven team composition and statistically driven in-game decisions is that the people who have the power to make decisions often stick to conventional wisdom instead of using “radical” statistically-driven ideas. There are a number of reasons as to why this happens. One high-level reason is that the change we’re talking about was a cultural change and cultural change is slow. Even as this change was happening and teams that were more data-driven were outperforming relative to their budgets, people anti-data folks ridiculed anyone who was using data. If you were one of the early data folks, you'd have to be <a href="look-stupid/">willing to tolerate a lot of the biggest names in the game calling you stupid, as well as fans, friends, etc.</a>. It doesn’t surprise people when it takes a generation for scientific consensus to shift in the face of this kind of opposition, so why should be baseball be any different?</p> <p>One specific lower-level reason “obviously” non-optimal decisions can persist for so long is that there’s a lot of noise in team results. You sometimes see a manager make some radical decisions (not necessarily statistics-driven), followed by some poor results, causing management to fire the manager. There’s so much volatility that you can’t really judge players or managers based on small samples, but this doesn’t stop people from doing so. The combination of volatility and skepticism of radical ideas heavily disincentivizes going against conventional wisdom.</p> <p>Among the many consequences of this noise is the fact that the winner of the &quot;world series&quot; (the baseball championship) is heavily determined by randomness. Whether or not a team makes the playoffs is determined over 162 games, which isn't enough to remove all randomness, but is enough that the result isn't mostly determined by randomness. This isn't true of the playoffs, which are too short for the outcome to be primarily determined by the difference in the quality of teams. Once a team wins the world series, people come up with all kinds of just-so stories to justify why the team should've won, but if we look across all games, we can see that the stories are just stories. This is, perhaps, not so different to listening to people tell you why their startup was successful.</p> <p>There are metrics we can use that are better predictors of future wins and losses (i.e., are less volatile than wins and losses), but, until recently, convincing people that those metrics were meaningful was also a radical idea.</p> <h4 id="board-games">Board games</h4> <p>That's the baseball example. Now on to the board game example. In this example, we'll look at people who make comments on &quot;modern&quot; board game strategy, by which I mean they comment on strategy for games like Catan, Puerto Rico, Ark Nova, etc.</p> <p>People often vehemently disagree about what works and what doesn't work. Today, most online discussions of this sort happen on boardgamegeek (BGG), a forum that is, by far, the largest forum for discussing board games. A quirk of these discussions is that people often use the same username on BGG as on boardgamearena (BGA), a online boardgame site where people's ratings (Elos) are tracked and you can see people's Elo ratings.</p> <p>So, in these discussions, you'll see someone saying that strategy X is dominant. Then someone else will come in and say, no, strategy Y beats strategy X, I win with strategy Y all the time when people do strategy X, etc. If you understand the game, you'll see that the person arguing for X is correct and the person arguing for Y is wrong, and then you'll look up these people's Elos and find that the X-player is a high-ranked player and the Y-player is a low-ranked player.</p> <p>The thing that's odd about this is, how come the low-ranked players so confidently argue that their position is correct? Not only do they get per-game information indicating that they're wrong (because they often lose), they have a rating that aggregates all of their gameplay and tells them, roughly, how good they are. Despite this rating telling them that they don't know what they're doing in the game, they're utterly convinced that they're strong players who are playing well and that they not only have good strategies, their strategies are good enough that they should be advising much higher rated players on how to play.</p> <p>When people correct these folks, they often get offended because they're sure that they're good and they'll say things like &quot;I'm a good [game name] player. I win a lot of games&quot;, followed by some indignation that their advice isn't taken seriously and/or huffy comments about how people who think strategy X works are all engaging in group think even when these people are playing in the same pool of competitive online players where, if it were true that strategy X players were engaging in incorrect group think, strategy Y players would beat them and have higher ratings. And, as we noted <a href="p95-skill/">when we looked at video game skill</a>, players often express great frustration and anger at losing and not being better at the game, so it's clear that they want to do better and win. But even having a rating that pretty accurately sums on your skill displayed on your screen at all times doesn't seem to be enough to get people to realize that they're, on average, making poor decisions and could easily make better decisions by taking advice from higher-rated players instead of insisting that their losing strategies work.</p> <p>When looking at the video game Overwatch, we noted that often overestimated their own skill and blamed teammates for losses. But in these kinds of boardgames, people are generally not playing on teams, so there's no one else to blame. And not only is there no teammate to blame, in most games, the most serious rated game format is 1v1 and not some kind of multi-player FFA, so you can't even blame a random person who's not on your team. In general, someone's rating in a 1v1 game is about as accurate as metric as you're going to get for someone's domain-specific decision making skill in any domain.</p> <p>And yet, people are extremely confident about their own skills despite their low ratings. If you look at board game strategy commentary today, almost all of it is wrong and, when you look up people's ratings, almost all of it comes from people who are low rated in every game they play, who don't appear to understand how to play any game well. Of course there's nothing inherently wrong with playing a game poorly if that's what someone enjoys. The incongruity here comes from people playing poorly, having a well-defined rating that shows that they're playing poorly, be convinced that they're playing well and taking offence when people note that the strategies they advocate for don't work.</p> <h4 id="life-outside-of-games">Life outside of games</h4> <p>In the world, it's rare to get evidence of the quality of our decision making that's as clear as we see in sports and board games. When making an engineering decision, you almost never have data that's as clean as you do in baseball, nor do you ever have an Elo rating that can basically accurately sum up how good your past decision making is. This makes it much easier to adjust to feedback and make good decisions in sports and board games and yet, we can observe that most decision making in sport and board games in poor. This was true basically forever in sports despite a huge amount of money being on the line, and is true in board games despite people getting quite worked up over them and seeming to care a lot.</p> <p>If we think about the general version of the baseball decision we examined, what’s happening is that decisions have probabilistic payoffs. There’s very high variance in actual outcomes (wins and losses), so it’s possible to make good decisions and not see the direct effect of them for a long time. Even if there are metrics that give us a better idea of what the “true” value of a decision is, if you’re operating in an environment where your management doesn’t believe in those metrics, you’re going to have a hard time keeping your job (or getting a job in the first place) if you want to do something radical whose value is only demonstrated by some obscure-sounding metric unless they take a chance on you for a year or two. There have been some major phase changes in what metrics are accepted, but they’ve taken decades.</p> <p>If we look at business or engineering decisions, the situation is much messier. If we look at product or infrastructure success as a “win”, there seems to be much more noise in whether or not a team gets a “win”. Moreover, unlike in baseball, the sort of play-by-play or even game data that would let someone analyze “wins” and “losses” to determine the underlying cause isn’t recorded, so it’s impossible to determine the true value of decisions. And even if the data were available, there are so many more factors that determine whether or not something is a “win” that it’s not clear if we’d be able to determine the expected value of decisions even if we had the data.</p> <p>We’ve seen that in a field where one can sit down and determine the expected value of decisions, it can take decades for this kind of analysis to influence some important decisions. If we look at fields where it’s more difficult to determine the true value of decisions, how long should we expect it to take for “good” decision making to surface? It seems like it would be a while, perhaps forever, unless there’s something about the structure of baseball and other sports that makes it particularly difficult to remove a poor decision maker and insert a better decision maker.</p> <p>One might argue that baseball is different because there are a fixed number of teams and it’s quite unusual for a new team to enter the market, but if you look at things like public clouds, operating systems, search engines, car manufacturers, etc., the situation doesn’t look that different. If anything, it appears to be much cheaper to take over a baseball team and replace management (you sometimes see baseball teams sell for roughly a billion dollars) and there are more baseball teams than there are competitive products in the markets we just discussed, at least in the U.S. One might also argue that, if you look at the structure of baseball teams, it’s clear that positions are typically not handed out based on decision-making merit and that <a href="https://thezvi.wordpress.com/2017/10/29/leaders-of-men/">other factors tend to dominate</a>, but this doesn’t seem obviously more true in baseball than in engineering fields.</p> <p>This isn’t to say that we expect obviously bad decisions everywhere. You might get that idea if you hung out on baseball stats nerd forums before <a href="https://www.amazon.com/gp/product/0393324818/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0393324818&amp;linkId=3ec395b394d2f8caaa30df2b2fc15d5d">Moneyball</a> was published (and for quite some time after), but if you looked at formula 1 (F1) around the same time, you’d see teams employing PhDs who are experts in economics and game theory to make sure they were making reasonable decisions. This doesn’t mean that F1 teams always make perfect decisions, but they at least avoided making decisions that interested amateurs could identify as inefficient for decades. There are some fields where competition is cutthroat and you have to do rigorous analysis to survive and there are some fields where competition is more sedate. In living memory, <a href="https://repositories.lib.utexas.edu/bitstream/handle/2152/17760/bournen.pdf?%852">there was a time when training for sports was considered ungentlemanly</a> and someone who trained with anything resembling modern training techniques would’ve had a huge advantage. Over the past decade or so, we’re seeing the same kind of shift but for statistical techniques in baseball instead of training in various sports.</p> <p>If we want to look at the quality of decision making, it's too simplistic to say that we expect a firm to make good decisions because they're exposed to markets and there's economic value in making good decisions and people within the firm will probably be rewarded greatly if they make good decisions. You can't even tell if this is happening by asking people if they're making rigorous, data-driven, decisions. If you'd ask people in baseball they were using data in their decisions, they would've said yes throughout the 70s and 80s. Baseball has long been known as a sport where people track all kinds of numbers and then use those numbers. It's just that people didn't backtest their predictions, let alone backtest their predictions with holdouts.</p> <p>The <a href="https://www.uky.edu/~eushe2/Pajares/Kuhn.html">paradigm shift</a> of using data effectively to drive decisions has been hitting different fields at different rates over the past few decades, both inside and outside of sports. Why this change happened in F1 before it happened in baseball is due to a combination of the difference in incentive structure in F1 teams vs. baseball teams and the difference in <a href="wat/">institutional culture</a>. We may take a look at this in a future post, but this turns out to be a fairly complicated issue that requires a lot more background.</p> <p>Looking at the overall picture, we could view this glass as being half empty (wow, people suck at making easy decisions that they consider very important, so they must be absolutely horrible at making non-easy decisions) or the glass as being half full (wow, you can find good opportunities for improvement in many places, even in areas people claim must be hard due to <a href="cocktail-ideas/">econ 101 reasoning</a> like &quot;they must be making the right call because they're highly incentivized&quot; could trick one into thinking that there aren't easy opportunities available).</p> <h3 id="appendix-non-idealities-in-our-baseball-analysis">Appendix: non-idealities in our baseball analysis</h3> <p>In order to make this a short blog post and not a book, there are a lot of simplifications the approximation we discussed. One major simplification is the idea that all runs are equivalent. This is close enough to true that this is a decent approximation. But there are situations where the approximation isn’t very good, such as when it’s the 9th inning and the game is tied. In that case, a decision that increases the probability of scoring 1 run but decreases the probability of scoring multiple runs is actually the right choice.</p> <p>This is often given as a justification for a relatively late-game sac bunt. But if we look at the probability of a successful sac bunt, we see that it goes down in later innings. We didn’t talk about how the defense is set up, but defenses can set up in ways that reduce the probability of a successful sac bunt but increase the probability of success of non-bunts and vice versa. Before the last inning, this actually makes sac bunt worse late in the game and not better! If we take all of that into account in the last inning of a tie game, the probability that a sac bunt is a good idea then depends on something else we haven’t discussed, the batter at the plate.</p> <p>In our simplified model, we computed the expected value in runs across all batters. But at any given time, a particular player is batting. A successful sac bunt advances runners and increases the number of outs by one. The alternative is to let the batter “swing away”, which will result in some random outcome. The better the batter, the higher the probability of an outcome that’s better than the outcome of a sac bunt. To determine the optimal decision, we not only need to know how good the current batter is but how good the subsequent batters are. One common justification for the sac bunt is that pitchers are terrible hitters and they’re not bad at sac bunting because they have so much practice doing it (because they’re terrible hitters), but it turns out that pitchers are also below average sac bunters and that the argument that we should expect pitchers to sac because they’re bad hitters doesn’t hold up if we look at the data in detail.</p> <p>Another reason to sac bunt (or bunt in general) is that the tendency to sometimes do this induces changes in defense which make non-bunt plays work better.</p> <p>A full computation should also take into account the number of balls and strikes a current batter has, which is a piece of state we haven’t discussed at all as well as the speed of the batter and the players on base as well as the particular stadium the game is being played in and the opposing pitcher as well as the quality of their defense. All of this can be done, even on a laptop -- this is all “small data” as far as computers are concerned, but walking through the analysis even for one particular decision would be substantially longer than everything in this post combined including this disclaimer. It’s perhaps a little surprising that taking all of these non-idealities into account doesn’t overturn the general result, but it turns out that it doesn’t (it finds that there are many situations in which sac bunts have positive expected value, but that sac bunts were still heavily overused for decades).</p> <p>There’s a similar situation for intentional walks, where the non-idealities in our analysis appear to support issuing intentional walks. In particular, the two main conventional justifications for an intentional walk are</p> <ol> <li>By walking the current batter, we can set up a “force” or a “double play” (increase the probability of getting one out or two outs in one play). If the game is tied in the last inning, putting another player on base has little downside and has the upside of increasing the probability of allowing zero runs and continuing the tie.</li> <li>By walking the current batter, we can get to the next, worse batter.</li> </ol> <p>An example situation where people apply the justification in (1) is in the <code>{1B, 3B; 2 out}</code> state. The team that’s on defense will lose if the player at <code>3B</code> advances one base. The reasoning goes, walking a player and changing the state to <code>{1B, 2B, 3B; 2 out}</code> won’t increase the probability that the player at <code>3B</code> will score and end the game if the current batter “puts the ball into play”, and putting another player on base increases the probability that the defense will be able to get an out.</p> <p>The hole in this reasoning is that the batter won’t necessarily put the ball into play. After the state is <code>{1B, 2B, 3B; 2 out}</code>, the pitcher may issue an unintentional walk, causing each runner to advance and losing the game. It turns out that being in this state doesn’t affect the the probability of an unintentional walk very much. The pitcher tries very hard to avoid a walk but, at the same time, the batter tries very hard to induce a walk!</p> <p>On (2), the two situations where the justification tend to be applied are when the current player at bat is good or great, or the current player is batting just before the pitcher. Let’s look at these two separately.</p> <p>Barry Bonds’s seasons from 2001, 2002, and 2004 were some of the statistically best seasons of all time and are as extreme a case as one can find in modern baseball. If we run our same analysis and account for the quality of the players batting after Bonds, we find that it’s sometimes the correct decision for the opposing team to intentionally walk Bonds, but it was still the case that most situations do not warrant an intentional walk and that Bonds was often intentionally walked in a situation that didn’t warrant an intentional walk. In the case of a batter who is not having one of the statistically best seasons on record in modern baseball, intentional walks are even less good.</p> <p>In the case of the pitcher batting, doing the same kind of analysis as above also reveals that there are situations where an intentional walk are appropriate (not-late game, {1B, 2B; 2 out}, when the pitcher is not a significantly above average batter for a pitcher). Even though it’s not always the wrong decision to issue an intentional walk, the intentional walk is still grossly overused.</p> <p>One might argue the fact that our simple analysis has all of these non-idealities that could have invalidated the analysis is a sign that decision making in baseball wasn’t so bad after all, but I don’t think that holds. A first-order approximation that someone could do in an hour or two finds that decision making seems quite bad, on average. If a team was interested in looking at data, that ought to lead them into doing a more detailed analysis that takes into account the conventional-wisdom based critiques of the obvious one-hour analysis. It appears that this wasn’t done, at least not for decades.</p> <p>The problem is that before people started running the data, all we had to go by were stories. Someone would say &quot;with 2 outs, you should walk the batter before the pitcher to get to the pitcher [in some situations] to get to the pitcher and get the guaranteed out&quot;. Someone else might respond &quot;we obviously shouldn't do that late game because the pitcher will get subbed out for a pinch hitter and early game, we shouldn't do it because even if it works and we get the easy out, it sets the other team up to lead off the next inning with their #1 hitter instead of an easy out&quot;. Which of these stories is the right story turns out to be an empirical question. The thing that I find most unfortunate is that, after started people running the numbers and the argument became one of stories vs. data, people persisted in sticking with the story-based argument for decades. We see the same thing in business and engineering, but it's arguably more excusable there because decisions in those areas tend to be harder to quantify. Even if you can reduce something to a simple engineering equation, someone can always argue that the engineering decision isn't what really matters and this other business concern that's hard to quantify is the most important thing.</p> <h3 id="appendix-possession">Appendix: possession</h3> <p>Something I find interesting is that statistical analysis in football, baseball, and basketball has found that teams have overwhelmingly undervalued possessions for decades. Baseball doesn't have the concept of possession per se, but if you look at being on offense as &quot;having possession&quot; and getting 3 outs as &quot;losing possession&quot;, it's quite similar.</p> <p>In football, we see that <a href="https://eml.berkeley.edu/~dromer/papers/JPE_April06.pdf">maintaining possession</a> is such a big deal that it is usually an error to punt on 4th down, but this hasn't stopped teams from punting by default basically forever. And in basketball, players who shoot a lot with a low shooting percentage were (and arguably still are) overrated.</p> <p>I don't think this is fundamental -- that possessions are as valuable as they are comes out of the rules of each game. It's arbitrary. I still find it interesting, though.</p> <h3 id="appendix-other-analysis-of-management-decisions">Appendix: other analysis of management decisions</h3> <p><a href="https://people.stanford.edu/nbloom/sites/default/files/dmm.pdf">Bloom et al., Does management matter? Evidence from India</a> looks at the impact of management interventions and the effect on productivity.</p> <p><a href="https://people.stanford.edu/nbloom/research">Other work by Bloom</a>.</p> <p><a href="http://web.stanford.edu/~gentzkow/research/uniform-pricing.pdf">DellaVigna et al., Uniform pricing in US retail chains</a> allegedly finds a significant amount of money left on the table by retail chains (seven percent of profits) and explores why that might happen and what the impacts are.</p> <p>The upside of work like this vs. sports work is that it attempts to quantify the impact of things outside of a contrived game. The downside is that the studies are on things that are quite messy and it's hard to tell what the study actually means. Just for example, if you look at studies on innovation, economists often use patents as a proxy for innovation and then come to some conclusion based on some variable vs. number of patents. But if you're familiar with engineering patents, you'll know that number of patents is an incredibly poor proxy for innovation. In the hardware world, IBM is known for cranking out a very large number of useless patents (both in the sense of useless for innovation and also in the narrow sense of being useless as a counter-attack in patent lawsuits) and there are some companies that get much more mileage out of filing many fewer patents.</p> <p>AFAICT, our options here are to know a lot about decisions in a context that's arguably completely irrelevant, or to have ambiguous information and probably know very little about a context that seems relevant to the real world. I'd love to hear about more studies in either camp (or even better, studies that don't have either problem).</p> <p><i>Thanks to Leah Hanson, David Turner, Milosz Dan, Andrew Nichols, Justin Blank, @hoverbikes, Kate Murphy, Ben Kuhn, Patrick Collison, and an anonymous commenter for comments/corrections/discussion.</i></p> How out of date are Android devices? android-updates/ Sun, 12 Nov 2017 00:00:00 +0000 android-updates/ <p>It's common knowledge that Android device tend to be more out of date than iOS devices, but what does this actually mean? Let’s look at android marketshare data to see how old devices in the wild are. The x axis of the plot below is date, and the y axis is Android marketshare. The share of all devices sums to 100% (with some artifacts because the public data Google provides is low precision).</p> <p>Color indicates age:</p> <ul> <li>blue: current (API major version)</li> <li>yellow: 6 months</li> <li>orange: 1 year</li> <li>dark red: 2 years</li> <li>bright red/white: 3 years</li> <li>light grey: 4 years</li> <li>grey: 5 years</li> <li>black: 6 years or more</li> </ul> <p>If we look at the graph, we see a number of reverse-S shaped contours; between each pair of contours, devices get older as we go from left to right. Each contour corresponds to the release of a new android version and the associated devices running that android version. As time passes, devices on that version get older. When a device is upgraded, they’re effectively removed from one contour into a new contour and the color changes to a less outdated color.</p> <p><img src="images/android-updates/android-age.svg" width="1152" height="1152" alt="Markshare of outdated android devices is increasing"></p> <p>There are three major ways in which this graph understates the number of outdated devices:</p> <p>First, we’re using API version data for this and don’t have access to the marketshare of point releases and minor updates, so we assume that all devices on the same API version are up to date until the moment a new API version is released, but many (and perhaps most) devices won’t receive updates within an API version.</p> <p>Second, this graph shows marketshare, but the number of Android devices has dramatically increased over time. For example, if we look at the 80%-ile most outdated devices (i.e., draw a line 20% up from the bottom), it the 80%-ile device today is a few months more outdated than it was in 2014. The huge growth of Android means that there are many many more outdated devices now than there were in 2014.</p> <p>Third, this data comes from scraping Google Play Store marketshare info. That data shows marketshare of devices that have visited in the Play Store in the last 7 days. In general, it seems reasonable to believe that devices that visit the play store are more up to date than devices that don’t, so we should expect an unknown amount of bias in this data that causes the graph to show that devices are newer than they actually are. This seems plausible both for devices that are used as conventional mobile devices as well as for mobile devices that have replaced things liked traditonally embedded devices, PoS boxes, etc.</p> <p>If we're looking at this from a security standpoint, some devices will receive updates without updating their major version, skewing the date to look more outdated than it used it. However, when researchers have used more fine-grained data to see which devices are taking updates, <a href="https://www.cl.cam.ac.uk/~drt24/dissertation.pdf">they found that this was not a large effect</a>.</p> <p>One thing we can see from that graph is that, as time goes on, the world accumulates a larger fraction of old devices over time. This makes sense and we could have figured this out without looking at the data. After all, back at the beginning of 2010, Android phones couldn’t be much more than a year old, and now it’s possible to have Android devices that are nearly a decade old.</p> <p>Something that wouldn’t have been obvious without looking at the data is that the uptake of new versions seems to be slowing down -- we can see this by looking at the last few contour lines at the top right of the graph, corresponding to the most recent Android releases. These lines have a shallower slope than the contour lines for previous releases. Unfortunately, with this data alone, we can’t tell why the slope is shallower. Some possible reasons might be:</p> <ul> <li>Android growth is slowing down</li> <li>Android device turnover (device upgrade rate) is slowing down</li> <li>Fewer devices are receiving updates</li> </ul> <p>Without more data, it’s impossible to tell how much each of these is contributing to the problem. BTW, <a href="https://twitter.com/danluu/">let me know</a> if you know of a reasonable source for the active number of Android devices going back to 2010! I’d love to produce a companion graph of the total number of outdated devices.</p> <p>But even with the data we have, we can take a guess at how many outdated devices are in use. In May 2017, Google announced that there are over two billion active Android devices. If we look at the latest stats (the far right edge), we can see that nearly half of these devices are two years out of date. At this point, we should expect that there are more than one billion devices that are two years out of date! Given Android's update model, we should expect approximately 0% of those devices to ever get updated to a modern version of Android.</p> <h4 id="percentiles">Percentiles</h4> <p>Since there’s a lot going on in the graph, we might be able to see something if we look at some subparts of the graph. If we look at a single horizontal line across the graph, that corresponds to the device age at a certain percentile:</p> <p><img src="images/android-updates/android-percentile-age.svg" width="1152" height="1152" alt="Over time, the Nth percentile out of date device is getting more out of date"></p> <p>In this graph, the date is on the x axis and the age in months is on the y axis. Each line corresponds to a different percentile (higher percentile is older), which corresponds to a horizontal slice of the top graph at that percentile.</p> <p>Each individual line seems to have two large phases (with some other stuff, too). There’s one phase where devices for that percentile get older as quickly as time is passing, followed by a phase where, on average, devices only get slightly older. In the second phase, devices sometimes get younger as new releases push younger versions into a certain percentile, but this doesn’t happen often enough to counteract the general aging of devices. Taken as a whole, this graph indicates that, if current trends continue, we should expect to see proportionally more old Android devices as time goes on, which is exactly what we’d expect from the first, busier, graph.</p> <h4 id="dates">Dates</h4> <p>Another way to look at the graph is to look at a vertical slice instead of a horizontal slice. In that case, each slice corresponds to looking at the ages of devices at one particular date:</p> <p><img src="images/android-updates/android-age-over-time.svg" width="1152" height="1152"></p> <p>In this plot, the x axis indicates the age percentile and the y axis indicates the raw age in months. Each line is one particular date, with older dates being lighter / yellower and newer dates being darker / greener.</p> <p>As with the other views of the same data, we can see that Android devices appear to be getting more out of date as time goes on. This graph would be too busy to read if we plotted data for all of the dates that are available, but we can see it as an animation:</p> <p><img src="images/android-updates/android-age.gif" width="480" height="480"></p> <h4 id="ios">iOS</h4> <p>For reference, iOS 11 was released two months ago and it now has just under 50% iOS marketshare despite November’s numbers coming before the release of the iPhone X (this is compared to &lt; 1% marketshare for the latest Android version, which was released in August). It’s overwhelmingly likely that, by the start of next year, iOS 11 will have more than 50% marketshare and there’s an outside chance that it will have 75% marketshare, i.e., it’s likely that the corresponding plot for iOS would have the 50%-ile (red) line in the second plot at age = 0 and it’s not implausible that the 75%-ile (orange) line would sometimes dip down to 0. As is the case with Android, there are some older devices that stubbornly refuse to update; iOS 9.3, released a bit over two years ago, sits at just a bit above 5% marketshare. This means that, in the iOS version of the plot, it’s plausible that we’d see the corresponding 99%-ile (green) line in the second plot at a bit over two years (half of what we see for the Android plot).</p> <h4 id="windows-xp">Windows XP</h4> <p>People sometimes compare Android to Windows XP because there are a large number of both in the wild and in both cases, most devices will not get security updates. However, this is tremendously unfair to Windows XP, which was released on 10/2001 and got security updates until 4/2014, twelve and a half years later. Additionally, Microsoft has released at least one security update after the official support period (there was an update in 5/2017 in response to the WannaCry ransomware). It's unfortunate that Microsoft decided to end support for XP while there are still so many XP boxes in the wild, but supporting an old OS for over twelve years and then issuing an emergency security patch after more fifteen years puts Microsoft into a completely different league than Google and Apple when it comes to device support.</p> <p>Another difference between Android and Windows is that Android's scale is unprecedented in the desktop world. The were roughly 200 million PCs sold in 2017. Samsung alone has been selling that many mobile devices per year since 2008. Of course, those weren't Android devices in 2008, but Android's dominance in the non-iOS mobile space means that, overall, those have mostly been Android devices. Today, <a href="https://twitter.com/danluu/status/864194210752344064">we still see nearly 50 year old PDP-11 devices in use</a>. There are few enough PDPs around that running into one is a cute, quaint, surprise (0.6 million PDP-11s were sold). Desktops boxes age out of service more quickly than PDPs and mobile devices age out of service even more quickly, but the sheer difference in number of devices caused by the ubiquity of modern computing devices means that we're going to see many more XP-era PCs in use 50 years after the release of XP and it's plausible we'll see even more mobile devices around 50 years from now. Many of these ancient PDP, VAX, DOS, etc. boxes are basically safe because they're run in non-networked configurations, but it looks like the same thing is not going to be true for many of these old XP and Android boxes that are going to stay in service for decades.</p> <h3 id="conclusion">Conclusion</h3> <p>We’ve seen that Android devices appear to be getting more out of date over time. This makes it difficult for developers to target “new” Android API features, where new means anything introduced in the past few years. It also means that there are <em>a lot</em> of Android devices out there that are behind in terms of security. This is true both in absolute terms and also relative to iOS.</p> <p>Until recently, Android was directly tied to the hardware it ran on, making it very painful to keep old devices up to date because that requiring a custom Android build with phone-specific (or at least SoC-specific work). <a href="https://android-developers.googleblog.com/2017/05/here-comes-treble-modular-base-for.html">Google claims that this problem is fixed in the latest Android version (8.0, Oreo)</a>. People who remember Google's <a href="https://www.google.com/search?client=ubuntu&amp;hs=xVc&amp;channel=fs&amp;ei=laMJWombGJSvjwSmu5TQCA&amp;q=android+update+alliance&amp;oq=android+update+alliance&amp;gs_l=psy-ab.3..35i39k1l2.41466.41684.0.42014.2.2.0.0.0.0.111.193.1j1.2.0....0...1.1.64.psy-ab..0.2.192...0i22i30k1.0.SNY-JCWhD9M">&quot;Android update alliance&quot; annoucement in 2011</a> may be a bit skeptical of the more recent annoucement. In 2011, Google and U.S. carries announced that they'd keep devices up to date for 18 months, <a href="https://www.pcmag.com/article2/0,2817,2397729,00.asp">which mostly didn't happen</a>. However, even if the current annoucement isn't smoke and mirrors and the latest version of Android solves the update probem, we've seen that it takes years for Android releases to get adopted and we've also seen that the last few Android releases have significantly slower uptake than previous releases. Additionally, even though this is supposed to make updates easier, it looks like Android is still likely to stay behind iOS in terms of updates for a while. Google has promised that its latest phone (Pixel 2, 10/2017) will get updates for three years. That seems like a step in the right direction, but as we’ve seen from the graphs above, extending support by a year isn’t nearly enough to keep most Android devices up to date. But if you have an iPhone, the latest version of iOS (released 9/2017) works on devices back to the iPhone 5S (released 9/2013).</p> <p>If we look at the newest Android release (8.0, 8/2017), it looks like you’re quite lucky if you have a two year old device that will get the latest update. The oldest “Google” phone supported is the Nexus 6P (9/2015), giving it just under two years of support.</p> <p>If you look back at devices that were released around when the iPhone5S, the situation looks even worse. Back then, I got a free Moto X for working at Google; the Moto X was about as close to an official Google phone as you could get at the time (this was back when Google owned Moto). The Moto X was released on 8/2013 (a month before the iPhone 5S) and the latest version of Android it supports is 5.1, which was released on 2/2015, a little more than a year and a half later. For an Android phone of its era, the Moto X was supported for an unusually long time. It's a good sign that things look worse as look further back in time, but at the rate things are improving, it will be years before there's a decently supported Android device released and then years beyond those years before that Android version is in widespread use. It's possible that <a href="https://lwn.net/Articles/718267/">Fuchsia</a> will fix this, but Fuchsia is also many years away from widespread use.</p> <p><a href="input-lag/">In a future post, we'll look at Android response latency</a>, which is much higher than iPhone and iPad latency.</p> <p><i> Thanks to Leah Hanson, Kate Murphy, Daniel Thomas, Marek Majkowski, @zofrex, @Aissn, Chris Palmer, JonLuca De Caro, and an anonymous person for comments/corrections/related discussion.</p> <p>Also, thanks to Victorien Villard for making the data these graphs were based on available! </i></p> UI backwards compatibility ui-compatibility/ Thu, 09 Nov 2017 00:00:00 +0000 ui-compatibility/ <p>About once a month, an app that I regularly use will change its UI in a way that breaks muscle memory, basically tricking the user into doing things they don’t want.</p> <h3 id="zulip">Zulip</h3> <p>In recent memory, Zulip (a slack competitor) changed its newline behavior so that <code>ctrl + enter</code> sends a message instead of inserting a new line. After this change, I sent a number of half-baked messages and it seemed like some other people did too.</p> <p>Around the time they made that change, they made another change such that a series of clicks that would cause you to send a private message to someone would instead cause you to send a private message to the alphabetically first person who was online. Most people didn’t notice that this was a change, but when I mentioned that this had happened to me a few times in the past couple weeks, multiple people immediately said that the exact same thing happened to them. Some people also mentioned that the behavior of navigation shortcut keys was changed in a way that could cause people to broadcast a message instead of sending a private message. In both cases, some people blamed themselves and didn’t know why they’d just started making mistakes that caused them to send messages to the wrong place.</p> <h3 id="doors">Doors</h3> <p>A while back, I was at Black Seed Bagel, which has a door that <a href="https://www.google.com/search?tbm=isch&amp;source=hp&amp;biw=1739&amp;bih=1689&amp;ei=ADIFWq7SLcTPmwH8g6y4Dg&amp;q=black+seed+bagel&amp;oq=black+se&amp;gs_l=img.3.0.35i39k1l2j0l8.761.2250.0.3272.9.9.0.0.0.0.165.806.4j4.8.0....0...1.1.64.img..1.8.805.0...0.cicb-SDN7SU#imgrc=zwf_w60g8r1uWM:">looks 75% like a “push” door from both sides</a> when it’s actually a push door from the outside and a pull door from the inside. An additional clue that makes it seem even more like a &quot;push&quot; door from the inside is that most businesses have outward opening doors (this is required for exit doors in the U.S. when the room occupancy is above 50 and many businesses in smaller spaces voluntarily follow the same convention). During the course of an hour long conversation, I saw a lot of people go in and out and my guess is that ten people failed on their first attempt to use the door while exiting. When people were travelling in pairs or groups, the person in front would often say something like “I’m dumb. We just used this door a minute ago”. But the people were not, in fact, acting dumb. If anything is dumb, it’s designing doors such that are users have to memorize which doors act like “normal” doors and which doors have their cues reversed.</p> <p>If you’re interested in the physical world, <a href="https://www.amazon.com/gp/product/0465050654/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0465050654&amp;linkId=671c3ae6f7143212c71afe93c7c66951">The Design of Everyday Things</a>, gives many real-world examples where users are subtly nudged into doing the wrong thing. It also discusses general principles in a way that allows you to see the general idea and apply and avoid the same issues when designing software.</p> <h3 id="facebook">Facebook</h3> <p>Last week, FB changed its interface so that my normal sequence of clicks to hide a story saves the story instead of hiding it. Saving is pretty much the opposite of hiding! It’s the opposite both from the perspective of the user and also as a ranking signal to the feed ranker. The really “great” thing about a change like this is that it A/B tests incredibly well if you measure new feature “engagement” by number of clicks because many users will accidentally save a story when they meant to hide it. Earlier this year, twitter did something similar by swapping the location of “moments” and “notifications”.</p> <p>Even if the people making the change didn’t create the tricky interface in order to juice their engagement numbers, this kind of change is still problematic because it poisons analytics data. While it’s technically possible to build a model to separate out accidental clicks vs. purposeful clicks, that’s quite rare (I don’t know of any A/B tests where people have done that) and even in cases where it’s clear that users are going to accidentally trigger an action, I still see devs and PMs justify a feature because of how great it looks on naive statistics like <a href="https://en.wikipedia.org/wiki/Daily_active_users">DAU</a>/MAU.</p> <h3 id="api-backwards-compatibility">API backwards compatibility</h3> <p>When it comes to software APIs, there’s a school of thought that says that you should never break backwards compatibility for some classes of widely used software. A well-known example is <a href="http://lkml.iu.edu/hypermail/linux/kernel/1710.3/02487.html">Linus Torvalds</a>:</p> <blockquote> <p>People should basically always feel like they can update their kernel and simply not have to worry about it.</p> <p>I refuse to introduce &quot;you can only update the kernel if you also update that other program&quot; kind of limitations. If the kernel used to work for you, the rule is that it continues to work for you. … I have seen, and can point to, lots of projects that go &quot;We need to break that use case in order to make progress&quot; or &quot;you relied on undocumented behavior, it sucks to be you&quot; or &quot;there's a better way to do what you want to do, and you have to change to that new better way&quot;, and I simply don't think that's acceptable outside of very early alpha releases that have experimental users that know what they signed up for. The kernel hasn't been in that situation for the last two decades. ... We do API breakage <em>inside</em> the kernel all the time. We will fix internal problems by saying &quot;you now need to do XYZ&quot;, but then it's about internal kernel API's, and the people who do that then also obviously have to fix up all the in-kernel users of that API. Nobody can say &quot;I now broke the API you used, and now <em>you</em> need to fix it up&quot;. Whoever broke something gets to fix it too. ... And we simply do not break user space.</p> </blockquote> <p><a href="https://blogs.msdn.microsoft.com/oldnewthing/20031223-00/?p=41373">Raymond Chen quoting Colen</a>:</p> <blockquote> <p>Look at the scenario from the customer’s standpoint. You bought programs X, Y and Z. You then upgraded to Windows XP. Your computer now crashes randomly, and program Z doesn’t work at all. You’re going to tell your friends, &quot;Don’t upgrade to Windows XP. It crashes randomly, and it’s not compatible with program Z.&quot; Are you going to debug your system to determine that program X is causing the crashes, and that program Z doesn’t work because it is using undocumented window messages? Of course not. You’re going to return the Windows XP box for a refund. (You bought programs X, Y, and Z some months ago. The 30-day return policy no longer applies to them. The only thing you can return is Windows XP.)</p> </blockquote> <p>While this school of thought is a minority, it’s a vocal minority with a lot of influence. It’s much rarer to hear this kind of case made for UI backwards compatibility. You might argue that this is fine -- people are forced to upgrade nowadays, so it doesn’t matter if stuff breaks. But even if users can’t escape, it’s still a bad user experience.</p> <p>The counterargument to this school of thought is that maintaining compatibility creates technical debt. It’s true! Just for example, Linux is full of slightly to moderately wonky APIs due to the “do not break user space” dictum. One example is <a href="http://man7.org/linux/man-pages/man2/recvmmsg.2.html"><code>int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags, struct timespec *timeout); </code></a>. You might expect the timeout to fire if you don’t receive a packet, but the manpage reads:</p> <blockquote> <p>The <em>timeout</em> argument points to a <em>struct timespec</em> (see clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for the receive operation (<em>but see BUGS!</em>).</p> </blockquote> <p>The BUGS section reads:</p> <blockquote> <p>The <em>timeout</em> argument does not work as intended. The timeout is checked only after the receipt of each datagram, so that if up to <em>vlen-1</em> datagrams are received before the timeout expires, but then no further datagrams are received, the call will block forever.</p> </blockquote> <p>This is arguably not even the worst mis-feature of <code>recvmmsg</code>, <a href="http://www.openwall.com/lists/musl/2014/06/07/5">which returns an <code>ssize_t</code> into a field of size <code>int</code></a>.</p> <p>If you have a policy like “we simply do not break user space”, this sort of technical debt sticks around forever. But it seems to me that it’s not a coincidence that the most widely used desktop, laptop, and server operating systems in the world bend over backwards to maintain backwards compatibility.</p> <p>The case for UI backwards compatability is arguably stronger than the case for API backwards compatability because breaking API changes can be mechanically fixed and, <a href="//danluu.com/monorepo/">with the proper environment</a>, all callers can be fixed at the same time as the API changes. There's no equivalent way to reach into people's brains and change user habits, so a breaking UI change inevitably results in pain for some users.</p> <p>The case for the case for UI backwards compatibility is arguably weaker than the case for API backwards compatibility because API backwards compatibility has a lower cost -- if some API is problematic, you can make a new API and then document the old API as something that shouldn’t be used (you’ll see lots of these if you look at Linux syscalls). This doesn’t really work with GUIs since UI elements compete with each other for a small amount of screen real-estate. An argument that I think is underrated is that changing UIs isn’t as great as most companies seem to think -- very dated looking UIs that haven’t been refreshed to keep up with trends can be successful (e.g., plentyoffish and craigslist). Companies can even become wildly successful without any significant UI updates, let alone UI redesigns -- a large fraction of linkedin’s rocketship growth happened in a period where the UI was basically frozen. I’m told that freezing the UI wasn’t a deliberate design decision; instead, it was a side effect of severe technical debt, and that the UI was unfrozen the moment a re-write allowed people to confidently change the UI. Linkedin has managed to <a href="https://darkpatterns.org/">add a lot of dark patterns</a> since they unfroze their front-end, but the previous UI seemed to work just fine in terms of growth.</p> <p>Despite the success of a number of UIs which aren’t always updated to track the latest trends, at most companies, it’s basically impossible to make the case that UIs shouldn’t be arbitrarily changed without adding functionality, let alone make the case that UIs shouldn’t push out old functionality with new functionality.</p> <h3 id="ui-deprecation">UI deprecation</h3> <p>A case that might be easier to make is that shortcuts and shortcut-like UI elements can be deprecated before removal, similar to the way evolving APIs will add deprecation warnings before making breaking changes. Instead of regularly changing UIs so that users’ muscle memory is used against them and causes users to do the opposite of what they want, UIs can be changed so that doing the previously trained set of actions causes nothing to happen. For example, FB could have moved “hide post” down and inserted a no-op item in the old location, and then after people had gotten used to not clicking in the old “hide post” location for “hide post”, they could have then put “save post” in the old location for “hide post”.</p> <p>Zulip could’ve done something similar and caused the series of actions that used to let you send a private message to the person you want cause no message to be sent instead of sending a private message to the alphabetically first person on the <code>online</code> list.</p> <p>These solutions aren’t ideal because the user still has to retrain their muscle memory on the new thing, but it’s still a lot better than the current situation, where many UIs regularly introduce arbitrary-seeming changes that sow confusion and chaos.</p> <p>In some cases (e.g., the no-op menu item), this presents a pretty strange interface to new users. Users don’t expect to see a menu item that does nothing with an arrow that says to click elsewhere on the menu instead. This can be fixed by only rolling out deprecation “warnings” to users who regularly use the old shortcut or shortcut-like path. If there are multiple changes being deprecated, this results in a combinatorial explosion of possibilities, but if you're regularly deprecating multiple independent items, that's pretty extreme and users are probably going to be confused regardless of how it's handled. Given <a href="https://twitter.com/danluu/status/891508449414197248">the amount of effort made to avoid user hostile changes</a> and the dominance of the “move fast and break things” mindset, the case for adding this kind of complexity just to avoid giving users a bad experience probably won’t hold at most companies, but this at least seems plausible in principle.</p> <p>Breaking existing user workflows arguably doesn’t matter for an app like FB, which is relatively sticky as a result of its dominance in its area, but most applications are more like Zulip than FB. Back when Zulip and Slack were both young, Zulip messages couldn’t be edited or deleted. This was on purpose -- messages were immutable and everyone I know who suggested allowing edits was shot down because mutable messages didn’t fit into the immutable model. Back then, if there was a UI change or bug that caused users to accidentally send a public message instead of a private message, that was basically permanent. I saw people accidentally send public messages often enough that I got into the habit of moving private message conversations to another medium. That didn’t bother me too much since I’m used to <a href="everything-is-broken/">quirky software</a>, but I know people who tried Zulip back then and, to this day, still refuse to use Zulip due to UI issues they hit back then. That’s a bit of an extreme case, but the general idea that users will tend to avoid apps that repeatedly cause them pain isn’t much of a stretch.</p> <p>In studies on user retention, it appears to be the case that an additional 500ms of page-load latency negative impacts retention. If that's the case, it seems like switching the UI around so that the user has to spend 5s undoing and action or broadcasts a private message publicly in a way that can't be undone should have a noticable impact on retention, although I don't know of any public studies that look at this.</p> <h3 id="conclusion">Conclusion</h3> <p>If I worked on UI, I might have some suggestions or a call to action. But as an outsider, I’m wary of making actual suggestions -- programmers seem especially prone to coming into an area they’re not familiar with and telling experts how they should solve their problems. While this occasionally works, the most likely outcome is that the outsider either re-invents something that’s been known for decades or completely misses the most important parts of the problem.</p> <p>It sure would be nice if shortcuts didn’t break so often that I spend as much time consciously stopping myself from using shortcuts as I do actually using the app. But there are probably reasons this is difficult to test/enforce. The huge number of platforms that need to be tested for robust UI testing make testing hard even without adding this extra kind of test. And, even when we’re talking about functional correctness problems, “move fast and break things” is much trendier than “try to break relatively few things”. Since UI “correctness” often has even lower priority than functional correctness, it’s not clear how someone could successfully make a case for spending more effort on it.</p> <p>On the other hand, despite all these disclaimers, Google sometimes does the exact things described in this post. Chrome recently removed <code>backspace</code> to go backwards; if you hit <code>backspace</code>, you get a note telling you to use <code>alt+left</code> instead and when maps moved some items around a while back, they put in no-op placeholders that pointed people to the new location. This doesn't mean that Google always does this well -- on April fools day of 2016, gmail replaced <code>send and archive</code> with <code>send and attach a gif that's offensive in some contexts</code> -- but these examples indicate that maintaining backwards compatibility through significant changes isn't just a hypothetical idea, it can and has been done.</p> <p><i> Thanks to Leah Hanson, Allie Jones, Randall Koutnik, Kevin Lynagh, David Turner, Christian Ternus, Ted Unangst, Michael Bryc, Tony Finch, Stephen Tigner, Steven McCarthy, Julia Evans, @BaudDev, and an anonymous person who has a moral objection to public acknowledgements for comments/corrections/discussion.</p> <p>If you're curious why &quot;anon&quot; is against acknowledgements, it's because they first saw these in Paul Graham's writing, whose acknowledgements are sort of a who's who of SV. anon's belief is that these sorts of list serve as a kind of signalling. I won't claim that's wrong, but I get a lot of help with my writing both from people reading drafts and also from the occasional helpful public internet comment and I think it's important to make it clear that this isn't a one-person effort to combat <a href="https://www.bunniestudios.com/blog/?p=5046">what Bunnie Huang calls &quot;the idol effect&quot;</a>.</p> <p>In a future post, we'll look at empirical work on how line length affects readability. I've read every study I could find, but I might be missing some. If know of a good study you think I should include, <a href="https://twitter.com/danluu/">please let me know</a>. </i></p> Filesystem error handling filesystem-errors/ Mon, 23 Oct 2017 00:00:00 +0000 filesystem-errors/ <p>We’re going to reproduce some <a href="//danluu.com/file-consistency/">results from papers on filesystem robustness that were written up roughly a decade ago</a>: <a href="http://research.cs.wisc.edu/wind/Publications/iron-sosp05.pdf">Prabhakaran et al. SOSP 05 paper</a>, which injected errors below the filesystem and <a href="https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al. FAST 08</a>, which looked at how often filesystems failed to check return codes of functions that can return errors.</p> <p><a href="http://research.cs.wisc.edu/wind/Publications/iron-sosp05.pdf">Prabhakaran et al.</a> injected errors at the block device level (just underneath the filesystem) and found that <code>ext3</code>, <code>resierfs</code>, <code>ntfs</code>, and <code>jfs</code> mostly handled read errors reasonbly but <code>ext3</code>, <code>ntfs</code>, and <code>jfs</code> mostly ignored write errors. While the paper is interesting, someone installing Linux on a system today is much more likely to use <code>ext4</code> than any of the now-dated filesystems tested by Prahbhakaran et al. We’ll try to reproduce some of the basic results from the paper on more modern filesystems like <code>ext4</code> and <code>btrfs</code>, some legacy filesystems like <code>exfat</code>, <code>ext3</code>, and <code>jfs</code>, as well as on <code>overlayfs</code>.</p> <p><a href="https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al. </a> found that errors weren’t checked most of the time. After we look at error injection on modern filesystems, we’ll look at how much (or little) filesystems have improved their error handling code.</p> <h3 id="error-injection">Error injection</h3> <p>A cartoon view of a file read might be: <code>pread syscall -&gt; OS generic filesystem code -&gt; filesystem specific code -&gt; block device code -&gt; device driver -&gt; device controller -&gt; disk</code>. Once the disk gets the request, it sends the data back up: <code>disk -&gt; device controller -&gt; device driver -&gt; block device code -&gt; filesystem specific code -&gt; OS generic filesystem code -&gt; pread</code>. We’re going to look at error injection at the block device level, right below the file system.</p> <p>Let’s look at what happened when we injected errors in 2017 vs. what Prabhakaran et al. found in 2005.</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <table> <tr><th rowspan="3"></td><th colspan="3">2005</td><th colspan="6">2017</td></tr> <tr> <th>read</th><th>write</th><th>silent</th><th>read</th><th>write</th><th>silent</th><th>read</th><th>write</th><th>silent</th></tr> <tr><th colspan="6">file</td><th colspan="3">mmap</td></tr> <tr> <th>btrfs</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td></tr> <tr> <th>exfat</th><td colspan="3" bgcolor=gainsboro></td></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>ext3</th><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>ext4</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>fat</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>jfs</th><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>reiserfs</th><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td colspan="6" bgcolor=gainsboro></td></tr> <tr> <th>xfs</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> </table></p> <p>Each row shows results for one filesystem. <code>read</code> and <code>write</code> indicating reading and writing data, respectively, where the block device returns an error indicating that the operation failed. <code>silent</code> indicates a read failure (incorrect data) where the block device didn’t indicate an error. This could happen if there’s disk corruption, a transient read failure, or a transient write failure silently caused bad data to be written. <code>file</code> indicates that the operation was done on a file opened with <code>open</code> and <code>mmap</code> indicates that the test was done on a file mapped with <code>mmap</code>. <code>ignore</code> (red) indicates that the error was ignored, <code>prop</code> (yellow) indicates that the error was propagated and that the <code>pread</code> or <code>pwrite</code> syscall returned an error code, and <code>fix</code> (green) indicates that the error was corrected. No errors were corrected. Grey entries indicate configurations that weren’t tested.</p> <p>From the table, we can see that, in 2005, <code>ext3</code> and <code>jfs</code> ignored write errors even when the block device indicated that the write failed and that things have improved, and that any filesystem you’re likely to use will correctly tell you that a write failed. <code>jfs</code> hasn’t improved, but <code>jfs</code> is now rarely used outside of legacy installations.</p> <p>No tested filesystem other than <code>btrfs</code> handled silent failures correctly. The other filesystems tested neither duplicate nor checksum data, making it impossible for them to detect silent failures. <code>zfs</code> would probably also handle silent failures correctly but wasn’t tested. <code>apfs</code>, despite post-dating <code>btrfs</code> and <code>zfs</code>, made the explicit decision to not checksum data and silently fail on silent block device errors. We’ll discuss this more later.</p> <p>In all cases tested where errors were propagated, file reads and writes returned <code>EIO</code> from <code>pread</code> or <code>pwrite</code>, respectively; <code>mmap</code> reads and writes caused the process to receive a <code>SIGBUS</code> signal.</p> <p>The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test. The table below tests the same thing, but with a 445 byte file instead of an 8k file. The choice of 445 was arbitrary.</p> <table> <tr><th rowspan="3"></td><th colspan="3">2005</td><th colspan="6">2017</td></tr> <tr> <th>read</th><th>write</th><th>silent</th><th>read</th><th>write</th><th>silent</th><th>read</th><th>write</th><th>silent</th></tr> <tr><th colspan="6">file</td><th colspan="3">mmap</td></tr> <tr> <th>btrfs</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=green>fix</td><td bgcolor=green>fix</td><td bgcolor=green>fix</td><td bgcolor=green>fix</td><td bgcolor=green>fix</td><td bgcolor=green>fix</td></tr> <tr> <th>exfat</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>ext3</th><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>ext4</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>fat</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>jfs</th><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> <tr> <th>reiserfs</th><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td colspan="6" bgcolor=gainsboro></td></tr> <tr> <th>xfs</th><td colspan="3" bgcolor=gainsboro></td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=palegoldenrod>prop</td><td bgcolor=crimson>ignore</td></tr> </table> <p>In the small file test table, all the results are the same, except for <code>btrfs</code>, which returns correct data in every case tested. What’s happening here is that the filesystem was created on a rotational disk and, by default, <code>btrfs</code> duplicates filesystem metadata on rotational disks (it can be configured to do so on SSDs, but that’s not the default). Since the file was tiny, <code>btrfs</code> packed the file into the metadata and the file was duplicated along with the metadata, allowing the filesystem to fix the error when one block either returned bad data or reported a failure.</p> <h4 id="overlay">Overlay</h4> <p><a href="https://en.wikipedia.org/wiki/OverlayFS">Overlayfs</a> allows one file system to be “overlaid” on another. <a href="https://github.com/torvalds/linux/commit/e9be9d5e76e34872f0c37d72e25bc27fe9e2c54c">As explained in the initial commit</a>, one use case might be to put an (upper) read-write directory tree on top of a (lower) read-only directory tree, where all modifications go to the upper, writable layer.</p> <p>Although not listed on the tables, we also tested every filesystem other than <code>fat</code> as the lower filesystem with overlay fs (ext4 was the upper filesystem for all tests). Every filessytem tested showed the same results when used as the bottom layer in <code>overlay</code> as when used alone. <code>fat</code> wasn’t tested because mounting <code>fat</code> resulted in a <code>filesystem not supported</code> error.</p> <h4 id="error-correction">Error correction</h4> <p><code>btrfs</code> doesn’t, by default, duplicate metadata on SSDs because the developers believe that redundancy wouldn’t provide protection against errors on SSD (which is the same reason <code>apfs</code> doesn’t have redundancy). SSDs do a kind of write coalescing, which is likely to cause writes which happen consecutively to fall into the same block. If that block has a total failure, the redundant copies would all be lost, so redundancy doesn’t provide as much protection against failure as it would on a rotational drive.</p> <p>I’m not sure that this means that redundancy wouldn’t help -- Individual flash cells degrade with operation and lose charge as they age. SSDs have built-in <a href="https://en.wikipedia.org/wiki/Wear_leveling">wear-leveling</a> and <a href="https://en.wikipedia.org/wiki/Low-density_parity-check_code">error-correction</a> that’s designed to reduce the probability that a block returns bad data, but over time, some blocks will develop so many errors that the error-correction won’t be able to fix the error and the block will return bad data. In that case, a read should return some bad bits along with mostly good bits. AFAICT, the publicly available data on SSD error rates seems to line up with this view.</p> <h4 id="error-detection">Error detection</h4> <p>Relatedly, it appears that <a href="http://dtrace.org/blogs/ahl/2016/06/19/apfs-part5/"><code>apfs</code> doesn’t checksum data because “[apfs] engineers contend that Apple devices basically don’t return bogus data”</a>. Publicly available studies on SSD reliability have not found that there’s a model that doesn’t sometimes return bad data. It’s a common conception that SSDs are less likely to return bad data than rotational disks, but when Google studied this across their drives, they found:</p> <blockquote> <p>The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.</p> </blockquote> <p>While there is one sense in which SSDs are more reliable than rotational disks, there’s also a sense in which they appear to be less reliable. It’s not impossible that Apple uses some kind of custom firmware on its drive that devotes more bits to error correction than you can get in publicly available disks, but even if that’s the case, you might plug a non-apple drive into your apple computer and want some kind of protection against data corruption.</p> <h3 id="internal-error-handling">Internal error handling</h3> <p>Now that we’ve reproduced some tests from Prabhakaran et al., we’re going to move on to <a href="https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al.</a>. Since the paper is fairly involved, we’re just going to look at one small part of the paper, the part where they examined three function calls, <code>filemap_fdatawait</code>, <code>filemap_fdatawrite</code>, and <code>sync_blockdev</code> to see how often errors weren’t checked for these functions.</p> <p>Their justification for looking at these function is given as:</p> <blockquote> <p>As discussed in Section 3.1, a function could return more than one error code at the same time, and checking only one of them suffices. However, if we know that a certain function only returns a single error code and yet the caller does not save the return value properly, then we would know that such call is really a flaw. To find real flaws in the file system code, we examined three important functions that we know only return single error codes: sync_blockdev, filemap_fdatawrite, and filemap_fdatawait. A file system that does not check the returned error codes from these functions would obviously let failures go unnoticed in the upper layers.</p> </blockquote> <p>Ignoring errors from these functions appears to have fairly serious consequences. The documentation for <code>filemap_fdatawait</code> says:</p> <blockquote> <p>filemap_fdatawait — wait for all under-writeback pages to complete ... Walk the list of under-writeback pages of the given address space and wait for all of them. Check error status of the address space and return it. Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.</p> </blockquote> <p>The comment next to the code for <code>sync_blockdev</code> reads:</p> <blockquote> <p>Write out and wait upon all the dirty data associated with a block device via its mapping. Does not take the superblock lock.</p> </blockquote> <p>In both of these cases, it appears that ignoring the error code could mean that data would fail to get written to disk without notifying the writer that the data wasn’t actually written?</p> <p>Let’s look at how often calls to these functions didn’t completely ignore the error code:</p> <table> <thead> <tr> <th>fn</th> <th>2008</th> <th>'08 %</th> <th>2017</th> <th>'17 %</th> </tr> </thead> <tbody> <tr> <td>filemap_fdatawait</td> <td>7 / 29</td> <td>24</td> <td>12 / 17</td> <td>71</td> </tr> <tr> <td>filemap_fdatawrite</td> <td>17 / 47</td> <td>36</td> <td>13 / 22</td> <td>59</td> </tr> <tr> <td>sync_blockdev</td> <td>6 / 21</td> <td>29</td> <td>7 / 23</td> <td>30</td> </tr> </tbody> </table> <p>This table is for all code in linux under <code>fs</code>. Each row shows data for calls of one function. For each year, the leftmost cell shows the number of calls that do something with the return value over the total number of calls. The cell to the right shows the percentage of calls that do something with the return value. “Do something” is used very loosely here -- branching on the return value and then failing to handle the error in either branch, returning the return value and having the caller fail to handle the return value, as well as saving the return value and then ignoring it are all considered doing something for the purposes of this table.</p> <p>For example Gunawi et al. noted that <code>cifs/transport.c</code> had</p> <pre><code>int SendReceive () { int rc; rc = cifs_sign_smb(); // ... rc = smb_send(); } </code></pre> <p>Although <code>cifs_sign_smb</code> returned an error code, it was never checked before being overwritten by <code>smb_send</code>, which counted as being used for our purposes even though the error wasn’t handled.</p> <p>Overall, the table appears to show that many more errors are handled now than were handled in 2008 when Gunawi et al. did their analysis, but it’s hard to say what this means from looking at the raw numbers because it might be ok for some errors not to be handled and different lines of code are executed with different probabilities.</p> <h3 id="conclusion">Conclusion</h3> <p>Filesystem error handling seems to have improved. Reporting an error on a <code>pwrite</code> if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.</p> <p>Most filesystems don’t have checksums for data and leave error detection and correction up to userspace software. When I talk to server-side devs at big companies, their answer is usually something like “who cares? All of our file accesses go through a library that checksums things anyway and redundancy across machines and datacenters takes care of failures, so we only need error detection and not correction”. While that’s true for developers at certain big companies, there’s a lot of software out there that isn’t written robustly and just assumes that filesystems and disks don’t have errors.</p> <p><i>This was a joint project with Wesley Aptekar-Cassels; the vast majority of the work for the project was done while pair programming at <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>. We also got </i>a lot<i> of help from Kate Murphy. Both Wesley (w.aptekar@gmail.com) and Kate (hello@kate.io) are looking for work. They’re great and I highly recommend talking to them if you’re hiring!</i></p> <h3 id="appendix-error-handling-in-c">Appendix: error handling in C</h3> <p>A fair amount of effort has been applied to get error handling right. But C makes it very easy to get things wrong, even when you apply a fair amount effort and even apply extra tooling. One example of this in the code is the <code>submit_one_bio</code> function. If you look at the definition, you can see that it’s annotated with <code>__must_check</code>, which will cause a compiler warning when the result is ignored. But if you look at calls of <code>submit_one_bio</code>, you’ll see that its callers aren’t annotated and can ignore errors. If you dig around enough you’ll find one path of error propagation that looks like:</p> <pre><code>submit_one_bio submit_extent_page __extent_writepage extent_write_full_page write_cache_pages generic_writepages do_writepages __filemap_fdatawrite_range __filemap_fdatawrite filemap_fdatawrite </code></pre> <p>Nine levels removed from <code>submit_one_bio</code>, we see our old friend, `filemap_fdatawrite, which we know often doesn’t get checked for errors.</p> <p>There's a very old debate over how to prevent things like this from accidentally happening. One school of thought, which I'll call the Uncle Bob (UB) school believes that <a href="https://twitter.com/danluu/status/916315156199636992">we can't fix these kinds of issues with tools or processes and simply need to be better programmers in order to avoid bugs</a>. You'll often hear people of the UB school say things like, &quot;you can't get rid of all bugs with better tools (or processes)&quot;. In his famous and well-regarded talk, <a href="https://github.com/matthiasn/talk-transcripts/blob/master/Hickey_Rich/SimpleMadeEasy.md">Simple Made Easy</a>, Rich Hickey says</p> <blockquote> <p>What's true of every bug found in the field?</p> <p>[Audience reply: Someone wrote it?] [Audience reply: It got written.]</p> <p>It got written. Yes. What's a more interesting fact about it? It passed the type checker.</p> <p>[Audience laughter]</p> <p>What else did it do?</p> <p>[Audience reply: (Indiscernible)]</p> <p>It passed all the tests. Okay. So now what do you do? Right? I think we're in this world I'd like to call guardrail programming. Right? It's really sad. We're like: I can make change because I have tests. Who does that? Who drives their car around banging against the guardrail saying, &quot;Whoa! I'm glad I've got these guardrails because I'd never make it to the show on time.&quot;</p> <p>[Audience laughter]</p> </blockquote> <p>If you watch the talk, Rich uses &quot;simplicity&quot; the way Uncle Bob uses &quot;discipline&quot;. They way these statements are used, they're roughly equivalent to Ken Thompson saying &quot;<a href="https://twitter.com/danluu/status/885214004649615360">Bugs are bugs. You write code with bugs because you do</a>&quot;. The UB school throws tools and processes under the bus, saying that it's unsafe to rely solely on tools or processes.</p> <p>Rich's rhetorical trick is brilliant -- I've heard that line quoted tens of times since the talk to argue against tests or tools or types. But, like guardrails, most tools and processes aren't about eliminating all bugs, they're about reducing the severity or probability of bugs. If we look at this particular function call, we can see that a static analysis tool failed to find this bug. Does that mean that we should give up on static analysis tools? A static analysis tool could look for all calls of <code>submit_one_bio</code> and show you the cases where the error is propagated up N levels only to be dropped. Gunawi et al. did exactly that and found <em>a lot</em> of bugs. A person basically can't do the same thing without tooling. They could try, but people are lucky if they get 95% accuracy when manually digging through things like this. The sheer volume of code guarantees that a human doing this by hand would make mistakes.</p> <p>Even better than a static analysis tool would be a language that makes it harder to accidentally forget about checking for an error. One of the issues here is that it's sometimes valid to drop an error. There are a number of places where there's no interace that allows an error to get propagated out of the filesystem, making it correct to drop the error, modulo changing the interface. In the current situation, as an outsider reading the code, if you look at a bunch of calls that drop errors, it's very hard to say, for all of them, which of those is a bug and which of those is correct. If the default is that we have a kind of guardrail that says &quot;this error must be checked&quot;, people can still incorrectly ignore errors, but you at least get an annotation that the omission was on purpose. For example, if you're forced to specifically write code that indicates that you're ignoring an error, and in code that's inteded to be robust, like filesystem code, code that drops an error on purpose is relatively likely to be accompanied by a comment explaining why the error was dropped.</p> <h3 id="appendix-why-wasn-t-this-done-earlier">Appendix: why wasn't this done earlier?</h3> <p>After all, it would be nice if we knew if modern filesystems could do basic tasks correctly. Filesystem developers probably know this stuff, but since I don't <a href="http://lkml.iu.edu/hypermail/linux/kernel/0908.3/01481.html">follow LKML</a>, I had no idea whether or not things had improved since 2005 until we ran the experiment.</p> <p>The papers we looked at here came out of Andrea and Remzi Arpaci-Dusseau's research lab. Remzi has a talk where he mentioned that grad students don't want to reproduce and update old work. That's entirely reasonable, given the incentives they face. And I don't mean to pick on academia here -- this work came out of academia, not industry. It's possible this kind of work simply wouldn't have happened if not for the academic incentive system.</p> <p>In general, it seems to be quite difficult to fund work on correctness. There are a fair number of papers on new ways to find bugs, but that's relatively little work on applying existing techniques to existing code. In academia, that seems to be hard to get a good publication out of, in the open source world, that seems to be less interesting to people than writing new code. That's also entirely reasonable -- people should work on what they want, and even if they enjoy working on correctness, that's probably not a great career decision in general. I was at the <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> career fair the other night and my badge said I was interested in testing. The first person who chatted me up opened with &quot;do you work in QA?&quot;. Back when I worked in hardware, that wouldn't have been a red flag, but in software, &quot;QA&quot; is code for a low-skill, tedious, and poorly paid job. Much of industry considers testing and QA to be an afterthought. As a result, open source projects that companies rely on are often <a href="https://en.wikipedia.org/wiki/OpenSSL">woefully underfunded</a>. Google funds some great work (like afl-fuzz), but that's the exception and not the rule, even within Google, and most companies don't fund any open source work. The work in this post was done by a few people who are intentionally temporarily unemployed, which isn't really a scalable model.</p> <p>Occasionally, you'll see someone spend a lot of effort on immproving correctness, but that's usually done as a massive amount of free labor. Kyle Kingsbury might be the canonical example of this -- my understanding is that he worked on the <a href="http://jepsen.io/">Jepsen distributed systems testing tool</a> on nights and weekends for years before turning that into a consulting business. It's great that he did that -- he showed that almost every open source distributed system had serious data loss or corruption bugs. I think that's great, but stories about heoric effort like that always worry me because heroism doesn't scale. If Kyle hadn't come along, would most of the bugs that he and his tool found still plague open source distributed systems today? That's a scary thought.</p> <p>If I knew how to fund more work on correctness, I'd try to convince you that we should switch to this new model, but I don't know of a funding model that works. I've set up a <a href="https://www.patreon.com/danluu">patreon (donation account)</a>, but it would be quite extraordinary if that was sufficient to actually fund a signifcant amount of work. If you look at how much programmers make off of donations, if I made two order of magnitude less than I could if I took a job in industry, that would already put me in the top 1% of programmers on patreon. If I made one order of magnitude less than I'd make in industry, that would be extraordinary. Off the top of my head, the only programmers who make more than that off of patreon either make something with much broader appeal (like games) or are Evan You, who makes one of the most widely use front-end libraries in existence. And if I actually made as much as I can make in industry, I suspect that would make me the highest grossing programmer on patreon, even though, by industry standards, my compensation hasn't been anything special.</p> <p>If I had to guess, I'd say that part of the reason it's hard to fund this kind of work is that consumers don't incentivize companies to fund this sort of work. If you look at &quot;big&quot; tech companies, two of them are substantially more serious about correctness than their competitors. This results in many fewer horror stories about lost emails and documents as well as lost entire accounts. If you look at the impact on consumers, it might be something like the difference between 1% of people seeing lost/corrupt emails vs. 0.001%. I think that's pretty significant if you multiply that cost across all consumers, but the vast majority of consumers aren't going to make decisions based on that kind of difference. If you look at an area where correctness problems are much more apparent, like databases or backups, you'll find that even the worst solutions have defenders who will pop into any dicussions and say &quot;works for me&quot;. A backup solution that works 90% of the time is quite bad, but if you have one that works 90% of the time, it will still have staunch defenders who drop into discussions to say things like &quot;I've restored from backup three times and it's never failed! You must be making stuff up!&quot;. I don't blame companies for rationally responding to consumers, but I do think that the result is unfortunate for consumers.</p> <p>Just as an aside, one of the great wonders of doing open work for free is that the more free work you do, the more people complain that you didn't do enough free work. As David MacIver has said, <a href="https://www.youtube.com/watch?v=2XCDm3MoXUY">doing open source work is like doing normal paid work, except that you get paid in complaints instead of cash</a>. It's basically guaranteed that the most common comment on this post, for all time, will be that didn't test someone's pet filesystem because we're <code>btrfs</code> shills or just plain lazy, even though we include a link to a repo that lets anyone add tests as they please. Pretty much every time I've done any kind of free experimental work, people who obvously haven't read the experimental setup or the source code complain that the experiment couldn't possibly be right because of [thing that isn't true that anyone could see by looking at the setup] and that it's absolutely inexcusable that I didn't run the experiment on the exact pet thing they wanted to see. Having played video games competitively in the distant past, I'm used to much more intense internet trash talk, but in general, this incentive system seems to be backwards.</p> <h3 id="appendix-experimental-setup">Appendix: experimental setup</h3> <p>For the error injection setup, a high-level view of the experimental setup is that <a href="https://stackoverflow.com/q/1870696/334816"><code>dmsetup</code></a> was used to simulate bad blocks on the disk.</p> <p>A list of the commands run looks something like:</p> <pre><code>cp images/btrfs.img.gz /tmp/tmpeas9efr6.gz gunzip -f /tmp/tmpeas9efr6.gz losetup -f losetup /dev/loop19 /tmp/tmpeas9efr6 blockdev --getsize /dev/loop19 # 0 74078 linear /dev/loop19 0 # 74078 1 error # 74079 160296 linear /dev/loop19 74079 dmsetup create fserror_test_1508727591.4736078 mount /dev/mapper/fserror_test_1508727591.4736078 /mnt/fserror_test_1508727591.4736078/ mount -t overlay -o lowerdir=/mnt/fserror_test_1508727591.4736078/,upperdir=/tmp/tmp4qpgdn7f,workdir=/tmp/tmp0jn83rlr overlay /tmp/tmpeuot7zgu/ ./mmap_read /tmp/tmpeuot7zgu/test.txt umount /tmp/tmpeuot7zgu/ rm -rf /tmp/tmp4qpgdn7f rm -rf /tmp/tmp0jn83rlr umount /mnt/fserror_test_1508727591.4736078/ dmsetup remove fserror_test_1508727591.4736078 losetup -d /dev/loop19 rm /tmp/tmpeas9efr6 </code></pre> <p><a href="https://github.com/danluu/fs-errors/">See this github repo for the exact set of commands run to execute tests</a>.</p> <p>Note that all of these tests were done on linux, so <code>fat</code> means the linux <code>fat</code> implementation, not the windows <code>fat</code> implementation. <code>zfs</code> and <code>reiserfs</code> weren’t tested because they couldn’t be trivially tested in the exact same way that we tested other filesystems (one of us spent an hour or two trying to get <code>zfs</code> to work, but its configuration interface is inconsistent with all of the filesystems tested; <code>reiserfs</code> appears to have a consistent interface but testing it requires doing extra work for a filesystem that appears to be dead). <code>ext3</code> support is now provided by the <code>ext4</code> code, so what <code>ext3</code> means now is different from what it meant in 2005.</p> <p>All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines. All filesystems were configured with default settings. For <code>btrfs</code>, this meant duplicated metadata without duplicated data and, as far as we know, the settings wouldn't have made a difference for other filesystems.</p> <p>The second part of this doesn’t have much experimental setup to speak of. The setup was to grep the linux source code for the relevant functions.</p> <p><em>Thanks to Leah Hanson, David Wragg, Ben Kuhn, Wesley Aptekar-Cassels, Joel Borggrén-Franck, Yuri Vishnevsky, and Dan Puttick for comments/corrections on this post.</em></p> Keyboard latency keyboard-latency/ Mon, 16 Oct 2017 00:00:00 +0000 keyboard-latency/ <p>If you look at “gaming&quot; keyboards, a lot of them sell for $100 or more on the promise that they’re fast. Ad copy that you’ll see includes:</p> <ul> <li>a custom designed keycap that has been made shorter to reduce the time it takes for your actions to register</li> <li>8x FASTER - Polling Rate of 1000Hz: Response time 0.1 milliseconds</li> <li>Wield the ultimate performance advantage over your opponents with light operation 45g key switches and an actuation 40% faster than standard Cherry MX Red switches</li> <li>World's Fastest Ultra Polling 1000Hz</li> <li>World's Fastest Gaming Keyboard, 1000Hz Polling Rate, 0.001 Second Response Time</li> </ul> <p>Despite all of these claims, I can only find <a href="http://forums.blurbusters.com/viewtopic.php?f=10&amp;t=1836">one person who’s publicly benchmarked keyboard latency</a> and they only tested two keyboards. In general, my belief is that if someone makes performance claims without benchmarks, the claims probably aren’t true, just like how code that isn’t tested (or otherwise verified) should be assumed broken.</p> <p>The situation with gaming keyboards reminds me a lot of talking to car salesmen:</p> <blockquote> <p>Salesman: this car is super safe! It has 12 airbags! Me: that’s nice, but how does it fare in crash tests? Salesman: 12 airbags!</p> </blockquote> <p>Sure, gaming keyboards have 1000Hz polling, but so what?</p> <p>Two obvious questions are:</p> <ol> <li>Does keyboard latency matter?</li> <li>Are gaming keyboards actually quicker than other keyboards?</li> </ol> <h3 id="does-keyboard-latency-matter">Does keyboard latency matter?</h3> <p>A year ago, if you’d asked me if I was going to build a custom setup to measure keyboard latency, I would have said that’s silly, and yet here I am, measuring keyboard latency with a <a href="https://www.amazon.com/gp/product/B074TVSLN1/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B074TVSLN1&amp;linkId=9377ad585ac27e528ee655cc01b48cef">logic analyzer</a>.</p> <p>It all started because <a href="//danluu.com/input-lag/">I had this feeling that some old computers feel much more responsive than modern machines</a>. For example, an iMac G4 running macOS 9 or an Apple 2 both feel quicker than my <a href="https://www.amazon.com/gp/product/B01MXSI216/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01MXSI216&amp;linkId=ce7b5129c29c9c012b53e5402ae33a18">4.2 GHz Kaby Lake</a> system. I never trust feelings like this because there’s decades of research showing that users often have feelings that are the literal opposite of reality, so got a high-speed camera and started measuring actual keypress-to-screen-update latency as well as mouse-move-to-screen-update latency. It turns out the machines that feel quick are actually quick, much quicker than my modern computer -- computers from the 70s and 80s commonly have keypress-to-screen-update latencies in the 30ms to 50ms range out of the box, whereas modern computers are often in the 100ms to 200ms range when you press a key in a terminal. It’s possible to get down to the 50ms range in well optimized games with a fancy gaming setup, and there’s one extraordinary consumer device that can easily get below 50ms, but the default experience is much slower. Modern computers have much better <a href="https://stackoverflow.com/a/39187441/334816">throughput, but their latency</a> isn’t so great.</p> <p>Anyway, at the time I did these measurements, my 4.2 GHz kaby lake had the fastest single-threaded performance of any machine you could buy but had worse latency than a quick machine from the 70s (roughly 6x worse than an Apple 2), which seems a bit curious. To figure out where the latency comes from, I started measuring keyboard latency because that’s the first part of the pipeline. My plan was to look at the end-to-end pipeline and start at the beginning, ruling out keyboard latency as a real source of latency. But it turns out keyboard latency is significant! I was surprised to find that the median keyboard I tested has more latency than the entire end-to-end pipeline of the Apple 2. If this doesn’t immedately strike you as absurd, consider that an Apple 2 has 3500 transistors running at 1MHz and an Atmel employee estimates that the core used in a number of high-end keyboards today has <a href="https://www.embeddedrelated.com/showthread/comp.arch.embedded/14362-1.php">80k transistors</a> running at 16MHz. That's 20x the transistors running at 16x the clock speed -- keyboards are often more powerful than entire computers from the 70s and 80s! And yet, the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.</p> <p>Let’s look at the measured keypress-to-USB latency on some keyboards:</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <div style="overflow-x:auto;"></p> <table> <thead> <tr> <th>keyboard</th> <th>latency<br>(ms)</th> <th>connection</th> <th>gaming</th> </tr> </thead> <tbody> <tr> <td><a href="https://www.amazon.com/gp/product/B016QO64FI/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B016QO64FI&amp;linkId=37ca409628c850d85d4b8649699cae9b">apple magic</a> (usb)</td> <td>15</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B0000U1DJ2/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B0000U1DJ2&amp;linkId=20d58842cee7c22d03e405c59cb1d2b9">hhkb lite 2</a></td> <td>20</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B004XMAL00/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B004XMAL00&amp;linkId=82939c8d37d45d1a0a8a77d22eb3ac77">MS natural 4000</a></td> <td>20</td> <td>USB</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B008PFABI8/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B008PFABI8&amp;linkId=3d1945a4fa3cfb25d6f7de0278fbbdfc">das</a> 3</td> <td>25</td> <td>USB</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B0086G2YE0/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B0086G2YE0&amp;linkId=0772fce67ce00f03760bbd53b179e355">logitech k120</a></td> <td>30</td> <td>USB</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B01M29PYF4/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01M29PYF4&amp;linkId=d885df4dca893dd7deaf36c9d0dd3213">unicomp model M</a></td> <td>30</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B00OFM51L2/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B00OFM51L2&amp;linkId=96bf16aaba80391dd38bfbb763d02bcc">pok3r vortex</a></td> <td>30</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B004WOF7S0/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B004WOF7S0&amp;linkId=3dc1bc2fd4fa06be77f90a73f4992911">filco majestouch</a></td> <td>30</td> <td>USB</td> <td></td> </tr> <tr> <td>dell OEM</td> <td>30</td> <td>USB</td> <td></td> </tr> <tr> <td>powerspec OEM</td> <td>30</td> <td>USB</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B00CMALD3E/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B00CMALD3E&amp;linkId=8873c99f91b35e17f61d8510fbb50a39">kinesis freestyle 2</a></td> <td>30</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B01D2TSGL0/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01D2TSGL0&amp;linkId=966405c7866aff4b1e24e1dbfacd5035">chinfai silicone</a></td> <td>35</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B01LVTI3TO/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01LVTI3TO&amp;linkId=b1f13ca07d80d302103bb0c313a994d9">razer ornata chroma</a></td> <td>35</td> <td>USB FS</td> <td>Yes</td> </tr> <tr> <td><a href="https://olkb.com/planck/">olkb planck rev 4</a></td> <td>40</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B01CJ4LJ68/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01CJ4LJ68&amp;linkId=7ce482c03631b402385c5b0bb5fca221">ergodox</a></td> <td>40</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B014W20C90/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B014W20C90&amp;linkId=a72b89b476d857d4df4389bb7995bc8a">MS comfort 5000</a></td> <td>40</td> <td>wireless</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B01DBJTZU2/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01DBJTZU2&amp;linkId=fb0a73e976666bebd9c1e6a5657e83b9">easterntimes i500</a></td> <td>50</td> <td>USB FS</td> <td>Yes</td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B01K32MU2A/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01K32MU2A&amp;linkId=99b7e5c9a826b61a0f30f43210e21d38">kinesis advantage</a></td> <td>50</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B003YGVDDU/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B003YGVDDU&amp;linkId=c8c51873d132176213b864de76f31dbe">genius luxemate i200</a></td> <td>55</td> <td>USB</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B00DGJALYW/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B00DGJALYW&amp;linkId=96a1e65f5e4dee9f5d20306048c1dd0c">topre type heaven</a></td> <td>55</td> <td>USB FS</td> <td></td> </tr> <tr> <td><a href="https://www.amazon.com/gp/product/B00L1Y11D4/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B00L1Y11D4&amp;linkId=8a84bfe004ab161c183e26c35f5966c3">logitech k360</a></td> <td>60</td> <td>&quot;unifying&quot;</td> <td></td> </tr> </tbody> </table> <p>The latency measurements are the time from when the key starts moving to the time when the <a href="https://docs.mbed.com/docs/ble-hid/en/latest/api/md_doc_HID.html">USB packet associated with the key</a> makes it out onto the USB bus. Numbers are rounded to the nearest <code>5 ms</code> in order to avoid giving a false sense of precision. The <code>easterntimes i500</code> is also sold as the <code>tomoko MMC023</code>.</p> <p>The connection column indicates the connection used. <code>USB FS</code> stands for the <code>usb full speed</code> protocol, which allows up to 1000Hz polling, a feature commonly advertised by high-end keyboards. <code>USB</code> is the <code>usb low speed</code> protocol, which is the protocol most keyboards use. The ‘gaming’ column indicates whether or not the keyboard is branded as a gaming keyboard. <code>wireless</code> indicates some kind of keyboard-specific dongle and <code>unifying</code> is logitech's wireless device standard.</p> <p>We can see that, even with the limited set of keyboards tested, there can be as much as a 45ms difference in latency between keyboards. Moreover, a modern computer with one of the slower keyboards attached can’t possibly be as responsive as a quick machine from the 70s or 80s because the keyboard alone is slower than the entire response pipeline of some older computers.</p> <p>That establishes the fact that modern keyboards contribute to the latency bloat we’ve seen over the past forty years. The other half of the question is, does the latency added by a modern keyboard actually make a difference to users? From looking at the table, we can see that among the keyboard tested, we can get up to a 40ms difference in average latency. Is 40ms of latency noticeable? Let’s take a look at some latency measurements for keyboards and then look at the empirical research on how much latency users notice.</p> <p>There’s a fair amount of empirical evidence on this and we can see that, for very simple tasks, <a href="https://pdfs.semanticscholar.org/386a/15fd85c162b8e4ebb6023acdce9df2bd43ee.pdf">people can perceive latencies down to 2ms or less</a>. Moreover, increasing latency is not only noticeable to users, <a href="http://www.tactuallabs.com/papers/howMuchFasterIsFastEnoughCHI15.pdf">it causes users to execute simple tasks less accurately</a>. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, <a href="https://www.youtube.com/watch?v=vOvQCPLkPt4">check out this MSR demo on touchscreen latency</a>.</p> <h3 id="are-gaming-keyboards-faster-than-other-keyboards">Are gaming keyboards faster than other keyboards?</h3> <p>I’d really like to test more keyboards before making a strong claim, but from the preliminary tests here, it appears that gaming keyboards aren’t generally faster than non-gaming keyboards.</p> <p>Gaming keyboards often claim to have features that reduce latency, like connecting over USB FS and using 1000Hz polling. The USB low speed spec states that the minimum time between packets is <code>10ms</code>, or 100 Hz. However, it’s common to see USB devices round this down to the nearest power of two and run at <code>8ms</code>, or 125Hz. With <code>8ms</code> polling, the average latency added from having to wait until the next polling interval is <code>4ms</code>. With <code>1ms</code> polling, the average latency from USB polling is <code>0.5ms</code>, giving us a <code>3.5ms</code> delta. While that might be a significant contribution to latency for a quick keyboard like the Apple magic keyboard, it’s clear that other factors dominate keyboard latency for most keyboards and that the gaming keyboards tested here are so slow that shaving off <code>3.5ms</code> won’t save them.</p> <p>Another thing to note about gaming keyboards is that they often advertise &quot;n-key rollover&quot; (the ability to have n simulataneous keys pressed at once — for many key combinations, typical keyboards will often only let you press two keys at once, excluding modifier keys). Although not generally tested here, I tried a &quot;Razer DeathStalker Expert Gaming Keyboard&quot; that advertises &quot;Anti-ghosting capability for up to 10 simultaneous key presses&quot;. The Razer gaming keyboard did not have this capability in a useful manner and many combinations of three keys didn't work. Their advertising claim could, I suppose, technically true in that 3 in some cases could be &quot;up to 10&quot;, but like gaming keyboards claiming to have lower latency due to 1000 Hz polling, the claim is highly misleading at best.</p> <h3 id="conclusion">Conclusion</h3> <p>Most keyboards add enough latency to make the user experience noticeably worse, and keyboards that advertise speed aren’t necessarily faster. The two gaming keyboards we measured weren’t faster than non-gaming keyboards, and the fastest keyboard measured was a minimalist keyboard from Apple that’s marketed more on design than speed.</p> <p>Previously, we've seen that <a href="//danluu.com/term-latency/">terminals can add significant latency, up 100ms in mildly pessimistic conditions if you choose the &quot;right&quot; terminal</a>. In a future post, we'll look at the entire end-to-end pipeline to see other places latency has crept in and we'll also look at how some modern devices keep latency down.</p> <h3 id="appendix-where-is-the-latency-coming-from">Appendix: where is the latency coming from?</h3> <p>A major source of latency is key travel time. It’s not a coincidence that the quickest keyboard measured also has the shortest key travel distance by a large margin. The video setup I’m using to measure end-to-end latency is a 240 fps camera, which means that frames are 4ms apart. When videoing “normal&quot; keypresses and typing, it takes 4-8 frames for a key to become fully depressed. Most switches will start firing before the key is fully depressed, but the key travel time is still significant and can easily add <code>10ms</code> of delay (or more, depending on the switch mechanism). Contrast this to the Apple &quot;magic&quot; keyboard measured, where the key travel is so short that it can’t be captured with a 240 fps camera, indicating that the key travel time is &lt; 4ms.</p> <p>Note that, unlike the other measurement I was able to find online, this measurement was from the start of the keypress instead of the switch activation. This is because, as a human, you don't activate the switch, you press the key. A measurement that starts from switch activiation time misses this large component to latency. If, for example, you're playing a game and you switch from moving forward to moving backwards when you see something happen, you have pay the cost of the key movement, which is different for different keyboards. A common response to this is that &quot;real&quot; gamers will preload keys so that they don't have to pay the key travel cost, but if you go around with a high speed camera and look at how people actually use their keyboards, the fraction of keypresses that are significantly preloaded is basically zero even when you look at gamers. It's possible you'd see something different if you look at high-level competitive gamers, but even then, just for example, people who use a standard wasd or esdf layout will typically not preload a key when going from back to forward. Also, the idea that it's fine that keys have a bunch of useless travel because you can pre-depress the key before really pressing the key is just absurd. That's like saying latency on modern computers is fine because some people build gaming boxes that, when run with unusually well optimzed software, get 50ms response time. Normal, non-hardcore-gaming users simply aren't going to do this. Since that's the vast majority of the market, even if all &quot;serious&quot; gamers did this, that would stll be a round error.</p> <p>The other large sources of latency are scaning the <a href="http://blog.komar.be/how-to-make-a-keyboard-the-matrix/">keyboard matrix</a> and debouncing. Neither of these delays are inherent -- keyboards use a matrix that has to be scanned instead of having a wire per-key because it saves a few bucks, and most keyboards scan the matrix at such a slow rate that it induces human noticable delays because that saves a few bucks, but a manufacturer willing to spend a bit more on manufacturing a keyboard could make the delay from that far below the threshold of human perception. See below for debouncing delay.</p> <p>Although we didn't discuss throughput in this, when I measure my typing speed, I find that I can type faster with the <a href="https://www.amazon.com/gp/product/B01NABDNPH/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01NABDNPH&amp;linkId=ccde207372bd2239204e1ebc3a36a9cd">low-travel Apple keyboard</a> than with any of the other keyboards. There's no way to do a blinded experiment for this, but Gary Bernhardt and others have also observed the same thing. Some people claim that key travel doesn't matter for typing speed because they use the minimum amount of travel necessary and that this therefore can't matter, but as with the above claims on keypresses, if you walk around with a high speed camera and observe what actually happens when people type, it's very hard to find someone who actually does this.</p> <h3 id="2022-update">2022 update</h3> <p>When I ran these experiments, it didn't seem that anyone was testing latency across multiple keyboards. I found the results I got so unintuitive that I tried to find anyone else's keyboard latency measurements and all I could find was a forum post from someone who tried to measure their keyboard (just one) and got results in the same range, but using a setup that wasn't fast enough to really measure the latency properly. I also video'd my test as well as non-test keypresses with a high-speed camera to see how much time it took to depress keys, and the results weren't obviously inconsistent with the results I got now.</p> <p>Starting a year or two after I wrote the post, I witnessed some discussions from some gaming mouse and keyboard makers on how to make lower latency devices and they started releasing devices that actually have lower latency, as opposed to the devices they had, which basically had gaming skins and would often light up.</p> <p>If you want a low-latency keyboard that isn't the Apple keyboard (quite a few people I've talked to report finger pain after using the Apple keyboard for an extended period of time), the <a href="https://amzn.to/3AYD7FG">SteelSeries Apex Pro</a> is fairly low latency; for a mouse, the <a href="https://amzn.to/3gPLoVl">Corsair Sabre</a> is also pretty quick.</p> <p>Another change since then is that more people understand that debouncing doesn't have to add noticeable latency. When I wrote the original post, I had multiple keyboard makers explain to me that the post is wrong and it's impossible to not add latency when debouncing. I found that very odd since I'd expect a freshman EE or, for that matter, a high school kid who plays with electronics, to understand why that's not the case but, for whatever reason, multiple people who made keyboards for a living didn't understand this. Now, how to debounce without adding latency has become common knowledge and, when I see discussions where someone says debouncing must add a lot of latency, they usually get corrected. This knowledge has spread to most keyboard makers and reduced keyboard latency for some new keyboards, although I know there's still at least one keyboard maker that doesn't believe that you can debounce with low latency and they still add quite a bit of latency from their new keyboards as a result.</p> <h3 id="appendix-counter-arguments-to-common-arguments-that-latency-doesn-t-matter">Appendix: counter-arguments to common arguments that latency doesn’t matter</h3> <p>Before writing this up, I read what I could find about latency and it was hard to find non-specialist articles or comment sections that didn’t have at least one of the arguments listed below:</p> <h4 id="computers-and-devices-are-fast">Computers and devices are fast</h4> <p>The most common response to questions about latency is that input latency is basically zero, or so close to zero that it’s a rounding error. For example, two of the top comments on <a href="https://hardware.slashdot.org/story/13/07/13/1331235/ask-slashdot-low-latency-ps2usb-gaming-keyboards" rel="nofollow">this slashdot post asking about keyboard latency</a> are that keyboards are so fast that keyboard speed doesn’t matter. One person even says</p> <blockquote> <p>There is not a single modern keyboard that has 50ms latency. You (humans) have that sort of latency.</p> <p>As far as response times, all you need to do is increase the poll time on the USB stack</p> </blockquote> <p>As we’ve seen, some devices do have latencies in the 50ms range. This quote as well as other comments in the thread illustrate another common fallacy -- that input devices are limited by the speed of the USB polling. While that’s technically possible, most devices are nowhere near being fast enough to be limited by USB polling latency.</p> <p>Unfortunately, most online explanations of input latency <a href="https://byuu.org/articles/latency/" rel="nofollow">assume that the USB bus is the limiting factor</a>.</p> <h4 id="humans-can-t-notice-100ms-or-200ms-latency">Humans can’t notice 100ms or 200ms latency</h4> <p><a href ="https://stackoverflow.com/a/39985088/334816" rel="nofollow">Here’s a “cognitive neuroscientist who studies visual perception and cognition&quot;</a> who refers to the fact that human reaction time is roughly 200ms, and then throws in a bunch more scientific mumbo jumbo to say that no one could really notice latencies below 100ms. This is a little unusual in that the commenter claims some kind of special authority and uses a lot of terminology, but it’s common to hear people claim that you can’t notice 50ms or 100ms of latency because human reaction time is 200ms. This doesn’t actually make sense because there are independent quantities. This line of argument is like saying that you wouldn’t notice a flight being delayed by an hour because the duration of the flight is six hours.</p> <p>Another problem with this line of reasoning is that the full pipeline from keypress to screen update is quite long and if you say that it’s always fine to add 10ms here and 10ms there, you end up with a much larger amount of bloat through the entire pipeline, which is how we got where we are today, where can buy a system with the CPU that gives you the fastest single-threaded performance money can buy and get 6x the latency of a machine from the 70s.</p> <h4 id="it-doesn-t-matter-because-the-game-loop-runs-at-60-hz">It doesn’t matter because the game loop runs at 60 Hz</h4> <p>This is fundamentally the same fallacy as above. If you have a delay that’s half the duration a clock period, there’s a 50% chance the delay will push the event into the next processing step. That’s better than a 100% chance, but it’s not clear to me why people think that you’d need a delay as long as the the clock period for the delay to matter. And for reference, the <code>45ms</code> delta between the slowest and fastest keyboard measured here corresponds to 2.7 frames at 60fps.</p> <h4 id="keyboards-can-t-possibly-response-faster-more-quickly-than-5ms-10ms-20ms-due-to-debouncing-http-www-eng-utah-edu-cs5780-debouncing-pdf">Keyboards can’t possibly response faster more quickly than 5ms/10ms/20ms due to <a href="http://www.eng.utah.edu/~cs5780/debouncing.pdf">debouncing</a></h4> <p>Even without going through contortions to optimize the switch mechanism, if you’re willing to put hysteresis into the system, there’s no reason that the keyboard can’t assume a keypress (or release) is happening the moment it sees an edge. This is commonly done for other types of systems and AFAICT there’s no reason keyboards couldn’t do the same thing (and perhaps some do). The debounce time might limit the repeat rate of the key, but there’s no inherent reason that it has to affect the latency. And if we're looking at the repeat rate, imagine we have a 5ms limit on the rate of change of the key state due to introducing hysteresis. That gives us one full keypress cycle (press and release) every 10ms, or 100 keypresses per second per key, which is well beyond the capacity of any human. You might argue that this introduces a kind of imprecision, which might matter in some applications (music, rythym games), but that's limited by the switch mechanism. Using a debouncing mechanism with hysteresis doesn't make us any worse off than we were before.</p> <p>An additional problem with debounce delay is that most keyboard manufacturers seem to have confounded scan rate and debounce delay. It's common to see keyboards with scan rates in the 100 Hz to 200 Hz range. This is justified by statements like &quot;there's no point in scanning faster because the debounce delay is 5ms&quot;, which combines two fallacies mentioned above. If you pull out the schematics for the Apple 2e, you can see that the scan rate is roughly 50 kHz. Its debounce time is roughly 6ms, which corresponds to a frequency of 167 Hz. Why scan so quickly? The fast scan allows the keyboard controller to start the clock on the debounce time almost immediately (after at most 20 microseconds), as opposed a modern keyboard that scans at 167 Hz, which might not start the clock on debouncing for 6ms, or after 300x as much time.</p> <p>Apologies for not explaining terminology here, but I think that anyone making this objection should understand the explanation :-).</p> <h3 id="appendix-experimental-setup">Appendix: experimental setup</h3> <p>The USB measurement setup was a <a href="https://www.amazon.com/gp/product/B000E5CYW8/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B000E5CYW8&amp;linkId=4e14a96314e989f8166fe183a0a9b0ee0 that was cut open to expose the wires, connected to a logic analyzer. The exact model of logic analzyer doesn’t really matter, but if you’re curious about the details, this set of experiments used a [salae pro](https://www.amazon.com/gp/product/B074TWM5WX/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B074TWM5WX&amp;linkId=562b98ee01041c876c27311abdff418c). While the exact model of USB cable doesn't matter, if you want to reproduce the experiment, I recommend using a [relatively short USB cable](https://www.amazon.com/gp/product/B000E5CYW8/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B000E5CYW8&amp;linkId=4e14a96314e989f8166fe183a0a9b0ee">USB cable</a>. Cutting open the cable damages the signal integrity and I found that, with a very long cable, some keyboards that weakly drive the data lines didn't drive them strongly enough to get a good signal with the cheap logic analyzer I used.</p> <p>The start-of-input was measured by pressing two keys at once -- one key on the keyboard and a button that was also connected to the logic analyzer. This introduces some jitter as the two buttons won’t be pressed at exactly the same time. To calibrate the setup, we used two identical buttons connected to the logic analyzer. The median jitter was &lt; 1ms and the 90%-ile jitter was roughly 5ms. This is enough that tail latency measurements for quick keyboards aren’t really possible with this setup, but average latency measurements like the ones done here seem like they should be ok. The input jitter could probably be reduced to a negligible level by building a device to both trigger the logic analyzer and press a key on the keyboard under test at the same time. Average latency measurements would also get better with such a setup (because it would be easier to run a large number of measurements).</p> <p>If you want to know the exact setup, a <a href="https://www.digikey.com/product-detail/en/e-switch/LL1105AF065Q/LL1105AF065Q-ND/3777946">E-switch LL1105AF065Q</a> switch was used. Power and ground were supplied by <a href="https://www.amazon.com/gp/product/B008GRTSV6/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B008GRTSV6&amp;linkId=51afaa1d76d1e6ef7c1e6bc0eeebdf9d">an arduino board</a>. There’s no particular reason to use this setup. In fact, it’s a bit absurd to use an entire arduino to provide power, but this was done with spare parts that were lying around and this stuff just happened to be stuff that <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> had in their lab, with the exception of the switches. There weren’t two identical copies of any switch, so we bought a few switches so we could do calibration measurements with two identical switches. The exact type of switch isn’t important here; any low-resistance switch would do.</p> <p>Tests were done by pressing the <code>z</code> key and then looking for byte 29 on the USB bus and then marking the end of the first packet containing the appropriate information. But, as above, any key would do.</p> <p>I don't actually trust this setup and I'd like to build a completely automated setup before testing more keyboards. While the measurements are in line with the one other keyboard measurement I could find online, this setup has an inherent imprecision that's probably in the 1ms to 10ms range. While averaging across multiple measurements reduces that imprecision, since the measurements are done by a human, it's not guaranteed and perhaps not even likely that the errors are independent and will average out.</p> <p>This project was done with help from Wesley Aptekar-Cassels, Leah Hanson, and Kate Murphy.</p> <p>Thanks to <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>, Ahmad Jarara, Raph Levien, Peter Bhat Harkins, Brennan Chesley, Dan Bentley, Kate Murphy, Christian Ternus, Sophie Haskins, and Dan Puttick, for letting us use their keyboards for testing.</p> <p>Thanks for Leah Hanson, Mark Feeney, Greg Kennedy, and Zach Allaun for comments/corrections/discussion on this post. </i></p> Branch prediction branch-prediction/ Wed, 23 Aug 2017 00:00:00 +0000 branch-prediction/ <p><em>This is a pseudo-transcript for a talk on branch prediction given at Two Sigma on 8/22/2017 to kick off &quot;localhost&quot;, a talk series organized by <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>.</em></p> <p>How many of you use branches in your code? Could you please raise your hand if you use if statements or pattern matching?</p> <p><code>Most of the audience raises their hands</code></p> <p>I won’t ask you to raise your hands for this next part, but my guess is that if I asked, how many of you feel like you have a good understanding of what your CPU does when it executes a branch and what the performance implications are, and how many of you feel like you could understand a modern paper on branch prediction, fewer people would raise their hands.</p> <p>The purpose of this talk is to explain how and why CPUs do “branch prediction” and then explain enough about classic branch prediction algorithms that you could read a modern paper on branch prediction and basically know what’s going on.</p> <p>Before we talk about branch prediction, let’s talk about why CPUs do branch prediction. To do that, we’ll need to know a bit about how CPUs work.</p> <p>For the purposes of this talk, you can think of your computer as a CPU plus some memory. The instructions live in memory and the CPU executes a sequence of instructions from memory, where instructions are things like “add two numbers”, “move a chunk of data from memory to the processor”. Normally, after executing one instruction, the CPU will execute the instruction that’s at the next sequential address. However, there are instructions called “branches” that let you change the address next instruction comes from.</p> <p>Here’s an abstract diagram of a CPU executing some instructions. The x-axis is time and the y-axis distinguishes different instructions.</p> <p><img src="images/branch-prediction/1-pipeline.png" alt="Instructions executing sequentially" height="130" width="1262"></p> <p>Here, we execute instruction <code>A</code>, followed by instruction <code>B</code>, followed by instruction <code>C</code>, followed by instruction <code>D</code>.</p> <p>One way you might design a CPU is to have the CPU do all of the work for one instruction, then move on to the next instruction, do all of the work for the next instruction, and so on. There’s nothing wrong with this; a lot of older CPUs did this, and some modern very low-cost CPUs still do this. But if you want to make a faster CPU, you might make a CPU that works like an assembly line. That is, you break the CPU up into two parts, so that half the CPU can do the “front half” of the work for an instruction while half the CPU works on the “back half” of the work for an instruction, like an assembly line. This is typically called a pipelined CPU.</p> <p><img src="images/branch-prediction/2-pipeline.png" alt="Instructions with overlapping execution" height="275" width="1263"></p> <p>If you do this, the execution might look something like the above. After the first half of instruction A is complete, the CPU can work on the second half of instruction A while the first half of instruction B runs. And when the second half of A finishes, the CPU can start on both the second half of B and the first half of C. In this diagram, you can see that the pipelined CPU can execute twice as many instructions per unit time as the unpipelined CPU above.</p> <p>There’s no reason that a CPU can only be broken up into two parts. We could break the CPU into three parts, and get a 3x speedup, or four parts and get a 4x speedup. This isn’t strictly true, and we generally get less than a 3x speedup for a three-stage pipeline or 4x speedup for a 4-stage pipeline because there’s overhead in breaking the CPU up into more parts and having a deeper pipeline.</p> <p>One source of overhead is how branches are handled. One of the first things the CPU has to do for an instruction is to get the instruction; to do that, it has to know where the instruction is. For example, consider the following code:</p> <pre><code>if (x == 0) { // Do stuff } else { // Do other stuff (things) } // Whatever happens later </code></pre> <p>This might turn into assembly that looks something like</p> <pre><code>branch_if_not_equal x, 0, else_label // Do stuff goto end_label else_label: // Do things end_label: // whatever happens later </code></pre> <p>In this example, we compare <code>x</code> to 0. <code>if_not_equal</code>, then we branch to <code>else_label</code> and execute the code in the else block. If that comparison fails (i.e., if <code>x</code> is 0), we fall through, execute the code in the <code>if</code> block, and then jump to <code>end_label</code> in order to avoid executing the code in <code>else</code> block.</p> <p>The particular sequence of instructions that’s problematic for pipelining is</p> <pre><code>branch_if_not_equal x, 0, else_label ??? </code></pre> <p>The CPU doesn’t know if this is going to be</p> <pre><code>branch_if_not_equal x, 0, else_label // Do stuff </code></pre> <p>or</p> <pre><code>branch_if_not_equal x, 0, else_label // Do things </code></pre> <p>until the branch has finished (or nearly finished) executing. Since one of the first things the CPU needs to do for an instruction is to get the instruction from memory, and we don’t know which instruction <code>???</code> is going to be, we can’t even start on <code>???</code> until the previous instruction is nearly finished.</p> <p>Earlier, when we said that we’d get a 3x speedup for a 3-stage pipeline or a 20x speedup for a 20-stage pipeline, that assumed that you could start a new instruction every cycle, but in this case the two instructions are nearly serialized.</p> <p><img src="images/branch-prediction/3-branch-stall.png" alt="non-overlapping execution due to branch stall" height="72" width="640"></p> <p>One way around this problem is to use branch prediction. When a branch shows up, the CPU will guess if the branch was taken or not taken.</p> <p><img src="images/branch-prediction/4-branch-speculation.png" alt="speculating about a branch result" height="69" width="326"></p> <p>In this case, the CPU predicts that the branch won’t be taken and starts executing the first half of <code>stuff</code> while it’s executing the second half of the branch. If the prediction is correct, the CPU will execute the second half of <code>stuff</code> and can start another instruction while it’s executing the second half of <code>stuff</code>, like we saw in the first pipeline diagram.</p> <p><img src="images/branch-prediction/5-branch-correct-prediction.png" alt="overlapped execution after a correct prediction" height="72" width="482"></p> <p>If the prediction is wrong, when the branch finishes executing, the CPU will throw away the result from <code>stuff.1</code> and start executing the correct instructions instead of the wrong instructions. Since we would’ve stalled the processor and not executed any instructions if we didn’t have branch prediction, we’re no worse off than we would’ve been had we not made a prediction (at least at the level of detail we’re looking at).</p> <p><img src="images/branch-prediction/6-branch-misprediction.png" alt="aborted prediction" height="101" width="640"></p> <p>What’s the performance impact of doing this? To make an estimate, we’ll need a performance model and a workload. For the purposes of this talk, our cartoon model of a CPU will be a pipelined CPU where non-branches take an average of one instruction per clock, unpredicted or mispredicted branches take 20 cycles, and correctly predicted branches take one cycle.</p> <p>If we look at the most commonly used benchmark of “workstation” integer workloads, SPECint, the composition is maybe 20% branches, and 80% other operations. Without branch prediction, we then expect the “average” instruction to take <code>branch_pct * 1 + non_branch_pct * 20 = 0.2 * 20 + 0.8 * 1 = 4 + 0.8 = 4.8 cycles</code>. With perfect, 100% accurate, branch prediction, we’d expect the average instruction to take 0.8 * 1 + 0.2 * 1 = 1 cycle, a 4.8x speedup! Another way to look at it is that if we have a pipeline with a 20-cycle branch misprediction penalty, we have nearly a 5x overhead from our ideal pipelining speedup just from branches alone.</p> <p>Let’s see what we can do about this. We’ll start with the most naive things someone might do and work our way up to something better.</p> <h3 id="predict-taken">Predict taken</h3> <p>Instead of predicting randomly, we could look at all branches in the execution of all programs. If we do this, we’ll see that taken and not not-taken branches aren’t exactly balanced -- there are substantially more taken branches than not-taken branches. One reason for this is that loop branches are often taken.</p> <p>If we predict that every branch is taken, we might get 70% accuracy, which means we’ll pay the the misprediction cost for 30% of branches, making the cost of of an average instruction <code>(0.8 + 0.7 * 0.2) * 1 + 0.3 * 0.2 * 20 = 0.94 + 1.2. = 2.14</code>. If we compare always predicting taken to no prediction and perfect prediction, always predicting taken gets a large fraction of the benefit of perfect prediction despite being a very simple algorithm.</p> <p><img src="images/branch-prediction/7-always-taken-cpi.png" alt="2.14 cycles per instruction" height="156" width="1147"></p> <h3 id="backwards-taken-forwards-not-taken-btfnt">Backwards taken forwards not taken (BTFNT)</h3> <p>Predicting branches as taken works well for loops, but not so great for all branches. If we look at whether or not branches are taken based on whether or not the branch is forward (skips over code) or backwards (goes back to previous code), we can see that backwards branches are taken more often than forward branches, so we could try a predictor which predicts that backward branches are taken and forward branches aren’t taken (BTFNT). If we implement this scheme in hardware, compiler writers will conspire with us to arrange code such that branches the compiler thinks will be taken will be backwards branches and branches the compiler thinks won’t be taken will be forward branches.</p> <p>If we do this, we might get something like 80% prediction accuracy, making our cost function <code>(0.8 + 0.8 * 0.2) * 1 + 0.2 * 0.2 * 20 = 0.96 + 0.8 = 1.76</code> cycles per instruction.</p> <p><img src="images/branch-prediction/8-btfnt-cpi.png" alt="1.76 cycles per instruction" height="157" width="1147"></p> <h4 id="used-by">Used by</h4> <ul> <li>PPC 601(1993): also uses compiler generated branch hints</li> <li>PPC 603</li> </ul> <h3 id="one-bit">One-bit</h3> <p>So far, we’ve look at schemes that don’t store any state, i.e., schemes where the prediction ignores the program’s execution history. These are called <em>static</em> branch prediction schemes in the literature. These schemes have the advantage of being simple but they have the disadvantage of being bad at predicting branches whose behavior change over time. If you want an example of a branch whose behavior changes over time, you might imagine some code like</p> <pre><code>if (flag) { // things } </code></pre> <p>Over the course of the program, we might have one phase of the program where the flag is set and the branch is taken and another phase of the program where flag isn’t set and the branch isn’t taken. There’s no way for a static scheme to make good predictions for a branch like that, so let’s consider <em>dynamic</em> branch prediction schemes, where the prediction can change based on the program history.</p> <p>One of the simplest things we might do is to make a prediction based on the last result of the branch, i.e., we predict taken if the branch was taken last time and we predict not taken if the branch wasn’t taken last time.</p> <p>Since having one bit for every possible branch is too many bits to feasibly store, we’ll keep a table of some number of branches we’ve seen and their last results. For this talk, let’s store <code>not taken</code> as <code>0</code> and <code>taken</code> as <code>1</code>.</p> <p><img src="images/branch-prediction/9-1bit.png" alt="prediction table with 1-bit entries indexed by low bits of branch address" height="420" width="1358"></p> <p>In this case, just to make things fit on a diagram, we have a 64-entry table, which mean that we can index into the table with 6 bits, so we index into the table with the low 6 bits of the branch address. After we execute a branch, we update the entry in the prediction table (highlighted below) and the next time the branch is executed again, we index into the same entry and use the updated value for the prediction.</p> <p><img src="images/branch-prediction/10-1bit-update.png" alt="indexed entry changes on update" height="420" width="1358"></p> <p>It’s possible that we’ll observe aliasing and two branches in two different locations will map to the same location. This isn’t ideal, but there’s a tradeoff between table speed &amp; cost vs. size that effectively limits the size of the table.</p> <p>If we use a one-bit scheme, we might get 85% accuracy, a cost of <code>(0.8 + 0.85 * 0.2) * 1 + 0.15 * 0.2 * 20 = 0.97 + 0.6 = 1.57</code> cycles per instruction.</p> <p><img src="images/branch-prediction/11-1bit-cpi.png" alt="1.57 cycles per instruction" height="201" width="1147"></p> <h4 id="used-by-1">Used by</h4> <ul> <li>DEC EV4 (1992)</li> <li>MIPS R8000 (1994)</li> </ul> <h3 id="two-bit-https-courses-cs-washington-edu-courses-cse590g-04sp-smith-1981-a-study-of-branch-prediction-strategies-pdf"><a href="https://courses.cs.washington.edu/courses/cse590g/04sp/Smith-1981-A-Study-of-Branch-Prediction-Strategies.pdf">Two-bit</a></h3> <p>A one-bit scheme works fine for patterns like <code>TTTTTTTT…</code> or <code>NNNNNNN…</code> but will have a misprediction for a stream of branches that’s mostly taken but has one branch that’s not taken, <code>...TTTNTTT...</code> This can be fixed by adding second bit for each address and implementing a saturating counter. Let’s arbitrarily say that we count down when a branch is not taken and count up when it’s taken. If we look at the binary values, we’ll then end up with:</p> <pre><code>00: predict Not 01: predict Not 10: predict Taken 11: predict Taken </code></pre> <p>The “saturating” part of saturating counter means that if we count down from <code>00</code>, instead of underflowing, we stay at <code>00</code>, and similar for counting up from <code>11</code> staying at <code>11</code>. This scheme is identical to the one-bit scheme, except that each entry in the prediction table is two bits instead of one bit.</p> <p><img src="images/branch-prediction/12-2bit.png" alt="same as 1-bit, except that the table has 2 bits" height="481" width="1371"></p> <p>Compared to a one-bit scheme, a two-bit scheme can have half as many entries at the same size/cost (if we only consider the cost of storage and ignore the cost of the logic for the saturating counter), but even so, for most reasonable table sizes a two-bit scheme provides better accuracy.</p> <p>Despite being simple, this works quite well, and we might expect to see something like 90% accuracy for a two bit predictor, which gives us a cost of 1.38 cycles per instruction.</p> <p><img src="images/branch-prediction/13-2bit-cpi.png" alt="1.38 cycles per instruction" height="234" width="1147"></p> <p>One natural thing to do would be to generalize the scheme to an n-bit saturating counter, but it turns out that adding more bits has a relatively small effect on accuracy. We haven’t really discussed the cost of the branch predictor, but going from 2 bits to 3 bits per branch increases the table size by 1.5x for little gain, which makes it not worth the cost in most cases. The simplest and most common things that we won’t predict well with a two-bit scheme are patterns like <code>NTNTNTNTNT...</code> or <code>NNTNNTNNT…</code>, but going to n-bits won’t let us predict those patterns well either!</p> <h4 id="used-by-2">Used by</h4> <ul> <li>LLNL S-1 (1977)</li> <li>CDC Cyber? (early 80s)</li> <li>Burroughs B4900 (1982): state stored in instruction stream; hardware would over-write instruction to update branch state</li> <li>Intel Pentium (1993)</li> <li>PPC 604 (1994)</li> <li>DEC EV45 (1993)</li> <li>DEC EV5 (1995)</li> <li>PA 8000 (1996): actually a 3-bit shift register with majority vote</li> </ul> <h3 id="two-level-adaptive-global-http-www-seas-upenn-edu-cis501-papers-two-level-branch-pred-pdf-1991"><a href="http://www.seas.upenn.edu/~cis501/papers/two-level-branch-pred.pdf">Two-level adaptive, global</a> (1991)</h3> <p>If we think about code like</p> <pre><code>for (int i = 0; i &lt; 3; ++i) { // code here. } </code></pre> <p>That code will produce a pattern of branches like <code>TTTNTTTNTTTN...</code>.</p> <p>If we know the last three executions of the branch, we should be able to predict the next execution of the branch:</p> <pre><code>TTT:N TTN:T TNT:T NTT:T </code></pre> <p>The previous schemes we’ve considered use the branch address to index into a table that tells us if the branch is, according to recent history, more likely to be taken or not taken. That tells us which direction the branch is biased towards, but it can’t tell us that we’re in the middle of a repetitive pattern. To fix that, we’ll store the history of the most recent branches as well as a table of predictions.</p> <p><img src="images/branch-prediction/14-global.png" alt="Use global branch history and branch address to index into prediction table" height="481" width="1371"></p> <p>In this example, we concatenate 4 bits of branch history together with 2 bits of branch address to index into the prediction table. As before, the prediction comes from a 2-bit saturating counter. We don’t want to only use the branch history to index into our prediction table since, if we did that, any two branches with the same history would alias to the same table entry. In a real predictor, we’d probably have a larger table and use more bits of branch address, but in order to fit the table on a slide, we have an index that’s only 6 bits long.</p> <p>Below, we’ll see what gets updated when we execute a branch.</p> <p><img src="images/branch-prediction/15-global-update.png" alt="Update changes index because index uses bits from branch history" height="481" width="1371"></p> <p>The bolded parts are the parts that were updated. In this diagram, we shift new bits of branch history in from right to left, updating the branch history. Because the branch history is updated, the low bits of the index into the prediction table are updated, so the next time we take the same branch again, we’ll use a different entry in the table to make the prediction, unlike in previous schemes where the index is fixed by the branch address. The old entry’s value is updated so that the next time we take the same branch again with the same branch history, we’ll have the updated prediction.</p> <p>Since the history in this scheme is global, this will correctly predict patterns like <code>NTNTNTNT…</code> in inner loops, but may not always correct make predictions for higher-level branches because the history is global and will be contaminated with information from other branches. However, the tradeoff here is that keeping a global history is cheaper than keeping a table of local histories. Additionally, using a global history lets us correctly predict correlated branches. For example, we might have something like:</p> <pre><code>if x &gt; 0: x -= 1 if y &gt; 0: y -= 1 if x * y &gt; 0: foo() </code></pre> <p>If either the first branch or the next branch isn’t taken, then the third branch definitely will not be taken.</p> <p>With this scheme, we might get 93% accuracy, giving us 1.27 cycles per instruction.</p> <p><img src="images/branch-prediction/16-global-cpi.png" alt="1.27 cycles per instruction" height="290" width="1147"></p> <h4 id="used-by-3">Used by</h4> <ul> <li>Pentium MMX (1996): 4-bit global branch history</li> </ul> <h3 id="two-level-adaptive-local-http-www-cse-iitd-ac-in-srsarangi-col-718-2017-papers-branchpred-alternative-impl-pdf-1992"><a href="http://www.cse.iitd.ac.in/~srsarangi/col_718_2017/papers/branchpred/alternative-impl.pdf">Two-level adaptive, local</a> [1992]</h3> <p>As mentioned above, an issue with the global history scheme is that the branch history for local branches that could be predicted cleanly gets contaminated by other branches.</p> <p>One way to get good local predictions is to keep separate branch histories for separate branches.</p> <p><img src="images/branch-prediction/17-local.png" alt="keep a table of per-branch histories instead of a global history" height="505" width="1369"></p> <p>Instead of keeping a single global history, we keep a table of local histories, index by the branch address. This scheme is identical to the global scheme we just looked at, except that we keep multiple branch histories. One way to think about this is that having global history is a special case of local history, where the number of histories we keep track of is <code>1</code>.</p> <p>With this scheme, we might get something like 94% accuracy, which gives us a cost of 1.23 cycles per instruction.</p> <p><img src="images/branch-prediction/18-local-cpi.png" alt="1.23 cycles per instruction" height="362" width="1147"></p> <h4 id="used-by-4">Used by</h4> <ul> <li>Pentium Pro (1996): <a href="http://www.ece.uah.edu/~milenka/docs/milenkovic_WDDD02.pdf">4 bit local branch history, low bits of PC used for index</a>. Note that is under some dispute and Agner Fog claims that the PPro and follow-on processors use 4-bit global history</li> <li>Pentium II (1997): same as PPro</li> <li>Pentium III (1999): same as PPro</li> </ul> <h3 id="gshare">gshare</h3> <p>One tradeoff a global two-level scheme has to make is that, for a prediction table of a fixed size, bits must be dedicated to either the branch history or the branch address. We’d like to give more bits to the branch history because that allows correlations across greater “distance” as well as tracking more complicated patterns and we’d like to give more bits to the branch address to avoid interference between unrelated branches.</p> <p>We can try to get the best of both worlds by hashing both the branch history and the branch address instead of concatenating them. One of the simplest reasonable things one might do, and the first proposed mechanism was to <a href="https://en.wikipedia.org/wiki/XOR_gate"><code>xor</code></a> them together. This two-level adaptive scheme, where we <code>xor</code>the bits together is called <code>gshare</code>.</p> <p><img src="images/branch-prediction/19-gshare.png" alt="hash branch address and branch history instead of appending" height="488" width="1371"></p> <p>With this scheme, we might see something like 94% accuracy. That’s the accuracy we got from the local scheme we just looked at, but gshare avoids having to keep a large table of local histories; getting the same accuracy while having to track less state is a significant improvement.</p> <h4 id="used-by-5">Used by</h4> <ul> <li>MIPS R12000 (1998): <a href="https://courses.cs.washington.edu/courses/csep548/06au/lectures/branchPred.pdf">2K entries, 11 bits of PC, 8 bits of history</a></li> <li>UltraSPARC-3 (2001): <a href="https://courses.cs.washington.edu/courses/csep548/06au/lectures/branchPred.pdf">16K entries, 14 bits of PC, 12 bits of history</a></li> </ul> <h3 id="agree-http-meseec-ce-rit-edu-eecc722-fall2006-papers-branch-prediction-5-agree-isca24-pdf-1997"><a href="http://meseec.ce.rit.edu/eecc722-fall2006/papers/branch-prediction/5/agree_isca24.pdf">agree</a> (1997)</h3> <p>One reason for branch mispredictions is interference between different branches that alias to the same location. There are many ways to reduce interference between branches that alias to the same predictor table entry. In fact, the reason this talk only runs into schemes invented in the 90s is because a wide variety of interference-reducing schemes were proposed and there are too many to cover in half an hour.</p> <p>We’ll look at one scheme which might give you an idea of what an interference-reducing scheme could look like, the “agree” predictor. When two branch-history pairs collide, the predictions either match or they don’t. If they match, we’ll call that neutral interference and if they don’t, we’ll call that negative interference. The idea is that most branches tend to be strongly biased (that is, if we use two-bit entries in the predictor table, we expect that, without interference, most entries will be <code>00</code> or <code>11</code> most of the time, not <code>01</code> or <code>10</code>). For each branch in the program, we’ll store one bit, which we call the “bias”. The table of predictions will, instead of storing the absolute branch predictions, store whether or not the prediction matches or does not match the bias.</p> <p><img src="images/branch-prediction/20-agree.png" alt="predict whether or not a branch agrees with its bias as opposed to whether or not it's taken" height="782" width="1371"></p> <p>If we look at how this works, the predictor is identical to a gshare predictor, except that we make the changes mentioned above -- the prediction is agree/disagree instead of taken/not-taken and we have a bias bit that’s indexed by the branch address, which gives us something to agree or disagree with. In the original paper, they propose using the first thing you see as the bias and other people have proposed using profile-guided optimization (basically running the program and feeding the data back to the compiler) to determine the bias.</p> <p>Note that, when we execute a branch and then later come back around to the same branch, we’ll use the same bias bit because the bias is indexed by the branch address, but we’ll use a different predictor table entry because that’s indexed by both the branch address and the branch history.</p> <p><img src="images/branch-prediction/21-agree-update.png" alt="updating uses the same bias but a different meta-prediction table entry" height="782" width="1371"></p> <p>If it seems weird that this would do anything, let’s look at a concrete example. Say we have two branches, branch A which is taken with 90% probability and branch B which is taken with 10% probability. If those two branches alias and we assume the probabilities that each branch is taken are independent, the probability that they disagree and negatively interfere is <code>P(A taken) * P(B not taken) + P(A not taken) + P(B taken) = (0.9 * 0.9) + (0.1 * 0.1) = 0.82</code>.</p> <p>If we use the agree scheme, we can re-do the calculation above, but the probability that the two branches disagree and negatively interfere is <code>P(A agree) * P(B disagree) + P(A disagree) * P(B agree) = P(A taken) * P(B taken) + P(A not taken) * P(B taken) = (0.9 * 0.1) + (0.1 * 0.9) = 0.18</code>. Another way to look at it is, to have destructive interference, one of the branches must disagree with its bias. By definition, if we’ve correctly determined the bias, this cannot be likely to happen.</p> <p>With this scheme, we might get something like 95% accuracy, giving us 1.19 cycles per instruction.</p> <p><img src="images/branch-prediction/22-agree-cpi.png" alt="1.19 cycles per instruction" height="432" width="1147"></p> <h4 id="used-by-6">Used by</h4> <ul> <li>PA-RISC 8700 (2001)</li> </ul> <h3 id="hybrid-http-www-hpl-hp-com-techreports-compaq-dec-wrl-tn-36-pdf-1993"><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-36.pdf">Hybrid</a> (1993)</h3> <p>As we’ve seen, local predictors can predict some kinds of branches well (e.g., inner loops) and global predictors can predict some kinds of branches well (e.g., some correlated branches). One way to try to get the best of both worlds is to have both predictors, then have a meta predictor that predicts if the local or the global predictor should be used. A simple way to do this is to have the meta-predictor use the same scheme as the two-bit predictor above, except that instead of predicting <code>taken</code> or <code>not taken</code> it predicts <code>local predictor</code> or <code>global predictor</code></p> <p><img src="images/branch-prediction/23-hybrid.png" alt="predict which of two predictors is correct instead of predicting if the branch is taken" height="752" width="1358"></p> <p>Just as there are many possible interference-reducing schemes, of which the <code>agree</code> predictor, above is one, there are many possible hybrid schemes. We could use any two predictors, not just a local and global predictor, and we could even use more than two predictors.</p> <p>If we use a local and global predictor, we might get something like 96% accuracy, giving us 1.15 cycles per instruction.</p> <p><img src="images/branch-prediction/24-hybrid-cpi.png" alt="1.15 cycles per instruction" height="496" width="1147"></p> <h4 id="used-by-7">Used by</h4> <ul> <li>DEC EV6 (1998): combination of local (1k entries, 10 history bits, 3 bit counter) &amp; global (4k entries, 12 history bits, 2 bit counter) predictors</li> <li>IBM POWER4 (2001): local (16k entries) &amp; gshare (16k entries, 11 history bits, xor with branch address, 16k selector table)</li> <li>IBM POWER5 (2004): combination of bimodal (not covered) and two-level adaptive</li> <li>IBM POWER7 (2010)</li> </ul> <h3 id="not-covered">Not covered</h3> <p>There are a lot of things we didn’t cover in this talk! As you might expect, the set of material that we didn’t cover is much larger than what we did cover. I’ll briefly describe a few things we didn’t cover, with references, so you can look them up if you’re interested in learning more.</p> <p>One major thing we didn’t talk about is <a href="https://en.wikipedia.org/wiki/Branch_target_predictor">how to predict the branch target</a>. Note that this needs to be done even for some unconditional branches (that is, branches that don’t need directional prediction because they’re always taken), since <a href="http://www.engr.uconn.edu/~zshi/course/cse5302/ref/bray91btb_tr.pdf">(some) unconditional branches have unknown branch targets</a>.</p> <p>Branch target prediction is expensive enough that some early CPUs had a branch prediction policy of “always predict not taken” because a branch target isn’t necessary when you predict the branch won’t be taken! Always predicting not taken has poor accuracy, but it’s still better than making no prediction at all.</p> <p>Among the interference reducing predictors we didn’t discuss are <a href="https://people.eecs.berkeley.edu/~kubitron/courses/cs152-S04/handouts/papers/p4-lee.pdf">bi-mode</a>, <a href="http://meseec.ce.rit.edu/eecc722-fall2002/papers/branch-prediction/7/michaud97trading.pdf">gskew</a>, and <a href="http://web.eecs.umich.edu/~tnm/papers/yags.pdf">YAGS</a>. Very briefly, bi-mode is somewhat like agree in that it tries to seperate out branches based on direction, but the mechanism used in bi-mode is that we keep multiple predictor tables and a third predictor based on the branch address is used to predict which predictor table gets use for the particular combination of branch and branch history. Bi-mode appears to be more successful than agree in that it's seen wider use. With gskew, we keep at least three predictor tables and use a different hash to index into each table. The idea is that, even if two branches alias, those two branches will only alias in one of the tables, so we can use a vote and the result from the other two tables will override the potentially bad result from the aliasing table. I don't know how to describe YAGS very briefly :-).</p> <p>Because we didn't take about speed (as in latency), a prediction strategy we didn't talk about is to have a small/fast predictor that can be overridden by a slower and more accurate predictor when the slower predictor computes its result.</p> <p>Some modern CPUs have completely different branch predictors; AMD Zen (2017) and AMD Bulldozer (2011) chips appear to use <a href="https://www.cs.utexas.edu/~lin/papers/hpca01.pdf">perceptron based branch predictors</a>. Perceptrons are single-layer neural nets.</p> <p><a href="https://hal.inria.fr/hal-01100647/document">It’s been argued that</a> Intel Haswell (2013) uses a variant of a <a href="http://www.irisa.fr/caps/people/seznec/JILP-COTTAGE.pdf">TAGE predictor</a>. TAGE stands for TAgged GEometric history length predictor. If we look at the predictors we’ve covered and look at actual executions of programs to see which branches we’re not predicting correctly, one major class of branches are branches that need a lot of history -- a significant number of branches need tens or hundreds of bits of history and some even need more than a thousand bits of branch history. If we have a single predictor or even a hybrid predictor that combines a few different predictors, it’s counterproductive to keep a thousand bits of history because that will make predictions worse for the branches which need a relatively small amount of history (especially relative to the cost), which is most branches. One of the ideas in the TAGE predictor is that, by keeping a geometric series of history lengths, each branch can use the appropriate history. That explains the GE. The TA part is that branches are tagged, which is a mechanism we don’t discuss that the predictor uses to track which branches should use which set of history.</p> <p>Modern CPUs often have specialized predictors, e.g., a loop predictor can accurately predict loop branches in cases where a generalized branch predictor couldn’t reasonably store enough history to make perfect predictions for every iteration of the loop.</p> <p>We didn’t talk at all about the tradeoff between using up more space and getting better predictions. Not only does changing the size of the table change the performance of a predictor, it also changes which predictors are better relative to each other.</p> <p>We also didn’t talk at all about how different workloads affect different branch predictors. Predictor performance varies not only based on table size but also based on which particular program is run.</p> <p>We’ve also talked about branch misprediction cost as if it’s a fixed thing, <a href="http://users.elis.ugent.be/~leeckhou/papers/ispass06-eyerman.pdf">but it is not</a>, and for that matter, the cost of non-branch instructions also varies widely between different workloads.</p> <p>I tried to avoid introducing non-self-explanatory terminology when possible, so if you read the literature, terminology will be somewhat different.</p> <h3 id="conclusion">Conclusion</h3> <p>We’ve looked at a variety of classic branch predictors and very briefly discussed a couple of newer predictors. Some of the classic predictors we discussed are still used in CPUs today, and if this were an hour long talk instead of a half-hour long talk, we could have discussed state-of-the-art predictors. I think that a lot of people have an idea that CPUs are mysterious and hard to understand, but I think that CPUs are actually easier to understand than software. I might be biased because I used to work on CPUs, but I think that this is not a result of my bias but something fundamental.</p> <p>If you think about the complexity of software, the main limiting factor on complexity is your imagination. If you can imagine something in enough detail that you can write it down, you can make it. Of course there are cases where that’s not the limiting factor and there’s something more practical (e.g., the performance of large scale applications), but I think that most of us spend most of our time writing software where the limiting factor is the ability to create and manage complexity.</p> <p>Hardware is quite different from this in that there are forces that push back against complexity. Every chunk of hardware you implement costs money, so you want to implement as little hardware as possible. Additionally, performance matters for most hardware (whether that’s absolute performance or performance per dollar or per watt or per other cost), and adding complexity makes hardware slower, which limits performance. Today, you can buy an off-the-shelf CPU for $300 which can be overclocked to 5 GHz. At 5 GHz, one unit of work is one-fifth of one nanosecond. For reference, light travels roughly one foot in one nanosecond. Another limiting factor is that people get pretty upset when CPUs don’t work perfectly all of the time. Although <a href="cpu-bugs/">CPUs do have bugs</a>, the rate of bugs is much lower than in almost all software, i.e., the standard to which they’re verified/tested is much higher. Adding complexity makes things harder to test and verify. Because CPUs are held to a higher correctness standard than <a href="everything-is-broken/">most software</a>, adding complexity creates a much higher test/verification burden on CPUs, which makes adding a similar amount of complexity much more expensive in hardware than in software, even ignoring the other factors we discussed.</p> <p>A side effect of these factors that push back against chip complexity is that, for any particular “high-level” general purpose CPU feature, it is generally conceptually simple enough that it can be described in a half-hour or hour-long talk. CPUs are simpler than many programmers think! BTW, I say “high-level” to rule out things like how transistors and circuit design, which can require a fair amount of low-level (physics or solid-state) background to understand.</p> <h3 id="cpu-internals-series">CPU internals series</h3> <ul> <li><a href="//danluu.com/new-cpu-features/">New CPU features since the 80s</a></li> <li><a href="//danluu.com/integer-overflow/">The cost of branches and integer overflow checking in real code</a></li> <li><a href="cpu-bugs/">CPU bugs</a></li> <li><a href="//danluu.com/branch-prediction/">A brief history of branch prediction</a></li> <li><a href="//danluu.com/hardware-unforgiving/">Why CPU development is hard</a></li> <li><a href="//danluu.com/why-hardware-development-is-hard/">Verilog sucks, part 1</a></li> <li><a href="//danluu.com/pl-troll/">Verilog sucks, part 2</a></li> </ul> <p><em>Thanks to Leah Hanson, Hari Angepat, and Nick Bergson-Shilcock for reviewing practice versions of the talk and to Fred Clausen Jr for finding a typo in this post. Apologies for the somewhat slapdash state of this post -- I wrote it quickly so that people who attended the talk could refer to the “transcript ” soon afterwards and look up references, but this means that there are probably more than the usual number of errors and that the organization isn’t as nice as it would be for a normal blog post. In particular, things that were explained using a series of animations in the talk are not explained in the same level of detail and on skimming this, I notice that there’s less explanation of what sorts of branches each predictor doesn’t handle well, and hence less motivation for each predictor. I may try to go back and add more motivation, but I’m unlikely to restructure the post completely and generate a new set of graphics that better convey concepts when there are a couple of still graphics next to text. Thanks to Julien Vivenot, Ralph Corderoy, Vaibhav Sagar, Mindy Preston, Stefan Kanthak, and Uri Shaked for catching typos in this hastily written post.</em></p> Sattolo's algorithm sattolo/ Wed, 09 Aug 2017 00:00:00 +0000 sattolo/ <p>I recently had a problem where part of the solution was to do a series of pointer accesses that would walk around a chunk of memory in pseudo-random order. Sattolo's algorithm provides a solution to this because it produces a permutation of a list with exactly one cycle, which guarantees that we will reach every element of the list even though we're traversing it in random order.</p> <p>However, the explanations of why the algorithm worked that I could find online either used some kind of mathematical machinery (Stirling numbers, assuming familiarity with cycle notation, etc.), or used logic that was hard for me to follow. I find that this is common for explanations of concepts that could, but don't have to, use a lot of mathematical machinery. I don't think there's anything wrong with using existing mathematical methods per se -- it's a nice mental shortcut if you're familiar with the concepts. If you're taking a combinatorics class, it makes sense to cover Stirling numbers and then rattle off a series of results whose proofs are trivial if you're familiar with Stirling numbers, but for people who are only interested in a single result, I think it's unfortunate that it's hard to find a relatively simple explanation that <a href="https://twitter.com/danluu/status/1147984717238562816">doesn't require any background</a>. When I was looking for a simple explanation, I also found a lot of people who were using Sattolo's algorithm in places where it wasn't appropriate and also people who didn't know that Sattolo's algorithm is what they were looking for, so here's an attempt at an explanation of why the algorithm works that doesn't assume an undergraduate combinatorics background.</p> <p>Before we look at Sattolo's algorithm, let's look at Fisher-Yates, which is an <a href="https://en.wikipedia.org/wiki/In-place_algorithm">in-place</a> algorithm that produces a random permutation of an array/vector, where every possible permutation occurs with uniform probability.</p> <p>We'll look at the code for Fisher-Yates and then how to prove that the algorithm produces the intended result.</p> <pre><code>def shuffle(a): n = len(a) for i in range(n - 1): # i from 0 to n-2, inclusive. j = random.randrange(i, n) # j from i to n-1, inclusive. a[i], a[j] = a[j], a[i] # swap a[i] and a[j]. </code></pre> <p><code>shuffle</code> takes an array and produces a permutation of the array, i.e., it shuffles the array. We can think of this loop as placing each element of the array, <code>a</code>, in turn, from <code>a[0]</code> to <code>a[n-2]</code>. On some iteration, <code>i</code>, we choose one of <code>n-i</code> elements to swap with and swap element <code>i</code> with some random element. The last element in the array, <code>a[n-1]</code>, is skipped because it would always be swapped with itself. One way to see that this produces every possible permutation with uniform probability is to write down the probability that each element will end up in any particular location<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">1</a></sup>. Another way to do it is to observe two facts about this algorithm:</p> <ol> <li>Every output that Fisher-Yates produces is produced with uniform probability</li> <li>Fisher-Yates produces as many outputs as there are permutations (and each output is a permutation)</li> </ol> <p>(1) For each random choice we make in the algorithm, if we make a different choice, we get a different output. For example, if we look at the resultant <code>a[0]</code>, the only way to place the element that was originally in <code>a[k]</code> (for some <code>k</code>) in the resultant <code>a[0]</code> is to swap <code>a[0]</code> with <code>a[k]</code> in iteration <code>0</code>. If we choose a different element to swap with, we'll end up with a different resultant <code>a[0]</code>. Once we place <code>a[0]</code> and look at the resultant <code>a[1]</code>, the same thing is true of <code>a[1]</code> and so on for each <code>a[i]</code>. Additionally, each choice reduces the range by the same amount -- there's a kind of symmetry, in that although we place <code>a[0]</code> first, we could have placed any other element first; every choice has the same effect. This is vaguely analogous to the reason that you can pick an integer uniformly at random by picking digits uniformly at random, one at a time.</p> <p>(2) How many different outputs does Fisher-Yates produce? On the first iteration, we fix one of <code>n</code> possible choices for <code>a[0]</code>, then given that choice, we fix one of <code>n-1</code> choices for <code>a[1]</code>, then one of <code>n-2</code> for <code>a[2]</code>, and so on, so there are <code>n * (n-1) * (n-2) * ... 2 * 1 = n!</code> possible different outputs.</p> <p>This is exactly the same number of possible permutations of <code>n</code> elements, by pretty much the same reasoning. If we want to count the number of possible permutations of <code>n</code> elements, we first pick one of <code>n</code> possible elements for the first position, <code>n-1</code> for the second position, and so on resulting in <code>n!</code> possible permutations.</p> <p>Since Fisher-Yates only produces unique permutations and there are exactly as many outputs as there are permutations, Fisher-Yates produces every possible permutation. Since Fisher-Yates produces each output with uniform probability, it produces all possible permutations with uniform probability.</p> <p>Now, let's look at Sattolo's algorithm, which is almost identical to Fisher-Yates and also produces a shuffled version of the input, but produces something quite different:</p> <pre><code>def sattolo(a): n = len(a) for i in range(n - 1): j = random.randrange(i+1, n) # i+1 instead of i a[i], a[j] = a[j], a[i] </code></pre> <p>Instead of picking an element at random to swap with, like we did in Fisher-Yates, we pick an element at random that is not the element being placed, i.e., we do not allow an element to be swapped with itself. One side effect of this is that no element ends up where it originally started.</p> <p>Before we talk about why this produces the intended result, let's make sure we're on the same page regarding terminology. One way to look at an array is to view it as a description of a graph where the index indicates the node and the value indicates where the edge points to. For example, if we have the list <code>0 2 3 1</code>, this can be thought of as a directed graph from its indices to its value, which is a graph with the following edges:</p> <pre><code>0 -&gt; 0 1 -&gt; 2 2 -&gt; 3 3 -&gt; 1 </code></pre> <p>Node 0 points to itself (because the value at index 0 is 0), node 1 points to node 2 (because the value at index 1 is 2), and so on. If we traverse this graph, we see that there are two cycles. <code>0 -&gt; 0 -&gt; 0 ...</code> and <code>1 -&gt; 2 -&gt; 3 -&gt; 1...</code>.</p> <p>Let's say we swap the element in position 0 with some other element. It could be any element, but let's say that we swap it with the element in position 2. Then we'll have the list <code>3 2 0 1</code>, which can be thought of as the following graph:</p> <pre><code>0 -&gt; 3 1 -&gt; 2 2 -&gt; 0 3 -&gt; 1 </code></pre> <p>If we traverse this graph, we see the cycle <code>0 -&gt; 3 -&gt; 1 -&gt; 2 -&gt; 0...</code>. This is an example of a permutation with exactly one cycle.</p> <p>If we swap two elements that belong to different cycles, we'll merge the two cycles into a single cycle. One way to see this is when we swap two elements in the list, we're essentially picking up the arrow-heads pointing to each element and swapping where they point (rather than the arrow-tails, which stay put). Tracing the result of this is like tracing a figure-8. Just for example, say if we swap <code>0</code> with an arbitrary element of the other cycle, let's say element 2, we'll end up with <code>3 2 0 1</code>, whose only cycle is <code>0 -&gt; 3 -&gt; 1 -&gt; 2 -&gt; 0...</code>. Note that this operation is reversible -- if we do the same swap again, we end up with two cycles again. In general, if we swap two elements from the same cycle, we break the cycle into two separate cycles.</p> <p>If we feed a list consisting of <code>0 1 2 ... n-1</code> to Sattolo's algorithm we'll get a permutation with exactly one cycle. Furthermore, we have the same probability of generating any permutation that has exactly one cycle. Let's look at why Sattolo's generates exactly one cycle. Afterwards, we'll figure out why it produces all possible cycles with uniform probability.</p> <p>For Sattolo's algorithm, let's say we start with the list <code>0 1 2 3 ... n-1</code>, i.e., a list with <code>n</code> cycles of length <code>1</code>. On each iteration, we do one swap. If we swap elements from two separate cycles, we'll merge the two cycles, reducing the number of cycles by 1. We'll then do <code>n-1</code> iterations, reducing the number of cycles from <code>n</code> to <code>n - (n-1) = 1</code>.</p> <p>Now let's see why it's safe to assume we always swap elements from different cycles. In each iteration of the algorithm, we swap some element with index &gt; <code>i</code> with the element at index <code>i</code> and then increment <code>i</code>. Since <code>i</code> gets incremented, the element that gets placed into index <code>i</code> can never be swapped again, i.e., each swap puts one of the two elements that was swapped into its final position, i.e., for each swap, we take two elements that were potentially swappable and render one of them unswappable.</p> <p>When we start, we have <code>n</code> cycles of length <code>1</code>, each with <code>1</code> element that's swappable. When we swap the initial element with some random element, we'll take one of the swappable elements and render it unswappable, creating a cycle of length <code>2</code> with <code>1</code> swappable element and leaving us with <code>n-2</code> other cycles, each with <code>1</code> swappable element.</p> <p>The key invariant that's maintained is that each cycle has exactly <code>1</code> swappable element. The invariant holds in the beginning when we have <code>n</code> cycles of length <code>1</code>. And as long as this is true, every time we merge two cycles of any length, we'll take the swappable element from one cycle and swap it with the swappable element from the other cycle, rendering one of the two elements unswappable and creating a longer cycle that still only has one swappable element, maintaining the invariant.</p> <p>Since we cannot swap two elements from the same cycle, we merge two cycles with every swap, reducing the number of cycles by 1 with each iteration until we've run <code>n-1</code> iterations and have exactly one cycle remaining.</p> <p>To see that we generate each cycle with equal probability, note that there's only one way to produce each output, i.e., changing any particular random choice results in a different output. In the first iteration, we randomly choose one of <code>n-1</code> placements, then <code>n-2</code>, then <code>n-3</code>, and so on, so for any particular cycle, we produce it with probability <code>(n-1) * (n-2) * (n-3) ... * 2 * 1 = (n-1)!</code>. If we can show that there are <code>(n-1)!</code> permutations with exactly one cycle, then we'll know that we generate every permutation with exactly one cycle with uniform probability.</p> <p>Let's say we have an arbitrary list of length <code>n</code> that has exactly one cycle and we add a single element, there are <code>n</code> ways to extend that to become a cycle of length <code>n+1</code> because there are <code>n</code> places we could add in the new element and keep the cycle, which means that the number of cycles of length <code>n+1</code>, <code>cycles(n+1)</code>, is <code>n * cycles(n)</code>.</p> <p>For example, say we have a cycle that produces the path <code>0 -&gt; 1 -&gt; 2 -&gt; 0 ...</code> and we want to add a new element, <code>3</code>. We can substitute <code>-&gt; 3 -&gt;</code> for any <code>-&gt;</code> and get a cycle of length 4 instead of length 3.</p> <p>In the base case, there's one cycle of length 2, the permutation <code>1 0</code> (the other permutation of length two, <code>0 1</code>, has two cycles of length one instead of having a cycle of length 2), so we know that <code>cycles(2) = 1</code>. If we apply <a href="https://en.wikipedia.org/wiki/Recurrence_relation">the recurrence above</a>, we get that <code>cycles(n) = (n-1)!</code>, which is exactly the number of different permutations that Sattolo's algorithm generates, which means that we generate all possible permutations with one cycle. Since we know that we generate each cycle with uniform probability, we now know that we generate all possible one-cycle permutations with uniform probability.</p> <p>An alternate way to see that there are <code>(n-1)!</code> permutations with exactly one cycle, is that we rotate each cycle around so that <code>0</code> is at the start and write it down as <code>0 -&gt; i -&gt; j -&gt; k -&gt; ...</code>. The number of these is the same as the number of permutations of elements to the right of the <code>0 -&gt;</code>, which is <code>(n-1)!</code>.</p> <h3 id="conclusion">Conclusion</h3> <p>We've looked at two algorithms that are identical, except for a two character change. These algorithms produce quite different results -- one algorithm produces a random permutation and the other produces a random permutation with exactly one cycle. I think these algorithms are neat because they're so simple, just a double for loop with a swap.</p> <p>In practice, you probably don't &quot;need&quot; to know how these algorithms work because the standard library for most modern languages will have some way of producing a random shuffle. And if you have a function that will give you a shuffle, you can produce a permutation with exactly one cycle if you don't mind a non-in-place algorithm that takes an extra pass. I'll leave that as an exercise for the reader, but if you want a hint, one way to do it parallels the &quot;alternate&quot; way to see that there are <code>(n-1)!</code> permutations with exactly one cycle.</p> <p>Although I said that you probably don't need to know this stuff, you do actually need to know it if you're going to implement a custom shuffling algorithm! That may sound obvious, but there's a long history of people implementing incorrect shuffling algorithms. This was common in games and on <a href="http://www.datamation.com/entdev/article.php/616221/How-We-Learned-to-Cheat-at-Online-Poker-A-Study-in-Software-Security.htm">online gambling sites in the 90s and even the early 2000s</a> and you still see the occasional mis-implemented shuffle, e.g., when <a href="http://www.robweir.com/blog/2010/02/microsoft-random-browser-ballot.html">Microsoft implemented a bogus shuffle and failed to properly randomize a browser choice poll</a>. At the time, the top Google hit for <code>javascript random array sort</code> was <a href="https://web.archive.org/web/20100102004604/http://www.javascriptkit.com/javatutors/arraysort.shtml">the incorrect algorithm that Microsoft ended up using</a>. That site has been fixed, but you can still find incorrect tutorials floating around online.</p> <h4 id="appendix-generating-a-random-derangement">Appendix: generating a random derangement</h4> <p>A permutation where no element ends up in its original position is called a derangement. When I searched for uses of Sattolo's algorithm, I found many people using Sattolo's algorithm to generate random derangements. While Sattolo's algorithm generates derangements, it only generates derangements with exactly one cycle, and there are derangements with more than one cycle (e.g., <code>3 2 1 0</code>), so it can't possibly generate random derangements with uniform probability.</p> <p>One way to generate random derangements is to generate random shuffles using Fisher-Yates and then retry until we get a derangement:</p> <pre><code>def derangement(n): assert n != 1, &quot;can't have a derangement of length 1&quot; a = list(range(n)) while not is_derangement(a): shuffle(a) return a </code></pre> <p>This algorithm is simple, and is overwhelmingly likely to eventually return a derangement (for n != 1), but it's not immediately obvious how long we should expect this to run before it returns a result. Maybe we'll get a derangement on the first try and run <code>shuffle</code> once, or maybe it will take 100 tries and we'll have to do 100 shuffles before getting a derangement.</p> <p>To figure this out, we'll want to know the probability that a random permutation (shuffle) is a derangement. To get that, we'll want to know, given a list of of length <code>n</code>, how many permutations there are and how many derangements there are.</p> <p>Since we're deep in the appendix, I'll assume that you know <a href="https://en.wikipedia.org/wiki/Permutation">the number of permutations of a n elements is <code>n!</code></a> what <a href="https://en.wikipedia.org/wiki/Binomial_coefficient">binomial coefficients</a> are, and are comfortable with <a href="https://en.wikipedia.org/wiki/Taylor_series">Taylor series</a>.</p> <p>To count the number of derangements, we can start with the number of permutations, <code>n!</code>, and subtract off permutations where an element remains in its starting position, <code>(n choose 1) * (n - 1)!</code>. That isn't quite right because this double subtracts permutations where two elements remain in the starting position, so we'll have to add back <code>(n choose 2) * (n - 2)!</code>. That isn't quite right because we've overcorrected elements with three permutations, so we'll have to add those back, <a href="https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle">and so on and so forth</a>, resulting in <code>∑ (−1)ᵏ (n choose k)(n−k)!</code>. If we expand this out and divide by <code>n!</code> and cancel things out, we get <code>∑ (−1)ᵏ (1 / k!)</code>. If we look at the limit as the number of elements goes to infinity, this looks just like <a href="https://en.wikipedia.org/wiki/Taylor_series">the Taylor series</a> for <code>e^x</code> where <code>x = -1</code>, i.e., <code>1/e</code>, i.e., in the limit, we expect that the fraction of permutations that are derangements is <code>1/e</code>, i.e., we expect to have to do <code>e</code> times as many swaps to generate a derangement as we do to generate a random permutation. Like many alternating series, this series converges quickly. It gets within 7 significant figures of <code>e</code> when <code>k = 10</code>!</p> <p>One silly thing about our algorithm is that, if we place the first element in the first location, we already know that we don't have a derangement, but we continue placing elements until we've created an entire permutation. If we reject illegal placements, we can do even better than a factor of <code>e</code> overhead. It's also possible to come up with <a href="http://mathforum.org/library/drmath/view/61957.html">a non-rejection based algorithm</a>, but I really enjoy the naive rejection based algorithm because I find it delightful when <a href="https://www.cs.princeton.edu/courses/archive/fall13/cos521/lecnotes/lec2final.pdf">basic randomized algorithms that consist of &quot;keep trying again&quot; work well</a>.</p> <h4 id="appendix-wikipedia-s-explanation-of-sattolo-s-algorithm">Appendix: wikipedia's explanation of Sattolo's algorithm</h4> <p>I wrote this explanation because I found <a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">the explanation in Wikipedia</a> relatively hard to follow, but if you find the explanation above difficult to understand, maybe you'll prefer wikipedia's version:</p> <blockquote> <p>The fact that Sattolo's algorithm always produces a cycle of length n can be shown by induction. Assume by induction that after the initial iteration of the loop, the remaining iterations permute the first n - 1 elements according to a cycle of length n - 1 (those remaining iterations are just Sattolo's algorithm applied to those first n - 1 elements). This means that tracing the initial element to its new position p, then the element originally at position p to its new position, and so forth, one only gets back to the initial position after having visited all other positions. Suppose the initial iteration swapped the final element with the one at (non-final) position k, and that the subsequent permutation of first n - 1 elements then moved it to position l; we compare the permutation π of all n elements with that remaining permutation σ of the first n - 1 elements. Tracing successive positions as just mentioned, there is no difference between σ and π until arriving at position k. But then, under π the element originally at position k is moved to the final position rather than to position l, and the element originally at the final position is moved to position l. From there on, the sequence of positions for π again follows the sequence for σ, and all positions will have been visited before getting back to the initial position, as required.</p> <p>As for the equal probability of the permutations, it suffices to observe that the modified algorithm involves (n-1)! distinct possible sequences of random numbers produced, each of which clearly produces a different permutation, and each of which occurs--assuming the random number source is unbiased--with equal probability. The (n-1)! different permutations so produced precisely exhaust the set of cycles of length n: each such cycle has a unique cycle notation with the value n in the final position, which allows for (n-1)! permutations of the remaining values to fill the other positions of the cycle notation</p> </blockquote> <p><em>Thanks to Mathieu Guay-Paquet, Leah Hanson, Rudi Chen, Kamal Marhubi, Michael Robert Arntzenius, Heath Borders, Shreevatsa R, @chozu@fedi.absturztau.be, and David Turner for comments/corrections/discussion.</em></p> <p><link rel="prefetch" href=""> <link rel="prefetch" href="about/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:P"><p><code>a[0]</code> is placed on the first iteration of the loop. Assuming <code>randrange</code> generates integers with uniform probability in the appropriate range, the original <code>a[0]</code> has <code>1/n</code> probability of being swapped with any element (including itself), so the resultant <code>a[0]</code> has a 1/n chance of being any element from the original <code>a</code>, which is what we want.</p> <p><code>a[1]</code> is placed on the second iteration of the loop. At this point, <code>a[0]</code> is some element from the array before it was mutated. Let's call the unmutated array <code>original</code>. <code>a[0]</code> is <code>original[k]</code>, for some <code>k</code>. For any particular value of <code>k</code>, it contains <code>original[k]</code> with probability <code>1/n</code>. We then swap <code>a[1]</code> with some element from the range <code>[1, n-1]</code>.</p> <p>If we want to figure out the probability that <code>a[1]</code> is some particular element from <code>original</code>, we might think of this as follows: <code>a[0]</code> is <code>original[k_0]</code> for some <code>k_0</code>. <code>a[1]</code> then becomes <code>original[k_1]</code> for some <code>k_1</code> where <code>k_1 != k_0</code>. Since <code>k_0</code> was chosen uniformly at random, if we integrate over all <code>k_0</code>, <code>k_1</code> is also uniformly random.</p> <p>Another way to look at this is that it's arbitrary that we place <code>a[0]</code> and choose <code>k_0</code> before we place <code>a[1]</code> and choose <code>k_1</code>. We could just have easily placed <code>a[1]</code> and chosen <code>k_1</code> first so, over all possible choices, the choice of <code>k_0</code> cannot bias the choice of <code>k_1</code>.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> </ol> </div> Terminal latency term-latency/ Tue, 18 Jul 2017 00:00:00 +0000 term-latency/ <p>There’s <a href="https://www.youtube.com/watch?v=vOvQCPLkPt4">a great MSR demo from 2012 that shows the effect of latency on the experience of using a tablet</a>. If you don’t want to watch the three minute video, they basically created a device which could simulate arbitrary latencies down to a fraction of a millisecond. At 100ms (1/10th of a second), which is typical of consumer tablets, the experience is terrible. At 10ms (1/100th of a second), the latency is noticeable, but the experience is ok, and at &lt; 1ms the experience is great, as good as pen and paper. If you want to see a mini version of this for yourself, you can try a random Android tablet with a stylus vs. the current generation iPad Pro with the Apple stylus. The Apple device has well above 10ms end-to-end latency, but the difference is still quite dramatic -- it’s enough that I’ll actually use the new iPad Pro to take notes or draw diagrams, whereas I find Android tablets unbearable as a pen-and-paper replacement.</p> <p>You can also see something similar if you try VR headsets with different latencies. <a href="http://oculusrift-blog.com/john-carmacks-message-of-latency/682/">20ms feels fine, 50ms feels laggy, and 150ms feels unbearable</a>.</p> <p>Curiously, I rarely hear complaints about keyboard and mouse input being slow. One reason might be that keyboard and mouse input are quick and that inputs are reflected nearly instantaneously, but I don’t think that’s true. People often tell me that’s true, but I think it’s just the opposite. The idea that computers respond quickly to input, so quickly that humans can’t notice the latency, is the most common performance-related fallacy I hear from professional programmers.</p> <p>When people measure actual end-to-end latency for games on normal computer setups, they usually find latencies in the <a href="http://renderingpipeline.com/2013/09/measuring-input-latency/">100ms</a> <a href="https://www.youtube.com/watch?v=GxaEJY-zd_4&amp;index=5&amp;list=PLfOoCUS0PSkXVGjhB63KMDTOT5sJ0vWy8&amp;t=187s">range</a>.</p> <p>If we look at <a href="http://renderingpipeline.com/2013/09/measuring-input-latency/">Robert Menzel’s breakdown of the the end-to-end pipeline for a game</a>, it’s not hard to see why we expect to see 100+ ms of latency:</p> <ul> <li>~2 msec (mouse)</li> <li>8 msec (average time we wait for the input to be processed by the game)</li> <li>16.6 (game simulation)</li> <li>16.6 (rendering code)</li> <li>16.6 (GPU is rendering the previous frame, current frame is cached)</li> <li>16.6 (GPU rendering)</li> <li>8 (average for missing the vsync)</li> <li>16.6 (frame caching inside of the display)</li> <li>16.6 (redrawing the frame)</li> <li>5 (pixel switching)</li> </ul> <p>Note that this assumes a gaming mouse and a pretty decent LCD; it’s common to see substantially slower latency for the mouse and for pixel switching.</p> <p>It’s possible to tune things to get into the 40ms range, but the vast majority of users don’t do that kind of tuning, and even if they do, that’s still quite far from the 10ms to 20ms range, where tablets and VR start to feel really “right”.</p> <p>Keypress-to-display measurements are mostly done in games because gamers care more about latency than most people, but I don’t think that most applications are all that different from games in terms of latency. While games often do much more work per frame than “typical” applications, they’re also much better optimized than “typical” applications. Menzel budgets 33ms to the game, half for game logic and half for rendering. How much time do non-game applications take? Pavel Fatin measured this for text editors and found latencies ranging from <a href="https://pavelfatin.com/typing-with-pleasure/">a few milliseconds to hundreds of milliseconds</a> and he did this <a href="https://github.com/pavelfatin/typometer">with an app he wrote that we can use to measure the latency of other applications</a> that uses <a href="https://docs.oracle.com/javase/7/docs/api/java/awt/Robot.html">java.awt.Robot</a> to generate keypresses and do screen captures.</p> <p>Personally, I’d like to see the latency of different terminals and shells for a couple of reasons. First, I spend most of my time in a terminal and usually do editing in a terminal, so the latency I see is at least the latency of the terminal. Second, the most common terminal benchmark I see cited (by at least two orders of magnitude) is the rate at which a terminal can display output, often measured by running <code>cat</code> on a large file. This is pretty much as useless a benchmark as I can think of. I can’t recall the last task I did which was limited by the speed at which I can <code>cat</code> a file to <code>stdout</code> on my terminal (well, unless I’m using eshell in emacs), nor can I think of any task for which that sub-measurement is useful. The closest thing that I care about is the speed at which I can <code>^C</code> a command when I’ve accidentally output too much to <code>stdout</code>, but as we’ll see when we look at actual measurements, a terminal’s ability to absorb a lot of input to <code>stdout</code> is only weakly related to its responsiveness to <code>^C</code>. The speed at which I can scroll up or down an entire page sounds related, but in actual measurements the two are not highly correlated (e.g., emacs-eshell is quick at scrolling but extremely slow at sinking <code>stdout</code>). Another thing I care about is latency, but knowing that a particular terminal has high <code>stdout</code> throughput tells me little to nothing about its latency.</p> <p>Let’s look at some different terminals to see if any terminals add enough latency that we’d expect the difference to be noticeable. If we measure the latency from keypress to internal screen capture on my laptop, we see the following latencies for different terminals</p> <p><img src="images/term-latency/idle-terminal-latency.svg" alt="Plot of terminal tail latency"> <img src="images/term-latency/loaded-terminal-latency.svg" alt="Plot of terminal tail latency"></p> <p>These graphs show the distribution of latencies for each terminal. The y-axis has the latency in milliseconds. The x-axis is the percentile (e.g., 50 means represents 50%-ile keypress i.e., the median keypress). Measurements are with macOS unless otherwise stated. The graph on the left is when the machine is idle, and the graph on the right is under load. If we just look at median latencies, some setups don’t look too bad -- terminal.app and emacs-eshell are at roughly 5ms unloaded, small enough that many people wouldn’t notice. But most terminals (st, alacritty, hyper, and iterm2) are in the range <a href="https://pdfs.semanticscholar.org/386a/15fd85c162b8e4ebb6023acdce9df2bd43ee.pdf">where you might expect people</a> to <a href="http://www.tactuallabs.com/papers/howMuchFasterIsFastEnoughCHI15.pdf">notice the additional latency</a> even when the machine is idle. If we look at the tail when the machine is idle, say the 99.9%-ile latency, every terminal gets into the range where the additional latency ought to be perceptible, according to studies on user interaction. For reference, the internally generated keypress to GPU memory trip for some terminals is slower than <a href="http://ipnetwork.bgtmo.ip.att.net/pws/network_delay.html">the time it takes to send a packet from Boston to Seattle <em>and back</em></a>, about 70ms.</p> <p>All measurements were done with input only happening on one terminal at a time, with full battery and running off of A/C power. The loaded measurements were done while compiling Rust (as before, with full battery and running off of A/C power, and in order to make the measurements reproducible, each measurement started 15s after a clean build of Rust after downloading all dependencies, with enough time between runs to avoid thermal throttling interference across runs).</p> <p>If we look at median loaded latencies, other than emacs-term, most terminals don’t do much worse than at idle. But as we look at tail measurements, like 90%-ile or 99.9%-ile measurements, every terminal gets much slower. Switching between macOS and Linux makes some difference, but the difference is different for different terminals.</p> <p>These measurements aren't anywhere near the worst case (if we run off of battery when the battery is low, and wait 10 minutes into the compile in order to exacerbate thermal throttling, it’s easy to see latencies that are multiple hundreds of ms) but even so, every terminal has tail latency that should be observable. Also, recall that this is only a fraction of the total end-to-end latency.</p> <p>Why don’t people complain about keyboard-to-display latency the way they complain stylus-to-display latency or VR latency? My theory is that, for both VR and tablets, people have a lot of experience with a much lower latency application. For tablets, the “application” is pen-and-paper, and for VR, the “application” is turning your head without a VR headset on. But input-to-display latency is so bad for every application that most people just expect terrible latency.</p> <p>An alternate theory might be that keyboard and mouse input are fundamentally different from tablet input in a way that makes latency less noticeable. Even without any data, I’d find that implausible because, when I access a remote terminal in a way that adds tens of milliseconds of extra latency, I find typing to be noticeably laggy. And it turns out that when extra latency is A/B tested, <a href="http://forums.blurbusters.com/viewtopic.php?f=10&amp;t=1134">people can and do notice latency in the range we’re discussing here</a>.</p> <p>Just so we can compare the most commonly used benchmark (throughput of stdout) to latency, let’s measure how quickly different terminals can sink input on stdout: <style>table {border-collapse: collapse;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <div style="overflow-x:auto;"></p> <table> <thead> <tr> <th>terminal</th> <th>stdout<br>(MB/s)</th> <th>idle50<br>(ms)</th> <th>load50<br>(ms)</th> <th>idle99.9<br>(ms)</th> <th>load99.9<br>(ms)</th> <th>mem<br>(MB)</th> <th>^C</th> </tr> </thead> <tbody> <tr> <td>alacritty</td> <td>39</td> <td>31</td> <td>28</td> <td>36</td> <td>56</td> <td>18</td> <td>ok</td> </tr> <tr> <td>terminal.app</td> <td>20</td> <td>6</td> <td>13</td> <td>25</td> <td>30</td> <td>45</td> <td>ok</td> </tr> <tr> <td>st</td> <td>14</td> <td>25</td> <td>27</td> <td>63</td> <td>111</td> <td>2</td> <td>ok</td> </tr> <tr> <td>alacritty tmux</td> <td>14</td> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>terminal.app tmux</td> <td>13</td> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>iterm2</td> <td>11</td> <td>44</td> <td>45</td> <td>60</td> <td>81</td> <td>24</td> <td>ok</td> </tr> <tr> <td>hyper</td> <td>11</td> <td>32</td> <td>31</td> <td>49</td> <td>53</td> <td>178</td> <td>fail</td> </tr> <tr> <td>emacs-eshell</td> <td>0.05</td> <td>5</td> <td>13</td> <td>17</td> <td>32</td> <td>30</td> <td>fail</td> </tr> <tr> <td>emacs-term</td> <td>0.03</td> <td>13</td> <td>30</td> <td>28</td> <td>49</td> <td>30</td> <td>ok</td> </tr> </tbody> </table> <p></div> The relationship between the rate that a terminal can sink <code>stdout</code> and its latency is non-obvious. For the matter, the relationship between the rate at which a terminal can sink <code>stdout</code> and how fast it looks is non-obvious. During this test, terminal.app looked very slow. The text that scrolls by jumps a lot, as if the screen is rarely updating. Also, hyper and emacs-term both had problems with this test. Emacs-term can’t really keep up with the output and it takes a few seconds for the display to finish updating after the test is complete (the status bar that shows how many lines have been output appears to be up to date, so it finishes incrementing before the test finishes). Hyper falls further behind and pretty much doesn’t update the screen after a flickering a couple of times. The <code>Hyper Helper</code> process gets pegged at 100% CPU for about two minutes and the terminal is totally unresponsive for that entire time.</p> <p>Alacritty was tested with tmux because alacritty doesn’t support scrolling back up, and the docs indicate that you should use tmux if you want to be able to scroll up. Just to have another reference, terminal.app was also tested with tmux. For most terminals, tmux doesn’t appear to reduce <code>stdout</code> speed, but alacritty and terminal.app are fast enough that they’re actually limited by the speed of tmux.</p> <p>Emacs-eshell is technically not a terminal, but I also tested eshell because it can be used as a terminal alternative for some use cases. Emacs, with both eshell and term, is actually slow enough that I care about the speed at which it can sink <code>stdout</code>. When I’ve used eshell or term in the past, I find that I sometimes have to wait for a few thousand lines of text to scroll by if I run a command with verbose logging to <code>stdout</code> or <code>stderr</code>. Since that happens very rarely, it’s not really a big deal to me unless it’s so slow that I end up waiting half a second or a second when it happens, and no other terminal is slow enough for that to matter.</p> <p>Conversely, I type individual characters often enough that I’ll notice tail latency. Say I type at 120wpm and that results in 600 characters per minute, or 10 characters per second of input. Then I’d expect to see the 99.9% tail (1 in 1000) every 100 seconds!</p> <p>Anyway, the <code>cat</code> “benchmark” that I care about more is whether or not I can <code>^C</code> a process when I’ve accidentally run a command that outputs millions of lines to the screen instead of thousands of lines. For that benchmark, every terminal is fine except for hyper and emacs-eshell, both of which hung for at least ten minutes (I killed each process after ten minutes, rather than waiting for the terminal to catch up).</p> <p>Memory usage at startup is also included in the table for reference because that's the other measurement I see people benchmark terminals with. While I think that it's a bit absurd that terminals can use 40MB at startup, even the three year old hand-me-down laptop I'm using has 16GB of RAM, so squeezing that 40MB down to 2MB doesn't have any appreciable affect on user experience. Heck, even the $300 chromebook we recently got has 16GB of RAM.</p> <h3 id="conclusion">Conclusion</h3> <p>Most terminals have enough latency that the user experience could be improved if the terminals concentrated more on latency and less on other features or other aspects of performance. However, when I search for terminal benchmarks, I find that terminal authors, if they benchmark anything, <a href="https://github.com/jwilm/alacritty/issues/289">benchmark the speed of</a> <a href="https://github.com/jwilm/alacritty/issues/205">sinking stdout</a> or memory usage at startup. This is unfortunate because most “low performance” terminals can already sink <code>stdout</code> many orders of magnitude faster than humans can keep up with, so further optimizing <code>stdout</code> throughput has a relatively small impact on actual user experience for most users. Likewise for reducing memory usage when an idle terminal uses 0.01% of the memory on my old and now quite low-end laptop.</p> <p>If you work on a terminal, perhaps consider relatively more latency and interactivity (e.g., responsiveness to <code>^C</code>) optimization and relatively less throughput and idle memory usage optimization.</p> <p><em>Update: In response to this post, <a href="https://github.com/jwilm/alacritty/issues/673">the author of alacritty explains where alacritty's latency comes from and describes how alacritty could reduce its latency</a></em></p> <h3 id="appendix-negative-results">Appendix: negative results</h3> <p>Tmux and latency: I tried tmux and various terminals and found that the the differences were within the range of measurement noise.</p> <p>Shells and latency: I tried a number of shells and found that, even in the quickest terminal, the difference between shells was within the range of measurement noise. Powershell was somewhat problematic to test with the setup I was using because it doesn’t handle colors correctly (the first character typed shows up with the color specified by the terminal, but other characters are yellow regardless of setting, <a href="https://github.com/lzybkr/PSReadLine/issues/472">which appears to be an open issue</a>), which confused the image recognition setup I used. <a href="https://www.youtube.com/watch?v=cz5Hczlzvio">Powershell also doesn’t consistently put the cursor where it should be</a> -- it jumps around randomly within a line, which also confused the image recognition setup I used. However, despite its other problems, powershell had comparable performance to other shells.</p> <p>Shells and stdout throughput: As above, the speed difference between different shells was within the range of measurement noise.</p> <p>Single-line vs. multiline text and throughput: Although some text editors bog down with extremely long lines, throughput was similar when I shoved a large file into a terminal whether the file was all one line or was line broken every 80 characters.</p> <p>Head of line blocking / coordinated omission: I ran these tests with input at a rate of 10.3 characters per second. But it turns out this doesn't matter much and input rates that humans are capapable of and the latencies are quite similar to doing input once every 10.3 seconds. It's possible to overwhelm a terminal, and hyper is the first to start falling over at high input rates, but the speed necessary to make the tail latency worse is beyond the rate at which any human I know of can type.</p> <h3 id="appendix-experimental-setup">Appendix: experimental setup</h3> <p>All tests were done on a dual core 2.6GHz 13” Mid-2014 Macbook pro. The machine has 16GB of RAM and a 2560x1600 screen. The OS X version was 10.12.5. Some tests were done in Linux (Lubuntu 16.04) to get a comparison between macOS and Linux. 10k keypresses were for each latency measurements.</p> <p>Latency measurements were done with the <code>.</code> key and throughput was done with default <code>base32</code> output, which is all plain ASCII text. George King notes that different kinds of text can change output speed:</p> <blockquote> <p>I’ve noticed that Terminal.app slows dramatically when outputting non-latin unicode ranges. I’m aware of three things that might cause this: having to load different font pages, and having to parse code points outside of the BMP, and wide characters.</p> <p>The first probably boils down to a very complicated mix of lazy loading of font glyphs, font fallback calculations, and caching of the glyph pages or however that works.</p> <p>The second is a bit speculative, but I would bet that Terminal.app uses Cocoa’s UTF16-based NSString, which almost certainly hits a slow path when code points are above the BMP due to surrogate pairs.</p> </blockquote> <p>Terminals were fullscreened before running tests. This affects test results, and resizing the terminal windows can and does significantly change performance (e.g., it’s possible to get hyper to be slower than iterm2 by changing the window size while holding everything else constant). st on macOS was running as an X client under XQuartz. To see if XQuartz is inherently slow, I tried <a href="https://github.com/doy/runes/">runes</a>, another &quot;native&quot; Linux terminal that uses XQuartz; runes had much better tail latency than st and iterm2.</p> <p>The “idle” latency tests were done on a freshly rebooted machine. All terminals were running, but input was only fed to one terminal at a time.</p> <p>The “loaded” latency tests were done with rust compiling in the background, 15s after the compilation started.</p> <p>Terminal bandwidth tests were done by creating a large, pseudo-random, text file with</p> <pre><code>timeout 64 sh -c 'cat /dev/urandom | base32 &gt; junk.txt' </code></pre> <p>and then running</p> <pre><code>timeout 8 sh -c 'cat junk.txt | tee junk.term_name' </code></pre> <p>Terminator and urxvt weren’t tested because they weren’t completely trivial to install on mac and I didn’t want to futz around to make them work. Terminator was easy to build from source, but it hung on startup and didn’t get to a shell prompt. Urxvt installed through brew, but one of its dependencies (also installed through brew) was the wrong version, which prevented it from starting.</p> <p><small> Thanks to Kamal Marhubi, Leah Hanson, Wesley Aptekar-Cassels, David Albert, Vaibhav Sagar, Indradhanush Gupta, Rudi Chen, Laura Lindzey, Ahmad Jarara, George King, Tim Dierks, Nikith Naide, Veit Heller, and Nick Bergson-Shilcock for comments/corrections/discussion. </small></p> The widely cited studies on mouse vs. keyboard efficiency are completely bogus keyboard-v-mouse/ Tue, 13 Jun 2017 00:00:00 +0000 keyboard-v-mouse/ <p>Which is faster, keyboard or mouse? A large number of programmers believe that the keyboard is faster for all (programming-related) tasks. However, there are a few widely cited webpages on AskTog which claim that Apple studies show that using the mouse is faster than using the keyboard for everything and that people who think that using the keyboard is faster are just deluding themselves. This might sound extreme, but, just for example, one page says that the author has “never seen [the keyboard] outperform the mouse”.</p> <p>But it can’t be the case that the mouse is faster for everything — almost no one is faster at clicking on an on-screen keyboard with a mouse than typing at a physical keyboard. Conversely, there are tasks for which mice are much better suited than keyboards (e.g., aiming in FPS games). For someone without an agenda, the question shouldn’t be, which is faster at all tasks, but which tasks are faster with a keyboard, which are faster with a mouse, and which are faster when both are used?</p> <p>You might ask if any of this matters. It depends! One of the best programmrers I know is a hunt-and-peck typist, so it's clearly possible to be a great programmer without having particularly quick input speed. But I'm in the middle of an easy data munging task where I'm limited by the speed at which I can type in a large amount of boring code. If I were quicker, this task would be quicker, and there are tasks that I don't do that I might do. I can type at &gt; 100 wpm, which isn't bad, but I can talk at &gt; 400 wpm and I can think much faster than I can talk. I'm often rate limited even when talking; typing is much worse and the half-a-second here and one-second there I spent on navigation certainly doesn't help. When I first got started in tech, I had a mundane test/verification/QA role where my primary job was to triage test failures. Even before I started automating tasks, I could triage nearly twice as many bugs per day as other folks in the same role because I took being efficient at basic navigation tasks seriously. Nowadays, my jobs aren't 90% rote anymore, but my guess is that about a third of the time I spend in front of a computer is spent on mindless tasks that are rate-limited by my input and navigation speed. If I could get faster at those mundane tasks and have to spend less time on them and more time doing things that are fun, that would be great.</p> <p>Anyway, to start, let’s look at the cited studies to see where the mouse is really faster. Most references on the web, when followed all the way back, point to the AskTog, a site by <a href="https://en.wikipedia.org/wiki/Bruce_Tognazzini">Bruce Tognazzini</a>, who describes himself as a &quot;recognized leader in human/computer interaction design&quot;.</p> <p><a href="http://www.asktog.com/TOI/toi06KeyboardVMouse1.html">The most cited AskTog page on the topic claims that they've spent $50M of R&amp;D and done all kinds of studies</a>; the page claims that, among other things, the $50M in R&amp;D showed “Test subjects consistently report that keyboarding is faster than mousing” and “The stopwatch consistently proves mousing is faster than keyboarding. ”. The claim is that this both proves that the mouse is faster than the keyboard, and explains why programmers think the keyboard is faster than the mouse even though it’s slower. However, the result is unreproducible because “Tog” not only doesn’t cite the details of the experiments, Tog doesn’t even describe the experiments and just makes a blanket claim.</p> <p>The <a href="http://www.asktog.com/TOI/toi22KeyboardVMouse2.html">second widely cited AskTog page</a> is in response to a response to the previous page, and it simply repeats that the first page showed that keyboard shortcuts are slower. While there’s a lot of sarcasm, like “Perhaps we have all been misled these years. Perhaps the independent studies that show over and over again that Macintosh users are more productive, can learn quicker, buy more software packages, etc., etc., etc., are somehow all flawed. Perhaps....” no actual results are cited, as before. There is, however, a psuedo-scientific explanation of why the mouse is faster than the keyboard:</p> <blockquote> <p>Command Keys Aren’t Faster. As you know from my August column, it takes just as long to decide upon a command key as it does to access the mouse. The difference is that the command-key decision is a high-level cognitive function of which there is no long-term memory generated. Therefore, subjectively, keys seem faster when in fact they usually take just as long to use.</p> <p>Since mouse acquisition is a low-level cognitive function, the user need not abandon cognitive process on the primary task during the acquisition period. Therefore, the mouse acquirer achieves greater productivity.</p> </blockquote> <p>One question this raises is, why should typing on the keyboard be any different from using command keys? There certainly are people who aren’t fluent at touch typing who have to think about which key they’re going to press when they type. Those people are very slow typists, perhaps even slower than someone who’s quick at using the mouse to type via an on screen keyboard. But there are also people who are fluent with the keyboard and can type without consciously thinking about which keys they’re going to press. The implicit claim here is that it’s not possible to be fluent with command keys in the same way it’s possible to be fluent with the keyboard for typing. It’s possible that’s true, but I find the claim to be highly implausible, both in principle, and from having observed people who certainly seem to be fluent with command keys, and the claim has no supporting evidence.</p> <p><a href="http://www.asktog.com/SunWorldColumns/S02KeyboardVMouse3.html">The third widely cited AskTog page cites a single experiment</a>, where the author typed a paragraph and then had to replace every “e” with a “|”, either using cursor keys or the mouse. The author found that the average time for using cursor keys was 99.43 seconds and the average time for the mouse was 50.22 seconds. No information about the length of the paragraph or the number of “e”s was given. The third page was in response to a user who cited specific editing examples where they found that they were faster with a keyboard than with a mouse.</p> <p>My experience with benchmarking is that the vast majority of microbenchmarks have wrong or misleading results because they’re difficult to set up properly, and even when set up properly, understanding how the microbenchmark results relate to real-world world results requires a deep understanding of the domain. As a result, I’m deeply skeptical of broad claims that come from microbenchmarks unless the author has a demonstrated, deep, understanding of benchmarking their particular domain, and even then I’ll ask why they believe their result generalizes. The opinion that microbenchmarks are very difficult to interpret properly is <a href="https://twitter.com/shipilev/status/709982588673388544">widely shared among people who understand benchmarking</a>.</p> <p>The <code>e -&gt; |</code> replacement task described is not only a microbenchmark, it's a bizarrely artificial microbenchmark.</p> <p>Based on the times given in the result, the task was either for very naive users, or disallowed any kind of search and replace functionality. This particular AskTog column is in response to a programmer who mentioned editing tasks, so the microbenchmark is meaningless unless that programmer is trapped in an experiment where they’re not allowed to use their editor’s basic functionality. Moreover, the replacement task itself is unrealistic — how often do people replace <code>e</code> with <code>|</code>?</p> <p>I timed this task without the bizarre no-search-and-replace restriction removed and got the following results:</p> <ul> <li>Keyboard shortcut: 1.26s</li> <li>M-x, “replace-string” (instead of using mapped keyboard shortcut): 2.8s</li> <li>Navigate to search and replace with mouse: 5.39s</li> </ul> <p>The first result was from using a keyboard shortcut. The second result is something I might do if I were in someone else’s emacs setup, which has different keyboard shortcuts mapped; emacs lets you run a command by hitting “M-x” and typing the entire name of the command. That’s much slower than using a keyboard shortcut directly, but still faster than using the mouse (at least for me, here) Does this mean that keyboards are great and mice are terrible? No, the result is nearly totally meaningless because I spend almost none of my time doing single-character search-and-replace, making the speed of single-character search-and-replace irrelevant.</p> <p>Also, since I’m used to using the keyboard, the mouse speed here is probably unusually slow. That’s doubly true here because my normal editor setup (<code>emacs -nw</code>) doesn’t allow for mouse usage, so I ended up using an unfamiliar editor, <code>TextEdit</code>, for the mouse test. I did each task once in order to avoid “practicing” the exact task, which could unrealistically make the keyboard-shortcut version nearly instantaneous because it’s easy to hit a practiced sequence of keys very quickly. However, this meant that I was using an unfamiliar mouse in an unfamiliar set of menus for the mouse. Furthermore, like many people who’ve played video games in the distant past, I’m used to having “<a href="https://www.google.com/search?q=mouse+acceleration">mouse acceleration</a>” turned off, but the Mac has this on by default and I didn’t go through the rigmarole necessary to disable mouse acceleration. Additionally, recording program I used (quicktime) made the entire machine laggy, which probably affects mousing speed more than keyboard speed, and the menu setup for the program I happened to use forced me to navigate through two levels of menus.</p> <p>That being said, despite not being used to the mouse, if I want to find a microbenchmark where I’m faster with the mouse than with the keyboard, that’s easy: let me try selecting a block of text that’s on the screen but not near my cursor:</p> <ul> <li>Keyboard: 1.8s</li> <li>Mouse: 0.7s</li> </ul> <p>I tend to do selection of blocks in emacs by searching for something at the start of the block, setting a mark, and then searching for something at the end of the mark. I typically type three characters to make sure that I get a unique chunk of text (and I’ll type more if it’s text where I don’t think three characters will cut it). This makes the selection task somewhat slower than the replacement task because the replacement task used single characters and this task used multiple characters.</p> <p>The mouse is so much better suited for selecting a block of text that even with an unfamiliar mouse setup where I end up having to make a correction instead of being able to do the selection in one motion, the mouse is over twice as fast. But, if I wanted select something that was off screen and the selection was so large that it wouldn’t fit on one screen, the keyboard time wouldn’t change and the mouse time would get much slower, making the keyboard faster.</p> <p>In addition to doing the measurements, I also (informally) polled people to ask if they thought the keyboard or the mouse would be faster for specific tasks. Both search-and-replace and select-text are tasks where the result was obvious to most people. But not all tasks are obvious; scrolling was one where people didn’t have strong opinions one way or another. Let’s look at scrolling, which is a task both the keyboard and the mouse are well suited for. To have something concrete, let’s look at scrolling down 4 pages:</p> <ul> <li>Keyboard: 0.49s</li> <li>Mouse: 0.57s</li> </ul> <p>While there’s some difference, and I suspect that if I repeated the experiment enough times I could get a statistically significant result, but the difference is small enough that the difference isn’t of practical significance.</p> <p>Contra Tog’s result, which was that everyone believes the keyboard was faster even though the mouse is faster, I find that people are pretty good at estimating what’s which device is faster for which tasks and also at estimate when both devices will give a similar result. One possible reason is that I’m polling programmers, and in particular, programmers at <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>, who are probably a different population than whoever Tog might’ve studied in his studies. He was in a group that was looking at how to design the UI for a general purpose computer in the 80s, where it would have been actually been unreasonable to focus on studying people, many of whom grew up using computers, and then chose a career where you use computers all day. The equivalent population would’ve had to start using computers in the 60s or even earlier, but even if they had, input devices were quite different (the ball mouse wasn’t invented until 1972, and it certainly wasn’t in wide use the moment it was invented). There’s nothing wrong with studying populations who aren’t relatively expert at using computer input devices, but there is something wrong with generalizing those results to people who are relatively expert.</p> <p>Unlike claims by either keyboard or mouse advocates, when I do experiments myself, the results are mixed. Some tasks are substantially faster if I use the keyboard and some are substantially faster if I use the mouse. Moreover, most of the results are easily predictable (when the results are similar, the prediction is that it would be hard to predict). If we look at the most widely cited, authoritative, results on the web, we find that they make very strong claims that the mouse is much faster than the keyboard but back up the claim with nothing but a single, bogus, experiment. It’s possible that some of the vaunted $50M in R&amp;D went into valid experiments, but those experiments, if they exist, aren’t cited.</p> <p>I spent some time reviewing the literature on the subject, but couldn’t find anything conclusive. Rather than do a point-by-point summary of each study (<a href="empirical-pl/">like I did here for here for another controversial topic</a>), I’ll mention the high-level issues that make the studies irrelevant to me. All studies I could find had at least one of the issues listed below; if you have a link to a study that isn’t irrelevant for one of the following reasons, I’d love to hear about it!</p> <ol> <li>Age of study: it’s unclear how a study on interacting with computers from the mid-80s transfers to how people interact with computers today. Even ignoring differences in editing programs, there are large differences in the interface. Mice are more precise and a decent modern optical mouse can be moved as fast as a human can move it without the tracking becoming erratic, something that isn’t true of any mouse I’ve tried from the 80s and was only true of high quality mice from the 90s when the balls were recently cleaned and the mouse was on a decent quality mousepad. Keyboards haven’t improved as much, but even so, I can type substantially faster a modern, low-travel, keyboard than on any keyboard I’ve tried from the 80s.</li> <li>Narrow microbenchmarking: not all of these are as irrelevant as the <code>e -&gt; |</code> without search and replace task, but even in the case of tasks that aren’t obviously irrelevant, it’s not clear what the impact of the result is on actual work I might do.</li> <li>Not keyboard vs. mouse: a tiny fraction of published studies are on keyboard vs. mouse interaction. When a study is on device interaction, it’s often about some new kind of device or a new interaction model.</li> <li>Vague description: a lot of studies will say something like they found a 7.8% improvement, with results being significant with p &lt; 0.005, without providing enough information to tell if the results are actually significant or merely statistically significant (recall that the practically insignificant scrolling result was a 0.08s difference, which could also be reported as a 16.3% improvement).</li> <li>Unskilled users: in one, typical, paper, they note that it can take users as long as two seconds to move the mouse from one side of the screen to a scrollbar on the other side of the screen. While there’s something to be said for doing studies on unskilled users in order to figure out what sorts of interfaces are easiest for users who have the hardest time, a study on users who take 2 seconds to get their mouse onto the scrollbar doesn’t appear to be relevant to my user experience. When I timed this for myself, it took 0.21s to get to the scrollbar from the other side of the screen and scroll a short distance, despite using an unfamiliar mouse with different sensitivity than I’m used to and running a recording program which made mousing more difficult than usual.</li> <li>Seemingly unreasonable results: some studies claim to show large improvements in overall productivity when switching from type of device to another (e.g., a 20% total productivity gain from switching types of mice).</li> </ol> <h4 id="conclusion">Conclusion</h4> <p>It’s entirely possible that the mysterious studies Tog’s org spent $50M on prove that the mouse is faster than the keyboard for all tasks other than raw text input, but there doesn’t appear to be enough information to tell what the actual studies were. There are many public studies on user input, but I couldn’t find any that are relevant to whether or not I should use the mouse more or less at the margin.</p> <p>When I look at various tasks myself, the results are mixed, and they’re mixed in the way that most programmers I polled predicted. This result is so boring that it would barely be worth mentioning if not for the large groups of people who believe that either the keyboard is always faster than the mouse or vice versa.</p> <p>Please let me know if there are relevant studies on this topic that I should read! I’m not familiar with the relevant fields, so it’s possible that I’m searching with the wrong keywords and reading the wrong papers.</p> <h4 id="appendix-note-to-self">Appendix: note to self</h4> <p>I didn't realize that scrolling was so fast relative to searching (not explicitly mentioned in the blog post, but 1/2 of the text selection task). I tend to use search to scroll to things that are offscreen, but it appears that I should consider scrolling instead when I don't want to drop my cursor in a specific position.</p> <p><small>Thanks to Leah Hanson, Quentin Pradet, Alex Wilson, and Gaxun for comments/corrections on this post and to Annie Cherkaev, Chris Ball, Stefan Lesser, and David Isaac Lee for related discussion.</small></p> Startup options v. cash startup-options/ Wed, 07 Jun 2017 00:00:00 +0000 startup-options/ <p>I often talk to startups that claim that their compensation package has a higher expected value than the equivalent package at a place like Facebook, Google, Twitter, or Snapchat. One thing I don’t understand about this claim is, if the claim is true, why shouldn’t the startup go to an investor, sell their options for what they claim their options to be worth, and then pay me in cash? The non-obvious value of options combined with their volatility is a barrier for recruiting.</p> <p>Additionally, given my risk function and the risk function of VCs, this appears to be a better deal for everyone. Like most people, extra income gives me diminishing utility, but VCs have an arguably nearly linear utility in income. Moreover, even if VCs shared my risk function, because VCs hold a diversified portfolio of investments, the same options would be worth more to them than they are to me because they can diversify away downside risk much more effectively than I can. If these startups are making a true claim about the value of their options, there should be a trade here that makes all parties better off.</p> <p>In a classic series of essays written a decade ago, seemingly aimed at convincing people to either found or join startups, Paul Graham stated &quot;If you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years&quot; and &quot;Risk and reward are always proportionate.&quot; This risk-reward assertion is used to back the claim that people can make more money, in expectation, by joining startups and taking risky equity packages than they can by taking jobs that pay cash or cash plus public equity. However, the premise — that risk and reward are <em>always</em> proportionate — isn’t true in the general case. It's basic finance 101 that only assets whose risk cannot be diversified away carry a risk premium (on average). Since VCs can and do diversify risk away, there’s no reason to believe that an individual employee who “invests” in startup options by working at a startup is getting a deal because of the risk involved. And by the way, when you look at historical returns, <a href="https://web-beta.archive.org/web/20121118012653/http://www.kauffman.org/uploadedfiles/vc-enemy-is-us-report.pdf">VC funds don’t appear to outperform other investment classes</a> even though they get to buy a kind of startup equity that has less downside risk than the options you get as a normal employee.</p> <p>So how come startups can’t or won’t take on more investment and pay their employees in cash? Let’s start by looking at some cynical reasons, followed by some less cynical reasons.</p> <h3 id="cynical-reasons">Cynical reasons</h3> <p>One possible answer, perhaps the simplest possible answer, is that options aren’t worth what startups claim they’re worth and startups prefer options because their lack of value is less obvious than it would be with cash. A simplistic argument that this might be the case is, if you look at the amount investors pay for a fraction of an early-stage or mid-stage startup and look at the extra cash the company would have been able to raise if they gave their employee option pool to investors, it usually isn’t enough to pay employees competitive compensation packages. Given that VCs don’t, on average, have outsized returns, this seems to imply that employee options aren’t worth as much as startups often claim. Compensation is much cheaper if you can convince people to take an arbirary number of lottery tickets in a lottery of unknown value instead of cash.</p> <p>Some common ways that employee options are misrepresented are:</p> <h4 id="strike-price-as-value">Strike price as value</h4> <p>A company that gives you 1M options with a strike price of $10 might claim that those are “worth” $10M. However, if the share price stays at $10 for the lifetime of the option, the options will end up being worth $0 because an option with a $10 strike price is an option to buy the stock at $10, which is not the same as a grant of actual shares worth $10 a piece.</p> <h4 id="public-valuation-as-value">Public valuation as value</h4> <p>Let’s say a company raised $300M by selling 30% of the company, giving the company an implied valuation of $1B. The most common misrepresentation I see is that the company will claim that because they’re giving an option for, say, 0.1% of the company, your option is worth $1B * 0.001 = $1M. A related, common, misrepresentation is that the company raised money last year and has increased in value since then, e.g., the company has since doubled in value, so your option is worth $2M. Even if you assume the strike price was $0 and and go with the last valuation at which the company raised money, the implied value of your option isn’t $1M because investors buy a different class of stock than you get as an employee.</p> <p>There are a lot of differences between the preferred stock that VCs get and the common stock that employees get; let’s look at a couple of concrete scenarios.</p> <p>Let’s say those investors that paid $300M for 30% of the company have a straight (1x) liquidation preference, and the company sells for $500M. The 1x liquidation preference means that the investors will get 1x of their investment back before lowly common stock holders get anything, so the investors will get $300M for their 30% of the company. The other 70% of equity will split $200M: your 0.1% common stock option with a $0 strike price is worth $285k (instead of the $500k you might expect it to be worth if you multiply $500M by 0.001).</p> <p>The preferred stock VCs get usually has <em>at least</em> a 1x liquidation preference. Let’s say the investors had a 2x liquidation preference in the above scenario. They would get 2x their investment back before the common stockholders split the rest of the company. Since 2 * $300M is greater than $500M, the investors would get everything and the remaining equity holders would get $0.</p> <p>Another difference between your common stock and preferred stock is that preferred stock sometimes comes with an anti-dilution clause, which you have no chance of getting as a normal engineering hire. Let’s look at <a href="https://news.ycombinator.com/item?id=11200296">an actual example of dilution at a real company</a>. Mayhar got 0.4% of a company when it was valued at $5M. By the time the company was worth $1B, Mayhar’s share of the company was diluted by 8x, which made his share of the company worth less than $500k (minus the cost of exercising his options) instead of $4M (minus the cost of exercising his options).</p> <p>This story has a few additional complications which illustrate other reasons options are often worth less than they seem. Mayhar couldn’t afford to exercise his options (by paying the strike price times the number of shares he had an option for) when he joined, which is common for people who take startup jobs out of college who don’t come from wealthy families. When he left four years later, he could afford to pay the cost of exercising the options, but due to a quirk of U.S. tax law, he either couldn’t afford the tax bill or didn’t want to pay that cost for what was still a lottery ticket — when you exercise your options, you’re effectively taxed on the difference between the current valuation and the strike price. Even if the company has a successful IPO for 10x as much in a few years, you’re still liable for the tax bill the year you exercise (and if the company stays private indefinitely or fails, you get nothing but a future tax deduction). Because, like most options, Mayhar’s option has a 90-day exercise window, he didn’t get anything from his options.</p> <p>While that’s more than the average amount of dilution, there are much worse cases, for example, cases where investors and senior management basically get to keep their equity and everyone else gets <a href="http://avc.com/2010/11/employee-equity-how-much/#comment-100654148">diluted to the point where their equity is worthless</a>.</p> <p>Those are just a few of the many ways in which the differences between preferred and common stock can cause the value of options to be wildly different from a value naively calculated from a public valuation. I often see both companies and employees use public preferred stock valuations as a benchmark in order to precisely value common stock options, but this isn’t possible, even in principle, without access to a company’s cap table (which shows how much of the company different investors own) as well as access to the specific details of each investment. Even if you can get that (which you usually can’t), determining the appropriate numbers to plug into a model that will give you the expected value is non-trivial because it requires answering questions like “what’s the probability that, in an acquisition, upper management will collude with investors to keep everything and leave the employees with nothing?”</p> <h4 id="black-scholes-valuation-as-value">Black-Scholes valuation as value</h4> <p>Because of the issues listed above, people will sometimes try to use a model to estimate the value of options. Black-Scholes is commonly used because well known and has an easy to use closed form solution, it’s the most commonly used model. Unfortunately, most of the major assumptions for Black-Scholes are false for startup options, making the relationship between the output between Black-Scholes and the actual value of your options non-obvious.</p> <h4 id="options-are-often-free-to-the-company">Options are often free to the company</h4> <p>A large fraction of options get returned to the employee option pool when employees leave, either voluntarily or involuntarily. I haven’t been able to find comprehensive numbers on this, but anecdotally, I hear that more than 50% of options end up getting taken back from employees and returned to the general pool. Dan McKinley points out <a href="https://medium.com/eshares-blog/broken-cap-tables-bbf84574a76a">an (unvetted) analysis that shows that only 5% of employee grants are exercised</a>. Even with a conservative estimate, a 50% discount on options granted sounds pretty good. A 20x discount sounds amazing, and would explain why companies like options so much.</p> <h4 id="present-value-of-a-future-sum-of-money">Present value of a future sum of money</h4> <p>When someone says that a startup’s compensation package is worth as much as Facebook’s, they often mean that the total value paid out over N years is similar. But a fixed nominal amount of money is worth more the sooner you get it because you can (at a minimum) invest it in a low-risk asset, like Treasury bonds, and get some return on the money.</p> <p>That’s an abstract argument you’ll hear in an econ 101 class, but in practice, if you live somewhere with a relatively high cost of living, like SF or NYC, there’s an even greater value to getting paid sooner rather than later because it lets you live in a relatively nice place (however you define nice) without having to cram into a space with more roommates than would be considered reasonable elsewhere in the U.S. Many startups from the last two generations seem to be putting off their IPOs; for folks in those companies with contracts that prevent them from selling options on a secondary market, that could easily mean that the majority of their potential wealth is locked up for the first decade of their working life. Even if the startup’s compensation package is worth more when adjusting for inflation and interest, it’s not clear if that’s a great choice for most people who aren’t already moderately well off.</p> <h3 id="non-cynical-reasons">Non-cynical reasons</h3> <p>We’ve looked at some cynical reasons companies might want to offer options instead of cash, namely that they can claim that their options are worth more than they’re actually worth. Now, let’s look at some non-cynical reasons companies might want to give out stock options.</p> <p>From an employee standpoint, one non-cynical reason might have been stock option backdating, at least until that loophole was mostly closed. Up until late early 2000s, many companies backdated the date of options grants. Let’s <a href="http://www.law.harvard.edu/faculty/jfried/option_backdating_and_its_implications.pdf">look at this example</a>, explained by Jessie M. Fried</p> <blockquote> <p>Options covering 1.2 million shares were given to Reyes. The reported grant date was October 1, 2001, when the firm's stock was trading at around $13 per share, the lowest closing price for the year. A week later, the stock was trading at $20 per share, and a month later the stock closed at almost $26 per share.</p> <p>Brocade disclosed this grant to investors in its 2002 proxy statement in a table titled &quot;Option Grants in the Last Fiscal Year, prepared in the format specified by SEC rules. Among other things, the table describes the details of this and other grants to executives, including the number of shares covered by the option grants, the exercise price, and the options' expiration date. The information in this table is used by analysts, including those assembling Standard &amp; Poor's well-known ExecuComp database, to calculate the Black Scholes value for each option grant on the date of grant. In calculating the value, the analysts assumed, based on the firm's representations about its procedure for setting exercise prices, that the options were granted at-the-money. The calculated value was then widely used by shareholders, researchers, and the media to estimate the CEO's total pay. The Black Scholes value calculated for Reyes' 1.2 million stock option grant, which analysts assumed was at-the-money, was $13.2 million.</p> <p>However, the SEC has concluded that the option grant to Reyes was backdated, and the market price on the actual date of grant may have been around $26 per share. Let us assume that the stock was in fact trading at $26 per share when the options were actually granted. Thus, if Brocade had adhered to its policy of giving only at-the-money options, it should have given Reyes options with a strike price of $26 per share. Instead, it gave Reyes options with a strike price of $13 per share, so that the options were $13 in the money. And it reported the grant as if it had given Reyes at-the-money options when the stock price was $13 per share.</p> <p>Had Brocade given Reyes at-the-money options at a strike price of $26 per share, the Black Scholes value of the option grant would have been approximately $26 million. But because the options were $13 million in the money, they were even more valuable. According to one estimate, they were worth $28 million. Thus, if analysts had been told that Reyes received options with a strike price of $13 when the stock was trading for $26, they would have reported their value as $28 million rather than $13.2 million. In short, backdating this particular option grant, in the scenario just described, would have enabled Brocade to give Reyes $2 million more in options (Black Scholes value) while reporting an amount that was $15 million less.</p> </blockquote> <p>While stock options backdating isn’t (easily) possible anymore, there might be other loopholes or consequences of tax law that make options a better deal than cash. I could only think of one reason off the top of my head, so I spent a couple weeks asking folks (including multiple founders) for their non-cynical reasons why startups might prefer options to an equivalent amount of cash.</p> <h4 id="tax-benefit-of-isos">Tax benefit of ISOs</h4> <p>In the U.S., <a href="https://en.wikipedia.org/wiki/Incentive_stock_option">Incentive stock options</a> (ISOs) have the property that, if held for one year after the exercise date and two years after the grant date, the owner of the option pays long-term capital gains tax instead of ordinary income tax on the difference between the exercise price and the strike price. In general, the capital gains has a lower tax rate than ordinary income.</p> <p>This isn’t quite as good as it sounds because the difference between the exercise price and the strike price is subject to the Alternative Minimum Tax (AMT). I don’t find this personally relevant since I prefer to sell employer stock as quickly as possible in order to be as diversified as possible, but if you’re interested in figuring out how the AMT affects your tax bill when you exercise ISOs, see <a href="https://www.nceo.org/articles/stock-options-alternative-minimum-tax-amt">this explanation</a> for more details. For people in California, California also has a relatively poor treatment of capital gains at the state level, which also makes this difference smaller than you might expect from looking at capital gains vs. ordinary income tax rates.</p> <h4 id="tax-benefit-of-qsbs">Tax benefit of QSBS</h4> <p>There’s a certain class of stock that is exempt from <a href="https://blog.wealthfront.com/qualified-small-business-stock-2016/">federal capital gains tax</a> and state tax in many states (though not in CA). This is interesting, but it seems like people rarely take advantage of this when eligible, and many startups aren’t eligible.</p> <h4 id="tax-benefit-of-other-options">Tax benefit of other options</h4> <p><a href="https://www.irs.gov/taxtopics/tc427.html">The IRS says</a>:</p> <blockquote> <p>Most nonstatutory options don't have a readily determinable fair market value. For nonstatutory options without a readily determinable fair market value, there's no taxable event when the option is granted but you must include in income the fair market value of the stock received on exercise, less the amount paid, when you exercise the option. You have taxable income or deductible loss when you sell the stock you received by exercising the option. You generally treat this amount as a capital gain or loss.</p> </blockquote> <h4 id="valuations-are-bogus">Valuations are bogus</h4> <p>One quirk of stock options is that, to qualify as ISOs, the strike price must be at least the fair market value. That’s easy to determine for public companies, but the fair market value of a share in a private company is somewhat arbitrary. For ISOs, my reading of the requirement is that companies must make “<a href="https://www.law.cornell.edu/uscode/text/26/422">an attempt, made in good faith</a>” to determine the fair market value. For other types of options, there’s <a href="http://avc.com/2010/11/employee-equity-the-option-strike-price/">other regulation which which determines the definition of fair market value</a>. Either way, startups usually go to an outside firm between 1 and N times a year to get an estimate of the fair market value for their common stock. This results in at least two possible gaps between a hypothetical “real” valuation and the fair market value for options purposes.</p> <p>First, the valuation is updated relatively infrequently. A common pitch I’ve heard is that the company hasn’t had its valuation updated for ages, and the company is worth twice as much now, so you’re basically getting a 2x discount.</p> <p>Second, the firms doing the valuations are poorly incentivized to produce “correct” valuations. The firms are paid by startups, which gain something when the legal valuation is as low as possible.</p> <p>I don’t really believe that these things make options amazing, because I hear these exact things from startups and founders, which means that their offers take these into account and are priced accordingly. However, if there’s a large gap between the legal valuation and the “true” valuation and this allows companies to effectively give out higher compensation, the way stock option backdating did, I could see how this would tilt companies towards favoring options.</p> <h4 id="control">Control</h4> <p>Even if employees got the same class of stock that VCs get, founders would retain less control if they transferred the equity from employees to VCs because employee-owned equity is spread between a relatively large number of people.</p> <h4 id="retention">Retention</h4> <p>This answer was commonly given to me as a non-cynical reason. The idea is that, if you offer employees options and have a clause that prevents them from selling options on a secondary market, many employees won’t be able to leave without walking away from the majority of their compensation. Personally, this strikes me as a cynical reason, but that’s not how everyone sees it. For example, Andreessen Horowitz managing partner <a href="http://www.benkuhn.net/clawback">Scott Kupor recently proposed a scheme under which employees would lose their options under all circumstances if they leave before a liquidity event</a>, supposedly in order to help employees.</p> <p>Whether or not you view employers being able to lock in employees for indeterminate lengths of time as good or bad, options lock-in appears to be a poor retention mechanism — companies that pay cash seem to have better retention. Just for example, Netflix pays salaries that are comparable to the total compensation in the senior band at places like Google and, anecdotally, they seem to have less attrition than trendy Bay Area startups. In fact, even though Netflix makes a lot of noise about showing people the door if they’re not a good fit, they don’t appear to have a higher involuntary attrition rate than trendy Bay Area startups — they just seem more honest about it, something which they can do because their recruiting pitch doesn’t involve you walking away with below-market compensation if you leave. If you think this comparison is unfair because Netflix hasn’t been a startup in recent memory, you can compare to finance startups, e.g. Headlands, which was founded in the same era as Uber, Airbnb, and Stripe. They (and some other finance startups) pay out hefty sums of cash and this does not appear to result in higher attrition than similarly aged startups which give out illiquid option grants.</p> <p>In the cases where this results in the employee staying longer than they otherwise would, options lock-in is often a bad deal for all parties involved. The situation is obviously bad for employees and, on average, companies don’t want unhappy people who are just waiting for a vesting cliff or liquidity event.</p> <h4 id="incentive-alignment">Incentive alignment</h4> <p>Another commonly stated reason is that, if you give people options, they’ll work harder because they’ll do well when the company does well. This was the reason that was given most vehemently (“you shouldn’t trust someone who’s only interested in a paycheck”, etc.)</p> <p>However, as far as I can tell, paying people in options almost totally decouples job performance and compensation. If you look at companies that have made a lot of people rich, like Microsoft, Google, Apple, and Facebook, almost none of the employees who became rich had an instrumental role in the company’s success. Google and Microsoft each made thousands of people rich, but the vast majority of those folks just happened to be in the right place at the right time and could have just as easily taken a different job where they didn't get rich. Conversely, the vast majority of startup option packages end up being worth little to nothing, but nearly none of the employees whose options end up being worthless were instrumental in causing their options to become worthless.</p> <p>If options are a large fraction of compensation, choosing a company that’s going to be successful is much more important than working hard. For reference, <a href="http://www.nytimes.com/1992/06/28/business/microsoft-s-unlikely-millionaires.html?pagewanted=all">Microsoft is estimated to have created roughly <code>10^3</code> millionaires by 1992</a> (adjusted for inflation, that's $1.75M). The stock then went up by more than 20x. Microsoft was legendary for making people who didn't particularly do much rich; all told, it's been estimated that they made <code>10^4</code> people rich by the late 90s. The vast majority of those people were no different from people in similar roles at Microsoft's competitors. They just happened to pick a winning lottery ticket. This is the opposite of what founders claim they get out of giving options. As above, companies that pay cash, like Netflix, don’t seem to have a problem with employee productivity.</p> <p>By the way, a large fraction of the people who were made rich by working at Microsoft joined after their IPO, which was in 1986. The same is true of Google, and while Facebook is too young for us to have a good idea what the long-term post-IPO story is, the folks who joined a year or two after the IPO (5 years ago, in 2012) have done quite well for themselves. People who joined pre-IPO have done better, but as mentioned above, most people have diminishing returns to individual wealth. The same power-law-like distribution that makes VC work also means that it's entirely plausible that Microsoft alone made more post-IPO people rich from 1986-1999 than all pre-IPO tech companies combined during that period. Something similar is plausibly true for Google from 2004 until FB's IPO in 2012, even including the people who got rich from FB's IPO as people who were made rich by a pre-IPO company, and you can do a similar calculation for Apple.</p> <h4 id="vc-firms-vs-the-market">VC firms vs. the market</h4> <p>There are several potential counter-arguments to the statement that VC returns (and therefore startup equity) don’t beat the market.</p> <p>One argument is, when people say that, they typically mean that after VCs take their fees, returns to VC funds don’t beat the market. As an employee who gets startup options, you don’t (directly) pay VC fees, which means you can beat the market by keeping the VC fees for yourself.</p> <p>Another argument is that, some investors (like YC) seem to consistently do pretty well. If you join a startup that’s funded by a savvy investors, you too can do pretty well. For this to make sense, you have to realize that the company is worth more than “expected” while the company doesn’t have the same realization because you need the company to give you an option package without properly accounting for its value. For you to have that expectation and get a good deal, this requires the founders to not only not be overconfident in the company’s probability of success, but actually requires that the founders are underconfident. While this isn’t impossible, the majority of startup offers I hear about have the opposite problem.</p> <h3 id="investing">Investing</h3> <p>This section is an update written in 2020. This post was originally written when I didn't realize that it was possible for people who aren't extremely wealthy to invest in startups. But once I moved to SF, I found that it's actually very easy to invest in startups and that you don't have to be particularly wealthy (for a programmer) to do so — <a href="https://www.freshpaint.io/blog/anatomy-of-a-seed-round-during-covid-19">people will often take small checks (as small as $5k or sometimes even less) in seed rounds</a>. If you can invest directly in a seed round, this is a strictly better deal than joining as an early employee.</p> <p>As of this writing, it's quite common for companies to raise a seed round at a $10M valuation. This meeans you'd have to invest $100k to get 1%, or about as much equity as you'd expect to get as a very early employee. However, if you were to join the company, your equity would vest over four years, you'd get a worse class of equity, and you'd (typically) get much less information about the share structure of the company. As an investor, you only need to invest $25k to get 1 year's worth of early employee equity. Morever, you can invest in multiple companies, which gives you better risk adjusted return. At rates big companies are paying today (mid-band of perhaps $380k/yr for senior engineer, $600k/yr for staff engineer), working at a big company and spending $25k/yr investing in startups is strictly superior to working at a startup from the standpoint of financial return.</p> <h3 id="conclusion">Conclusion</h3> <p>There are a number of factors that can make options more or less valuable than they seem. From an employee standpoint, the factors that make options more valuable than they seem can cause equity to be worth tens of percent more than a naive calculation. The factors that make options less valuable than they seem do so in ways that mostly aren’t easy to quantify.</p> <p>Whether or not the factors that make options relatively more valuable dominate or the factors that make options relatively less valuable dominate is an empirical question. My intuition is that the factors that make options relatively less valuable are stronger, but that’s just a guess. A way to get an idea about this from public data would be to go through through successful startup S-1 filing. Since this post is already ~5k words, I’ll leave that for another post, but I’ll note that in my preliminary skim of a handful of 99%-ile exits (&gt; $1B), the median employee seems to do worse than someone who’s on the standard Facebook/Google/Amazon career trajectory.</p> <p>From a company standpoint, there are a couple factors that allow companies to retain more leverage/control by giving relatively more options to employees and relatively less equity to investors.</p> <p>All of this sounds fine for founders and investors, but I don’t see what’s in it for employees. If you have additional reasons that I’m missing, I’d love to hear them.</p> <p>_If you liked this post, you may also like <a href="//danluu.com/startup-tradeoffs/">this other post on the tradeoff between working at a big company and working at a startup</a>.</p> <h3 id="appendix-caveats">Appendix: caveats</h3> <p>Many startups don’t claim that their offers are financially competitive. As time goes on, I hear less “If you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years.” and more “we’re not financially competitive with Facebook, but ... ”. I’ve heard from multiple founders that joining as an early employee is an incredibly bad deal when you compare early-employee equity and workload vs. founder equity and workload.</p> <p>Some startups are giving out offers that are actually competitive with large company offers. Something I’ve seen from startups that are trying to give out compelling offers is that, for “senior” folks, they’re willing to pay substantially higher salaries than public companies because it’s understood that options aren’t great for employees because of their timeline, risk profile, and expected value.</p> <p>There’s a huge amount of variation in offers, much of which is effectively random. I know of cases where an individual got a more lucrative offer from a startup (that doesn’t tend to give particular strong offers) than from Google, and if you ask around you’ll hear about a lot of cases like that. It’s not always true that startup offers are lower than Google/Facebook/Amazon offers, even at startups that don’t pay competitively (on average).</p> <p>Anything in this post that’s related to taxes is U.S. specific. For example, I’m told that in Canada, “you can defer the payment of taxes when exercising options whose strike price is way below fair market valuation until disposition, as long as the company is Canadian-controlled and operated in Canada”.</p> <p>You might object that the same line of reasoning we looked at for options can be applied to RSUs, even RSUs for public companies. That’s true, although the largest downsides of startup options are mitigated or non-existent, cash still has significant advantages to employees over RSUs. Unfortunately, the only non-finance company I know of that uses this to their advantage in recruiting is Netflix; please let me know if you can think of other tech companies that use the same compensation model.</p> <p>Some startups have a sliding scale that lets you choose different amounts of option/salary compensation. I haven't seen an offer that will let you put the slider to 100% cash and 0% options (or 100% options and 0% cash), but someone out there will probably be willing to give you an all-cash offer.</p> <p>In the current environment, looking at public exits may bias the data towards less sucessful companies. The most sucessful startups from the last couple generations of startups that haven't exited by acquisition have so far chosen not to IPO. It's possible that, once all the data are in, the average returns to joining a startup will look quite different (although I doubt the median return will change much).</p> <p>BTW, I don't have anything against taking a startup offer, even if it's low. When I graduated from college, I took the lowest offer I had, and my partner recently took the lowest offer she got (nearly a 2x difference over the highest offer). There are plenty of reasons you might want to take an offer that isn't the best possible financial offer. However, I think you should know what you're getting into and not take an offer that you think is financially great when it's merely mediocre or even bad.</p> <h3 id="appendix-non-counterarguments">Appendix: non-counterarguments</h3> <p>The most common objection I’ve heard to this is that most startups don’t have enough money to pay equivalent cash and couldn’t raise that much money by selling off what would “normally” be their employee option pool. Maybe so, but that’s not a counter-argument — it’s an argument that the most startups don’t have options that are valuable enough to be exchanged for the equivalent sum of money, i.e., that the options simply aren’t as valuable as claimed. This argument can be phrased in a variety of ways (e.g., paying salary instead of options increases burn rate, reduces runway, makes the startup default dead, etc.), but arguments of this form are fundamentally equivalent to admitting that startup options aren’t worth much because they wouldn't hold up if the options were worth enough that a typical compensation package was worth as much as a typical &quot;senior&quot; offer at Google or Facebook.</p> <p>If you don't buy this, imagine a startup with a typical valuation that's at a stage where they're giving out 0.1% equity in options to new hires. Now imagine that some irrational bystander is willing to make a deal where they take 0.1% of the company for $1B. Is it worth it to take the money and pay people out of the $1B cash pool instead of paying people with 0.1% slices of the option pool? Your answer should be yes, unless you believe that the ratio between the value of cash on hand and equity is nearly infinite. Absolute statements like &quot;options are preferred to cash because paying cash increases burn rate, making the startup default dead&quot; at any valuation are equivalent to stating that the correct ratio is infinity. That's clearly nonsensical; there's some correct ratio, and we might disagree over what the correct ratio is, but for typical startups it should not be the case that the correct ratio is infinite. Since this was such a common objection, if you have this objection, my question to you is, why don't you argue that startups should pay even less cash and even more options? Is the argument that the current ratio is exactly optimal, and if so, why? Also, why does the ratio vary so much between different companies at the same stage which have raised roughly the same amount of money? Are all of those companies giving out optimal deals?</p> <p>The second most common objection is that startup options are actually worth a lot, if you pick the right startup and use a proper model to value the options. Perhaps, but if that’s true, why couldn’t they have raised a bit more money by giving away more equity to VCs at its true value, and then pay cash?</p> <p>Another common objection is something like &quot;I know lots of people who've made $1m from startups&quot;. Me too, but I also know lots of people who've made much more than that working at public companies. This post is about the relative value of compensation packages, not the absolute value.</p> <h4 id="acknowledgements">Acknowledgements</h4> <p>Thanks to Leah Hanson, Ben Kuhn, Tim Abbott, David Turner, Nick Bergson-Shilcock, Peter Fraenkel, Joe Ardent, Chris Ball, Anton Dubrau, Sean Talts, Danielle Sucher, Dan McKinley, Bert Muthalaly, Dan Puttick, Indradhanush Gupta, and Gaxun for comments and corrections.</p> How web bloat impacts users with slow connections web-bloat/ Wed, 08 Feb 2017 00:00:00 +0000 web-bloat/ <p>A couple years ago, I took a road trip from Wisconsin to Washington and mostly stayed in rural hotels on the way. I expected the internet in rural areas too sparse to have cable internet to be slow, but I was still surprised that a large fraction of the web was inaccessible. Some blogs with lightweight styling were readable, as were pages by academics who hadn’t updated the styling on their website since 1995. But very few commercial websites were usable (other than Google). When I measured my connection, I found that the bandwidth was roughly comparable to what I got with a 56k modem in the 90s. The latency and packetloss were significantly worse than the average day on dialup: latency varied between 500ms and 1000ms and packetloss varied between 1% and 10%. Those numbers are comparable to what I’d see on dialup on a bad day.</p> <p>Despite my connection being only a bit worse than it was in the 90s, the vast majority of the web wouldn’t load. Why shouldn’t the web work with dialup or a dialup-like connection? It would be one thing if I tried to watch youtube and read pinterest. It’s hard to serve videos and images without bandwidth. But my online interests are quite boring from a media standpoint. Pretty much everything I consume online is plain text, even if it happens to be styled with images and fancy javascript. In fact, I recently tried using w3m (a terminal-based web browser that, by default, doesn’t support css, javascript, or even images) for a week and it turns out there are only two websites I regularly visit that don’t really work in w3m (twitter and zulip, both fundamentally text based sites, at least as I use them)<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">1</a></sup>.</p> <p>More recently, I was reminded of how poorly the web works for people on slow connections when I tried to read a joelonsoftware post while using a flaky mobile connection. The HTML loaded but either one of the five CSS requests or one of the thirteen javascript requests timed out, leaving me with a broken page. Instead of seeing the article, I saw <a href="https://twitter.com/danluu/status/823286780560437248">three entire pages of sidebar, menu, and ads</a> before getting to the title because the page required some kind of layout modification to display reasonably. Pages are often designed so that they're hard or impossible to read if some dependency fails to load. On a slow connection, it's quite common for at least one depedency to fail. After refreshing the page twice, the page loaded as it was supposed to and I was able to read the blog post, a fairly compelling post on eliminating dependencies.</p> <p>Complaining that people don’t care about performance like they used to and that we’re letting bloat slow things down for no good reason is “old man yells at cloud” territory; I probably sound like that dude who complains that his word processor, which used to take 1MB of RAM, takes 1GB of RAM. Sure, that could be trimmed down, but there’s a real cost to spending time doing optimization and even a $300 laptop comes with 2GB of RAM, so why bother? But it’s not quite the same situation -- it’s not just nerds like me who care about web performance. When Microsoft looked at actual measured connection speeds, they found that <a href="https://blogs.microsoft.com/on-the-issues/2018/12/03/the-rural-broadband-divide-an-urgent-national-problem-that-we-can-solve/">half of Americans don't have broadband speed</a>. Heck, AOL had 2 million dial-up subscribers in 2015, just AOL alone. Outside of the U.S., there are even more people with slow connections. I recently chatted with Ben Kuhn, who spends a fair amount of time in Africa, about his internet connection:</p> <blockquote> <p>I've seen ping latencies as bad as ~45 sec and packet loss as bad as 50% on a mobile hotspot in the evenings from Jijiga, Ethiopia. (I'm here now and currently I have 150ms ping with no packet loss but it's 10am). There are some periods of the day where it ~never gets better than 10 sec and ~10% loss. The internet has gotten a lot better in the past ~year; it used to be that bad all the time except in the early mornings.</p> <p>…</p> <p>Speedtest.net reports 2.6 mbps download, 0.6 mbps upload. I realized I probably shouldn't run a speed test on my mobile data because bandwidth is really expensive.</p> <p>Our server in Ethiopia is has a fiber uplink, but it frequently goes down and we fall back to a 16kbps satellite connection, though I think normal people would just stop using the Internet in that case.</p> </blockquote> <p>If you think <a href="https://1-minute-modem.branchable.com/">browsing on a 56k connection</a> is bad, try a 16k connection from Ethiopia!</p> <p>Everything we’ve seen so far is anecdotal. Let’s load some websites that programmers might frequent with a variety of simulated connections to get data on page load times. <a href="https://www.webpagetest.org/">webpagetest</a> lets us see how long it takes a web site to load (and why it takes that long) from locations all over the world. It even lets us simulate different kinds of connections as well as load sites on a variety of mobile devices. The times listed in the table below are the time until the page is “visually complete”; as measured by webpagetest, that’s the time until the above-the-fold content stops changing.</p> <style type="text/css" > #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03 .row_heading, .blank { display: none;; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col1 { background-color: #00441b; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col3 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col4 { background-color: #66c2a4; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col5 { background-color: #66c2a4; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col6 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col7 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col8 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col9 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col10 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col1 { background-color: #00441b; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col3 { background-color: #006d2c; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col4 { background-color: #006d2c; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col5 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col6 { background-color: #ccece6; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col7 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col8 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col9 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col10 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col1 { background-color: #00441b; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col3 { background-color: #238b45; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col4 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col5 { background-color: #99d8c9; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col6 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col7 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col8 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col9 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col10 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col1 { background-color: #00441b; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col3 { background-color: #006d2c; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col4 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col5 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col6 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col7 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col8 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col9 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col10 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col1 { background-color: #238b45; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col3 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col4 { background-color: #99d8c9; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col5 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col6 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col7 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col8 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col9 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col10 { background-color: #d7301f; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col1 { background-color: #238b45; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col3 { background-color: #41ae76; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col4 { background-color: #ccece6; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col5 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col6 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col7 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col8 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col9 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col10 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col1 { background-color: #ccece6; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col3 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col4 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col5 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col6 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col7 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col8 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col9 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col1 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col3 { background-color: #ccece6; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col4 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col5 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col6 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col7 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col8 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col9 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col10 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col1 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col3 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col4 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col5 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col6 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col7 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col8 { background-color: #d7301f; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col9 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col1 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col3 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col4 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col5 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col6 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col7 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col8 { background-color: #d7301f; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col9 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col1 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col3 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col4 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col5 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col6 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col7 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col8 { background-color: #d7301f; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col9 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col1 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col3 { background-color: white; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col4 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col5 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col6 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col7 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col8 { background-color: #d7301f; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col9 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col1 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col3 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col4 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col5 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col6 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col7 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col8 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col9 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col1 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col3 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col4 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col5 { background-color: #fff7ec; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col6 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col7 { background-color: #fc8d59; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col8 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col9 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col10 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col1 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col3 { background-color: #fee8c8; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col4 { background-color: #fdbb84; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col5 { background-color: #fdd49e; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col6 { background-color: #ef6548; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col7 { background-color: #b30000; color: black; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col8 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col9 { background-color: #7f0000; color: white; } #T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col10 { background-color: #7f0000; color: white; } </style> <div style="overflow-x:auto;"> <table id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" None cellspacing="0" align="center"> <thead> <tr> <th>URL <th>Size <th>C <th colspan=8> Load time in seconds </tr> <tr> <th class="blank level0" > <th class="col_heading level0 col0" colspan=1> <th class="col_heading level0 col1" colspan=1> MB <th class="col_heading level0 col2" colspan=1> <th class="col_heading level0 col3" colspan=1> FIOS <th class="col_heading level0 col4" colspan=1> Cable <th class="col_heading level0 col5" colspan=1> LTE <th class="col_heading level0 col6" colspan=1> 3G <th class="col_heading level0 col7" colspan=1> 2G <th class="col_heading level0 col8" colspan=1> Dial <th class="col_heading level0 col9" colspan=1> Bad <th class="col_heading level0 col10" colspan=1> 😱 </tr> </thead> <tbody> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row0" rowspan=1> 0 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col0" class="data row0 col0" > http://bellard.org <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col1" class="data row0 col1" > 0.01 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col2" class="data row0 col2" > 5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col3" class="data row0 col3" > 0.40 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col4" class="data row0 col4" > 0.59 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col5" class="data row0 col5" > 0.60 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col6" class="data row0 col6" > 1.2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col7" class="data row0 col7" > 2.9 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col8" class="data row0 col8" > 1.8 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col9" class="data row0 col9" > 9.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row0_col10" class="data row0 col10" > 7.6 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row1" rowspan=1> 1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col0" class="data row1 col0" > http://danluu.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col1" class="data row1 col1" > 0.02 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col2" class="data row1 col2" > 2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col3" class="data row1 col3" > 0.20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col4" class="data row1 col4" > 0.20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col5" class="data row1 col5" > 0.40 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col6" class="data row1 col6" > 0.80 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col7" class="data row1 col7" > 2.7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col8" class="data row1 col8" > 1.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col9" class="data row1 col9" > 6.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col10" class="data row1 col10" > 7.6 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row2" rowspan=1> 2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col0" class="data row2 col0" > news.ycombinator.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col1" class="data row2 col1" > 0.03 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col2" class="data row2 col2" > 1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col3" class="data row2 col3" > 0.30 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col4" class="data row2 col4" > 0.49 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col5" class="data row2 col5" > 0.69 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col6" class="data row2 col6" > 1.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col7" class="data row2 col7" > 5.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col8" class="data row2 col8" > 5.0 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col9" class="data row2 col9" > 14 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row2_col10" class="data row2 col10" > 27 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row3" rowspan=1> 3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col0" class="data row3 col0" > danluu.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col1" class="data row3 col1" > 0.03 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col2" class="data row3 col2" > 2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col3" class="data row3 col3" > 0.20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col4" class="data row3 col4" > 0.40 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col5" class="data row3 col5" > 0.49 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col6" class="data row3 col6" > 1.1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col7" class="data row3 col7" > 3.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col8" class="data row3 col8" > 3.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col9" class="data row3 col9" > 9.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col10" class="data row3 col10" > 15 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row4" rowspan=1> 4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col0" class="data row4 col0" > http://jvns.ca <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col1" class="data row4 col1" > 0.14 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col2" class="data row4 col2" > 7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col3" class="data row4 col3" > 0.49 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col4" class="data row4 col4" > 0.69 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col5" class="data row4 col5" > 1.2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col6" class="data row4 col6" > 2.9 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col7" class="data row4 col7" > 10 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col8" class="data row4 col8" > 19 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col9" class="data row4 col9" > 29 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row4_col10" class="data row4 col10" > 108 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row5" rowspan=1> 5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col0" class="data row5 col0" > jvns.ca <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col1" class="data row5 col1" > 0.15 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col2" class="data row5 col2" > 4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col3" class="data row5 col3" > 0.50 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col4" class="data row5 col4" > 0.80 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col5" class="data row5 col5" > 1.2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col6" class="data row5 col6" > 3.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col7" class="data row5 col7" > 11 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col8" class="data row5 col8" > 21 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col9" class="data row5 col9" > 31 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row5_col10" class="data row5 col10" > 97 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row6" rowspan=1> 6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col0" class="data row6 col0" > fgiesen.wordpress.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col1" class="data row6 col1" > 0.37 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col2" class="data row6 col2" > 12 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col3" class="data row6 col3" > 1.0 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col4" class="data row6 col4" > 1.1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col5" class="data row6 col5" > 1.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col6" class="data row6 col6" > 5.0 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col7" class="data row6 col7" > 16 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col8" class="data row6 col8" > 66 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col9" class="data row6 col9" > 68 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row6_col10" class="data row6 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row7" rowspan=1> 7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col0" class="data row7 col0" > google.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col1" class="data row7 col1" > 0.59 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col2" class="data row7 col2" > 6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col3" class="data row7 col3" > 0.80 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col4" class="data row7 col4" > 1.8 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col5" class="data row7 col5" > 1.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col6" class="data row7 col6" > 6.8 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col7" class="data row7 col7" > 19 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col8" class="data row7 col8" > 94 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col9" class="data row7 col9" > 96 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row7_col10" class="data row7 col10" > 236 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row8" rowspan=1> 8 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col0" class="data row8 col0" > joelonsoftware.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col1" class="data row8 col1" > 0.72 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col2" class="data row8 col2" > 19 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col3" class="data row8 col3" > 1.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col4" class="data row8 col4" > 1.7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col5" class="data row8 col5" > 1.9 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col6" class="data row8 col6" > 9.7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col7" class="data row8 col7" > 28 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col8" class="data row8 col8" > 140 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col9" class="data row8 col9" > FAIL <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row8_col10" class="data row8 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row9" rowspan=1> 9 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col0" class="data row9 col0" > bing.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col1" class="data row9 col1" > 1.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col2" class="data row9 col2" > 12 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col3" class="data row9 col3" > 1.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col4" class="data row9 col4" > 2.9 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col5" class="data row9 col5" > 3.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col6" class="data row9 col6" > 11 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col7" class="data row9 col7" > 43 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col8" class="data row9 col8" > 134 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col9" class="data row9 col9" > FAIL <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row9_col10" class="data row9 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row10" rowspan=1> 10 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col0" class="data row10 col0" > reddit.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col1" class="data row10 col1" > 1.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col2" class="data row10 col2" > 26 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col3" class="data row10 col3" > 7.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col4" class="data row10 col4" > 6.9 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col5" class="data row10 col5" > 7.0 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col6" class="data row10 col6" > 20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col7" class="data row10 col7" > 58 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col8" class="data row10 col8" > 179 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col9" class="data row10 col9" > 210 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row10_col10" class="data row10 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row11" rowspan=1> 11 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col0" class="data row11 col0" > signalvnoise.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col1" class="data row11 col1" > 2.1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col2" class="data row11 col2" > 7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col3" class="data row11 col3" > 2.0 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col4" class="data row11 col4" > 3.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col5" class="data row11 col5" > 3.7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col6" class="data row11 col6" > 16 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col7" class="data row11 col7" > 47 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col8" class="data row11 col8" > 173 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col9" class="data row11 col9" > 218 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row11_col10" class="data row11 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row12" rowspan=1> 12 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col0" class="data row12 col0" > amazon.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col1" class="data row12 col1" > 4.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col2" class="data row12 col2" > 47 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col3" class="data row12 col3" > 6.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col4" class="data row12 col4" > 13 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col5" class="data row12 col5" > 8.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col6" class="data row12 col6" > 36 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col7" class="data row12 col7" > 65 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col8" class="data row12 col8" > 265 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col9" class="data row12 col9" > 300 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row12_col10" class="data row12 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row13" rowspan=1> 13 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col0" class="data row13 col0" > steve-yegge.blogspot.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col1" class="data row13 col1" > 9.7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col2" class="data row13 col2" > 19 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col3" class="data row13 col3" > 2.2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col4" class="data row13 col4" > 3.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col5" class="data row13 col5" > 3.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col6" class="data row13 col6" > 12 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col7" class="data row13 col7" > 36 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col8" class="data row13 col8" > 206 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col9" class="data row13 col9" > 188 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row13_col10" class="data row13 col10" > FAIL </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row14" rowspan=1> 14 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col0" class="data row14 col0" > blog.codinghorror.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col1" class="data row14 col1" > 23 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col2" class="data row14 col2" > 24 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col3" class="data row14 col3" > 6.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col4" class="data row14 col4" > 15 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col5" class="data row14 col5" > 9.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col6" class="data row14 col6" > 83 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col7" class="data row14 col7" > 235 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col8" class="data row14 col8" > FAIL <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col9" class="data row14 col9" > FAIL <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row14_col10" class="data row14 col10" > FAIL </tr> </tbody> </table> </div> <p>Each row is a website. For sites that support both plain HTTP as well as HTTPS, both were tested; URLs are HTTPS except where explicitly specified as HTTP. The first two columns show the amount of data transferred over the wire in MB (which includes headers, handshaking, compression, etc.) and the number of TCP connections made. The rest of the columns show the time in seconds to load the page on a variety of connections from fiber (FIOS) to less good connections. “Bad” has the bandwidth of dialup, but with 1000ms ping and 10% packetloss, which is roughly what I saw when using the internet in small rural hotels. “😱” simulates a 16kbps satellite connection from Jijiga, Ethiopia. Rows are sorted by the measured amount of data transferred.</p> <p>The timeout for tests was 6 minutes; anything slower than that is listed as FAIL. Pages that failed to load are also listed as FAIL. A few things that jump out from the table are:</p> <ol> <li>A large fraction of the web is unusable on a bad connection. Even on a good (0% packetloss, no ping spike) dialup connection, some sites won’t load.</li> <li>Some sites will use a lot of data!</li> </ol> <h4 id="the-web-on-bad-connections">The web on bad connections</h4> <p>As commercial websites go, Google is basically as good as it gets for people on a slow connection. On dialup, the 50%-ile page load time is a minute and a half. But at least it loads -- when I was on a slow, shared, satellite connection in rural Montana, virtually no commercial websites would load at all. I could view websites that only had static content via Google cache, but the live site had no hope of loading.</p> <h4 id="some-sites-will-use-a-lot-of-data">Some sites will use a lot of data</h4> <p>Although only two really big sites were tested here, there are plenty of sites that will use 10MB or 20MB of data. If you’re reading this from the U.S., maybe you don’t care, but if you’re browsing from Mauritania, Madagascar, or Vanuatu, loading codinghorror once will cost you <a href="https://web.archive.org/web/20190220011242if_/https://whatdoesmysitecost.com/test/170206_0G_2QG#gniCost">more than 10% of the daily per capita GNI</a>.</p> <h4 id="page-weight-matters">Page weight matters</h4> <p>Despite the best efforts of <a href="http://idlewords.com/talks/website_obesity.htm">Maciej</a>, the meme that page weight doesn’t matter keeps getting spread around. AFAICT, the top HN link of all time on web page optimization is to an article titled “Ludicrously Fast Page Loads - A Guide for Full-Stack Devs”. At the bottom of the page, the author links to another one of his posts, titled “Page Weight Doesn’t Matter”.</p> <blockquote> <p>Usually, the boogeyman that gets pointed at is bandwidth: users in low-bandwidth areas (3G, developing world) are getting shafted. But the math doesn’t quite work out. Akamai puts the global connection speed average at 3.9 megabits per second.</p> </blockquote> <p>The “ludicrously fast” guide fails to display properly on dialup or slow mobile connections because the images time out. On reddit, <a href="https://www.reddit.com/r/web_design/comments/3oppvo/ludicrously_fast_page_loads_a_guide_for_fullstack/">it also fails under load</a>: &quot;Ironically, that page took so long to load that I closed the window.&quot;, &quot;a lot of … gifs that do nothing but make your viewing experience worse&quot;, &quot;I didn't even make it to the gifs; the header loaded then it just hung.&quot;, etc.</p> <p>The flaw in the “page weight doesn’t matter because average speed is fast” is that if you average the connection of someone in my apartment building (which is wired for 1Gbps internet) and someone on 56k dialup, you get an average speed of 500 Mbps. That doesn’t mean the person on dialup is actually going to be able to load a 5MB website. The average speed of 3.9 Mbps comes from a 2014 Akamai report, but it’s just an average. If you look at Akamai’s 2016 report, you can find entire countries where more than 90% of IP addresses are slower than that!</p> <p>Yes, there are a lot of factors besides page weight that matter, and yes it's possible to create a contrived page that's very small but loads slowly, as well as a huge page that loads ok because all of the weight isn't blocking, but total page weight is still pretty decently correlated with load time.</p> <p>Since its publication, the &quot;ludicrously fast&quot; guide was updated with some javascript that only loads images if you scroll down far enough. That makes it look a lot better on webpagetest if you're looking at the page size number (if webpagetest isn't being scripted to scroll), but it's a worse user experience for people on slow connections who want to read the page. If you're going to read the entire page anyway, the weight increases, and you can no longer preload images by loading the site. Instead, if you're reading, you have to stop for a few minutes at every section to wait for the images from that section to load. And that's if you're lucky and the javascript for loading images didn't fail to load.</p> <h4 id="the-average-user-fallacy">The average user fallacy</h4> <p>Just like many people develop with an average connection speed in mind, many people have a fixed view of who a user is. Maybe they think there are customers with a lot of money with fast connections and customers who won't spend money on slow connections. That is, very roughly speaking, perhaps true on average, but sites don't operate on average, they operate in particular domains. Jamie Brandon writes the following about his experience with Airbnb:</p> <blockquote> <p>I spent three hours last night trying to book a room on airbnb through an overloaded wifi and presumably a satellite connection. OAuth seems to be particularly bad over poor connections. Facebook's OAuth wouldn't load at all and Google's sent me round a 'pick an account' -&gt; 'please reenter you password' -&gt; 'pick an account' loop several times. It took so many attempts to log in that I triggered some 2fa nonsense on airbnb that also didn't work (the confirmation link from the email led to a page that said 'please log in to view this page') and eventually I was just told to send an email to account.disabled@airbnb.com, who haven't replied.</p> <p>It's particularly galling that airbnb doesn't test this stuff, because traveling is pretty much the whole point of the site so they can't even claim that there's no money in servicing people with poor connections.</p> </blockquote> <h4 id="what-about-tail-latency">What about tail latency?</h4> <p>My original plan for this was post was to show 50%-ile, 90%-ile, 99%-ile, etc., tail load times. But the 50%-ile results are so bad that I don’t know if there’s any point to showing the other results. If you were to look at the 90%-ile results, you’d see that most pages fail to load on dialup and the “Bad” and “😱” connections are hopeless for almost all sites.</p> <h4 id="http-vs-https">HTTP vs HTTPs</h4> <div style="overflow-x:auto;"> <table id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" None cellspacing="0" align="center"> <thead> <tr> <th>URL <th>Size <th>C <th colspan=8> Load time in seconds </tr> <tr> <th class="blank level0" > <th class="col_heading level0 col0" colspan=1> <th class="col_heading level0 col1" colspan=1> kB <th class="col_heading level0 col2" colspan=1> <th class="col_heading level0 col3" colspan=1> FIOS <th class="col_heading level0 col4" colspan=1> Cable <th class="col_heading level0 col5" colspan=1> LTE <th class="col_heading level0 col6" colspan=1> 3G <th class="col_heading level0 col7" colspan=1> 2G <th class="col_heading level0 col8" colspan=1> Dial <th class="col_heading level0 col9" colspan=1> Bad <th class="col_heading level0 col10" colspan=1> 😱 </tr> </thead> <tbody> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row1" rowspan=1> 1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col0" class="data row1 col0" > http://danluu.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col1" class="data row1 col1" > 21.1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col2" class="data row1 col2" > 2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col3" class="data row1 col3" > 0.20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col4" class="data row1 col4" > 0.20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col5" class="data row1 col5" > 0.40 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col6" class="data row1 col6" > 0.80 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col7" class="data row1 col7" > 2.7 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col8" class="data row1 col8" > 1.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col9" class="data row1 col9" > 6.4 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row1_col10" class="data row1 col10" > 7.6 </tr> <tr> <th id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03" class="row_heading level0 row3" rowspan=1> 3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col0" class="data row3 col0" > https://danluu.com <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col1" class="data row3 col1" > 29.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col2" class="data row3 col2" > 2 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col3" class="data row3 col3" > 0.20 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col4" class="data row3 col4" > 0.40 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col5" class="data row3 col5" > 0.49 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col6" class="data row3 col6" > 1.1 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col7" class="data row3 col7" > 3.6 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col8" class="data row3 col8" > 3.5 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col9" class="data row3 col9" > 9.3 <td id="T_3d17f176_ecb9_11e6_bbf7_0cc47ac41d03row3_col10" class="data row3 col10" > 15 </tr> </tbody> </table> </div> <p>You can see that for a very small site that doesn’t load many blocking resources, HTTPS is noticeably slower than HTTP, especially on slow connections. Practically speaking, this doesn’t matter today because virtually no sites are that small, but if you design a web site as if people with slow connections actually matter, this is noticeable.</p> <h3 id="how-to-make-pages-usable-on-slow-connections">How to make pages usable on slow connections</h3> <p>The long version is, to really understand what’s going on, considering reading <a href="https://hpbn.co/">high-performance browser networking</a>, a great book on web performance that’s avaiable for free.</p> <p>The short version is that most sites are so poorly optimized that someone who has no idea what they’re doing can get a 10x improvement in page load times for a site whose job is to serve up text with the occasional image. When I started this blog in 2013, I used Octopress because Jekyll/Octopress was the most widely recommended static site generator back then. A plain blog post with one or two images took 11s to load on a cable connection because the Octopress defaults included multiple useless javascript files in the header (for never-used-by-me things like embedding flash videos and delicious integration), which blocked page rendering. Just moving those javascript includes to the footer halved page load time, and <a href="//danluu.com/octopress-speedup/">making a few other tweaks decreased page load time by another order of magnitude</a>. At the time I made those changes, I knew nothing about web page optimization, other than what I heard during a 2-minute blurb on optimization from a 40-minute talk on how the internet works and I was able to get a 20x speedup on my blog in a few hours. You might argue that I’ve now gone too far and removed too much CSS, but I got a 20x speedup for people on fast connections before making changes that affected the site’s appearance (and the speedup on slow connections was much larger).</p> <p>That’s normal. Popular themes for many different kinds of blogging software and CMSs contain anti-optimizations so blatant that any programmer, even someone with no front-end experience, can find large gains by just pointing <a href="https://www.webpagetest.org/">webpagetest</a> at their site and looking at the output.</p> <h3 id="what-about-browsers">What about browsers?</h3> <p>While it's easy to blame page authors because there's a lot of low-hanging fruit on the page side, there's just as much low-hanging fruit on the browser side. Why does my browser open up 6 TCP connections to try to download six images at once when I'm on a slow satellite connection? That just guarantees that all six images will time out! Even if I tweak the timeout on the client side, servers that are configured to protect against DoS attacks won't allow long lived connections that aren't doing anything. I can sometimes get some images to load by refreshing the page a few times (and waiting ten minutes each time), but why shouldn't the browser handle retries for me? If you think about it for a few minutes, there are a lot of optimiztions that browsers could do for people on slow connections, but because they don't, the best current solution for users appears to be: use w3m when you can, and then switch to a browser with ad-blocking when that doesn't work. But why should users have to use two entirely different programs, one of which has a text-based interface only computer nerds will find palatable?</p> <h3 id="conclusion">Conclusion</h3> <p>When I was at Google, someone told me a story about a time that “they” completed a big optimization push only to find that measured page load times increased. When they dug into the data, they found that the reason load times had increased was that they got a lot more traffic from Africa after doing the optimizations. The team’s product went from being unusable for people with slow connections to usable, which caused so many users with slow connections to start using the product that load times actually increased.</p> <p>Last night, at a presentation on the websockets protocol, Gary Bernhardt made the observation that the people who designed the websockets protocol did things like using a variable length field for frame length to save a few bytes. By contrast, if you look at the Alexa top 100 sites, almost all of them have a huge amount of slop in them; it’s plausible that the total bandwidth used for those 100 sites is probably greater than the total bandwidth for all websockets connections combined. Despite that, if we just look at the three top 35 sites tested in this post, two send uncompressed javascript over the wire, two redirect the bare domain to the www subdomain, and two send a lot of extraneous information by not compressing images as much as they could be compressed without sacrificing quality. If you look at twitter, which isn’t in our table but was mentioned above, they actually do <a href="https://twitter.com/danluu/status/705815510479302656">an anti-optimization where, if you upload a PNG which isn’t even particularly well optimized, they’ll re-encode it as a jpeg which is larger and has visible artifacts</a>!</p> <p>“Use bcrypt” has become the mantra for a reasonable default if you’re not sure what to do when storing passwords. The web would be a nicer place if “use webpagetest” caught on in the same way. It’s not always the best tool for the job, but it sure beats the current defaults.</p> <h3 id="appendix-experimental-caveats">Appendix: experimental caveats</h3> <p>The above tests were done by repeatedly loading pages via a private webpagetest image in AWS west 2, on a c4.xlarge VM, with simulated connections on a first page load in Chrome with no other tabs open and nothing running on the VM other than the webpagetest software and the browser. This is unrealistic in many ways.</p> <p>In relative terms, this disadvantages sites that have a large edge presence. When I was in rural Montana, I ran some tests and found that I had noticeably better latency to Google than to basically any other site. This is not reflected in the test results. Furthermore, this setup means that pages are nearly certain to be served from a CDN cache. That shouldn't make any difference for sites like Google and Amazon, but it reduces the page load time of less-trafficked sites that aren't &quot;always&quot; served out of cache. For example, when I don't have a post trending on social media, between 55% and 75% of traffic is served out of a CDN cache, and when I do have something trending on social media, it's more like 90% to 99%. But the test setup means that the CDN cache hit rate during the test is likely to be &gt; 99% for my site and other blogs which aren't so widely read that they'd normally always have a cached copy available.</p> <p>All tests were run assuming a first page load, but it’s entirely reasonable for sites like Google and Amazon to assume that many or most of their assets are cached. Testing first page load times is perhaps reasonable for sites with a traffic profile like mine, where much of the traffic comes from social media referrals of people who’ve never visited the site before.</p> <p>A c4.xlarge is a fairly powerful machine. Today, most page loads come from mobile and even the fastest mobile devices aren’t as fast as a c4.xlarge; most mobile devices are much slower than the fastest mobile devices. Most desktop page loads will also be from a machine that’s slower than a c4.xlarge. Although the results aren’t shown, I also ran a set of tests using a t2.micro instance: for simple sites, like mine, the difference was negligible, but for complex sites, like Amazon, page load times were as much as 2x worse. As you might expect, for any particular site, the difference got smaller as the connection got slower.</p> <p>As Joey Hess pointed out, many dialup providers attempt to do compression or other tricks to reduce the effective weight of pages and none of these tests take that into account.</p> <p>Firefox, IE, and Edge often have substantially different performance characteristics from Chrome. For that matter, different versions of Chrome can have different performance characteristics. I just used Chrome because it’s the most widely used desktop browser, and running this set of tests took over a full day of VM time with a single-browser.</p> <p>The simulated bad connections add a constant latency and fixed (10%) packetloss. In reality, poor connections have highly variable latency with peaks that are much higher than the simulated latency and periods of much higher packetloss than can last for minutes, hours, or days. Putting 😱 at the rightmost side of the table may make it seem like the worst possible connection, but packetloss can get much worse.</p> <p>Similarly, while codinghorror happens to be at the bottom of the page, it's nowhere to being the slowest loading page. Just for example, I originally considered including slashdot in the table but it was so slow that it caused a significant increase in total test run time because it timed out at six minutes so many times. Even on FIOS it takes 15s to load by making a whopping 223 requests over 100 TCP connections despite weighing in at &quot;only&quot; 1.9MB. Amazingly, slashdot also pegs the CPU at 100% for 17 entire seconds while loading on FIOS. In retrospect, this might have been a good site to include because it's pathologically mis-optimized sites like slashdot that allow the &quot;page weight doesn't matter&quot; meme to sound reasonable.</p> <p>The websites compared don't do the same thing. Just looking at the blogs, some blogs put entire blog entries on the front page, which is more convenient in some ways, but also slower. Commercial sites are even more different -- they often can't reasonably be static sites and have to have relatively large javascrit payloads in order to work well.</p> <h3 id="appendix-irony">Appendix: irony</h3> <p>The main table in this post is almost 50kB of HTML (without compression or minification); that’s larger than everything else in this post combined. That table is curiously large because I used a library (pandas) to generate the table instead of just writing a script to do it by hand, and as we know, the default settings for most libraries generate a massive amount of bloat. It didn’t even save time because every single built-in time-saving feature that I wanted to use was buggy, which forced me to write all of the heatmap/gradient/styling code myself anyway! Due to laziness, I left the pandas table generating scaffolding code, resulting in a table that looks like it’s roughly an order of magnitude larger than it needs to be.</p> <p>This isn't a criticism of pandas. Pandas is probably quite good at what it's designed for; it's just not designed to produce slim websites. The CSS class names are huge, which is reasonable if you want to avoid accidental name collisions for generated CSS. Almost every <code>td</code>, <code>th</code>, and <code>tr</code> element is tagged with a redundant <code>rowspan=1</code> or <code>colspan=1</code>, which is reasonable for generated code if you don't care about size. Each cell has its own CSS class, even though many cells share styling with other cells; again, this probably simplified things on the code generation. Every piece of bloat is totally reasonable. And unfortunately, there's no tool that I know of that will take a bloated table and turn it into a slim table. A pure HTML minifier can't change the class names because it doesn't know that some external CSS or JS doesn't depend on the class name. An HTML minifier could theoretically determine that different cells have the same styling and merge them, except for the aforementioned problem with potential but non-existent external depenencies, but that's beyond the capability of the tools I know of.</p> <p>For another level of ironic, consider that while I think of a 50kB table as bloat, this page is 12kB when gzipped, even with all of the bloat. Google's AMP currently has &gt; 100kB of blocking javascript that has to load before the page loads! There's no reason for me to use AMP pages because AMP is slower than my current setup of pure HTML with a few lines of embedded CSS and the occasional image, but, as a result, I'm penalized by Google (relative to AMP pages) for not &quot;accelerating&quot; (deccelerating) my page with AMP.</p> <p><small> Thanks to Leah Hanson, Jason Owen, Ethan Willis, and Lindsey Kuper for comments/corrections </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:M">excluding internal Microsoft stuff that’s required for work. Many of the sites are IE only and don’t even work in edge. I didn’t try those sites in w3m but I doubt they’d work! In fact, I doubt that even half of the non-IE specific internal sites would work in w3m. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> </ol> </div> HN: the good parts hn-comments/ Sun, 23 Oct 2016 00:00:00 +0000 hn-comments/ <p><a href="https://twitter.com/danluu/status/1433297087362486272">HN</a> <a href="https://twitter.com/altluu/status/1449285906360266754">comments</a> <a href="https://twitter.com/altluu/status/1452707802921652224">are</a> <a href="https://twitter.com/danluu/status/1584260625920032768">terrible</a>. <a href="https://twitter.com/altluu/status/1586623057019699200">On</a> <a href="https://twitter.com/danluu/status/1346392037365436418">any</a> <a href="https://twitter.com/danluu/status/1280651255275110400">topic</a> <a href="https://twitter.com/danluu/status/1498343295545581568">I’m</a> <a href="https://twitter.com/danluu/status/1512835522002972680">informed</a> <a href="https://twitter.com/danluu/status/1298477505272147968">about</a>, <a href="https://twitter.com/danluu/status/1277222715456282625">the</a> <a href="https://twitter.com/danluu/status/872676217350172673">vast</a> <a href="https://twitter.com/danluu/status/1484268111687663620">majority</a> <a href="https://twitter.com/danluu/status/1449113998293491716">of</a> <a href="https://twitter.com/danluu/status/1344219344985677824">comments</a> <a href="https://twitter.com/danluu/status/1135308282334023680">are</a> <a href="https://twitter.com/altluu/status/1572989465504915458">pretty</a> <a href="https://twitter.com/danluu/status/1271392053314781184">clearly</a> <a href="https://twitter.com/danluu/status/1311217788720001024">wrong</a>. <a href="https://twitter.com/danluu/status/1449113998293491716">Most</a> of the time, there are zero comments from people who know anything about the topic and the top comment is reasonable sounding but totally incorrect. Additionally, many comments are gratuitously mean. You'll often hear mean comments backed up with something like &quot;this is better than the other possibility, where everyone just pats each other on the back with comments like 'this is great'&quot;, as if being an asshole is some sort of talisman against empty platitudes. I've seen people push back against that; when pressed, people often say that it’s either impossible or inefficient to teach someone without being mean, as if telling someone that they're stupid somehow helps them learn. It's as if people learned how to explain things by watching Simon Cowell and can't comprehend the concept of an explanation that isn't littered with personal insults. Paul Graham has said, &quot;<a href="https://news.ycombinator.com/item?id=5937839">Oh, you should never read Hacker News comments about anything you write</a>”. Most of the negative things you hear about HN comments are true.</p> <p>And yet, I haven’t found a public internet forum with better technical commentary. On topics I'm familiar with, while it's rare that a thread will have even a single comment that's well-informed, when those comments appear, they usually float to the top. On other forums, well-informed comments are either non-existent or get buried by reasonable sounding but totally wrong comments when they appear, and they appear even more rarely than on HN.</p> <p>By volume, there are probably more interesting technical “posts” in comments than in links. Well, that depends on what you find interesting, but that’s true for my interests. If I see a low-level optimization comment from nkurz, a comment on business from patio11, a comment on how companies operate by nostrademons, I almost certainly know that I’m going to read an interesting comment. There are maybe 20 to 30 people I can think of who don’t blog much, but write great comments on HN and I doubt I even know of half the people who are writing great comments on HN<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">1</a></sup>.</p> <p>I compiled a very abbreviated list of comments I like because comments seem to get lost. If you write a blog post, people will refer it years later, but comments mostly disappear. I think that’s sad -- there’s a lot of great material on HN (and yes, even more not-so-great material).</p> <h4 id="what-s-the-deal-with-ms-word-s-file-format-https-news-ycombinator-com-item-id-12472505"><a href="https://news.ycombinator.com/item?id=12472505">What’s the deal with MS Word’s file format</a>?</h4> <blockquote> <p>Basically, the Word file format is a binary dump of memory. I kid you not. They just took whatever was in memory and wrote it out to disk. We can try to reason why (maybe it was faster, maybe it made the code smaller), but I think the overriding reason is that the original developers didn't know any better.</p> <p>Later as they tried to add features they had to try to make it backward compatible. This is where a lot of the complexity lies. There are lots of crazy workarounds for things that would be simple if you allowed yourself to redesign the file format. It's pretty clear that this was mandated by management, because no software developer would put themselves through that hell for no reason.</p> <p>Later they added a fast-save feature (I forget what it is actually called). This appends changes to the file without changing the original file. The way they implemented this was really ingenious, but complicates the file structure a lot.</p> <p>One thing I feel I must point out (I remember posting a huge thing on slashdot when this article was originally posted) is that 2 way file conversion is next to impossible for word processors. That's because the file formats do not contain enough information to format the document. The most obvious place to see this is pagination. The file format does not say where to paginate a text flow (unless it is explicitly entered by the user). It relies of the formatter to do it. Each word processor formats text completely differently. Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.</p> <p>Just considering the pagination issue, you can see what the problem is. When reading a Word document, you have to paginate it like Word -- only the file format doesn't tell you what that is. Then if someone modifies the document and you need to resave it, you need to somehow mark that it should be paginated like Word (even though it might now have features that are not in Word). If it was only pagination, you might be able to do it, but practically everything is like that.</p> <p>I recommend reading (a bit of) the XML Word file format for those who are interested. You will see large numbers of flags for things like &quot;Format like Word 95&quot;. The format doesn't say what that is -- because it's pretty obvious that the authors of the file format don't know. It's lost in a hopeless mess of legacy code and nobody can figure out what it does now.</p> </blockquote> <h4 id="fun-with-null-https-news-ycombinator-com-item-id-12007286"><a href="https://news.ycombinator.com/item?id=12007286">Fun with NULL</a></h4> <blockquote> <p>Here's another example of this fine feature:</p> </blockquote> <pre><code> #include &lt;stdio.h&gt; #include &lt;string.h&gt; #include &lt;stdlib.h&gt; #define LENGTH 128 int main(int argc, char **argv) { char *string = NULL; int length = 0; if (argc &gt; 1) { string = argv[1]; length = strlen(string); if (length &gt;= LENGTH) exit(1); } char buffer[LENGTH]; memcpy(buffer, string, length); buffer[length] = 0; if (string == NULL) { printf(&quot;String is null, so cancel the launch.\n&quot;); } else { printf(&quot;String is not null, so launch the missiles!\n&quot;); } printf(&quot;string: %s\n&quot;, string); // undefined for null but works in practice #if SEGFAULT_ON_NULL printf(&quot;%s\n&quot;, string); // segfaults on null when bare &quot;%s\n&quot; #endif return 0; } nate@skylake:~/src$ clang-3.8 -Wall -O3 null_check.c -o null_check nate@skylake:~/src$ null_check String is null, so cancel the launch. string: (null) nate@skylake:~/src$ icc-17 -Wall -O3 null_check.c -o null_check nate@skylake:~/src$ null_check String is null, so cancel the launch. string: (null) nate@skylake:~/src$ gcc-5 -Wall -O3 null_check.c -o null_check nate@skylake:~/src$ null_check String is not null, so launch the missiles! string: (null) </code></pre> <blockquote> <p>It appear that Intel's ICC and Clang still haven't caught up with GCC's optimizations. Ouch if you were depending on that optimization to get the performance you need! But before picking on GCC too much, consider that all three of those compilers segfault on printf(&quot;string: &quot;); printf(&quot;%s\n&quot;, string) when string is NULL, despite having no problem with printf(&quot;string: %s\n&quot;, string) as a single statement. Can you see why using two separate statements would cause a segfault? If not, see here for a hint: <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25609">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25609</a></p> </blockquote> <h4 id="how-do-you-make-sure-the-autopilot-backup-is-paying-attention-https-news-ycombinator-com-item-id-11017034"><a href="https://news.ycombinator.com/item?id=11017034">How do you make sure the autopilot backup is paying attention</a>?</h4> <blockquote> <p>Good engineering eliminates users being able to do the wrong thing as much as possible. . . . You don't design a feature that invites misuse and then use instructions to try to prevent that misuse.</p> <p>There was a derailment in Australia called the Waterfall derailment [1]. It occurred because the driver had a heart attack and was responsible for 7 deaths (a miracle it was so low, honestly). The root cause was the failure of the dead-man's switch.</p> <p>In the case of Waterfall, the driver had 2 dead-man switches he could use - 1) the throttle handle had to be held against a spring at a small rotation, or 2) a bar on the floor could be depressed. You had to do 1 of these things, the idea being that you prevent wrist or foot cramping by allowing the driver to alternate between the two. Failure to do either triggers an emergency brake.</p> <p>It turns out that this driver was fat enough that when he had a heart attack, his leg was able to depress the pedal enough to hold the emergency system off. Thus, the dead-man's system never triggered with a whole lot of dead man in the driver's seat.</p> <p>I can't quite remember the specifics of the system at Waterfall, but one method to combat this is to require the pedal to be held halfway between released and fully depressed. The idea being that a dead leg would fully depress the pedal so that would trigger a brake, and a fully released pedal would also trigger a brake. I don't know if they had that system but certainly that's one approach used in rail.</p> <p>Either way, the problem is equally possible in cars. If you lose consciousness and your foot goes limp, a heavy enough leg will be able to hold the pedal down a bit depending on where it's positioned relative to the pedal and the leverage it has on the floor.</p> <p>The other major system I'm familiar with for ensuring drivers are alive at the helm is called 'vigilance'. The way it works is that periodically, a light starts flashing on the dash and the driver has to acknowledge that. If they do not, a buzzer alarm starts sounding. If they still don't acknowledge it, the train brakes apply and the driver is assumed incapacitated. Let me tell you some stories of my involvement in it.</p> <p>When we first started, we had a simple vigi system. Every 30 seconds or so (for example), the driver would press a button. Ok cool. Except that then drivers became so hard-wired to pressing the button every 30 seconds that we were having instances of drivers falling asleep/dozing off and still pressing the button right on every 30 seconds because it was so ingrained into them that it was literally a subconscious action.</p> <p>So we introduced random-timing vigilance, where the time varies 30-60 seconds (for example) and you could only acknowledge it within a small period of time once the light started flashing. Again, drivers started falling asleep/semi asleep and would hit it as soon as the alarm buzzed, each and every time.</p> <p>So we introduced random-timing, task-linked vigilance and that finally broke the back of the problem. Now, the driver has to press a button, or turn a knob, or do a number of different activities and they must do that randomly-chosen activity, at a randomly-chosen time, for them to acknowledge their consciousness. It was only at that point that we finally nailed out driver alertness.</p> </blockquote> <p><a href="https://news.ycombinator.com/item?id=12013713">See also</a>.</p> <h4 id="prestige-https-news-ycombinator-com-item-id-11833832"><a href="https://news.ycombinator.com/item?id=11833832">Prestige</a></h4> <blockquote> <p>Curious why he would need to move to a more prestigious position? Most people realize by their 30s that prestige is a sucker's game; it's a way of inducing people to do things that aren't much fun and they wouldn't really want to do on their own, by lauding them with accolades from people they don't really care about.</p> </blockquote> <h4 id="why-is-fedex-based-in-mephis-https-news-ycombinator-com-item-id-9282104"><a href="https://news.ycombinator.com/item?id=9282104">Why is FedEx based in Mephis</a>?</h4> <blockquote> <p>. . . we noticed that we also needed:<br> (1) A suitable, existing airport at the hub location.<br> (2) Good weather at the hub location, e.g., relatively little snow, fog, or rain.<br> (3) Access to good ramp space, that is, where to park and service the airplanes and sort the packages.<br> (4) Good labor supply, e.g., for the sort center.<br> (5) Relatively low cost of living to keep down prices.<br> (6) Friendly regulatory environment.<br> (7) Candidate airport not too busy, e.g., don't want arriving planes to have to circle a long time before being able to land.<br> (8) Airport with relatively little in cross winds and with more than one runway to pick from in case of winds.<br> (9) Runway altitude not too high, e.g., not high enough to restrict maximum total gross take off weight, e.g., rule out Denver.<br> (10) No tall obstacles, e.g., mountains, near the ends of the runways.<br> (11) Good supplies of jet fuel.<br> (12) Good access to roads for 18 wheel trucks for exchange of packages between trucks and planes, e.g., so that some parts could be trucked to the hub and stored there and shipped directly via the planes to customers that place orders, say, as late as 11 PM for delivery before 10 AM.<br> So, there were about three candidate locations, Memphis and, as I recall, Cincinnati and Kansas City.<br> The Memphis airport had some old WWII hangers next to the runway that FedEx could use for the sort center, aircraft maintenance, and HQ office space. Deal done -- it was Memphis.<br></p> </blockquote> <h4 id="why-etherpad-joined-wave-and-why-it-didn-t-work-out-as-expected-https-news-ycombinator-com-item-id-2320172"><a href="https://news.ycombinator.com/item?id=2320172">Why etherpad joined Wave, and why it didn’t work out as expected</a></h4> <blockquote> <p>The decision to sell to Google was one of the toughest decisions I and my cofounders ever had to wrestle with in our lives. We were excited by the Wave vision though we saw the flaws in the product. The Wave team told us about how they wanted our help making wave simpler and more like etherpad, and we thought we could help with that, though in the end we were unsuccessful at making wave simpler. We were scared of Google as a competitor: they had more engineers and more money behind this project, yet they were running it much more like an independent startup than a normal big-company department. The Wave office was in Australia and had almost total autonomy. And finally, after 1.5 years of being on the brink of failure with AppJet, it was tempting to be able to declare our endeavor a success and provide a decent return to all our investors who had risked their money on us.</p> <p>In the end, our decision to join Wave did not work out as we had hoped. The biggest lessons learned were that having more engineers and money behind a project can actually be more harmful than helpful, so we were wrong to be scared of Wave as a competitor for this reason. It seems obvious in hindsight, but at the time it wasn't. Second, I totally underestimated how hard it would be to iterate on the Wave codebase. I was used to rewriting major portions of software in a single all-nighter. Because of the software development process Wave was using, it was practically impossible to iterate on the product. I should have done more diligence on their specific software engineering processes, but instead I assumed because they seemed to be operating like a startup, that they would be able to iterate like a startup. A lot of the product problems were known to the whole Wave team, but we were crippled by a large complex codebase built on poor technical choices and a cumbersome engineering process that prevented fast iteration.</p> </blockquote> <h4 id="the-accuracy-of-tech-news-https-news-ycombinator-com-item-id-12345365"><a href="https://news.ycombinator.com/item?id=12345365">The accuracy of tech news</a></h4> <blockquote> <p>When I've had inside information about a story that later breaks in the tech press, I'm always shocked at how differently it's perceived by readers of the article vs. how I experienced it. Among startups &amp; major feature launches I've been party to, I've seen: executives that flat-out say that they're not working on a product category when there's been a whole department devoted to it for a year; startups that were founded 1.5 years before the dates listed in Crunchbase/Wikipedia; reporters that count the number of people they meet in a visit and report that as a the &quot;team size&quot;, because the company refuses to release that info; funding rounds that never make it to the press; acquisitions that are reported as &quot;for an undisclosed sum&quot; but actually are less than the founders would've made if they'd taken a salaried job at the company; project start dates that are actually when the project was staffed up to its current size and ignore the year or so that a small team spent working on the problem (or the 3-4 years that other small teams spent working on the problem); and algorithms or other technologies that are widely reported as being the core of the company's success, but actually aren't even used by the company.</p> </blockquote> <h4 id="self-destructing-speakers-from-dell-https-news-ycombinator-com-item-id-7205875"><a href="https://news.ycombinator.com/item?id=7205875">Self-destructing speakers from Dell</a></h4> <blockquote> <p>As the main developer of VLC, we know about this story since a long time, and this is just Dell putting crap components on their machine and blaming others. Any discussion was impossible with them. So let me explain a bit...</p> <p>In this case, VLC just uses the Windows APIs (DirectSound), and sends signed integers of 16bits (s16) to the Windows Kernel.</p> <p>VLC allows amplification of the INPUT above the sound that was decoded. This is just like replay gain, broken codecs, badly recorded files or post-amplification and can lead to saturation.</p> <p>But this is exactly the same if you put your mp3 file through Audacity and increase it and play with WMP, or if you put a DirectShow filter that amplifies the volume after your codec output. For example, for a long time, VLC ac3 and mp3 codecs were too low (-6dB) compared to the reference output.</p> <p>At worse, this will reduce the dynamics and saturate a lot, but this is not going to break your hardware.</p> <p>VLC does not (and cannot) modify the OUTPUT volume to destroy the speakers. VLC is a Software using the OFFICIAL platforms APIs.</p> <p>The issue here is that Dell sound cards output power (that can be approached by a factor of the quadratic of the amplitude) that Dell speakers cannot handle. Simply said, the sound card outputs at max 10W, and the speakers only can take 6W in, and neither their BIOS or drivers block this.</p> <p>And as VLC is present on a lot of machines, it's simple to blame VLC. &quot;Correlation does not mean causation&quot; is something that seems too complex for cheap Dell support…</p> </blockquote> <h4 id="learning-on-the-job-startups-vs-big-companies-https-news-ycombinator-com-item-id-8278533"><a href="https://news.ycombinator.com/item?id=8278533">Learning on the job, startups vs. big companies</a></h4> <blockquote> <p>Working for someone else's startup, I learned how to quickly cobble solutions together. I learned about uncertainty and picking a direction regardless of whether you're sure it'll work. I learned that most startups fail, and that when they fail, the people who end up doing well are the ones who were looking out for their own interests all along. I learned a lot of basic technical skills, how to write code quickly and learn new APIs quickly and deploy software to multiple machines. I learned how quickly problems of scaling a development team crop up, and how early you should start investing in automation.</p> <p>Working for Google, I learned how to fix problems once and for all and build that culture into the organization. I learned that even in successful companies, everything is temporary, and that great products are usually built through a lot of hard work by many people rather than great ah-ha insights. I learned how to architect systems for scale, and a lot of practices used for robust, high-availability, frequently-deployed systems. I learned the value of research and of spending a lot of time on a single important problem: many startups take a scattershot approach, trying one weekend hackathon after another and finding nobody wants any of them, while oftentimes there are opportunities that nobody has solved because nobody wants to put in the work. I learned how to work in teams and try to understand what other people want. I learned what problems are really painful for big organizations. I learned how to rigorously research the market and use data to make product decisions, rather than making decisions based on what seems best to one person.</p> </blockquote> <h4 id="we-failed-this-person-what-are-we-going-to-do-differently-https-news-ycombinator-com-item-id-4324615"><a href="https://news.ycombinator.com/item?id=4324615">We failed this person, what are we going to do differently</a>?</h4> <blockquote> <p>Having been in on the company's leadership meetings where departures were noted with a simple 'regret yes/no' flag it was my experience that no single departure had any effect. Mass departures did, trends did, but one person never did, even when that person was a founder.</p> <p>The rationalizations always put the issue back on the departing employee, &quot;They were burned out&quot;, &quot;They had lost their ability to be effective&quot;, &quot;They have moved on&quot;, &quot;They just haven't grown with the company&quot; never was it &quot;We failed this person, what are we going to do differently?&quot;</p> </blockquote> <h4 id="aws-s-origin-story-https-news-ycombinator-com-item-id-3102129"><a href="https://news.ycombinator.com/item?id=3102129">AWS’s origin story</a></h4> <blockquote> <p>Anyway, the SOA effort was in full swing when I was there. It was a pain, and it was a mess because every team did things differently and every API was different and based on different assumptions and written in a different language.</p> <p>But I want to correct the misperception that this lead to AWS. It didn't. S3 was written by its own team, from scratch. At the time I was at Amazon, working on the retail site, none of Amazon.com was running on AWS. I know, when AWS was announced, with great fanfare, they said &quot;the services that power Amazon.com can now power your business!&quot; or words to that effect. This was a flat out lie. The only thing they shared was data centers and a standard hardware configuration. Even by the time I left, when AWS was running full steam ahead (and probably running Reddit already), none of Amazon.com was running on AWS, except for a few, small, experimental and relatively new projects. I'm sure more of it has been adopted now, but AWS was always a separate team (and a better managed one, from what I could see.)</p> </blockquote> <h4 id="why-is-windows-so-slow-https-news-ycombinator-com-item-id-3368867"><a href="https://news.ycombinator.com/item?id=3368867">Why is Windows so slow</a>?</h4> <blockquote> <p>I (and others) have put a lot of effort into making the Linux Chrome build fast. Some examples are multiple new implementations of the build system (<a href="http://neugierig.org/software/chromium/notes/2011/02/ninja.h..">http://neugierig.org/software/chromium/notes/2011/02/ninja.h..</a>. ), experimentation with the gold linker (e.g. measuring and adjusting the still off-by-default thread flags <a href="https://groups.google.com/a/chromium.org/group/chromium-dev/..">https://groups.google.com/a/chromium.org/group/chromium-dev/..</a>. ) as well as digging into bugs in it, and other underdocumented things like 'thin' ar archives.</p> <p>But it's also true that people who are more of Windows wizards than I am a Linux apprentice have worked on Chrome's Windows build. If you asked me the original question, I'd say the underlying problem is that on Windows all you have is what Microsoft gives you and you can't typically do better than that. For example, migrating the Chrome build off of Visual Studio would be a large undertaking, large enough that it's rarely considered. (Another way of phrasing this is it's the IDE problem: you get all of the IDE or you get nothing.)</p> <p>When addressing the poor Windows performance people first bought SSDs, something that never even occurred to me (&quot;your system has enough RAM that the kernel cache of the file system should be in memory anyway!&quot;). But for whatever reason on the Linux side some Googlers saw it fit to rewrite the Linux linker to make it twice as fast (this effort predated Chrome), and all Linux developers now get to benefit from that. Perhaps the difference is that when people write awesome tools for Windows or Mac they try to sell them rather than give them away.</p> </blockquote> <h4 id="why-is-windows-so-slow-an-insider-view-http-blog-zorinaq-com-i-contribute-to-the-windows-kernel-we-are-slower-than-other-oper"><a href="http://blog.zorinaq.com/i-contribute-to-the-windows-kernel-we-are-slower-than-other-oper/">Why is Windows so slow, an insider view</a></h4> <blockquote> <p>I'm a developer in Windows and contribute to the NT kernel. (Proof: the SHA1 hash of revision #102 of [Edit: filename redacted] is [Edit: hash redacted].) I'm posting through Tor for obvious reasons.</p> <p>Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening. The cause of the problem is social. There's almost none of the improvement for its own sake, for the sake of glory, that you see in the Linux world.</p> <p>Granted, occasionally one sees naive people try to make things better. These people almost always fail. We can and do improve performance for specific scenarios that people with the ability to allocate resources believe impact business goals, but this work is Sisyphean. There's no formal or informal program of systemic performance improvement. We started caring about security because pre-SP3 Windows XP was an existential threat to the business. Our low performance is not an existential threat to the business.</p> <p>See, component owners are generally openly hostile to outside patches: if you're a dev, accepting an outside patch makes your lead angry (due to the need to maintain this patch and to justify in in shiproom the unplanned design change), makes test angry (because test is on the hook for making sure the change doesn't break anything, and you just made work for them), and PM is angry (due to the schedule implications of code churn). There's just no incentive to accept changes from outside your own team. You can always find a reason to say &quot;no&quot;, and you have very little incentive to say &quot;yes&quot;.</p> </blockquote> <h4 id="what-s-the-probability-of-a-successful-exit-by-city-https-news-ycombinator-com-item-id-11987223"><a href="https://news.ycombinator.com/item?id=11987223">What’s the probability of a successful exit by city?</a></h4> <p>See link for giant table :-).</p> <h4 id="the-hiring-crunch-https-news-ycombinator-com-item-id-7260087"><a href="https://news.ycombinator.com/item?id=7260087">The hiring crunch</a></h4> <blockquote> <p>Broken record: startups are also probably rejecting a lot of engineering candidates that would perform as well or better than anyone on their existing team, because tech industry hiring processes are folkloric and irrational.</p> </blockquote> <p>Too long to excerpt. See the link!</p> <h4 id="should-you-leave-a-bad-job-https-news-ycombinator-com-item-id-7789438"><a href="https://news.ycombinator.com/item?id=7789438">Should you leave a bad job?</a></h4> <blockquote> <p>I am 42-year-old very successful programmer who has been through a lot of situations in my career so far, many of them highly demotivating. And the best advice I have for you is to get out of what you are doing. Really. Even though you state that you are not in a position to do that, you really are. It is okay. You are free. Okay, you are helping your boyfriend's startup but what is the appropriate cost for this? Would he have you do it if he knew it was crushing your soul?</p> <p>I don't use the phrase &quot;crushing your soul&quot; lightly. When it happens slowly, as it does in these cases, it is hard to see the scale of what is happening. But this is a very serious situation and if left unchecked it may damage the potential for you to do good work for the rest of your life.</p> <p>The commenters who are warning about burnout are right. Burnout is a very serious situation. If you burn yourself out hard, it will be difficult to be effective at any future job you go to, even if it is ostensibly a wonderful job. Treat burnout like a physical injury. I burned myself out once and it took at least 12 years to regain full productivity. Don't do it.</p> <ul> <li><p>More broadly, the best and most creative work comes from a root of joy and excitement. If you lose your ability to feel joy and excitement about programming-related things, you'll be unable to do the best work. That this issue is separate from and parallel to burnout! If you are burned out, you might still be able to feel the joy and excitement briefly at the start of a project/idea, but they will fade quickly as the reality of day-to-day work sets in. Alternatively, if you are not burned out but also do not have a sense of wonder, it is likely you will never get yourself started on the good work.</p></li> <li><p>The earlier in your career it is now, the more important this time is for your development. Programmers learn by doing. If you put yourself into an environment where you are constantly challenged and are working at the top threshold of your ability, then after a few years have gone by, your skills will have increased tremendously. It is like going to intensively learn kung fu for a few years, or going into Navy SEAL training or something. But this isn't just a one-time constant increase. The faster you get things done, and the more thorough and error-free they are, the more ideas you can execute on, which means you will learn faster in the future too. Over the long term, programming skill is like compound interest. More now means a LOT more later. Less now means a LOT less later.</p></li> </ul> <p>So if you are putting yourself into a position that is not really challenging, that is a bummer day in and day out, and you get things done slowly, you aren't just having a slow time now. You are bringing down that compound interest curve for the rest of your career. It is a serious problem. If I could go back to my early career I would mercilessly cut out all the shitty jobs I did (and there were many of them).</p> </blockquote> <h4 id="creating-change-when-politically-unpopular-https-news-ycombinator-com-item-id-5541517"><a href="https://news.ycombinator.com/item?id=5541517">Creating change when politically unpopular</a></h4> <blockquote> <p>A small anecdote. An acquaintance related a story of fixing the 'drainage' in their back yard. They were trying to grow some plants that were sensitive to excessive moisture, and the plants were dying. Not watering them, watering them a little, didn't seem to change. They died. A professional gardner suggested that their problem was drainage. So they dug down about 3' (where the soil was very very wet) and tried to build in better drainage. As they were on the side of a hill, water table issues were not considered. It turned out their &quot;problem&quot; was that the water main that fed their house and the houses up the hill, was so pressurized at their property (because it had maintain pressure at the top of the hill too) that the pipe seams were leaking and it was pumping gallons of water into the ground underneath their property. The problem wasn't their garden, the problem was that the city water supply was poorly designed.</p> <p>While I have never been asked if I was an engineer on the phone, I have experienced similar things to Rachel in meetings and with regard to suggestions. Co-workers will create an internal assessment of your value and then respond based on that assessment. If they have written you off they will ignore you, if you prove their assessment wrong in a public forum they will attack you. These are management issues, and something which was sorely lacking in the stories.</p> <p>If you are the &quot;owner&quot; of a meeting, and someone is trying to be heard and isn't. It is incumbent on you to let them be heard. By your position power as &quot;the boss&quot; you can naturally interrupt a discussion to collect more data from other members. Its also important to ask questions like &quot;does anyone have any concerns?&quot; to draw out people who have valid input but are too timid to share it.</p> <p>In a highly political environment there are two ways to create change, one is through overt manipulation, which is to collect political power to yourself and then exert it to enact change, and the other is covert manipulation, which is to enact change subtly enough that the political organism doesn't react. (sometimes called &quot;triggering the antibodies&quot;).</p> <p>The problem with the latter is that if you help make positive change while keeping everyone not pissed off, no one attributes it to you (which is good for the change agent because if they knew the anti-bodies would react, but bad if your manager doesn't recognize it). I asked my manager what change he wanted to be 'true' yet he (or others) had been unsuccessful making true, he gave me one, and 18 months later that change was in place. He didn't believe that I was the one who had made the change. I suggested he pick a change he wanted to happen and not tell me, then in 18 months we could see if that one happened :-). But he also didn't understand enough about organizational dynamics to know that making change without having the source of that change point back at you was even possible.</p> </blockquote> <h4 id="how-to-get-tech-support-from-google-https-news-ycombinator-com-item-id-6837249"><a href="https://news.ycombinator.com/item?id=6837249">How to get tech support from Google</a></h4> <blockquote> <p>Heavily relying on Google product? ✓<br> Hitting a dead-end with Google's customer service? ✓<br> Have an existing audience you can leverage to get some random Google employee's attention? ✓<br> Reach front page of Hacker News? ✓<br> Good news! You should have your problem fixed in 2-5 business days. The rest of us suckers relying on google services get to stare at our inboxes helplessly, waiting for a response to our support ticket (which will never come). I feel like it's almost a [rite] of passage these days to rely heavily on a Google service, only to have something go wrong and be left out in the cold.</p> </blockquote> <h4 id="taking-funding-https-news-ycombinator-com-item-id-7462112"><a href="https://news.ycombinator.com/item?id=7462112">Taking funding</a></h4> <blockquote> <p>IIRC PayPal was very similar - it was sold for $1.5B, but Max Levchin's share was only about $30M, and Elon Musk's was only about $100M. By comparison, many early Web 2.0 darlings (Del.icio.us, Blogger, Flickr) sold for only $20-40M, but their founders had only taken small seed rounds, and so the vast majority of the purchase price went to the founders. 75% of a $40M acquisition = 3% of a $1B acquisition.</p> <p>Something for founders to think about when they're taking funding. If you look at the gigantic tech fortunes - Gates, Page/Brin, Omidyar, Bezos, Zuckerburg, Hewlett/Packard - they usually came from having a company that was already profitable or was already well down the hockey-stick user growth curve and had a clear path to monetization by the time they sought investment. Companies that fight tooth &amp; nail for customers and need lots of outside capital to do it usually have much worse financial outcomes.</p> </blockquote> <h4 id="stackoverflow-vs-experts-exchange-https-news-ycombinator-com-item-id-3656581"><a href="https://news.ycombinator.com/item?id=3656581">StackOverflow vs. Experts-Exchange</a></h4> <blockquote> <p>A lot of the people who were involved in some way in Experts-Exchange don't understand Stack Overflow.</p> <p>The basic value flow of EE is that &quot;experts&quot; provide valuable &quot;answers&quot; for novices with questions. In that equation there's one person asking a question and one person writing an answer.</p> <p>Stack Overflow recognizes that for every person who asks a question, 100 - 10,000 people will type that same question into Google and find an answer that has already been written. In our equation, we are a community of people writing answers that will be read by hundreds or thousands of people. Ours is a project more like wikipedia -- collaboratively creating a resource for the Internet at large.</p> <p>Because that resource is provided by the community, it belongs to the community. That's why our data is freely available and licensed under creative commons. We did this specifically because of the negative experience we had with EE taking a community-generated resource and deciding to slap a paywall around it.</p> <p>The attitude of many EE contributors, like Greg Young who calculates that he &quot;worked&quot; for half a year for free, is not shared by the 60,000 people who write answers on SO every month. When you talk to them you realize that on Stack Overflow, answering questions is about learning. It's about creating a permanent artifact to make the Internet better. It's about helping someone solve a problem in five minutes that would have taken them hours to solve on their own. It's not about working for free.</p> <p>As soon as EE introduced the concept of money they forced everybody to think of their work on EE as just that -- work.</p> </blockquote> <h4 id="making-money-from-amazon-bots-https-news-ycombinator-com-item-id-2475986"><a href="https://news.ycombinator.com/item?id=2475986">Making money from amazon bots</a></h4> <blockquote> <p>I saw that one of my old textbooks was selling for a nice price, so I listed it along with two other used copies. I priced it $1 cheaper than the lowest price offered, but within an hour both sellers had changed their prices to $.01 and $.02 cheaper than mine. I reduced it two times more by $1, and each time they beat my price by a cent or two. So what I did was reduce my price by a few dollars every hour for one day until everybody was priced under $5. Then I bought their books and changed my price back.</p> </blockquote> <h4 id="what-running-a-business-is-like-https-news-ycombinator-com-item-id-4060229"><a href="https://news.ycombinator.com/item?id=4060229">What running a business is like</a></h4> <blockquote> <p>While I like the sentiment here, I think the danger is that engineers might come to the mistaken conclusion that making pizzas is the primary limiting reagent to running a successful pizzeria. Running a successful pizzeria is more about schlepping to local hotels and leaving them 50 copies of your menu to put at the front desk, hiring drivers who will both deliver pizzas in a timely fashion and not embezzle your (razor-thin) profits while also costing next-to-nothing to employ, maintaining a kitchen in sufficient order to pass your local health inspector's annual visit (and dealing with 47 different pieces of paper related to that), being able to juggle priorities like &quot;Do I take out a bank loan to build a new brick-oven, which will make the pizza taste better, in the knowledge that this will commit $3,000 of my cash flow every month for the next 3 years, or do I hire an extra cook?&quot;, sourcing ingredients such that they're available in quantity and quality every day for a fairly consistent price, setting prices such that they're locally competitive for your chosen clientele but generate a healthy gross margin for the business, understanding why a healthy gross margin really doesn't imply a healthy net margin and that the rent still needs to get paid, keeping good-enough records such that you know whether your business is dying before you can't make payroll and such that you can provide a reasonably accurate picture of accounts for the taxation authorities every year, balancing 50% off medium pizza promotions with the desire to not cannibalize the business of your regulars, etc etc, and by the way tomato sauce should be tangy but not sour and cheese should melt with just the faintest whisp of a crust on it.</p> <p>Do you want to write software for a living? Google is hiring. Do you want to run a software business? Godspeed. Software is now 10% of your working life.</p> </blockquote> <h4 id="how-to-handle-mismanagement-https-news-ycombinator-com-item-id-7179946"><a href="https://news.ycombinator.com/item?id=7179946">How to handle mismanagement?</a></h4> <blockquote> <p>The way I prefer to think of it is: it is not your job to protect people (particularly senior management) from the consequences of their decisions. Make your decisions in your own best interest; it is up to the organization to make sure that your interest aligns with theirs.</p> <p>Google used to have a severe problem where code refactoring &amp; maintenance was not rewarded in performance reviews while launches were highly regarded, which led to the effect of everybody trying to launch things as fast as possible and nobody cleaning up the messes left behind. Eventually launches started getting slowed down, Larry started asking &quot;Why can't we have nice things?&quot;, and everybody responded &quot;Because you've been paying us to rack up technical debt.&quot; As a result, teams were formed with the express purpose of code health &amp; maintenance, those teams that were already working on those goals got more visibility, and refactoring contributions started counting for something in perf. Moreover, many ex-Googlers who were fed up with the situation went to Facebook and, I've heard, instituted a culture there where grungy engineering maintenance is valued by your peers.</p> <p>None of this would've happened if people had just heroically fallen on their own sword and burnt out doing work nobody cared about. Sometimes it takes highly visible consequences before people with decision-making power realize there's a problem and start correcting it. If those consequences never happen, they'll keep believing it's not a problem and won't pay much attention to it.</p> </blockquote> <h4 id="some-downsides-of-immutability-https-news-ycombinator-com-item-id-9145197"><a href="https://news.ycombinator.com/item?id=9145197">Some downsides of immutability</a></h4> <h4 id="taking-responsibility-https-news-ycombinator-com-item-id-8082029"><a href="https://news.ycombinator.com/item?id=8082029">Taking responsibility</a></h4> <blockquote> <p>The thing my grandfather taught me was that you live with all of your decisions for the rest of your life. When you make decisions which put other people at risk, you take on the risk that you are going to make someones life harder, possibly much harder. What is perhaps even more important is that no amount of &quot;I'm so sorry I did that ...&quot; will ever undo it. Sometimes its little things, like taking the last serving because you thought everyone had eaten, sometimes its big things like deciding that home is close enough that and you're sober enough to get there safely. They are all decisions we make every day. And as I've gotten older the weight of ones I wish I had made differently doesn't get any lighter. You can lie to yourself about your choices, rationalize them, but that doesn't change them either.</p> <p>I didn't understand any of that when I was younger.</p> </blockquote> <h4 id="people-who-aren-t-exactly-lying-https-news-ycombinator-com-item-id-2819426"><a href="https://news.ycombinator.com/item?id=2819426">People who aren’t exactly lying</a></h4> <blockquote> <p>It took me too long to figure this out. There are some people to truly, and passionately, believe something they say to you, and realistically they personally can't make it happen so you can't really bank on that 'promise.'</p> <p>I used to think those people were lying to take advantage, but as I've gotten older I have come to recognize that these 'yes' people get promoted a lot. And for some of them, they really do believe what they are saying.</p> <p>As an engineer I've found that once I can 'calibrate' someone's 'yes-ness' I can then work with them, understanding that they only make 'wishful' commitments rather than 'reasoned' commitments.</p> <p>So when someone, like Steve Jobs, says &quot;we're going to make it an open standard!&quot;, my first question then is &quot;Great, I've got your support in making this an open standard so I can count on you to wield your position influence to aid me when folks line up against that effort, right?&quot; If the answer that that question is no, then they were lying.</p> <p>The difference is subtle of course but important. Steve clearly doesn't go to standards meetings and vote etc, but if Manager Bob gets push back from accounting that he's going to exceed his travel budget by sending 5 guys to the Open Video Chat Working Group which is championing the Facetime protocol as an open standard, then Manager Bob goes to Steve and says &quot;I need your help here, these 5 guys are needed to argue this standard and keep it from being turned into a turd by the 5 guys from Google who are going to attend.&quot; and then Steve whips off a one liner to accounting that says &quot;Get off this guy's back we need this.&quot; Then its all good. If on the other hand he says &quot;We gotta save money, send one guy.&quot; well in that case I'm more sympathetic to the accusation of prevarication.</p> </blockquote> <h4 id="what-makes-engineers-productive-https-news-ycombinator-com-item-id-5496914"><a href="https://news.ycombinator.com/item?id=5496914">What makes engineers productive</a>?</h4> <blockquote> <p>For those who work inside Google, it's well worth it to look at Jeff &amp; Sanjay's commit history and code review dashboard. They aren't actually all that much more productive in terms of code written than a decent SWE3 who knows his codebase.</p> <p>The reason they have a reputation as rockstars is that they can apply this productivity to things that really matter; they're able to pick out the really important parts of the problem and then focus their efforts there, so that the end result ends up being much more impactful than what the SWE3 wrote. The SWE3 may spend his time writing a bunch of unit tests that catch bugs that wouldn't really have happened anyway, or migrating from one system to another that isn't really a large improvement, or going down an architectural dead end that'll just have to be rewritten later. Jeff or Sanjay (or any of the other folks operating at that level) will spend their time running a proposed API by clients to ensure it meets their needs, or measuring the performance of subsystems so they fully understand their building blocks, or mentally simulating the operation of the system before building it so they rapidly test out alternatives. They don't actually write more code than a junior developer (oftentimes, they write less), but the code they do write gives them more information, which makes them ensure that they write the rightcode.</p> <p>I feel like this point needs to be stressed a whole lot more than it is, as there's a whole mythology that's grown up around 10x developers that's not all that helpful. In particular, people need to realize that these developers rapidly become 1x developers (or worse) if you don't let them make their own architectural choices - the reason they're excellent in the first place is because they know how to determine if certain work is going to be useless and avoid doing it in the first place. If you dictate that they do it anyway, they're going to be just as slow as any other developer</p> </blockquote> <h4 id="do-the-work-be-a-hero-https-news-ycombinator-com-item-id-2682645"><a href="https://news.ycombinator.com/item?id=2682645">Do the work, be a hero</a></h4> <blockquote> <p>I got the hero speech too, once. If anyone ever mentions the word &quot;heroic&quot; again and there isn't a burning building involved, I will start looking for new employment immediately. It seems that in our industry it is universally a code word for &quot;We're about to exploit you because the project is understaffed and under budgeted for time and that is exactly as we planned it so you'd better cowboy up.&quot;</p> <p>Maybe it is different if you're writing Quake, but I guarantee you the 43rd best selling game that year also had programmers &quot;encouraged onwards&quot; by tales of the glory that awaited after the death march.</p> </blockquote> <h4 id="learning-english-from-watching-movies-https-news-ycombinator-com-item-id-7333625"><a href="https://news.ycombinator.com/item?id=7333625">Learning English from watching movies</a></h4> <blockquote> <p>I was once speaking to a good friend of mine here, in English.<br> &quot;Do you want to go out for yakitori?&quot;<br> &quot;Go fuck yourself!&quot;<br> &quot;... switches to Japanese Have I recently done anything very major to offend you?&quot;<br> &quot;No, of course not.&quot;<br> &quot;Oh, OK, I was worried. So that phrase, that's something you would only say under extreme distress when you had maximal desire to offend me, or I suppose you could use it jokingly between friends, but neither you nor I generally talk that way.&quot;<br> &quot;I learned it from a movie. I thought it meant ‘No.’&quot;<br></p> </blockquote> <h4 id="being-smart-and-getting-things-done-https-news-ycombinator-com-item-id-2838149"><a href="https://news.ycombinator.com/item?id=2838149">Being smart and getting things done</a></h4> <blockquote> <p>True story: I went to a talk given by one of the 'engineering elders' (these were low Emp# engineers who were considered quite successful and were to be emulated by the workers :-) This person stated when they came to work at Google they were given the XYZ system to work on (sadly I'm prevented from disclosing the actual system). They remarked how they spent a couple of days looking over the system which was complicated and creaky, they couldn't figure it out so they wrote a new system. Yup, and they committed that. This person is a coding God are they not? (sarcasm) I asked what happened to the old system (I knew but was interested on their perspective) and they said it was still around because a few things still used it, but (quite proudly) nearly everything else had moved to their new system.</p> <p>So if you were reading carefully, this person created a new system to 'replace' an existing system which they didn't understand and got nearly everyone to move to the new system. That made them uber because they got something big to put on their internal resume, and a whole crapload of folks had to write new code to adapt from the old system to this new system, which imperfectly recreated the old system (remember they didn't understand the original), such that those parts of the system that relied on the more obscure bits had yet to be converted (because nobody undersood either the dependent code or the old system apparently).</p> <p>Was this person smart? Blindingly brilliant according to some of their peers. Did they get things done? Hell yes, they wrote the replacement for the XYZ system from scratch! One person? Can you imagine? Would I hire them? Not unless they were the last qualified person in my pool and I was out of time.</p> <p>That anecdote encapsulates the dangerous side of smart people who get things done.</p> </blockquote> <h4 id="public-speaking-tips-https-news-ycombinator-com-item-id-6199544"><a href="https://news.ycombinator.com/item?id=6199544">Public speaking tips</a></h4> <blockquote> <p>Some kids grow up on football. I grew up on public speaking (as behavioral therapy for a speech impediment, actually). If you want to get radically better in a hurry:</p> </blockquote> <p>Too long to excerpt. See the link.</p> <h4 id="a-reason-a-company-can-be-a-bad-fit-https-news-ycombinator-com-item-id-3018643"><a href="https://news.ycombinator.com/item?id=3018643">A reason a company can be a bad fit</a></h4> <blockquote> <p>I can relate to this, but I can also relate to the other side of the question. Sometimes it isn't me, its you. Take someone who gets things done and suddenly in your organization they aren't delivering. Could be them, but it could also be you.</p> <p>I had this experience working at Google. I had a horrible time getting anything done there. Now I spent a bit of time evaluating that since it had never been the case in my career, up to that point, where I was unable to move the ball forward and I really wanted to understand that. The short answer was that Google had developed a number of people who spent much, if not all, of their time preventing change. It took me a while to figure out what motivated someone to be anti-change.</p> <p>The fear was risk and safety. Folks moved around a lot and so you had people in charge of systems they didn't build, didn't understand all the moving parts of, and were apt to get a poor rating if they broke. When dealing with people in that situation one could either educate them and bring them along, or steam roll over them. Education takes time, and during that time the 'teacher' doesn't get anything done. This favors steamrolling evolutionarily :-)</p> <p>So you can hire someone who gets stuff done, but if getting stuff done in your organization requires them to be an asshole, and they aren't up for that, well they aren't going to be nearly as successful as you would like them to be.</p> </blockquote> <h4 id="what-working-at-google-is-like-https-news-ycombinator-com-item-id-3473117"><a href="https://news.ycombinator.com/item?id=3473117">What working at Google is like</a></h4> <blockquote> <p>I can tell that this was written by an outsider, because it focuses on the perks and rehashes several cliches that have made their way into the popular media but aren't all that accurate.</p> <p>Most Googlers will tell you that the best thing about working there is having the ability to work on really hard problems, with really smart coworkers, and lots of resources at your disposal. I remember asking my interviewer whether I could use things like Google's index if I had a cool 20% idea, and he was like &quot;Sure. That's encouraged. Oftentimes I'll just grab 4000 or so machines and run a MapReduce to test out some hypothesis.&quot; My phone screener, when I asked him what it was like to work there, said &quot;It's a place where really smart people go to be average,&quot; which has turned out to be both true and honestly one of the best things that I've gained from working there.</p> </blockquote> <h4 id="nsa-vs-black-hat-https-news-ycombinator-com-item-id-6136338"><a href="https://news.ycombinator.com/item?id=6136338">NSA vs. Black Hat</a></h4> <blockquote> <p>This entire event was a staged press op. Keith Alexander is a ~30 year veteran of SIGINT, electronic warfare, and intelligence, and a Four-Star US Army General --- which is a bigger deal than you probably think it is. He's a spy chief in the truest sense and a master politician. Anyone who thinks he walked into that conference hall in Caesars without a near perfect forecast of the outcome of the speech is kidding themselves.</p> <p>Heckling Alexander played right into the strategy. It gave him an opportunity to look reasonable compared to his detractors, and, more generally (and alarmingly), to have the NSA look more reasonable compared to opponents of NSA surveillance. It allowed him to &quot;split the vote&quot; with audience reactions, getting people who probably have serious misgivings about NSA programs to applaud his calm and graceful handling of shouted insults; many of those people probably applauded simply to protest the hecklers, who after all were making it harder for them to follow what Alexander was trying to say.</p> <p>There was no serious Q&amp;A on offer at the keynote. The questions were pre-screened; all attendees could do was vote on them. There was no possibility that anything would come of this speech other than an effectively unchallenged full-throated defense of the NSA's programs.</p> </blockquote> <h4 id="are-deadlines-necessary-https-news-ycombinator-com-item-id-2507946"><a href="https://news.ycombinator.com/item?id=2507946">Are deadlines necessary</a>?</h4> <blockquote> <p>Interestingly one of the things that I found most amazing when I was working for Google was a nearly total inability to grasp the concept of 'deadline.' For so many years the company just shipped it by committing it to the release branch and having the code deploy over the course of a small number of weeks to the 'fleet'.</p> <p>Sure there were 'processes', like &quot;Canary it in some cluster and watch the results for a few weeks before turning it loose on the world.&quot; but being completely vertically integrated is a unique sort of situation.</p> </blockquote> <h4 id="debugging-on-windows-vs-linux-https-news-ycombinator-com-item-id-5125078"><a href="https://news.ycombinator.com/item?id=5125078">Debugging on Windows vs. Linux</a></h4> <blockquote> <p>Being a very experienced game developer who tried to switch to Linux, I have posted about this before (and gotten flamed heavily by reactionary Linux people).</p> <p>The main reason is that debugging is terrible on Linux. gdb is just bad to use, and all these IDEs that try to interface with gdb to &quot;improve&quot; it do it badly (mainly because gdb itself is not good at being interfaced with). Someone needs to nuke this site from orbit and build a new debugger from scratch, and provide a library-style API that IDEs can use to inspect executables in rich and subtle ways.</p> <p>Productivity is crucial. If the lack of a reasonable debugging environment costs me even 5% of my productivity, that is too much, because games take so much work to make. At the end of a project, I just don't have 5% effort left any more. It requires everything. (But the current Linux situation is way more than a 5% productivity drain. I don't know exactly what it is, but if I were to guess, I would say it is something like 20%.)</p> </blockquote> <h4 id="what-happens-when-you-become-rich-https-news-ycombinator-com-item-id-7528556"><a href="https://news.ycombinator.com/item?id=7528556">What happens when you become rich</a>?</h4> <blockquote> <p>What is interesting is that people don't even know they have a complex about money until they get &quot;rich.&quot; I've watched many people, perhaps a hundred, go from &quot;working to pay the bills&quot; to &quot;holy crap I can pay all my current and possibly my future bills with the money I now have.&quot; That doesn't include the guy who lived in our neighborhood and won the CA lottery one year.</p> <p>It affects people in ways they don't expect. If its sudden (like lottery winning or sudden IPO surge) it can be difficult to process. But it is an important thing to realize that one is processing an exceptional event. Like having a loved one die or a spouse suddenly divorcing you.</p> <p>Not everyone feels &quot;guilty&quot;, not everyone feels &quot;smug.&quot; A lot of millionaires and billionaires in the Bay Area are outwardly unchanged. But the bottom line is that the emotion comes from the cognitive dissonance between values and reality. What do you value? What is reality?</p> <p>One woman I knew at Google was massively conflicted when she started work at Google. She always felt that she would help the homeless folks she saw, if she had more money than she needed. Upon becoming rich (on Google stock value), now she found that she wanted to save the money she had for her future kids education and needs. Was she a bad person? Before? After? Do your kids hate you if you give away their college education to the local foodbank? Do your peers hate you because you could close the current food gap at the foodbank and you don't?</p> </blockquote> <h4 id="microsoft-s-skype-acquisition-https-news-ycombinator-com-item-id-2531332"><a href="https://news.ycombinator.com/item?id=2531332">Microsoft’s Skype acquisition</a></h4> <blockquote> <p>This is Microsoft's ICQ moment. Overpaying for a company at the moment when its core competency is becoming a commodity. Does anyone have the slightest bit of loyalty to Skype? Of course not. They're going to use whichever video chat comes built into their SmartPhone, tablet, computer, etc. They're going to use FaceBook's eventual video chat service or something Google offers. No one is going to actively seek out Skype when so many alternatives exist and are deeply integrated into the products/services they already use. Certainly no one is going to buy a Microsoft product simply because it has Skype integration. Who cares if it's FaceTime, FaceBook Video Chat, Google Video Chat? It's all the same to the user.</p> <p>With $7B they should have just given away about 15 million Windows Mobile phones in the form of an epic PR stunt. It's not a bad product -- they just need to make people realize it exists. If they want to flush money down the toilet they might as well engage users in the process right?</p> </blockquote> <h4 id="what-happened-to-google-fiber-https-news-ycombinator-com-item-id-12793300"><a href="https://news.ycombinator.com/item?id=12793300">What happened to Google Fiber</a>?</h4> <blockquote> <p>I worked briefly on the Fiber team when it was very young (basically from 2 weeks before to 2 weeks after launch - I was on loan from Search specifically so that they could hit their launch goals). The bottleneck when I was there were local government regulations, and in fact Kansas City was chosen because it had a unified city/county/utility regulatory authority that was very favorable to Google. To lay fiber to the home, you either need right-of-ways on the utility poles (which are owned by Google's competitors) or you need permission to dig up streets (which requires a mess of permitting from the city government). In either case, the cable &amp; phone companies were in very tight with local regulators, and so you had hostile gatekeepers whose approval you absolutely needed.</p> <p>The technology was awesome (1G Internet and HDTV!), the software all worked great, and the economics of hiring contractors to lay the fiber itself actually worked out. The big problem was regulatory capture.</p> <p>With Uber &amp; AirBnB's success in hindsight, I'd say that the way to crack the ISP business is to provide your customers with the tools to break the law en masse. For example, you could imagine an ISP startup that basically says &quot;Here's a box, a wire, and a map of other customers' locations. Plug into their jack, and if you can convince others to plug into yours, we'll give you a discount on your monthly bill based on how many you sign up.&quot; But Google in general is not willing to break laws - they'll go right up to the boundary of what the law allows, but if a regulatory agency says &quot;No, you can't do that&quot;, they won't do it rather than fight the agency.</p> <p>Indeed, Fiber is being phased out in favor of Google's acquisition of WebPass, which does basically exactly that but with wireless instead of fiber. WebPass only requires the building owner's consent, and leaves the city out of it.</p> </blockquote> <h4 id="what-it-s-like-to-talk-at-microsoft-s-teched-https-news-ycombinator-com-item-id-6100368"><a href="https://news.ycombinator.com/item?id=6100368">What it's like to talk at Microsoft's TechEd</a></h4> <blockquote> <p>I've spoken at TechEds in the US and Europe, and been in the top 10 for attendee feedback twice.</p> <p>I'd never speak at TechEd again, and I told Microsoft the same thing, same reasons. The event staff is overly demanding and inconsiderate of speaker time. They repeatedly dragged me into mandatory virtual and in-person meetings to cover inane details that should have been covered via email. They mandated the color of pants speakers wore. Just ridiculously micromanaged.</p> </blockquote> <h4 id="why-did-hertz-suddenly-become-so-flaky-https-news-ycombinator-com-item-id-12855236"><a href="https://news.ycombinator.com/item?id=12855236">Why did Hertz suddenly become so flaky</a>?</h4> <blockquote> <p>Hertz laid off nearly the entirety of their rank and file IT staff earlier this year.</p> <p>In order to receive our severance, we were forced to train our IBM replacements, who were in India. Hertz's strategy of IBM and Austerity is the new SMT's solution for a balance sheet that's in shambles, yet they have rewarded themselves by increasing executive compensation 35% over the prior year, including a $6 million bonus to the CIO.</p> <p>I personally landed in an Alphabet company, received a giant raise, and now I get to work on really amazing stuff, so I'm doing fine. But to this day I'm sad to think how our once-amazing Hertz team, staffed with really smart people, led by the best boss I ever had, and were really driving the innovation at Hertz, was just thrown away like yesterday's garbage.</p> </blockquote> <h4 id="before-startups-put-clauses-in-contracts-forbidden-they-sometimes-blocked-sales-via-backchannel-communications-https-news-ycombinator-com-item-id-10705646"><a href="https://news.ycombinator.com/item?id=10705646">Before startups put clauses in contracts forbidden, they sometimes blocked sales via backchannel communications</a></h4> <blockquote> <p>Don't count on definitely being able to sell the stock to finance the taxes. I left after seven years in very good standing (I believed) but when I went to sell the deal was shut down [1]. Luckily I had a backup plan and I was ok [2].</p> <p>[1] Had a handshake deal with an investor in the company, then the investor went silent on me. When I followed up he said the deal was &quot;just much too small.&quot; I reached out to the company for help, and they said they'd actually told him not to buy from me. I never would have known if they hadn't decided to tell me for some reason. The takeaway is that the markets for private company stock tend to be small, and the buyers care more about their relationships with the company than they do about having your shares. Even if the stock terms allow them to buy, and they might not.</p> </blockquote> <h4 id="an-amazon-pilot-program-designed-to-reduce-the-cost-of-interviewing-https-news-ycombinator-com-item-id-13076821"><a href="https://news.ycombinator.com/item?id=13076821">An Amazon pilot program designed to reduce the cost of interviewing</a></h4> <blockquote> <p>I took the first test just like the OP, the logical reasoning part seemed kind of irrelevant and a waste of time for me. That was nothing compared to the second online test.</p> <p>The environment of the second test was like a scenario out of Black Mirror. Not only did they want to have the webcam and microphone on the entire time, I also had to install their custom software so the proctors could monitor my screen and control my computer. They opened up the macOS system preferences so they could disable all shortcuts to take screenshots, and they also manually closed all the background services I had running (even f.lux!).</p> <p>Then they asked me to pick up my laptop and show them around my room with the webcam. They specifically asked to see the contents of my desk and the walls and ceiling of my room. I had some pencil and paper on my desk to use as scratch paper for the obvious reasons and they told me that wasn't allowed. Obviously that made me a little upset because I use it to sketch out examples and concepts. They also saw my phone on the desk and asked me to put it out of arm's reach.</p> <p>After that they told me I couldn't leave the room until the 5 minute bathroom break allowed half-way through the test. I had forgotten to tell my roommate I was taking this test and he was making a bit of a ruckus playing L4D2 online (obviously a bit distracting). I asked the proctor if I could briefly leave the room to ask him to quiet down. They said I couldn't leave until the bathroom break so there was nothing I could do. Later on, I was busy thinking about a problem and had adjusted how I was sitting in my chair and moved my face slightly out of the camera's view. The proctor messaged me again telling me to move so they could see my entire face.</p> </blockquote> <h4 id="amazon-interviews-part-2-https-news-ycombinator-com-item-id-13077230"><a href="https://news.ycombinator.com/item?id=13077230">Amazon interviews, part 2</a></h4> <blockquote> <p>The first part of the interview was exactly like the linked experience. No coding questions just reasoning. The second part I had to use ProctorU instead of Proctorio. Personally I thought the experience was super weird but understandable, I'll get to that later, somebody watched me through my webcam the entire time with my microphone on. They needed to check my ID before the test. They needed me to show them the entire room I was in (which was my bedroom). My desktop computer was on behind my laptop so I turned off my computer (I don't remember if I offered to or if they asked me to) but they also asked me to cover my monitors up with something which I thought was silly after I turned them off so I covered them with a towel. They then used LogMeIn to remote into my machine so they could check running programs. I quit all my personal chat programs and pretty much only had the Chrome window running.</p> <p>...</p> <p>I didn't talk a real person who actually worked at Amazon (by email or through webcam) until I received an offer.</p> </blockquote> <h4 id="what-s-getting-acquired-by-oracle-like-https-news-ycombinator-com-item-id-13199717"><a href="https://news.ycombinator.com/item?id=13199717">What's getting acquired by Oracle like</a>?</h4> <blockquote> <p>[M]y company got acquired by Oracle. We thought things would be OK. Nothing changed immediately. Slowly but surely they turned the screws. 5 year laptop replacement policy. You get the corporate standard laptop and you'll like it. Sales? Oh those guys can buy new Macs every two years, they get whatever they want. Then you understand where Software Engineers rank in the company hierarchy. Oracle took the average price of our product from $100k to $5 million for the same size deals. Our sales went from $5-7m to more than $40m with no increasing in engineering headcount (team of 15). Didn't matter when bonus time came, we all got stack-ranked and some people got nothing. As a top performer I got a few options, worth maybe $5k.</p> <p>Oracle exists to extract the maximum amount of money possible from the Fortune 1000. Everyone else can fuck off. Your impotent internet rage is meaningless. If it doesn't piss off the CTO of $X then it doesn't matter. If it gets that CTO to cut a bigger check then it will be embraced with extreme enthusiasm.</p> <p>The culture wears down a lot (but not all) of the good people, who then leave. What's left is a lot of mediocrity and architecture astronauts. The more complex the product the better - it means extra consulting dollars!</p> <p>My relative works at a business dependent on Micros. When Oracle announced the acquisition I told them to start on the backup plan immediately because Oracle was going to screw them sooner or later. A few years on and that is proving true: Oracle is slowly excising the Micros dealers and ISVs out of the picture, gobbling up all the revenue while hiking prices.</p> </blockquote> <h4 id="how-do-you-avoid-hiring-developers-who-do-negative-work-https-news-ycombinator-com-item-id-13209317"><a href="https://news.ycombinator.com/item?id=13209317">How do you avoid hiring developers who do negative work</a>?</h4> <blockquote> <p>In practice, we have to face that all that our quest for more stringent hiring standards is not really selecting the best, but just selecting fewer people, in ways that might, or might not, have anything to do with being good at a job. Let's go through a few examples in my career:</p> <p>A guy that was the most prolific developer I have ever seen: He'd rewrite entire subsystems over a weekend. The problem is that said susbsytems were not necessarily better than they started, trading bugs for bugs, and anyone that wanted to work on them would have to relearn that programmer's idiosyncrasies of the week. He easily cost his project 12 man/months of work in 4 months, the length of time it took for management to realize that he had to be let go.</p> <p>A company's big UI framework was quite broken, and a new developer came in and fixed it. Great, right? Well, he was handed code review veto to changes into the framework, and his standards and his demeanor made people stop contributing after two or three attempts. In practice, the framework died as people found it antiquated, and they decided to build a new one: Well, the same developer was tasked with building new framwork, which was made mandatory for 200+ developers to use. Total contribution was clearly negative.</p> <p>A developer that was very fast, and wrote working code, had been managing a rather large 500K line codebase, and received some developers as help. He didn't believe in internal documentation or on keeping interfaces stable. He also didn't believe in writing code that wasn't brittle, or in unit tests: Code changes from the new developers often broke things, the veteran would come in, fix everything in the middle of the emergency, and look absolutely great, while all the other developers looked to management as if they were incompetent. They were not, however: they were quite successful when moved to other teams. It just happens that the original developer made sure nobody else could touch anything. Eventually, the experiment was retried after the original developer was sent to do other things. It took a few months, but the new replacement team managed to modularize the code, and new people could actually modify the codebase productively.</p> <p>All of those negative value developers could probably be very valuable in very specific conditions, and they'd look just fine in a tough job interview. They were still terrible hires. In my experience, if anything, a harder process that demands people to appear smarter or work faster in an interview have the opposite effect of what I'd want: They end up selecting for people that think less and do more quickly, building debt faster.</p> <p>My favorite developers ever all do badly in your typical stringent Silicon Valley intervew. They work slower, do more thinking, and consider every line of code they write technical debt. They won't have a million algorithms memorized: They'll go look at sources more often than not, and will spend a lot of time on tests that might as well be documentation. Very few of those traits are positive in an interview, but I think they are vital in creating good teams, but few select for them at all.</p> </blockquote> <h4 id="linux-and-the-demise-of-solaris-https-news-ycombinator-com-item-id-13081465"><a href="https://news.ycombinator.com/item?id=13081465">Linux and the demise of Solaris</a></h4> <blockquote> <p>I worked on Solaris for over a decade, and for a while it was usually a better choice than Linux, especially due to price/performance (which includes how many instances it takes to run a given workload). It was worth fighting for, and I fought hard. But Linux has now become technically better in just about every way. Out-of-box performance, tuned performance, observability tools, reliability (on patched LTS), scheduling, networking (including TCP feature support), driver support, application support, processor support, debuggers, syscall features, etc. Last I checked, ZFS worked better on Solaris than Linux, but it's an area where Linux has been catching up. I have little hope that Solaris will ever catch up to Linux, and I have even less hope for illumos: Linux now has around 1,000 monthly contributors, whereas illumos has about 15.</p> <p>In addition to technology advantages, Linux has a community and workforce that's orders of magnitude larger, staff with invested skills (re-education is part of a TCO calculation), companies with invested infrastructure (rewriting automation scripts is also part of TCO), and also much better future employment prospects (a factor than can influence people wanting to work at your company on that OS). Even with my considerable and well-known Solaris expertise, the employment prospects with Solaris are bleak and getting worse every year. With my Linux skills, I can work at awesome companies like Netflix (which I highly recommend), Facebook, Google, SpaceX, etc.</p> <p>Large technology-focused companies, like Netflix, Facebook, and Google, have the expertise and appetite to make a technology-based OS decision. We have dedicated teams for the OS and kernel with deep expertise. On Netflix's OS team, there are three staff who previously worked at Sun Microsystems and have more Solaris expertise than they do Linux expertise, and I believe you'll find similar people at Facebook and Google as well. And we are choosing Linux.</p> <p>The choice of an OS includes many factors. If an OS came along that was better, we'd start with a thorough internal investigation, involving microbenchmarks (including an automated suite I wrote), macrobenchmarks (depending on the expected gains), and production testing using canaries. We'd be able to come up with a rough estimate of the cost savings based on price/performance. Most microservices we have run hot in user-level applications (think 99% user time), not the kernel, so it's difficult to find large gains from the OS or kernel. Gains are more likely to come from off-CPU activities, like task scheduling and TCP congestion, and indirect, like NUMA memory placement: all areas where Linux is leading. It would be very difficult to find a large gain by changing the kernel from Linux to something else. Just based on CPU cycles, the target that should have the most attention is Java, not the OS. But let's say that somehow we did find an OS with a significant enough gain: we'd then look at the cost to switch, including retraining staff, rewriting automation software, and how quickly we could find help to resolve issues as they came up. Linux is so widely used that there's a good chance someone else has found an issue, had it fixed in a certain version or documented a workaround.</p> <p>What's left where Solaris/SmartOS/illumos is better? 1. There's more marketing of the features and people. Linux develops great technologies and has some highly skilled kernel engineers, but I haven't seen any serious effort to market these. Why does Linux need to? And 2. Enterprise support. Large enterprise companies where technology is not their focus (eg, a breakfast cereal company) and who want to outsource these decisions to companies like Oracle and IBM. Oracle still has Solaris enterprise support that I believe is very competitive compared to Linux offerings.~</p> </blockquote> <h4 id="why-wasn-t-rethinkdb-more-sucessful-https-news-ycombinator-com-item-id-12902689"><a href="https://news.ycombinator.com/item?id=12902689">Why wasn't RethinkDB more sucessful</a>?</h4> <blockquote> <p>I'd argue that where RethinkDB fell down is on a step you don't list, &quot;Understand the context of the problem&quot;, which you'd ideally do before figuring out how many people it's a problem for. Their initial idea was a MySQL storage engine for SSDs - the environmental change was that SSD prices were falling rapidly, SSDs have wildly different performance characteristics from disk, and so they figured there was an opportunity to catch the next wave. Only problem is that the biggest corporate buyers of SSDs are gigantic tech companies (eg. Google, Amazon) with large amounts of proprietary software, and so a generic MySQL storage engine isn't going to be useful to them anyway.</p> <p>Unfortunately they'd already taken funding, built a team, and written a lot of code by the time they found that out, and there's only so far you can pivot when you have an ecosystem like that.</p> </blockquote> <h4 id="on-falsehoods-programmers-believe-about-x-https-news-ycombinator-com-item-id-13260082"><a href="https://news.ycombinator.com/item?id=13260082">On falsehoods programmers believe about X</a></h4> <blockquote> <p>This unfortunately follows the conventions of the genre called &quot;Falsehood programmers believe about X&quot;: ...</p> <p>I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.</p> </blockquote> <h4 id="what-does-it-mean-if-a-company-rescinds-an-offer-because-you-tried-to-negotiate-https-news-ycombinator-com-item-id-13189972"><a href="https://news.ycombinator.com/item?id=13189972">What does it mean if a company rescinds an offer because you tried to negotiate</a>?</h4> <blockquote> <p>It happens sometimes. Usually it's because of one of two situations:</p> <p>1) The company was on the fence about wanting you anyway, and negotiating takes you from the &quot;maybe kinda sorta want to work with&quot; to the &quot;don't want to work with&quot; pile.</p> <p>2) The company is looking for people who don't question authority and don't stick up for their own interests.</p> <p>Both of these are red flags. It's not really a matter of ethics - they're completely within their rights to withdraw an offer for any reason - but it's a matter of &quot;Would you really want to work there anyway?&quot; For both corporations and individuals, it usually leads to a smoother life if you only surround yourself with people who really value you.</p> </blockquote> <h3 id="hn-comments-https-news-ycombinator-com-item-id-13291567"><a href="https://news.ycombinator.com/item?id=13291567">HN comments</a></h3> <blockquote> <p>I feel like this is every HN discussion about &quot;rates---comma---raising them&quot;: a mean-spirited attempt to convince the audience on the site that high rates aren't really possible, because if they were, the person telling you they're possible would be wealthy beyond the dreams of avarice. Once again: Patrick is just offering a more refined and savvy version of advice me and my Matasano friends gave him, and our outcomes are part of the record of a reasonable large public company.</p> <p>This, by the way, is why I'll never write this kind of end-of-year wrap-up post (and, for the same reasons, why I'll never open source code unless I absolutely have to). It's also a big part of what I'm trying to get my hands around for the Starfighter wrap-up post. When we started Starfighter, everyone said &quot;you're going to have such an amazing time because of all the HN credibility you have&quot;. But pretty much every time Starfighter actually came up on HN, I just wanted to hide under a rock. Even when the site is civil, it's still committed to grind away any joy you take either in accomplishing something near or even in just sharing something interesting you learned . You could sort of understand an atavistic urge to shit all over someone sharing an interesting experience that was pleasant or impressive. There's a bad Morrissey song about that. But look what happens when you share an interesting story that obviously involved significant unpleasantness and an honest accounting of one's limitations: a giant thread full of people piling on to question your motives and life choices. You can't win.</p> </blockquote> <h4 id="on-the-journalistic-integrity-of-quartz-https-news-ycombinator-com-item-id-13339723"><a href="https://news.ycombinator.com/item?id=13339723">On the journalistic integrity of Quartz</a></h4> <blockquote> <p>I was the first person to be interviewed by this journalist (Michael Thomas @curious_founder). He approached me on Twitter to ask questions about digital nomad and remote work life (as I founded Nomad List and have been doing it for years).</p> <p>I told him it'd be great to see more honest depictions as most articles are heavily idealized making it sound all great, when it's not necessarily. It's ups and downs (just like regular life really).</p> <p>What happened next may surprise you. He wrote a hit piece on me changing my entire story that I told him over Skype into a clickbait article of how digital nomadism doesn't work and one of the main people doing it for awhile (en public) even settled down and gave up altogether.</p> <p>I didn't settle down. I spent the summer in Amsterdam. Cause you know, it's a nice place! But he needed to say this to make a polarized hit piece with an angle. And that piece became viral. Resulting in me having to tell people daily that I didn't and getting lots of flack. You may understand it doesn't help if your entire startup is about something and a journalist writes a viral piece how you yourself don't even believe in that anymore. I contacted the journalist and Quartz but they didn't change a thing.</p> <p>It's great this meant his journalistic breakthrough but it hurt me in the process.</p> <p>I'd argue journalists like this are the whole problem we have these days. The articles they write can't be balanced because they need to get pageviews. Every potential to write something interesting quickly turns into clickbait. It turned me off from being interviewed ever again. Doing my own PR by posting comment sections of Hacker News or Reddit seems like a better idea (also see how Elon Musk does exactly this, seems smarter).</p> </blockquote> <h4 id="how-did-click-and-clack-always-manage-to-solve-the-problem-https-news-ycombinator-com-item-id-13348944"><a href="https://news.ycombinator.com/item?id=13348944">How did Click and Clack always manage to solve the problem</a>?</h4> <blockquote> <p>Hope this doesn't ruin it for you, but I knew someone who had a problem presented on the show. She called in and reached an answering machine. Someone called her and qualified the problem. Then one of the brothers called and talked to her for a while. Then a few weeks later (there might have been some more calls, I don't know) both brothers called her and talked to her for a while. Her parts of that last call was edited into the radio show so it sounded like she had called and they just figured out the answer on the spot.</p> </blockquote> <h4 id="why-are-so-many-people-down-on-blockchain-https-news-ycombinator-com-item-id-13420777"><a href="https://news.ycombinator.com/item?id=13420777">Why are so many people down on blockchain</a>?</h4> <blockquote> <p>Blockchain is the world's worst database, created entirely to maintain the reputations of venture capital firms who injected hundreds of millions of dollars into a technology whose core defining insight was &quot;You can improve on a Ponzi scam by making it self-organizing and distributed; that gets vastly more distribution, reduces the single point of failure, and makes it censorship-resistant.&quot;</p> <p>That's more robust than I usually phrase things on HN, but you did ask. In slightly more detail:</p> <p>Databases are wonderful things. We have a number which are actually employed in production, at a variety of institutions. They run the world. Meaningful applications run on top of Postgres, MySQL, Oracle, etc etc.</p> <p>No meaningful applications run on top of &quot;blockchain&quot;, because it is a marketing term. You cannot install blockchain just like you cannot install database. (Database sounds much cooler without the definitive article, too.) If you pick a particular instantiation of a blockchain-style database, it is a horrible, horrible database.</p> <p>Can I pick on Bitcoin? Let me pick on Bitcoin. Bitcoin is claimed to be a global financial network and ready for production right now. Bitcoin cannot sustain 5 transactions per second, worldwide.</p> <p>You might be sensibly interested in Bitcoin governance if, for some reason, you wanted to use Bitcoin. Bitcoin is a software artifact; it matters to users who makes changes to it and by what process. (Bitcoin is a software artifact, not a protocol, even though the Bitcoin community will tell you differently. There is a single C++ codebase which matters. It is essentially impossible to interoperate with Bitcoin without bugs-and-all replicating that codebase.) Bitcoin governance is captured by approximately ~5 people. This is a robust claim and requires extraordinary evidence.</p> <p>Ordinary evidence would be pointing you, in a handwavy fashion, about the depth of acrimony with regards to raising the block size, which would let Bitcoin scale to the commanding heights of 10 or, nay, 100 transactions per second worldwide.</p> <p>Extraordinary evidence might be pointing you to the time where the entire Bitcoin network was de-facto shut down based on the consensus of N people in an IRC channel. c.f. <a href="https://news.ycombinator.com/item?id=9320989">https://news.ycombinator.com/item?id=9320989</a> This was back in 2013. Long story short: a software update went awry so they rolled back global state by a few hours by getting the right two people to agree to it on a Skype call.</p> <p>But let's get back to discussing that sole technical artifact. Bitcoin has a higher cost-to-value ratio than almost any technology conceivable; the cost to date is the market capitalization of Bitcoin. Because Bitcoin enters through a seigniorage mechanism, every Bitcoin existing was minted as compensation for &quot;security the integrity of the blockchain&quot; (by doing computationally expensive makework).</p> <p>This cost is high. Today, routine maintenance of the Bitcoin network will cost the network approximately $1.5 million. That's on the order of $3 per write on a maximum committed capacity basis. It will cost another $1.5 million tomorrow, exchange rate depending.</p> <p>(Bitcoin has successfully shifted much of the cost of operating its database to speculators rather than people who actually use Bitcoin for transaction processing. That game of musical chairs has gone on for a while.)</p> <p>Bitcoin has some properties which one does not associate with many databases. One is that write acknowledgments average 5 minutes. Another is that they can stop, non-deterministically, for more than an hour at a time, worldwide, for all users simultaneously. This behavior is by design.</p> </blockquote> <h4 id="how-big-is-the-proprietary-database-market-https-news-ycombinator-com-item-id-13442042"><a href="https://news.ycombinator.com/item?id=13442042">How big is the proprietary database market</a>?</h4> <blockquote> <ol> <li><p>The database market is NOT closed. In fact, we are in a database boom. Since 2009 (the year RethinkDB was founded), there have been over 100 production grade databases released in the market. These span document stores, Key/Value, time series, MPP, relational, in-memory, and the ever increasing &quot;multi model databases.&quot;</p></li> <li><p>Since 2009, over $600 MILLION dollars (publicly announced) has been invested in these database companies (RethinkDB represents 12.2M or about 2%). That's aside from money invested in the bigger established databases.</p></li> <li><p>Almost all of the companies that have raised funding in this period generate revenue from one of more of the following areas:</p></li> </ol> <p>a) exclusive hosting (meaning AWS et al. do not offer this product) b) multi-node/cluster support c) product enhancements c) enterprise support</p> <p>Looking at each of the above revenue paths as executed by RethinkDB:</p> <p>a) RethinkDB never offered a hosted solution. Compose offered a hosted solution in October of 2014. b) RethinkDB didn't support true high availability until the 2.1 release in August 2015. It was released as open source and to my knowledge was not monetized. c/d) I've heard that an enterprise version of RethinkDB was offered near the end. Enterprise Support is, empirically, a bad approach for a venture backed company. I don't know that RethinkDB ever took this avenue seriously. Correct me if I am wrong.</p> <p>A model that is not popular among RECENT databases but is popular among traditional databases is a standard licensing model (e.g. Oracle, Microsoft SQL Server). Even these are becoming more rare with the advent of A, but never underestimate the licensing market.</p> <p>Again, this is complete conjecture, but I believe RethinkDB failed for a few reasons:</p> <p>1) not pursuing one of the above revenue models early enough. This has serious affects on the order of the feature enhancements (for instance, the HA released in 2015 could have been released earlier at a premium or to help facilitate a hosted solution).</p> <p>2) incorrect priority of enhancements:</p> <p>2a) general database performance never reached the point it needed to. RethinkDB struggled with both write and read performance well into 2015. There was no clear value add in this area compared to many write or read focused databases released around this time.</p> <p>2b) lack of (proper) High Availability for too long.</p> <p>2c) ReQL was not necessary - most developers use ORMs when interacting with SQL. When you venture into analytical queries, we actually seem to make great effort to provide SQL: look at the number of projects or companies that exist to bring SQL to databases and filesystems that don't support it (Hive, Pig, Slam Data, etc).</p> <p>2d) push notifications. This has not been demonstrated to be a clear market need yet. There are a small handful of companies that promoting development stacks around this, but no database company is doing the same.</p> <p>2e) lack of focus. What was RethinkDB REALLY good at? It push ReQL and joins at first, but it lacked HA until 2015, struggled with high write or read loads into 2015. It then started to focus on real time notifications. Again, there just aren't many databases focusing on these areas.</p> <p>My final thought is that RethinkDB didn't raise enough capital. Perhaps this is because of previous points, but without capital, the above can't be corrected. RethinkDB actually raised far less money than basically any other venture backed company in this space during this time.</p> <p>Again, I've never run a database company so my thoughts are just from an outsider. However, I am the founder of a company that provides database integration products so I monitor this industry like I hawk. I simply don't agree that the database market has been &quot;captured.&quot;</p> <p>I expect to see even bigger growth in databases in the future. I'm happy to share my thoughts about what types of databases are working and where the market needs solutions. Additionally, companies are increasingly relying on third part cloud services for data they previously captured themselves. Anything from payment processes, order fulfillment, traffic analytics etc is now being handled by someone else.</p> </blockquote> <h3 id="a-google-maps-employee-s-opinion-on-the-google-maps-pricing-change-https-news-ycombinator-com-item-id-32403665"><a href="https://news.ycombinator.com/item?id=32403665">A Google Maps employee's opinion on the Google Maps pricing change</a></h3> <blockquote> <p>I was a googler working on Google maps at the time of the API self immolation.</p> <p>There were strong complaints from within about the price changes. Obviously everyone couldn't believe what was being planned, and there were countless spreadsheets and reports and SQL queries showing how this was going to shit all over a lot of customers that we'd be guaranteed to lose to a competitor.</p> <p>Management didn't give a shit.</p> <p>I don't know what the rationale was apart from some vague claim about &quot;charging for value&quot;. A lot of users of the API apparently were basically under the free limits or only spending less than 100 USD on API usage so I can kind of understand the line of thought, but I still.thibk they went way too far.</p> <p>I don't know what happened to the architects of the plan. I presume promo.</p> <p>Edit: I should add that this was not a knee-jerk thing or some exec just woke up one day with an idea in their dreams. It was a planned change that took many months to plan and prepare for with endless preparations and reporting and so on.</p> </blockquote> <h3 id="toc_56">???</h3> <p>How did HN get get the commenter base that it has? If you read HN, on any given week, there are at least as many good, substantial, comments as there are posts. This is different from every other modern public news aggregator I can find out there, and I don’t really know what the ingredients are that make HN successful.</p> <p>For the last couple years (ish?), the moderation regime has been really active in trying to get a good mix of stories on the front page and in tamping down on gratuitously mean comments. But there was a period of years where the moderation could be described as sparse, arbitrary, and capricious, and while there are fewer “bad” comments now, it doesn’t seem like good moderation actually generates more “good” comments.</p> <p>The ranking scheme seems to penalize posts that have a lot of comments on the theory that flamebait topics will draw a lot of comments. That sometimes prematurely buries stories with good discussion, but much more often, it buries stories that draw pointless flamewars. If you just read HN, it’s hard to see the effect, but if you look at forums that use comments as a positive factor in ranking, the difference is dramatic -- those other forums that boost topics with many comments (presumably on theory that vigorous discussion should be highlighted) often have content-free flame wars pinned at the top for long periods of time.</p> <p>Something else that HN does that’s different from most forums is that user flags are weighted very heavily. On reddit, a downvote only cancels out an upvote, which means that flamebait topics that draw a lot of upvotes like “platform X is cancer” “Y is doing some horrible thing” often get pinned to the top of r/programming for a an entire day, since the number of people who don’t want to see that is drowned out by the number of people who upvote outrageous stories. If you read the comments for one of the &quot;X is cancer&quot; posts on r/programming, the top comment will almost inevitably that the post has no content, that the author of the post is a troll who never posts anything with content, and that we'd be better off with less flamebait by the author at the top of r/programming. But the people who will upvote outrage porn outnumber the people who will downvote it, so that kind of stuff dominates aggregators that use raw votes for ranking. Having flamebait drop off the front page quickly is significant, but it doesn’t seem sufficient to explain why there are so many more well-informed comments on HN than on other forums with roughly similar traffic.</p> <p>Maybe the answer is that people come to HN for the same reason people come to Silicon Valley -- despite all the downsides, there’s a relatively large concentration of experts there across a wide variety of CS-related disciplines. If that’s true, and it’s a combination of path dependence on network effects, that’s pretty depressing since that’s not replicable.</p> <p><em>If you liked this curated list of comments, you'll probably also like <a href="//danluu.com/programming-books/">this list of books</a> and <a href="//danluu.com/programming-blogs/">this list of blogs</a>.</em></p> <p><small> This is part of an experiment where I write up thoughts quickly, without proofing or editing. Apologies if this is less clear than a normal post. This is probably going to be the last post like this, for now, since, by quickly writing up a post whenever I have something that can be written up quickly, I'm building up a backlog of post ideas that require re-reading the literature in an area or running experiments.</p> <p>P.S. <a href="https://twitter.com/danluu">Please suggest other good comments</a>! By their nature, HN comments are much less discoverable than stories, so there are a lot of great coments that I haven't seen. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:B"><p>if you’re one of those people, you’ve probably already thought of this, but maybe consider, at the margin, blogging more and commenting on HN less? As a result of writing this post, I looked through my old HN comments and noticed that I <a href="https://news.ycombinator.com/item?id=6315483">wrote this comment three years ago</a>, which is another way of stating <a href="bimodal-compensation/#appendix-b-wtf">the second half of this post I wrote recently</a>. Comparing the two, I think the HN comment is substantially better written. But, like most HN comments, it got some traffic while the story was still current and is now buried, and AFAICT, nothing really happened as a result of the comment. The blog post, despite being “worse”, has gotten some people to contact me personally, and I’ve had some good discussions about that and other topics as a result. Additionally, people occasionally contact me about older posts I’ve written; I continue to get interesting stuff in my inbox as a result of having written posts years ago. Writing your comment up as a blog post will almost certainly provide more value to you, and if it gets posted to HN, it will probably provide no less value to HN.</p> <p>Steve Yegge has <a href="https://sites.google.com/site/steveyegge2/you-should-write-blogs">a pretty list of reasons why you should blog</a> that I won’t recapitulate here. And if you’re writing substantial comments on HN, you’re already doing basically everything you’d need to do to write a blog except that you’re putting the text into a little box on HN instead of into a static site generator or some hosted blogging service. BTW, I’m not just saying this for your benefit: my selfish reason for writing this appeal is that I really want to read the Nathan Kurz blog on low-level optimizations, the Jonathan Tang blog on what it’s like to work at startups vs. big companies, etc.</p> <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> </ol> </div> Programming book recommendations and anti-recommendations programming-books/ Sun, 16 Oct 2016 01:06:34 -0700 programming-books/ <p>There are a lot of “12 CS books every programmer must read” lists floating around out there. That's nonsense. The field is too broad for almost any topic to be required reading for all programmers, and even if a topic is that important, people's learning preferences differ too much for any book on that topic to be the best book on the topic for all people.</p> <p>This is a list of topics and books <a href="https://twitter.com/danluu/status/1574819407058325505">where I've read the book</a>, am familiar enough with the topic to say what you might get out of learning more about the topic, and have read other books and can say why you'd want to read one book over another.</p> <h3 id="algorithms-data-structures-complexity">Algorithms / Data Structures / Complexity</h3> <p>Why should you care? Well, there's the pragmatic argument: even if you never use this stuff in your job, most of the best paying companies will quiz you on this stuff in interviews. On the non-bullshit side of things, I find algorithms to be useful in the same way I find math to be useful. The probability of any particular algorithm being useful for any particular problem is low, but having a general picture of what kinds of problems are solved problems, what kinds of problems are intractable, and when approximations will be effective, is often useful.</p> <h4 id="mcdowell-cracking-the-coding-interview-https-www-amazon-com-gp-product-0984782850-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0984782850-linkid-a34501f41a8ccd1ba8d604198c026551">McDowell; <a href="https://www.amazon.com/gp/product/0984782850/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0984782850&amp;linkId=a34501f41a8ccd1ba8d604198c026551">Cracking the Coding Interview</a></h4> <p>Some problems and solutions, with explanations, matching the level of questions you see in entry-level interviews at Google, Facebook, Microsoft, etc. I usually recommend this book to people who want to pass interviews but not really learn about algorithms. It has just enough to get by, but doesn't really teach you the <em>why</em> behind anything. If you want to actually learn about algorithms and data structures, see below.</p> <h4 id="dasgupta-papadimitriou-and-vazirani-algorithms-https-www-amazon-com-gp-product-0073523402-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0073523402-linkid-51557d68c39707af447ec02339249dd1">Dasgupta, Papadimitriou, and Vazirani; <a href="https://www.amazon.com/gp/product/0073523402/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0073523402&amp;linkId=51557d68c39707af447ec02339249dd1">Algorithms</a></h4> <p>Everything about this book seems perfect to me. It breaks up algorithms into classes (e.g., divide and conquer or greedy), and teaches you how to recognize what kind of algorithm should be used to solve a particular problem. It has a good selection of topics for an intro book, it's the right length to read over a few weekends, and it has exercises that are appropriate for an intro book. Additionally, it has sub-questions in the middle of chapters to make you reflect on non-obvious ideas to make sure you don't miss anything.</p> <p>I know some folks don't like it because it's relatively math-y/proof focused. If that's you, you'll probably prefer Skiena.</p> <h4 id="skiena-the-algorithm-design-manual-https-www-amazon-com-gp-product-1848000693-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-1848000693-linkid-59bca0c3da96693a0c5384c97f6e59bb">Skiena; <a href="https://www.amazon.com/gp/product/1848000693/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1848000693&amp;linkId=59bca0c3da96693a0c5384c97f6e59bb">The Algorithm Design Manual</a></h4> <p>The longer, more comprehensive, more practical, less math-y version of Dasgupta. It's similar in that it attempts to teach you how to identify problems, use the correct algorithm, and give a clear explanation of the algorithm. Book is well motivated with “war stories” that show the impact of algorithms in real world programming.</p> <h4 id="clrs-introduction-to-algorithms-https-www-amazon-com-gp-product-0262033844-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0262033844-linkid-f7dc38d160a69e72427774682d9c0bc6">CLRS; <a href="https://www.amazon.com/gp/product/0262033844/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0262033844&amp;linkId=f7dc38d160a69e72427774682d9c0bc6">Introduction to Algorithms</a></h4> <p>This book somehow manages to make it into half of these “N books all programmers must read” lists despite being so comprehensive and rigorous that almost no practitioners actually read the entire thing. It's great as a textbook for an algorithms class, where you get a selection of topics. As a class textbook, it's nice bonus that it has exercises that are hard enough that they can be used for graduate level classes (about half the exercises from my grad level algorithms class were pulled from CLRS, and the other half were from Kleinberg &amp; Tardos), but this is wildly impractical as a standalone introduction for most people.</p> <p>Just for example, there's an entire chapter on Van Emde Boas trees. They're really neat -- it's a little surprising that a balanced-tree-like structure with <code>O(lg lg n)</code> insert, delete, as well as find, successor, and predecessor is possible, but a first introduction to algorithms shouldn't include Van Emde Boas trees.</p> <h4 id="kleinberg-tardos-algorithm-design-https-www-amazon-com-gp-product-9332518645-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-9332518645-linkid-284dc1e6677bb43997fac393456bbc70">Kleinberg &amp; Tardos; <a href="https://www.amazon.com/gp/product/9332518645/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=9332518645&amp;linkId=284dc1e6677bb43997fac393456bbc70">Algorithm Design</a></h4> <p>Same comments as for CLRS -- it's widely recommended as an introductory book even though it doesn't make sense as an introductory book. Personally, I found the exposition in Kleinberg to be much easier to follow than in CLRS, but plenty of people find the opposite.</p> <h4 id="demaine-advanced-data-structures-http-courses-csail-mit-edu-6-851-spring14">Demaine; <a href="http://courses.csail.mit.edu/6.851/spring14/">Advanced Data Structures</a></h4> <p>This is a set of lectures and notes and not a book, but if you want a coherent (but not intractably comprehensive) set of material on data structures that you're unlikely to see in most undergraduate courses, this is great. The notes aren't designed to be standalone, so you'll want to watch the videos if you haven't already seen this material.</p> <h4 id="okasaki-purely-functional-data-structures-https-www-amazon-com-gp-product-0521663504-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0521663504-linkid-2638d1c24b973d9ddf71f9968a309fea">Okasaki; <a href="https://www.amazon.com/gp/product/0521663504/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0521663504&amp;linkId=2638d1c24b973d9ddf71f9968a309fea">Purely Functional Data Structures</a></h4> <p>Fun to work through, but, unlike the other algorithms and data structures books, I've yet to be able to apply anything from this book to a problem domain where performance really matters.</p> <p>For a couple years after I read this, when someone would tell me that it's not that hard to reason about the performance of purely functional lazy data structures, I'd ask them about part of a proof that stumped me in this book. I'm not talking about some obscure super hard exercise, either. I'm talking about something that's in the main body of the text that was considered too obvious to the author to explain. No one could explain it. Reasoning about this kind of thing is harder than people often claim.</p> <h4 id="dominus-higher-order-perl-http-hop-perl-plover-com-book">Dominus; <a href="http://hop.perl.plover.com/book/">Higher Order Perl</a></h4> <p>A gentle introduction to functional programming that happens to use Perl. You could probably work through this book just as easily in Python or Ruby.</p> <p>If you keep up with what's trendy, this book might seem a bit dated today, but only because so many of the ideas have become mainstream. If you're wondering why you should care about this &quot;functional programming&quot; thing people keep talking about, and some of the slogans you hear don't speak to you or are even off-putting (types are propositions, it's great because it's math, etc.), give this book a chance.</p> <h4 id="levitin-algorithms-https-www-amazon-com-gp-product-0132316811-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0132316811-linkid-26aa7611b1a13ec1c73c2ffe0512a777">Levitin; <a href="https://www.amazon.com/gp/product/0132316811/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0132316811&amp;linkId=26aa7611b1a13ec1c73c2ffe0512a777">Algorithms</a></h4> <p>I ordered this off amazon after seeing these two blurbs: “Other learning-enhancement features include chapter summaries, hints to the exercises, and a detailed solution manual.” and “Student learning is further supported by exercise hints and chapter summaries.” One of these blurbs is even printed on the book itself, but after getting the book, the only self-study resources I could find were some yahoo answers posts asking where you could find hints or solutions.</p> <p>I ended up picking up Dasgupta instead, which was available off an author's website for free.</p> <h4 id="mitzenmacher-upfal-probability-and-computing-randomized-algorithms-and-probabilistic-analysis-https-www-amazon-com-gp-product-0521835402-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0521835402-linkid-d65b4aa027a011a1a5875776aee9b9d7">Mitzenmacher &amp; Upfal; Probability and Computing: <a href="https://www.amazon.com/gp/product/0521835402/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0521835402&amp;linkId=d65b4aa027a011a1a5875776aee9b9d7">Randomized Algorithms and Probabilistic Analysis</a></h4> <p>I've probably gotten more mileage out of this than out of any other algorithms book. A lot of randomized algorithms are <a href="http://danluu.com/2choices-eviction/">trivial to port to other applications</a> and can simplify things a lot.</p> <p>The text has enough of an intro to probability that you don't need to have any probability background. Also, the material on tails bounds (e.g., Chernoff bounds) is useful for a lot of CS theory proofs and isn't covered in the intro probability texts I've seen.</p> <h4 id="sipser-introduction-to-the-theory-of-computation-https-www-amazon-com-gp-product-113318779x-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-113318779x-linkid-aaee0c6fd8df18a1372efdb8295bb426">Sipser; <a href="https://www.amazon.com/gp/product/113318779X/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=113318779X&amp;linkId=aaee0c6fd8df18a1372efdb8295bb426">Introduction to the Theory of Computation</a></h4> <p>Classic intro to theory of computation. Turing machines, etc. Proofs are often given at an intuitive, “proof sketch”, level of detail. A lot of important results (e.g, Rice's Theorem) are pushed into the exercises, so you really have to do the key exercises. Unfortunately, most of the key exercises don't have solutions, so you can't check your work.</p> <p>For something with a more modern topic selection, maybe see <a href="http://theory.cs.princeton.edu/complexity/">Aurora &amp; Barak</a>.</p> <h4 id="bernhardt-computation-https-www-destroyallsoftware-com-screencasts-catalog-introduction-to-computation">Bernhardt; <a href="https://www.destroyallsoftware.com/screencasts/catalog/introduction-to-computation">Computation</a></h4> <p>Covers a few theory of computation highlights. The explanations are delightful and I've watched some of the videos more than once just to watch Bernhardt explain things. Targeted at a general programmer audience with no background in CS.</p> <h4 id="kearns-vazirani-an-introduction-to-computational-learning-theory-https-www-amazon-com-gp-product-0262111934-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0262111934-linkid-373c1d6780793c4552728116254da17c">Kearns &amp; Vazirani; <a href="https://www.amazon.com/gp/product/0262111934/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0262111934&amp;linkId=373c1d6780793c4552728116254da17c">An Introduction to Computational Learning Theory</a></h4> <p>Classic, but dated and riddled with errors, with no errata available. When I wanted to learn this material, I ended up cobbling together notes from a couple of courses, one by <a href="http://www.cs.utexas.edu/~klivans/cl.html">Klivans</a> and one by <a href="https://www.cs.cmu.edu/~avrim/ML14/">Blum</a>.</p> <h3 id="operating-systems">Operating Systems</h3> <p>Why should you care? Having a bit of knowledge about operating systems can save days or week of debugging time. This is a regular theme on <a href="http://jvns.ca/">Julia Evans's blog</a>, and I've found the same thing to be true of my experience. I'm hard pressed to think of anyone who builds practical systems and knows a bit about operating systems who hasn't found their operating systems knowledge to be a time saver. However, there's a bias in who reads operating systems books -- it tends to be people who do related work! It's possible you won't get the same thing out of reading these if you do really high-level stuff.</p> <h4 id="silberchatz-galvin-and-gagne-operating-system-concepts-https-www-amazon-com-gp-product-1118063333-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-1118063333-linkid-c60719f7dd6b3355d7f8535a25d74f07">Silberchatz, Galvin, and Gagne; <a href="https://www.amazon.com/gp/product/1118063333/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1118063333&amp;linkId=c60719f7dd6b3355d7f8535a25d74f07">Operating System Concepts</a></h4> <p>This was what we used at Wisconsin before <a href="http://pages.cs.wisc.edu/~remzi/OSTEP/">the comet book</a> became standard. I guess it's ok. It covers concepts at a high level and hits the major points, but it's lacking in technical depth, details on how things work, advanced topics, and clear exposition.</p> <h4 id="cox-kasshoek-and-morris-xv6-https-pdos-csail-mit-edu-6-828-2014-xv6-book-rev8-pdf">Cox, Kasshoek, and Morris; <a href="https://pdos.csail.mit.edu/6.828/2014/xv6/book-rev8.pdf">xv6</a></h4> <p>This book is great! It explains how you can actually implement things in a real system, and it <a href="https://en.wikipedia.org/wiki/Xv6">comes with its own implementation of an OS that you can play with</a>. By design, the authors favor simple implementations over optimized ones, so the algorithms and data structures used are often quite different than what you see in production systems.</p> <p>This book goes well when paired with a book that talks about how more modern operating systems work, like Love's Linux Kernel Development or Russinovich's Windows Internals.</p> <h4 id="arpaci-dusseau-and-arpaci-dusseau-operating-systems-three-easy-pieces-https-www-amazon-com-gp-product-b0722mjycb-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-b0722mjycb-linkid-ed01591ad6b853092462c17fb987f11e">Arpaci-Dusseau and Arpaci-Dusseau; <a href="https://www.amazon.com/gp/product/B0722MJYCB/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B0722MJYCB&amp;linkId=ed01591ad6b853092462c17fb987f11e">Operating Systems: Three Easy Pieces</a></h4> <p>Nice explanation of a variety of OS topics. Goes into much more detail than any other intro OS book I know of. For example, the chapters on file systems describe the details of multiple, real, filessytems, and discusses the major implementation features of <code>ext4</code>. If I have one criticism about the book, it's that it's very *nix focused. Many things that are described are simply how things are done in *nix and not inherent, but the text mostly doesn't say when something is inherent vs. when it's a *nix implementation detail.</p> <h4 id="love-linux-kernel-development-https-www-amazon-com-gp-product-0672329468-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0672329468-linkid-e20733e35b36d9e4fb51b5db3b74058d">Love; <a href="https://www.amazon.com/gp/product/0672329468/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0672329468&amp;linkId=e20733e35b36d9e4fb51b5db3b74058d">Linux Kernel Development</a></h4> <p>The title can be a bit misleading -- this is basically a book about how the Linux kernel works: how things fit together, what algorithms and data structures are used, etc. I read the 2nd edition, which is now quite dated. The 3rd edition has some updates, but <a href="https://lwn.net/Articles/419855/">introduced some errors and inconsistencies</a>, and is still dated (it was published in 2010, and covers 2.6.34). Even so, it's a nice introduction into how a relatively modern operating system works.</p> <p>The other downside of this book is that the author loses all objectivity any time Linux and Windows are compared. Basically every time they're compared, the author says that Linux has clearly and incontrovertibly made the right choice and that Windows is doing something stupid. On balance, I prefer Linux to Windows, but there are a number of areas where Windows is superior, as well as areas where there's parity but Windows was ahead for years. You'll never find out what they are from this book, though.</p> <h4 id="russinovich-solomon-and-ionescu-windows-internals-https-www-amazon-com-gp-product-0735648735-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0735648735-linkid-0d5e22fbe4c2130f4f19196ef2082f57">Russinovich, Solomon, and Ionescu; <a href="https://www.amazon.com/gp/product/0735648735/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0735648735&amp;linkId=0d5e22fbe4c2130f4f19196ef2082f57">Windows Internals</a></h4> <p>The most comprehensive book about how a modern operating system works. It just happens to be about Windows. Coming from a *nix background, I found this interesting to read just to see the differences.</p> <p>This is definitely not an intro book, and you should have some knowledge of operating systems before reading this. If you're going to buy a physical copy of this book, you might want to wait until the 7th edition is released (early in 2017).</p> <h4 id="downey-the-little-book-of-semaphores-http-greenteapress-com-wp-semaphores">Downey; <a href="http://greenteapress.com/wp/semaphores/">The Little Book of Semaphores</a></h4> <p>Takes a topic that's normally one or two sections in an operating systems textbook and turns it into its own 300 page book. The book is a series of exercises, a bit like The Little Schemer, but with more exposition. It starts by explaining what semaphore is, and then has a series of exercises that builds up higher level concurrency primitives.</p> <p>This book was very helpful when I first started to write threading/concurrency code. I subscribe to the <a href="http://danluu.com/butler-lampson-1999/">Butler Lampson school of concurrency</a>, which is to say that I prefer to have all the concurrency-related code stuffed into a black box that someone else writes. But sometimes you're stuck writing the black box, and if so, this book has a nice introduction to the style of thinking required to write maybe possibly not totally wrong concurrent code.</p> <p>I wish someone would write a book in this style, but both lower level and higher level. I'd love to see exercises like this, but starting with instruction-level primitives for a couple different architectures with different memory models (say, x86 and Alpha) instead of semaphores. If I'm writing grungy low-level threading code today, I'm overwhelmingly likely to be using <code>c++11</code> threading primitives, so I'd like something that uses those instead of semaphores, which I might have used if I was writing threading code against the <code>Win32</code> API. But since that book doesn't exist, this seems like the next best thing.</p> <p>I've heard that Doug Lea's <a href="https://www.amazon.com/gp/product/0201310090/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0201310090&amp;linkId=72b21afe79c9b254cd4383310cff1e42">Concurrent Programming in Java</a> is also quite good, but I've only taken a quick look at it.</p> <h3 id="computer-architecture">Computer architecture</h3> <p>Why should you care? The specific facts and trivia you'll learn will be useful when you're doing low-level performance optimizations, but the real value is learning how to reason about tradeoffs between performance and other factors, whether that's power, cost, size, weight, or something else.</p> <p>In theory, that kind of reasoning should be taught regardless of specialization, but my experience is that comp arch folks are much more likely to “get” that kind of reasoning and do back of the envelope calculations that will save them from throwing away a 2x or 10x (or 100x) factor in performance for no reason. This sounds obvious, but I can think of multiple production systems at large companies that are giving up 10x to 100x in performance which are operating at a scale where even a 2x difference in performance could pay a VP's salary -- all because people didn't think through the performance implications of their design.</p> <h4 id="hennessy-patterson-computer-architecture-a-quantitative-approach">Hennessy &amp; Patterson; Computer Architecture: A Quantitative Approach</h4> <p>This book teaches you how to do systems design with multiple constraints (e.g., performance, TCO, and power) and how to reason about tradeoffs. It happens to mostly do so using microprocessors and supercomputers as examples.</p> <p>New editions of this book have substantive additions and you really want the latest version. For example, the latest version added, among other things, a chapter on data center design, and it answers questions like, <a href="//danluu.com/datacenter-power/">how much opex/capex is spent on power, power distribution, and cooling, and how much is spent on support staff and machines</a>, what's the effect of using lower power machines on tail latency and result quality (bing search results are used as an example), and what other factors should you consider when designing a data center.</p> <p>Assumes some background, but that background is presented in the appendices (which are available online for free).</p> <h4 id="shen-lipasti-modern-processor-design-https-www-amazon-com-gp-product-1478607831-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-1478607831-linkid-415aef1c0fccd4f6b5f931dbd63a1852">Shen &amp; Lipasti: <a href="https://www.amazon.com/gp/product/1478607831/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1478607831&amp;linkId=415aef1c0fccd4f6b5f931dbd63a1852">Modern Processor Design</a></h4> <p>Presents most of what you need to know to architect a high performance Pentium Pro (1995) era microprocessor. That's no mean feat, considering the complexity involved in such a processor. Additionally, presents some more advanced ideas and bounds on how much parallelism can be extracted from various workloads (and how you might go about doing such a calculation). Has an unusually large section on <a href="http://pharm.ece.wisc.edu/mikko/oldpapers/asplos7.pdf">value prediction</a>, because the authors invented the concept and it was still hot when the first edition was published.</p> <p>For pure CPU architecture, this is probably the best book available.</p> <h4 id="hill-jouppi-and-sohi-readings-in-computer-architecture-https-www-amazon-com-gp-product-1558605398-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-1558605398-linkid-c066a06415728c60470dbee9a41c79fa">Hill, Jouppi, and Sohi, <a href="https://www.amazon.com/gp/product/1558605398/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1558605398&amp;linkId=c066a06415728c60470dbee9a41c79fa">Readings in Computer Architecture</a></h4> <p>Read for historical reasons and to see how much better we've gotten at explaining things. For example, compare Amdahl's paper on Amdahl's law (two pages, with a single non-obvious graph presented, and no formulas), vs. the presentation in a modern textbook (one paragraph, one formula, and maybe one graph to clarify, although it's usually clear enough that no extra graph is needed).</p> <p>This seems to be worse the further back you go; since comp arch is a relatively young field, nothing here is really hard to understand. If you want to see a dramatic example of how we've gotten better at explaining things, compare Maxwell's original paper on Maxwell's equations to a modern treatment of the same material. Fun if you like history, but a bit of slog if you're just trying to learn something.</p> <h3 id="algorithmic-game-theory-auction-theory-mechanism-design">Algorithmic game theory / auction theory / mechanism design</h3> <p>Why should you care? Some of the world's biggest tech companies run on ad revenue, and those ads are sold through auctions. This field explains how and why they work. Additionally, this material is useful any time you're trying to figure out how to design systems that allocate resources effectively.<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">1</a></sup></p> <p>In particular, incentive compatible mechanism design (roughly, how to create systems that provide globally optimal outcomes when people behave in their own selfish best interest) should be required reading for anyone who designs internal incentive systems at companies. If you've ever worked at a large company that &quot;gets&quot; this and one that doesn't, you'll see that the company that doesn't get it has giant piles of money that are basically being lit on fire because the people who set up incentives created systems that are hugely wasteful. This field gives you the background to understand what sorts of mechanisms give you what sorts of outcomes; reading case studies gives you a very long (and entertaining) list of mistakes that can cost millions or even billions of dollars.</p> <h4 id="krishna-auction-theory-https-www-amazon-com-gp-product-0123745071-ref-as-li-tl-ie-utf8-camp-1789-creative-9325-creativeasin-0123745071-linkcode-as2-tag-abroaview-20-linkid-cc4354c8796c223d9612f5dc12e7afd8">Krishna; <a href="https://www.amazon.com/gp/product/0123745071/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0123745071&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=cc4354c8796c223d9612f5dc12e7afd8">Auction Theory</a></h4> <p>The last time I looked, this was the only game in town for a comprehensive, modern, introduction to auction theory. Covers the classic <a href="https://en.wikipedia.org/wiki/Vickrey_auction">second price auction result</a> in the first chapter, and then moves on to cover risk aversion, bidding rings, interdependent values, multiple auctions, asymmetrical information, and other real-world issues.</p> <p>Relatively dry. Unlikely to be motivating unless you're already interested in the topic. Requires an understanding of basic probability and calculus.</p> <h4 id="steighlitz-snipers-shills-and-sharks-ebay-and-human-behavior-https-www-amazon-com-gp-product-0691127131-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0691127131-linkid-a0ed27b464ee65f5551f60e3a796cf26">Steighlitz; <a href="https://www.amazon.com/gp/product/0691127131/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0691127131&amp;linkId=a0ed27b464ee65f5551f60e3a796cf26">Snipers, Shills, and Sharks: eBay and Human Behavior</a></h4> <p>Seems designed as an entertaining introduction to auction theory for the layperson. Requires no mathematical background and relegates math to the small print. Covers maybe, 1/10th of the material of Krishna, if that. Fun read.</p> <h4 id="crampton-shoham-and-steinberg-combinatorial-auctions-ftp-cramton-umd-edu-ca-book-cramton-shoham-steinberg-combinatorial-auctions-pdf">Crampton, Shoham, and Steinberg; <a href="ftp://cramton.umd.edu/ca-book/cramton-shoham-steinberg-combinatorial-auctions.pdf">Combinatorial Auctions</a></h4> <p>Discusses things like how FCC spectrum auctions got to be the way they are and how “bugs” in mechanism design can leave hundreds of millions or billions of dollars on the table. This is one of those books where each chapter is by a different author. Despite that, it still manages to be coherent and I didn't mind reading it straight through. It's self-contained enough that you could probably read this without reading Krishna first, but I wouldn't recommend it.</p> <h4 id="shoham-and-leyton-brown-multiagent-systems-algorithmic-game-theoretic-and-logical-foundations-http-www-masfoundations-org">Shoham and Leyton-Brown; <a href="http://www.masfoundations.org/">Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations</a></h4> <p>The title is the worst thing about this book. Otherwise, it's a nice introduction to algorithmic game theory. The book covers basic game theory, auction theory, and other classic topics that CS folks might not already know, and then covers the intersection of CS with these topics. Assumes no particular background in the topic.</p> <h4 id="nisan-roughgarden-tardos-and-vazirani-algorithmic-game-theory-https-www-amazon-com-gp-product-0521872820-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0521872820-linkid-7caf296d20541c2fc320dd423ecff19a">Nisan, Roughgarden, Tardos, and Vazirani; <a href="https://www.amazon.com/gp/product/0521872820/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0521872820&amp;linkId=7caf296d20541c2fc320dd423ecff19a">Algorithmic Game Theory</a></h4> <p>A survey of various results in algorithmic game theory. Requires a fair amount of background (consider reading Shoham and Leyton-Brown first). For example, chapter five is basically Devanur, Papadimitriou, Saberi, and Vazirani's JACM paper, <em>Market Equilibrium via a Primal-Dual Algorithm for a Convex Program</em>, with a bit more motivation and some related problems thrown in. The exposition is good and the result is interesting (if you're into that kind of thing), but it's not necessarily what you want if you want to read a book straight through and get an introduction to the field.</p> <h3 id="misc">Misc</h3> <h4 id="beyer-jones-petoff-and-murphy-site-reliability-engineering-https-www-amazon-com-gp-product-149192912x-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-149192912x-linkid-3bfad60d002c35ba1809816e159efd77">Beyer, Jones, Petoff, and Murphy; <a href="https://www.amazon.com/gp/product/149192912X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=149192912X&amp;linkId=3bfad60d002c35ba1809816e159efd77">Site Reliability Engineering</a></h4> <p>A description of how Google handles operations. Has the typical Google tone, which is off-putting to a lot of folks with a “traditional” ops background, and assumes that many things can only be done with the SRE model when they can, in fact, be done without going full SRE.</p> <p>For a much longer description, <a href="//danluu.com/google-sre-book/">see this 22 page set of notes on Google's SRE book</a>.</p> <h4 id="fowler-beck-brant-opdyke-and-roberts-refactoring-https-www-amazon-com-gp-product-0201485672-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0201485672-linkid-72ba04417a765056c7e41111361dcfba">Fowler, Beck, Brant, Opdyke, and Roberts; <a href="https://www.amazon.com/gp/product/0201485672/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0201485672&amp;linkId=72ba04417a765056c7e41111361dcfba">Refactoring</a></h4> <p>At the time I read it, it was worth the price of admission for the section on code smells alone. But this book has been so successful that the ideas of refactoring and code smells have become mainstream.</p> <p>Steve Yegge has <a href="https://sites.google.com/site/steveyegge2/ten-great-books">a great pitch for this book</a>:</p> <blockquote> <p>When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you've been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: &quot;Yeah, uh, you've read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course.&quot; Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.</p> <p>...</p> <p>If you're a relatively experienced engineer, you'll recognize 80% or more of the techniques in the book as things you've already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don't comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!</p> </blockquote> <h4 id="demarco-lister-peopleware-https-www-amazon-com-gp-product-0321934113-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0321934113-linkid-37e2eb92615c6926852e31fa057c3df5">Demarco &amp; Lister, <a href="https://www.amazon.com/gp/product/0321934113/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0321934113&amp;linkId=37e2eb92615c6926852e31fa057c3df5">Peopleware</a></h4> <p>This book seemed convincing when I read it in college. It even had <a href="//danluu.com/dunning-kruger/">all sorts of studies</a> backing up what they said. No deadlines is better than having deadlines. Offices are better than cubicles. Basically all devs I talk to agree with this stuff.</p> <p>But virtually every successful company is run the opposite way. Even Microsoft is remodeling buildings from individual offices to open plan layouts. Could it be that all of this stuff just doesn't matter that much? If it really is that important, how come companies that are true believers, like Fog Creek, aren't running roughshod over their competitors?</p> <p>This book agrees with my biases and I'd love for this book to be right, but the meta evidence makes me want to re-read this with a critical eye and look up primary sources.</p> <h4 id="drummond-renegades-of-the-empire-https-www-amazon-com-gp-product-0609604163-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0609604163-linkid-367bcc7edf76352c97b6050c04af4aaa">Drummond; <a href="https://www.amazon.com/gp/product/0609604163/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0609604163&amp;linkId=367bcc7edf76352c97b6050c04af4aaa">Renegades of the Empire</a></h4> <p>This book explains how Microsoft's aggressive culture got to be the way it is today. The intro reads:</p> <blockquote> <p>Microsoft didn't necessarily hire clones of Gates (although there were plenty on the corporate campus) so much as recruiter those who shared some of Gates's more notable traits -- arrogance, aggressiveness, and high intelligence.</p> <p>…</p> <p>Gates is infamous for ridiculing someone's idea as “stupid”, or worse, “random”, just to see how he or she defends a position. This hostile managerial technique invariably spread through the chain of command and created a culture of conflict.</p> <p>…</p> <p>Microsoft nurtures a Darwinian order where resources are often plundered and hoarded for power, wealth, and prestige. A manager who leaves on vacation might return to find his turf raided by a rival and his project put under a different command or canceled altogether</p> </blockquote> <p>On interviewing at Microsoft:</p> <blockquote> <p>“What do you like about Microsoft?” “Bill kicks ass”, St. John said. “I like kicking ass. I enjoy the feeling of killing competitors and dominating markets”.</p> <p>…</p> <p>He was unsure how he was doing and thought he stumbled then asked if he was a &quot;people person&quot;. &quot;No, I think most people are idiots&quot;, St. John replied.</p> </blockquote> <p>These answers were exactly what Microsoft was looking for. They resulted in a strong offer and an aggresive courtship.</p> <p>On developer evangalism at Microsoft:</p> <blockquote> <p>At one time, Microsoft evangelists were also usually chartered with disrupting competitors by showing up at their conferences, securing positions on and then tangling standards commitees, and trying to influence the media.</p> <p>…</p> <p>&quot;We're the group at Microsoft whose job is to fuck Microsoft's competitors&quot;</p> </blockquote> <p>Read this book if you're considering a job at Microsoft. Although it's been a long time since the events described in this book, you can still see strains of this culture in Microsoft today.</p> <h4 id="bilton-hatching-twitter-https-www-amazon-com-gp-product-1591847087-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-1591847087-linkid-19754f59bce55c3f588419feffb68c87">Bilton; <a href="https://www.amazon.com/gp/product/1591847087/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1591847087&amp;linkId=19754f59bce55c3f588419feffb68c87">Hatching Twitter</a></h4> <p>An entertaining book about the backstabbing, mismangement, and random firings that happened in Twitter's early days. When I say random, I mean that there were instances where critical engineers were allegedly fired so that the &quot;decider&quot; could show other important people that current management was still in charge.</p> <p>I don't know folks who were at Twitter back then, but I know plenty of folks who were at the next generation of startups in their early days and there are a couple of companies where people had eerily similar experiences. Read this book if you're considering a job at a trendy startup.</p> <h4 id="galenson-old-masters-and-young-geniuses-https-www-amazon-com-gp-product-0691133808-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0691133808-linkid-9c6bc8a67b097a8f79bc4b10e0243fa0">Galenson; <a href="https://www.amazon.com/gp/product/0691133808/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0691133808&amp;linkId=9c6bc8a67b097a8f79bc4b10e0243fa0">Old Masters and Young Geniuses</a></h4> <p>This book is about art and how productivity changes with age, but if its thesis is valid, it probably also applies to programming. Galenson applies statistics to determine the &quot;greatness&quot; of art and then uses that to draw conclusions about how the productivty of artists change as they age. I don't have time to go over the data in detail, so I'll have to remain skeptical of this until I have more free time, but I think it's interesting reading even for a skeptic.</p> <h3 id="math">Math</h3> <p>Why should you care? From a pure ROI perspective, I doubt learning math is “worth it” for 99% of jobs out there. AFAICT, I use math more often than most programmers, and I don't use it all that often. But having the right math background sometimes comes in handy and I really enjoy learning math. YMMV.</p> <h4 id="bertsekas-introduction-to-probability-https-www-amazon-com-gp-product-188652923x-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-188652923x-linkid-31f59f124c2c8a7ee7917b01c7fbed52">Bertsekas; <a href="https://www.amazon.com/gp/product/188652923X/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=188652923X&amp;linkId=31f59f124c2c8a7ee7917b01c7fbed52">Introduction to Probability</a></h4> <p>Introductory undergrad text that tends towards intuitive explanations over epsilon-delta rigor. For anyone who cares to do more rigorous derivations, there are some exercises at the back of the book that go into more detail.</p> <p>Has many exercises with available solutions, making this a good text for self-study.</p> <h4 id="ross-a-first-course-in-probability-https-www-amazon-com-gp-product-013603313x-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-013603313x-linkid-ba9a3635504a6aadc9f40d3faf3c8785">Ross; <a href="https://www.amazon.com/gp/product/013603313X/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=013603313X&amp;linkId=ba9a3635504a6aadc9f40d3faf3c8785">A First Course in Probability</a></h4> <p>This is one of those books where they regularly crank out new editions to make students pay for new copies of the book (this is presently priced at a whopping $174 on Amazon)<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">2</a></sup>. This was the standard text when I took probability at Wisconsin, and I literally cannot think of a single person who found it helpful. Avoid.</p> <h4 id="brualdi-introductory-combinatorics-https-www-amazon-com-gp-product-0136020402-ref-as-li-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0136020402-linkid-9910ad28e52b0c86afa82f449644458a">Brualdi; <a href="https://www.amazon.com/gp/product/0136020402/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0136020402&amp;linkId=9910ad28e52b0c86afa82f449644458a">Introductory Combinatorics</a></h4> <p>Brualdi is a great lecturer, one of the best I had in undergrad, but this book was full of errors and not particularly clear. There have been two new editions since I used this book, but according to the Amazon reviews the book still has a lot of errors.</p> <p>For an alternate introductory text, I've heard good things about <a href="https://www.amazon.com/gp/product/B00F5QU8W4/ref=as_li_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B00F5QU8W4&amp;linkId=77558d398300005cc19a20780da38822">Camina &amp; Lewis's book</a>, but I haven't read it myself. Also, <a href="http://www.amazon.com/Combinatorial-Problems-Exercises-Chelsea-Publishing/dp/0821842625">Lovasz</a> is a great book on combinatorics, but it's not exactly introductory.</p> <h4 id="apostol-calculus-https-www-amazon-com-gp-product-0471000051-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0471000051-linkid-cf7635e3dcc9e368a8ca72ee7a8af3a0">Apostol; <a href="https://www.amazon.com/gp/product/0471000051/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0471000051&amp;linkId=cf7635e3dcc9e368a8ca72ee7a8af3a0">Calculus</a></h4> <p>Volume 1 covers what you'd expect in a calculus I + calculus II book. Volume 2 covers linear algebra and multivariable calculus. It covers linear algebra before multivariable calculus, which makes multi-variable calculus a lot easier to understand.</p> <p>It also makes a lot of sense from a programming standpoint, since a lot of the value I get out of calculus is its applications to approximations, etc., and that's a lot clearer when taught in this sequence.</p> <p>This book is probably a rough intro if you don't have a professor or TA to help you along. The Spring SUMS series tends to be pretty good for self-study introductions to various areas, but I haven't actually read their intro calculus book so I can't actually recommend it.</p> <h4 id="stewart-calculus-https-www-amazon-com-gp-product-0538497815-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0538497815-linkid-e25774196137c043bb638fdf39bcf473">Stewart; <a href="https://www.amazon.com/gp/product/0538497815/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0538497815&amp;linkId=e25774196137c043bb638fdf39bcf473">Calculus</a></h4> <p>Another one of those books where they crank out new editions with trivial changes to make money. This was the standard text for non-honors calculus at Wisconsin, and the result of that was I taught a lot of people to do complex integrals with the methods covered in Apostol, which are much more intuitive to many folks.</p> <p>This book takes the approach that, for a type of problem, you should pattern match to one of many possible formulas and then apply the formula. Apostol is more about teaching you a few tricks and some intuition that you can apply to a wide variety of problems. I'm not sure why you'd buy this unless you were required to for some class.</p> <h3 id="hardware-basics">Hardware basics</h3> <p>Why should you care? People often claim that, <a href="https://news.ycombinator.com/item?id=12095869">to be a good programmer, you have to understand every abstraction you use</a>. That's nonsense. Modern computing is too complicated for any human to have a real full-stack understanding of what's going on. In fact, one reason modern computing can accomplish what it does is that it's possible to be productive without having a deep understanding of much of the stack that sits below the level you're operating at.</p> <p>That being said, if you're curious about what sits below software, here are a few books that will get you started.</p> <h4 id="nisan-shocken-nand2tetris-http-www-nand2tetris-org">Nisan &amp; Shocken; <a href="http://www.nand2tetris.org/">nand2tetris</a></h4> <p>If you only want to read one single thing, this should probably be it. It's a “101” level intro that goes down to gates and Boolean logic. As implied by the name, it takes you from <a href="https://en.wikipedia.org/wiki/NAND_gate">NAND gates</a> to a working tetris program.</p> <h4 id="roth-fundamentals-of-logic-design-https-www-amazon-com-gp-product-0534378048-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0534378048-linkid-c60cb3be0ffa6e190224ffce81ece8f8">Roth; <a href="https://www.amazon.com/gp/product/0534378048/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0534378048&amp;linkId=c60cb3be0ffa6e190224ffce81ece8f8">Fundamentals of Logic Design</a></h4> <p>Much more detail on gates and logic design than you'll see in nand2tetris. The book is full of exercises and appears to be designed to work for self-study. Note that the link above is to the 5th edition. There are newer, more expensive, editions, but they don't seem to be much improved, have a lot of errors in the new material, and are much more expensive.</p> <h4 id="weste-harris-and-bannerjee-cmos-vlsi-design-https-www-amazon-com-gp-product-0321547748-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0321547748-linkid-6019b50b8c3474e0939bd14e6eabf6bb">Weste; Harris, and Bannerjee; <a href="https://www.amazon.com/gp/product/0321547748/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0321547748&amp;linkId=6019b50b8c3474e0939bd14e6eabf6bb">CMOS VLSI Design</a></h4> <p>One level below Boolean gates, you get to VLSI, a historical acronym (very large scale integration) that doesn't really have any meaning today.</p> <p>Broader and deeper than the alternatives, with clear exposition. Explores the design space (e.g., the section on adders doesn't just mention a few different types in an ad hoc way, it explores all the tradeoffs you can make. Also, has both problems and solutions, which makes it great for self study.</p> <h4 id="kang-leblebici-cmos-digital-integrated-circuits">Kang &amp; Leblebici; CMOS Digital Integrated Circuits</h4> <p>This was the standard text at Wisconsin way back in the day. It was hard enough to follow that the TA basically re-explained pretty much everything necessary for the projects and the exams. I find that it's ok as a reference, but it wasn't a great book to learn from.</p> <p>Compared to West et al., Weste spends a lot more effort talking about tradeoffs in design (e.g., when creating a parallel prefix tree adder, what does it really mean to be at some particular point in the design space?).</p> <h4 id="pierret-semiconductor-device-fundamentals-https-www-amazon-com-gp-product-0201543931-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0201543931-linkid-bb2f39f1f5776a37b62a2c7e5841d065">Pierret; <a href="https://www.amazon.com/gp/product/0201543931/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0201543931&amp;linkId=bb2f39f1f5776a37b62a2c7e5841d065">Semiconductor Device Fundamentals</a></h4> <p>One level below VLSI, you have how transistors actually work.</p> <p>Really beautiful explanation of solid state devices. The text nails the fundamentals of what you need to know to really understand this stuff (e.g., band diagrams), and then uses those fundamentals along with clear explanations to give you a good mental model of how different types of junctions and devices work.</p> <h4 id="streetman-bannerjee-solid-state-electronic-devices-https-www-amazon-com-gp-product-013149726x-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-013149726x-linkid-467c5ec3606c2fd43280ac8b81cc4c44">Streetman &amp; Bannerjee; <a href="https://www.amazon.com/gp/product/013149726X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=013149726X&amp;linkId=467c5ec3606c2fd43280ac8b81cc4c44">Solid State Electronic Devices</a></h4> <p>Covers the same material as Pierret, but seems to substitute mathematical formulas for the intuitive understanding that Pierret goes for.</p> <h4 id="ida-engineering-electromagnetics-https-www-amazon-com-gp-product-3319078054-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-3319078054-linkid-700743f63b21bf95abcaa8b4f80dee0d">Ida; <a href="https://www.amazon.com/gp/product/3319078054/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=3319078054&amp;linkId=700743f63b21bf95abcaa8b4f80dee0d">Engineering Electromagnetics</a></h4> <p>One level below transistors, you have electromagnetics.</p> <p>Two to three times thicker than other intro texts because it has more worked examples and diagrams. Breaks things down into types of problems and subproblems, making things easy to follow. For self-study, A much gentler introduction than Griffiths or Purcell.</p> <h4 id="shanley-pentium-pro-and-pentium-ii-system-architecture-https-www-amazon-com-gp-product-0201309734-ref-as-li-qf-sp-asin-il-tl-ie-utf8-tag-abroaview-20-camp-1789-creative-9325-linkcode-as2-creativeasin-0201309734-linkid-60d8203080d940d21a7191d5195513a7">Shanley; <a href="https://www.amazon.com/gp/product/0201309734/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0201309734&amp;linkId=60d8203080d940d21a7191d5195513a7">Pentium Pro and Pentium II System Architecture</a></h4> <p>Unlike the other books in this section, this book is about practice instead of theory. It's a bit like Windows Internals, in that it goes into the details of a real, working, system. Topics include hardware bus protocols, how I/O actually works (e.g., <a href="https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller">APIC</a>), etc.</p> <p>The problem with a practical introduction is that there's been an exponential increase in complexity ever since the 8080. The further back you go, the easier it is to understand the most important moving parts in the system, and the more irrelevant the knowledge. This book seems like an ok compromise in that the bus and I/O protocols had to handle multiprocessors, and many of the elements that are in modern systems were in these systems, just in a simpler form.</p> <h3 id="not-covered">Not covered</h3> <p>Of the books that I've liked, I'd say this captures at most 25% of the software books and 5% of the hardware books. On average, the books that have been left off the list are more specialized. This list is also missing many entire topic areas, like PL, practical books on how to learn languages, networking, etc.</p> <p>The reasons for leaving off topic areas vary; I don't have any PL books listed because I don't read PL books. I don't have any networking books because, although I've read a couple, I don't know enough about the area to really say how useful the books are. The vast majority of hardware books aren't included because they cover material that you wouldn't care about unless you were a specialist (e.g., <a href="https://www.amazon.com/gp/product/155860636X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=155860636X&amp;linkId=495e99808323168df92ec8d4b0b31fca">Skew-Tolerant Circuit Design</a> or <a href="https://www.amazon.com/gp/product/0471415391/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0471415391&amp;linkId=8087c8c45edba5ee615b4043388c9adf">Ultrafast Optics</a>). The same goes for areas like math and CS theory, where I left off a number of books that I think are great but have basically zero probability of being useful in my day-to-day programming life, e.g., <a href="http://www.thi.informatik.uni-frankfurt.de/~jukna/EC_Book_2nd/index.html">Extremal Combinatorics</a>. I also didn't include books I didn't read all or most of, unless I stopped because the book was atrocious. This means that I don't list classics I haven't finished like SICP and The Little Schemer, since those book seem fine and I just didn't finish them for one reason or another.</p> <p>This list also doesn't include many books on history and culture, like Inside Intel or Masters of Doom. I'll probably add more at some point, but I've been trying an experiment where I try to write more like Julia Evans (stream of consciousness, fewer or no drafts). I'd have to go back and re-read the books I read 10+ years ago to write meaningful comments, which doesn't exactly fit with the experiment. On that note, since this list is from memory and I got rid of almost all of my books a couple years ago, I'm probably forgetting a lot of books that I meant to add.</p> <p>_If you liked this, you might also like Thomas Ptacek's <a href="https://www.amazon.com/gp/richpub/listmania/fullview/R2EN4JTQOCHNBA/ref=cm_lm_pthnk_view?ie=UTF8&amp;lm_bb=">Application Security Reading List</a> or <a href="//danluu.com/programming-blogs/">this list of programming blogs</a>, which is written in a similar style_</p> <p>_Thanks to @tytr<em>dev for comments/corrections/discussion.</em></p> <div class="footnotes"> <hr /> <ol> <li id="fn:B">Also, if you play board games, auction theory explains why fixing game imbalance via an auction mechanism is non-trivial and often makes the game worse. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:C">I talked to the author of one of these books. He griped that the used book market destroys revenue from textbooks after a couple years, and that authors don't get much in royalties, so you have to charge a lot of money and keep producing new editions every couple of years to make money. That griping goes double in cases where a new author picks up a book classic book that someone else originally wrote, since the original author often has a much larger share of the royalties than the new author, despite doing no work no the later editions. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> </ol> </div> Hiring and the market for lemons hiring-lemons/ Sun, 09 Oct 2016 02:44:14 -0700 hiring-lemons/ <p>Joel Spolsky has a classic blog post on &quot;Finding Great Developers&quot; where he popularized the meme that great developers are impossible to find, a corollary of which is that if you can find someone, they're not great. Joel writes,</p> <blockquote> <p>The great software developers, indeed, the best people in every field, are quite simply never on the market.</p> <p>The average great software developer will apply for, total, maybe, four jobs in their entire career.</p> <p>...</p> <p>If you're lucky, if you're really lucky, they show up on the open job market once, when, say, their spouse decides to accept a medical internship in Anchorage and they actually send their resume out to what they think are the few places they'd like to work at in Anchorage.</p> <p>But for the most part, great developers (and this is almost a tautology) are, uh, great, (ok, it is a tautology), and, usually, prospective employers recognize their greatness quickly, which means, basically, they get to work wherever they want, so they honestly don't send out a lot of resumes or apply for a lot of jobs.</p> <p>Does this sound like the kind of person you want to hire? It should.The corollary of that rule--the rule that the great people are never on the market--is that the bad people--the seriously unqualified--are on the market quite a lot. They get fired all the time, because they can't do their job. Their companies fail--sometimes because any company that would hire them would probably also hire a lot of unqualified programmers, so it all adds up to failure--but sometimes because they actually are so unqualified that they ruined the company. Yep, it happens.</p> <p>These morbidly unqualified people rarely get jobs, thankfully, but they do keep applying, and when they apply, they go to Monster.com and check off 300 or 1000 jobs at once trying to win the lottery.</p> <p>Astute readers, I expect, will point out that I'm leaving out the largest group yet, the solid, competent people. They're on the market more than the great people, but less than the incompetent, and all in all they will show up in small numbers in your 1000 resume pile, but for the most part, almost every hiring manager in Palo Alto right now with 1000 resumes on their desk has the same exact set of 970 resumes from the same minority of 970 incompetent people that are applying for every job in Palo Alto, and probably will be for life, and only 30 resumes even worth considering, of which maybe, rarely, one is a great programmer. OK, maybe not even one.</p> </blockquote> <p>Joel's claim is basically that &quot;great&quot; developers won't have that many jobs compared to &quot;bad&quot; developers because companies will try to keep &quot;great&quot; developers. Joel also posits that companies can recognize prospective &quot;great&quot; developers easily. But these two statements are hard to reconcile. If it's so easy to identify prospective &quot;great&quot; developers, why not try to recruit them? You could just as easily make the case that &quot;great&quot; developers are overrepresented in the market because they have better opportunities and it's the &quot;bad&quot; developers who will cling to their jobs. This kind of adverse selection is common in companies that are declining; I saw that in my intern cohort at IBM<sup class="footnote-ref" id="fnref:I"><a rel="footnote" href="#fn:I">1</a></sup>, among other places.</p> <p>Should &quot;good&quot; developers be overrepresented in the market or underrepresented? If we listen to the anecdotal griping about hiring, we might ask if the market for developers is a market for lemons. This idea goes back to Akerlof's Nobel prize winning 1970 paper, &quot;<a href="http://www.econ.yale.edu/~dirkb/teach/pdf/akerlof/themarketforlemons.pdf">The Market for 'Lemons': Quality Uncertainty and the Market Mechanism</a>&quot;. Akerlof takes used car sales as an example, splitting the market into good used cars and bad used cars (bad cars are called &quot;lemons&quot;). If there's no way to distinguish between good cars and lemons, good cars and lemons will sell for the same price. Since buyers can't distinguish between good cars and bad cars, the price they're willing to pay is based on the quality of the average in the market. Since owners know if their car is a lemon or not, owners of non-lemons won't sell because the average price is driven down by the existence of lemons. This results in a feedback loop which causes lemons to be the only thing available.</p> <p>This model is certainly different from Joel's model. Joel's model assumes that &quot;great&quot; developers are sticky -- that they stay at each job for a long time. This comes from two assumptions; first, that it's easy for prospective employers to identify who's &quot;great&quot;, and second, that once someone is identified as &quot;great&quot;, their current employer will do anything to keep them (as in the market for lemons). But the first assumption alone is enough to prevent the developer job market from being a market for lemons. If you can tell that a <em>potential</em> employee is great, you can simply go and offer them twice as much as they're currently making (something that I've seen actually happen). You need an information asymmetry to create a market for lemons, and Joel posits that there's no information asymmetry.</p> <p>If we put aside Joel's argument and look at the job market, there's incomplete information, but both current and prospective employers have incomplete information, and whose information is better varies widely. It's actually quite common for prospective employers to have better information than current employers!</p> <p>Just for example, there's someone I've worked with, let's call him Bob, who's saved two different projects by doing the grunt work necessary to keep the project from totally imploding. The projects were both declared successes, promotions went out, they did <a href="http://www.paulgraham.com/submarine.html">a big PR blitz which involves seeding articles in all the usual suspects; Wired, Fortune</a>, and so on and so forth. That's worked out great for the people who are good at taking credit for things, but it hasn't worked out so well for Bob. In fact, someone else I've worked with recently mentioned to me that management keeps asking him why Bob takes so long to do simple tasks. The answer is that Bob's busy making sure the services he works on don't have global outages when they launch, but that's not the kind of thing you get credit for in Bob's org. The result of that is that Bob has a network who knows that he's great, which makes it easy for him to get a job anywhere else at market rate. But his management chain has no idea, and based on what I've seen of offers today, they're paying him about half what he could make elsewhere. There's no shortage of cases where information transfer inside a company is so poor that external management has a better view of someone's productivity than internal management. I have one particular example in mind, but if I just think of the Bob archetype, off the top of my head, I know of four people who are currently in similar situations. It helps that I currently work at a company that's notorious for being dysfunctional in this exact way, but this happens everywhere. When I worked at a small company, we regularly hired great engineers from big companies that were too clueless to know what kind of talent they had.</p> <p>Another problem with the idea that &quot;great&quot; developers are sticky is that this assumes that companies are capable of creating groups that developers want to work for on demand. This is usually not the case. Just for example, I once joined a team where the TL was pretty strongly against using version control or having tests. As a result of those (and other) practices, it took five devs one year to produce 10k lines of kinda-sorta working code for a straightforward problem. Additionally, it was a pressure cooker where people were expected to put in 80+ hour weeks, where the PM would shame people into putting in longer hours. Within a year, three of the seven people who were on the team when I joined had left; two of them went to different companies. The company didn't want to lose those two people, but it wasn't capable of creating an environment that would keep them.</p> <p>Around when I joined that team, a friend of mine joined a really great team. They do work that materially impacts the world, they have room for freedom and creativity, a large component of their jobs involves learning new and interesting things, and so on and so forth. Whenever I heard about someone who was looking for work, I'd forward them that team. That team is now full for the foreseeable future because everyone whose network included that team forwarded people into that team. But if you look at the team that lost three out of seven people in a year, that team is hiring. A lot. The result of this dynamic is that, as a dev, if you join a random team, you're overwhelmingly likely to join a team that has a lot of churn. Additionally, if you know of a good team, it's likely to be full.</p> <p>Joel's model implicitly assumes that, proportionally, there are many more dysfunctional developers than dysfunctional work environments.</p> <p>At the last conference I attended, I asked most people I met two questions:</p> <ol> <li>Do you know of any companies that aren't highly dysfunctional?</li> <li>Do you know of any particular teams that are great and are hiring?</li> </ol> <p>Not one single person told me that their company meets the criteria in (1). A few people suggested that, maybe, Dropbox is ok, or that, maybe, Jane Street is ok, but the answers were of the form &quot;I know a few people there and I haven't heard any terrible horror stories yet, plus I sometimes hear good stories&quot;, not &quot;that company is great and you should definitely work there&quot;. Most people said that they didn't know of any companies that weren't <a href="//danluu.com/wat/">a total mess</a>.</p> <p>A few people had suggestions for (2), but the most common answer was something like &quot;LOL no, if I knew that I'd go work there&quot;. The second most common answer was of the form &quot;I know some people on the Google Brain team and it sounds great&quot;. There are a few teams that are well known for being great places to work, but the fact that they're so few and far between that it's basically impossible to get a job on one of those teams. A few people knew of actual teams that they'd strongly recommend who were hiring, but that was rare. Much rarer than finding a developer who I'd want to work with who would consider moving. If I flipped the question around and asked if they knew of any good developers who were looking for work, the answer was usually &quot;yes&quot;<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">2</a></sup>.</p> <p>Another problem with the idea that &quot;great&quot; developers are impossible to find because they join companies and then stick is that developers (and companies) aren't immutable. Because I've been lucky enough to work in environments that allow people to really flourish, I've seen a lot of people go from unremarkable to amazing. Because most companies invest pretty much nothing in helping people, you can do really well here without investing much effort.</p> <p>On the flip side, I've seen entire teams of devs go on the market because their environment changed. Just for example, I used to know <em>a lot</em> of people who worked at company X under <a href="https://www.linkedin.com/in/marcyun">Marc Yun</a>. It was the kind of place that has low attrition because people really enjoy working there. And then Marc left. Over the next two years, literally everyone I knew who worked there left. This one change both created a lemon in the searching-for-a-team job market and put a bunch of good developers on the market. This kind of thing happens all the time, even more now than in the past because of today's acquisition-heavy environment.</p> <p>Is developer hiring a market for lemons? Well, it depends on what you mean by that. Both developers and hiring managers have incomplete information. It's not obvious if having a market for lemons in one direction makes the other direction better or worse. The fact that joining a new team is uncertain makes developers less likely to leave existing teams, which makes it harder to hire developers. But the fact that developers often join teams which they dislike makes it easier to hire developers. What's the net effect of that? I have no idea.</p> <p>From where I'm standing, it seems really hard to find a good manager/team, and I don't know of any replicable strategy for doing so; I have a lot of sympathy for people who can't find a good fit because I get how hard that is. But I have seen replicable strategies for hiring, so I don't have nearly as much sympathy for hiring managers who complain that hiring &quot;great&quot; developers is impossible.</p> <p>When a hiring manager complains about hiring, in every single case I've seen so far, the hiring manager has one of the following problems:</p> <ol> <li><p>They pay too little. The last time I went looking for work, I found a 6x difference in compensation between companies who might hire me <em>in the same geographic region</em>. Basically all of the companies thought that they were competitive, even when they were at the bottom end of the range. I don't know what it is, but companies always seem to think that they pay well, even when they're not even close to being in the right range. Almost everyone I talk to tells me that they pay as much as any <em>reasonable</em> company. Sure, there are some companies out there that pay a bit more, but they're overpaying! You can actually see this if you read Joel's writing -- back when he wrote the post I'm quoting above, he talked about how well Fog Creek paid. A couple years later, he complained that Google was overpaying for college kids with no experience, and more recently <a href="https://twitter.com/danluu/status/784929052519837696">he's pretty much said that you don't want to work at companies that pay well</a>.</p></li> <li><p>They pass on good or even &quot;great&quot; developers<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">3</a></sup>. Earlier, I claimed that I knew lots of good developers who are looking for work. You might ask, if there are so many good developers looking for work, why's it so hard to find them? Joel claims that out of a 1000 resumes, maybe 30 people will be &quot;solid&quot; and 970 will be &quot;incompetent&quot;. It seems to me it's more like 400 will be solid and 20 will be really good. It's just that almost everyone uses the same filters, so everyone ends up fighting over the 30 people who they think are solid. When people do randomized trials on what actually causes resumes to get filtered out, it often turns out that traits that are tangentially related or unrelated to job performance make huge differences. For example, <a href="http://www-2.rotman.utoronto.ca/facbios/file/RiveraTilcsik.pdf">in this study of law firm recruiting</a>, the authors found that a combination of being male and having &quot;high-class&quot; signifiers on the resume (sailing, polo, and classical music instead of track and field, pick-up soccer, and country music) with no other changes caused a 4x increase in interview invites.</p> <p>The first company I worked at, <a href="https://www.amazon.com/gp/product/B01FSZU6FK/ref=as_li_qf_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01FSZU6FK&amp;linkId=d15e514c6ecefa224be8f05d4d5837e3">Centaur</a>, had an onsite interview process that was less stringent than the phone screen at places like Google and Facebook. If you listen to people like Joel, you'd think that Centaur was full of bozos, but after over a decade in industry (including time at Google), Centaur had the best mean and median level of developer productivity of any place I've worked.</p> <p>Matasano famously solved their hiring problem by using a different set of filters and getting a different set of people. Despite the resounding success of their strategy, pretty much everyone insists on sticking with the standard strategy of picking people with brand name pedigrees and running basically the same interview process as everyone else, <a href="http://danluu.com/programmer-moneyball/">bidding up the price of folks who are trendy and ignoring everyone else</a>.</p> <p>If I look at developers I know who are in high-demand today, a large fraction of them went through a multi-year period where they were underemployed and practically begging for interesting work. These people are very easy to hire if you can find them.</p></li> <li><p>They're trying to hire for some combination of rare skills. Right now, if you're trying to hire for someone with experience in deep learning and, well, anything else, you're going to have a bad time.</p></li> <li><p>They're much more dysfunctional than they realize. I know one hiring manager who complains about how hard it is to hire. What he doesn't realize is that literally everyone on his team is bitterly unhappy and a significant fraction of his team gives anti-referrals to friends and tells them to stay away.</p> <p>That's an extreme case, but it's quite common to see a VP or founder baffled by why hiring is so hard when employees consider the place to be mediocre or even bad.</p></li> </ol> <p>Of these problems, (1), low pay, is both the most common and the simplest to fix.</p> <p>In the past few years, Oracle and Alibaba have spun up new cloud computing groups in Seattle. This is a relatively competitive area, and both companies have reputations that work against them when hiring<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">4</a></sup>. If you believe the complaints about how hard it is to hire, you wouldn't think one company, let alone two, could spin up entire cloud teams in Seattle. Both companies solved the problem by paying substantially more than their competitors were offering for people with similar experience. Alibaba became known for such generous offers that when I was negotiating my offer from Microsoft, MS told me that they'd match an offer from any company except Alibaba. I believe Oracle and Alibaba have hired hundreds of engineers over the past few years.</p> <p>Most companies don't need to hire anywhere near a hundreds of people; they can pay competitively without hiring so many developers that the entire market moves upwards, but they still refuse to do so, while complaining about how hard it is to hire.</p> <p>(2), filtering out good potential employees, seems like the modern version of &quot;no one ever got fired for hiring IBM&quot;. If you hire someone with a trendy background who's good at traditional coding interviews and they don't work out, who could blame you? And no one's going to notice all the people you missed out on. Like (1), this is something that almost everyone thinks they do well and they'll say things like &quot;we'd have to lower our bar to hire more people, and no one wants that&quot;. But I've never worked at a place that doesn't filter out a lot of people who end up doing great work elsewhere. I've tried to get underrated programmers<sup class="footnote-ref" id="fnref:U"><a rel="footnote" href="#fn:U">5</a></sup> hired at places I've worked, and I've literally never succeeded in getting one hired. Once, someone I failed to get hired managed to get a job at Google after something like four years being underemployed (and is a star there). That guy then got me hired at Google. Not hiring that guy didn't only cost them my brilliant friend, it eventually cost them me!</p> <p>BTW, this illustrates a problem with Joel's idea that &quot;great&quot; devs never apply for jobs. There's often a long time period where a &quot;great&quot; dev has an extremely hard time getting hired, even through their network who knows that they're great, because they don't look like what people think &quot;great&quot; developers look like. Additionally, Google, which has heavily studied which hiring channels give good results, has found that referrals and internal recommendations don't actually generate much signal. While people will refer &quot;great&quot; devs, they'll also refer terrible ones. The referral bonus scheme that most companies set up skews incentives in a way that makes referrals worse than you might expect. Because of this and other problems, many companies don't weight referrals particularly heavily, and &quot;great&quot; developers still go through the normal hiring process, just like everyone else.</p> <p>(3), needing a weird combination of skills, can be solved by hiring people with half or a third of the expertise you need and training people. People don't seem to need much convincing on this one, and I see this happen all the time.</p> <p>(4), dysfunction <a href="//danluu.com/learning-to-program/#fixing-totally-broken-danluu-com-wat-situations">seems hard to fix</a>. If I knew how to do that, I'd be manager.</p> <p>As a dev, it seems to me that teams I know of that are actually good environments that pay well have no problems hiring, and that teams that have trouble hiring can pretty easily solve that problem. But I'm biased. I'm not a hiring manager. There's probably some hiring manager out there thinking: &quot;every developer I know who complains that it's hard to find a good team has one of these four obvious problems; if only my problems were that easy to solve!&quot;</p> <p><small> Thanks to Leah Hanson, David Turner, Tim Abbott, Vaibhav Sagar, Victor Felder, Ezekiel Smithburg, Juliano Bortolozzo Solanho, Stephen Tu, Pierre-Yves Baccou, Jorge Montero, Alkin Kaz, Ben Kuhn, and Lindsey Kuper for comments and corrections. </small></p> <p>If you liked this post, you'd probably enjoy <a href="http://danluu.com/tech-discrimination/">this other post on the bogosity of claims that there can't possibly be discrimination in tech hiring</a>.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:I">The folks who stayed describe an environment that's mostly missing mid-level people they'd want to work with. There are lifers who've been there forever and will be there until retirement, and there are new grads who land there at random. But, compared to their competitors, there are relatively few people people with 5-15 years of experience. The person I knew who lasted the longest stayed until the 8 year mark, but he started interviewing with an eye on leaving when he found out the other person on his team who was competent was interviewing; neither one wanted to be the only person on the team doing any work, so they raced to get out the door first. <a class="footnote-return" href="#fnref:I"><sup>[return]</sup></a></li> <li id="fn:L">This section kinda makes it sound like I'm looking for work. I'm not looking for work, although I may end up forced into it if my partner takes a job outside of Seattle. <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:H"><p><a href="https://www.youtube.com/watch?v=r8RxkpUvxK0">Moishe Lettvin has a talk I really like</a>, where he talks about a time when he was on a hiring committee and they rejected every candidate that came up, only to find that the &quot;candidates&quot; were actually anonymized versions of their own interviews!</p> <p>The bit about when he first started interviewing at Microsoft should sound familiar to MS folks. As is often the case, he got thrown into the interview with no warning and no preparation. He had no idea what to do and, as a result, wrote up interview feedback that wasn't great. &quot;In classic Microsoft style&quot;, his manager forwarded the interview feedback to the entire team and said &quot;don't do this&quot;. &quot;In classic Microsoft style&quot; is a quote from Moishe, but I've observed the same thing. I'd like to talk about how we have a tendency to do <em>extremely blameful</em> postmortems and how that warps incentives, but that probably deserves its own post.</p> <p>Well, I'll tell one story, in remembrance of someone who recently left my former team for Google. Shortly after that guy joined, he was in the office on a weekend (a common occurrence on his team). A manager from another team pinged him on chat and asked him to sign off on some code from the other team. The new guy, wanting to be helpful, signed off on the code. On Monday, the new guy talked to his mentor and his mentor suggested that he not help out other teams like that. Later, there was an outage related to the code. In classic Microsoft style, the manager from the other team successfully pushed the blame for the outage from his team to the new guy.</p> <p>Note that this guy isn't included in my 3/7 stat because he joined shortly after I did, and I'm not trying to cherry pick a window with the highest possible attrition.</p> <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> <li id="fn:O">For a while, Oracle claimed that the culture of the Seattle office is totally different from mainline-Oracle culture, but from what I've heard, they couldn't resist Oracle-ifying the Seattle group and that part of the pitch is no longer convincing. <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:U"><p>This footnote is a response to Ben Kuhn, who asked me, what types of devs are underrated and how would you find them? I think this group is diverse enough that there's no one easy way to find them. There are people like &quot;Bob&quot;, who do critical work that's simply not noticed. There are also people who are just terrible at interviewing, like <a href="https://www.linkedin.com/in/jeshua-smith-1a873858">Jeshua Smith</a>. I believe he's only once gotten a performance review that wasn't excellent (that semester, his manager said he could only give out one top rating, and it wouldn't be fair to give it to only one of his two top performers, so he gave them both average ratings). In every place he's worked, he's been well known as someone who you can go to with hard problems or questions, and much higher ranking engineers often go to him for help. I tried to get him hired at two different companies I've worked at and he failed both interviews. He sucks at interviews. My understanding is that his interview performance almost kept him from getting his current job, but his references were so numerous and strong that his current company decided to take a chance on him anyway. But he only had those references because his old org has been disintegrating. His new company picked up <em>a lot</em> of people from his old company, so there were many people at the new company that knew him. He can't get the time of day almost anywhere else. Another person I've tried and failed to get hired is someone I'll call Ashley, who got rejected in the recruiter screening phase at Google for not being technical enough, despite my internal recommendation that she was one of the strongest programmers I knew. But she came from a &quot;nontraditional&quot; background that didn't fit the recruiter's idea of what a programmer looked like, so that was that. Nontraditional is a funny term because it seems like most programmers have a &quot;nontraditional&quot; background, but you know what I mean.</p> <p>There's enough variety here that there isn't one way to find all of these people. Having a filtering process that's more like Matasano's and less like Google, Microsoft, Facebook, almost any YC startup you can name, etc., is probably a good start.</p> <a class="footnote-return" href="#fnref:U"><sup>[return]</sup></a></li> </ol> </div> I could do that in a weekend! sounds-easy/ Mon, 03 Oct 2016 01:14:27 -0700 sounds-easy/ <p>I can't think of a single large software company that doesn't regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” <a href="https://bitquabit.com/post/one-which-i-call-out-hacker-news/">Benjamin Pollack</a> and <a href="https://blog.codinghorror.com/code-its-trivial/">Jeff Atwood</a> called out people who do that with Stack Overflow. But <a href="http://nickcraver.com/blog/2016/03/29/stack-overflow-the-hardware-2016-edition/">Stack Overflow is relatively obviously lean</a>, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?</p> <p>A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that <a href="http://blog.nullspace.io/building-search-engines.html">maybe building a better Google is a non-trivial problem</a>. Considering how much of Google's $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn't a trivial problem. But in the comments on Alex's posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google's capabilities in the next few years. It's been long enough since then that we can look back and say that Lucene hasn't improved so much that Google is in danger from a startup that puts together a Lucene cluster. If anything, the cost of creating a viable competitor to Google search has gone up.</p> <p>For making a viable Google competitor, I believe that ranking is a harder problem than indexing, but even if we just look at indexing, there are individual domains that contain on the order of one trillion pages we might want to index (like Twitter) and I'd guess that we can find on the order a trillion domains. If you try to configure any off-the-shelf search index to hold an index of some number of trillions of items to handle a load of, say, 1/100th Google's load, with a latency budget of, say, 100ms (most of the latency should be for ranking, not indexing), I think you'll find that this isn't trivial. And if you use Google to search Twitter, you can observe that, at least for select users or tweets, Google indexes Twitter quickly enough that it's basically real-time from the standpoint of users. Anyone who's tried to do real-time indexing with Lucene on a large corpus under high load will also find this to be non-trivial. You might say that this isn't totally fair since it's possible to find tweets that aren't indexed by major search engines, but if you want to make a call on what to index or not, well, that's also a problem that's non trivial in the general case. And we're only talking about indexing here, indexing is one of the easier parts of building a search engine.</p> <p>Businesses that actually care about turning a profit will spend a lot of time (hence, a lot of engineers) working on optimizing systems, even if an MVP for the system could have been built in a weekend. There's also a wide body of research that's found that decreasing latency has a significant effect on revenue over a pretty wide range of latencies for some businesses. Increasing performance also has the benefit of reducing costs. Businesses should keep adding engineers to work on optimization until the cost of adding an engineer equals the revenue gain plus the cost savings at the margin. This is often many more engineers than people realize.</p> <p>And that's just performance. Features also matter: when I talk to engineers working on basically any product at any company, they'll often find that there are seemingly trivial individual features that can add integer percentage points to revenue. Just as with performance, people underestimate how many engineers you can add to a product before engineers stop paying for themselves.</p> <p>Additionally, features are often much more complex than outsiders realize. If we look at search, how do we make sure that different forms of dates and phone numbers give the same results? How about internationalization? Each language has unique quirks that have to be accounted for. In french, “l'foo” should often match “un foo” and vice versa, but American search engines from the 90s didn't actually handle that correctly. How about tokenizing Chinese queries, where words don't have spaces between them, and sentences don't have unique tokenizations? How about Japanese, where queries can easily contain four different alphabets? How about handling Arabic, which is mostly read right-to-left, except for the bits that are read left-to-right? <a href="https://r12a.github.io/scripts/tutorial/part3">And that's not even the most complicated part of handling Arabic</a>! It's fine to ignore this stuff for a weekend-project MVP, but ignoring it in a real business means ignoring the majority of the market! Some of these are handled ok by open source projects, but many of the problems involve open research problems.</p> <p>There's also security! If you don't “bloat” your company by hiring security people, you'll end up like hotmail or yahoo, where your product is better known for how often it's hacked than for any of its other features.</p> <p>Everything we've looked at so far is a technical problem. Compared to organizational problems, technical problems are straightforward. Distributed systems are considered hard because real systems might drop something like 0.1% of messages, corrupt an even smaller percentage of messages, and see latencies in the microsecond to millisecond range. When I talk to higher-ups and compare what they think they're saying to what my coworkers think they're saying, I find that the rate of lost messages is well over 50%, every message gets corrupted, and latency can be months or years<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">1</a></sup>. When people imagine how long it should take to build something, they're often imagining a team that works perfectly and spends 100% of its time coding. But that's impossible to scale up. The question isn't whether or not there will inefficiencies, but how much inefficiency. A company that could eliminate organizational inefficiency would be a larger innovation than any tech startup, ever. But when doing the math on how many employees a company “should” have, people usually assume that the company is an efficient organization.</p> <p>This post happens to use search as an example because I ran across some people who claimed that Lucene was going to surpass Google's capabilities any day now, but there's nothing about this post that's unique to search. If you talk to people in almost any field, you'll hear stories about how people wildly underestimate the complexity of the problems in the field. The point here isn't that it would be impossible for a small team to build something better than Google search. It's entirely plausible that someone will have an innovation as great as PageRank, and that a small team could turn that into a viable company. But once that company is past the VC-funded hyper growth phase and wants to maximize its profits, it will end up with a multi-thousand person platforms org, just like Google's, unless the company wants to leave hundreds of millions or billions of dollars a year on the table due to hardware and software inefficiency. And the company will want to handle languages like Thai, Arabic, Chinese, and Japanese, each of which is non-trivial. And the company will want to have relatively good security. And there are the hundreds of little features that users don't even realize that are there, each of which provides a noticeable increase in revenue. It's &quot;obvious&quot; that companies should outsource their billing, except that when you talk to companies that handle their own billing, they can point to individual features that increase conversion by single or double digit percentages that they can't get from Stripe or Braintree. That fifty person billing team is totally worth it, beyond a certain size. And then there's sales, which most engineers don't even think of<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">2</a></sup>; the exact same line of reasoning that applies to optimization also applies to sales -- as long as marginal benefit of adding another salesperson exceeds the cost, you should expect the company to keep adding salespeople, which can often result in a sales force that's larger than the engineering team. There's also research which, almost by definition, involves a lot of bets that don't pan out!</p> <p>It's not that all of those things are necessary to run a service at all; it's that almost every large service is leaving money on the table if they don't seriously address those things. This reminds me of a common fallacy we see in unreliable systems, where people build the happy path with the idea that the happy path is the “real” work, and that error handling can be tacked on later. <a href="//danluu.com/postmortem-lessons/">For reliable systems, error handling is more work than the happy path</a>. The same thing is true for large services -- all of this stuff that people don't think of as “real” work is more work than the core service<sup class="footnote-ref" id="fnref:W"><a rel="footnote" href="#fn:W">3</a></sup>.</p> <h3 id="correction">Correction</h3> <p>I often make minor tweaks and add new information without comment, but the original version of this post had an error and removing the error was a large enough change that I believe it's worth pointing out the change. I had a back of the envelope calculation on the cost of indexing the web with Lucene, but the numbers were based on benchmarks results from some papers and comments from people who work on a commercial search engine. When I tried to reproduce the results from the papers, I found that <a href="https://twitter.com/danluu/status/814167684954738688?lang=ro">it was trivial to get orders of magnitude better performance than reported in one paper</a> and when I tried to track down the underlying source for the comments by people who work on a commercial search engine, I found that there was no experimental evidence underlying the comments, so I removed the example.</p> <p><small> I'm experimenting with writing blog posts stream-of-consciousness, without much editing. Both this post <a href="//danluu.com/bimodal-compensation/">and my last post</a> were written that way. <a href="https://twitter.com/danluu">Let me know what you think</a> of these posts relative to my “normal” posts!</p> <p>Thanks to Leah Hanson, Joel Wilder, Kay Rhodes, Heath Borders, Kris Shamloo, Justin Blank, and Ivar Refsdal for corrections. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:A">Recently, I was curious why an org that's notorious for producing unreliable services produces so many unreliable services. When I asked around about why, I found that that upper management were afraid of sending out any sort of positive message about reliability because they were afraid that people would use that as an excuse to slip schedules. Upper management changed their message to include reliability about a year ago, but if you talk to individual contributors, they still believe that the message is that features are the #1 priority and slowing down on features to make things more reliable is bad for your career (and based on who's getting promoted the individual contributors appear to be right). Maybe in another year, the org will have really gotten the message through to the people who hand out promotions, and in another couple of years, enough software will have been written with reliability in mind that they'll actually have reliable services. Maybe. That's just the first-order effect. The second-order effect is that their policies have caused a lot of people who care about reliability to go to companies that care more about reliability and less about demo-ing shiny new features. They might be able to fix that in a decade. Maybe. That's made harder by the fact that the org is in a company that's well known for having PMs drive features above all else. If that reputation is possible to change, it will probably take multiple decades. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:S">For a lot of products, the sales team is more important than the engineering team. If we build out something rivaling Google search, we'll probably also end up with the infrastructure required to sell a competitive cloud offering. Google actually tried to do that without having a serious enterprise sales force and the result was that AWS and Azure basically split the enterprise market between them. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:W">This isn't to say that there isn't waste or that different companies don't have different levels of waste. I see waste everywhere I look, but it's usually not what people on the outside think of as waste. Whenever I read outsider's descriptions of what's wasteful at the companies I've worked at, they're almost inevitably wrong. Friends of mine who work at other places also describe the same dynamic. <a class="footnote-return" href="#fnref:W"><sup>[return]</sup></a></li> </ol> </div> Is dev compensation bimodal? bimodal-compensation/ Mon, 26 Sep 2016 23:33:26 -0700 bimodal-compensation/ <p>Developer compensation has skyrocketed since the demise of the <a href="google-wage-fixing/">Google et al. wage-suppressing no-hire agreement</a>, to the point where compensation rivals and maybe even exceeds compensation in traditionally remunerative fields like law, consulting, etc. In software, &quot;senior&quot; dev salary at a high-paying tech company is $350k/yr, where &quot;senior&quot; can mean &quot;someone three years of out school&quot; and it's not uncommon for someone who's considered a high performing engineer to make seven figures.</p> <p>The fields have sharply bimodal income distributions. Are programmers in for the same fate? Let's see what data we can find. First, let's look at <a href="http://www.nalp.org/uploads/0812Research.pdf">data from the National Association for Law Placement</a>, which shows when legal salaries become bimodal.</p> <h3 id="lawyers-in-1991">Lawyers in 1991</h3> <p><img src="images/bimodal-compensation/law-1991.png" alt="First-year lawyer salaries in 1991. $40k median, trailing off with the upper end just under $90k" width="1280" height="914"></p> <p>Median salary is $40k, with the numbers slowly trickling off until about $90k. According to the BLS $90k in 1991 is worth $160k in 2016 dollars. That's a pretty generous starting salary.</p> <h3 id="lawyers-in-2000">Lawyers in 2000</h3> <p><img src="images/bimodal-compensation/law-2000.png" alt="First-year lawyer salaries in 2000. $50k median; bimodal with peaks at $40k and $125k" width="1280" height="908"></p> <p>By 2000, the distribution had become bimodal. The lower peak is about the same in nominal (non-inflation-adjusted) terms, putting it substantially lower in real (inflation-adjusted) terms, and there's an upper peak at around $125k, with almost everyone coming in under $130k. $130k in 2000 is $180k in 2016 dollars. The peak on the left has moved from roughly $30k in 1991 dollars to roughly $40k in 2000 dollars; both of those translate to roughly $55k in 2016 dollars. People in the right mode are doing better, while people in the left mode are doing about the same.</p> <p>I won't belabor the point with more graphs, but if you look at more recent data, the middle area between the two modes has hollowed out, increasing the level of inequality within the field. As a profession, lawyers have gotten hit hard by automation, and in real terms, 95%-ile offers today aren't really better than they were in 2000. But 50%-ile and even 75%-ile offers are worse off due to the bimodal distribution.</p> <h3 id="programmers-in-2015">Programmers in 2015</h3> <p>Enough about lawyers! What about programmers? Unfortunately, it's hard to get good data on this. Anecdotally, it sure seems to me like we're going down the same road. Unfortunately, almost all of the public data sources that are available, like H1B data, have salary numbers and not total compensation numbers. Since compensation at the the upper end is disproportionately bonus and stock, most data sets I can find don't capture what's going on.</p> <p>One notable exception is the new grad compensation data recorded by Dan Zhang and Jesse Collins:</p> <p><img src="images/bimodal-compensation/cs-2015.png" alt="First-year programmer compensation in 2016. Compensation ranges from $50k to $250k" width="1280" height="1006"></p> <p>There's certainly a wide range here, and while it's technically bimodal, there isn't a huge gulf in the middle like you see in law and business. Note that this data is mostly bachelors grads with a few master's grads. PhD numbers, which sometimes go much higher, aren't included.</p> <p>Do you know of a better (larger) source of data? This is from about 100 data points, members of the &quot;Hackathon Hackers&quot; Facebook group, in 2015. Dan and Jesse also have data from 2014, but it would be nice to get data over a wider timeframe and just plain more data. Also, this data is pretty clearly biased towards the high end — if you look at national averages for programmers at all levels of experience, the average comes in much lower than the average for new grads in this data set. The data here match the numbers I hear when we compete for people, but the population of &quot;people negotiating offers at Microsoft&quot; also isn't representative.</p> <p>If we had more representative data it's possible that we'd see a lot more data points in the $40k to $60k range along with the data we have here, which would make the data look bimodal. It's also possible that we'd see a lot more points in the $40k to $60k range, many more in the $70k to $80k range, some more in the $90k+ range, etc., and we'd see a smooth drop-off instead of two distinct modes.</p> <p>Stepping back from the meager data we have and looking at the circumstances, &quot;should&quot; programmer compensation be bimodal? Most other fields that have bimodal compensation have a very different compensation structure than we see in programming. For example, top law and consulting firms have an up-or-out structure, which is effectively a tournament, which distorts compensation and certainly makes it seem more likely that compensation is likely to end up being bimodal. Additionally, competitive firms pay the same rate to all 1st year employees, which they determine by matching whoever appears to be paying the most. For example, this year, <a href="http://abovethelaw.com/2016/06/breaking-ny-to-180k-cravath-raises-associate-base-salaries/2/">Cravath announced that it would pay first-year associates $180k, and many other firms followed suit</a>. Like most high-end firms, Cravath has a salary schedule that's entirely based on experience:</p> <ul> <li>0 years: $180k</li> <li>1 year: $190k</li> <li>2 years: $210k</li> <li>3 years: $235k</li> <li>4 years: $260k</li> <li>5 years: $280k</li> <li>6 years: $300k</li> <li>7 years: $315k</li> </ul> <p>In software, compensation tends to be on a case-by-case basis, which makes it much less likely that we'll see a sharp peak the way we do in law. If I had to guess, I'd say that while the dispersion in programmer compensation is increasing, it's not bimodal, but I don't really have the right data set to conclusively say anything. Please point me to any data you have that's better.</p> <h3 id="appendix-a-please-don-t-send-me-these">Appendix A: please don't send me these</h3> <ul> <li>H-1B: mostly salary only.</li> <li>Stack Overflow survey: salary only. Also, data is skewed by the heavy web focus of the survey —— I stopped doing the survey when none of their job descriptions matched anyone in my entire building, and I know other people who stopped for the same reason.</li> <li>Glassdoor: weirdly inconsistent about whether or not it includes stock compensation. Numbers for some companies seem to, but numbers for other companies don't.</li> <li>O'Reilly survey: salary focused.</li> <li>BLS: doesn't make fine-grained distribution available.</li> <li>IRS: they must have the data, but they're not sharing.</li> <li>IDG: only has averages.</li> <li>internal company data: too narrow</li> <li>compensation survey companies like PayScale: when I've talked to people from these companies, they acknowledge that they have very poor visibility into large company compensation, but that's what drives the upper end of the market (outside of finance).</li> <li>#talkpay on twitter: numbers skew low<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">1</a></sup>.</li> </ul> <h3 id="appendix-b-why-are-programmers-well-paid">Appendix B: why are programmers well paid?</h3> <p>Since we have both programmer and lawyer compensation handy, let's examine that. Programming pays so well that it seems a bit absurd. If you look at other careers with similar compensation, there are multiple factors that act as barriers or disincentives to entry.</p> <p>If you look at law, you have to win the prestige lottery and get into a top school, which will cost hundreds of thousands of dollars (while it's possible to get a full scholarship, a relatively small fraction of students at top schools are on full scholarships). Then you have to win the grades lottery and get good enough grades to get into a top firm. And then you have to continue winning tournaments to avoid getting kicked out, which requires sacrificing any semblance of a personal life. Consulting, investment banking, etc., are similar. Compensation appears to be proportional to the level of sacrifice (e.g., investment bankers are paid better, but work even longer hours than lawyers, private equity is somewhere between investment and banking and law in hours and compensation, etc.).</p> <p>Medicine seems to be a bit better from the sacrifice standpoint because there's a cartel which limits entry into the field, but the combination of medical school and residency is still incredibly brutal compared to most jobs at places like Facebook and Google.</p> <p>Programming also doesn't have a licensing body limiting the number of programmers, nor is there the same prestige filter where you have to go to a top school to get a well paying job. Sure, there are a lot of startups who basically only hire from MIT, Stanford, CMU, and a few other prestigious schools, and I see job ads like the following whenever I look at startups (the following is from a company that was advertising on Slate Star Codex for quite a long time):</p> <blockquote> <p>Our team of 14 includes 6 MIT alumni, 3 ex-Googlers, 1 Wharton MBA, 1 MIT Master in CS, 1 CMU CS alum, and 1 &quot;20 under 20&quot; Thiel fellow. Candidates often remark we're the strongest team they've ever seen.</p> <p>We’re not for everyone. We’re an enterprise SaaS company your mom will probably never hear of. We work really hard 6 days a week because we believe in the future of mobile and we want to win.</p> </blockquote> <p>Prestige obsessed places exist. But, in programming, measuring people by markers of prestige seems to be a Silicon Valley startup thing and not a top-paying companies thing. Big companies, <a href="startup-tradeoffs/">which pay a lot better than startups</a>, don't filter people out by prestige nearly as often. Not only do you not need the right degree from the right school, you also don't need to have the right kind of degree, or any degree at all. Although it's getting rarer to not have a degree, I still meet new hires with no experience and either no degree or a degree in an unrelated field (like sociology or philosophy).</p> <p>How is it possible that programmers are paid so well without these other barriers to entry that similarly remunerative fields have? One possibility is that we have a shortage of programmers. If that's the case, you'd expect more programmers to enter the field, bringing down compensation. CS enrollments have been at record levels recently, so this may already be happening. Another possibility is that programming is uniquely hard in some way, but that seems implausible to me. Programming doesn't seem inherently harder than electrical engineering or chemical engineering and it certainly hasn't gotten much harder over the past decade, but during that timeframe, programming has gone from having similar compensation to most engineering fields to paying much better. The last time I was negotiating with a EE company about offers, they remarked to me that their VPs don't make as much as I do, and I work at a software company that pays relatively poorly compared to its peers. There's no reason to believe that we won't see a flow of people from engineering fields into programming until compensation is balanced.</p> <p>Another possibility is that U.S. immigration laws act as a protectionist barrier to prop up programmer compensation. It seems impossible for this to last (why shouldn't there by really valuable non-U.S. companies), but it does appear to be somewhat true for now. When I was at Google, one thing that was remarkable to me was that they'd pay you approximately the same thing in Washington or Colorado as they do Silicon Valley, but they'd pay you much less in London. Whenever one of these discussions comes up, people always bring up the &quot;fact&quot; that SV salaries aren't really as good as they sound because the cost of living is so high, but companies will not only match SV offers in Seattle, they'll match them in places like Pittsburgh. My best guess for why this happens is that someone in the Midwest can credibly threaten to move to SV and take a job at any company there, whereas someone in London can't<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">2</a></sup>. While we seem unlikely to loosen current immigration restrictions, our immigration restrictions have caused and continue to cause people who would otherwise have founded companies in the U.S. to found companies elsewhere. Given that the U.S. doesn't have a monopoly on people who found startups and that we do our best to keep people who want to found startups here out, it seems inevitable that there will eventually be Facebooks and Googles founded outside of the U.S. who compete for programmers the same way companies compete inside the U.S.</p> <p>Another theory that I've heard a lot lately is that programmers at large companies get paid a lot because of the phenomenon described in <a href="https://en.wikipedia.org/wiki/O-ring_theory_of_economic_development">Kremer's O-ring model</a>. This model assumes that productivity is multiplicative. If your co-workers are better, you're more productive and produce more value. If that's the case, you expect a kind of assortive matching where you end up with high-skill firms that pay better, and low-skill firms that pay worse. This model has a kind of intuitive appeal to it, but it can't explain why programming compensation has higher dispersion than (for example) electrical engineering compensation. With the prevalence of open source, it's much easier to utilize the work of productive people outside your firm than in most fields. This model should be less true of programming than in most engineering fields, but the dispersion in compensation is higher.</p> <p>A related theory that can't be correct for similar reasons is that high-paid software engineers are extra elite, the best of the best, and are simply paid more because they're productive. If you look at how many programmers the BLS says exist in the US (on the order of a few million) and how many engineers high-paying tech companies employ in the U.S. (on the order of a couple or a few hundred thousand), high-paying software compnaies literally can't consist of the top 1%. Even if their filters were perfect (as opposed to the complete joke that they're widely regarded to be), they couldn't be better than 90%-ile. Realistically, it's more likely that the median programmer at a high-paying tech company is a bit above 50%-ile.</p> <p>The most common theory I've heard is that &quot;software is eating the world&quot;. The theory goes: of course programmers get paid a lot and will continue to get paid a lot because software is important and only becoming more important. Despite being the most commonly stated theory I've heard, this seems nonsensical if you compare other fields. You could've said this about microprocessor design in the late 90s as well as fiber optics. Those fields are both more important today than they were in the 90s, not only is there more demand for processing power and bandwidth than ever before, demand for software is actually dependent on those. And yet, the optics engineering job market still hasn't recovered from the dot com crash and the microprocessor design engineer market, after recovering, still pays experienced PhDs less than a CS new grad at Facebook.</p> <p>Furthermore, any argument for high programmer pay that relies on some inherent property of market conditions, the economy at large, the impact of programming, etc., seems like it cannot be correct if you look at what's actually driven up programmer pay. FB declined to participate in <a href="google-wage-fixing/">Google/Apple wage fixing agreement that became basically industry wide</a>, which mean that FB was outpaying other major tech companies. When the wage-fixing agreement was lifted, other companies &quot;had to&quot; come close to matching FB compensation to avoid losing people both to FB and to each other. When they did that, FB kept raising the bar on compensation and compensation kept getting better. [2022 update] This can most clearly be seen with changes to benefits and pay structure, where FB would make a change, Google would follow suit immediately, and other companies would pick up the change later, as when FB removed vesting cliffs and Google did the same within weeks and the change trickled out across the industry. There are companies that were paying programmers as well or better than FB, like Netflix and a variety of finance companies, but major tech companies tended to not match offers from those places because they were too small to hire away enough programmers to be concerning, but FB is large and hires enough to be a concern to Google, which matches FB and combined, are large enough to be a concern to other major tech companies.</p> <p>Because the mechanism for compensation increases has been arbitrary (FB could not exist, or the person who's in total control of FB, Zuckerberg, could decide on different compensation policy), it's quite arbitrary that programmer pay is as good as it is.</p> <p>In conclusion, high programmer pay seems like a mystery to me and would love to hear a compelling theory for why programming &quot;should&quot; pay more than other similar fields, or why it should pay as much as fields that have much higher barriers to entry.</p> <h3 id="update">Update</h3> <p>Eric Roberts has observed that <a href="https://cs.stanford.edu/people/eroberts/CSCapacity.pdf">it takes a long time for CS enrollments to recover after a downturn, leading to a large deficit in the number of people with CS degrees vs. demand</a>.</p> <p><img src="images/bimodal-compensation/cs-degrees.png" alt="More than a one decade lag between downturn and recovery in enrollment" width="1800" height="1031"></p> <p>The 2001 bubble bursting caused a severe drop in CS enrollment. CS enrollment didn't hit its previous peak again until 2014, and if you fit the graph and extrapolate against the peaks, it took another year or two for enrollments to hit the historical trend. If we didn't have any data, it wouldn't be surprising to find that there's a five year delay. Of the people who graduate in four years (as opposed to five or more), most aren't going to change their major after mid or late sophmore year, so that's already two to three years of delay right there. And after a downturn, it takes some time to recover, so we'd expect at least another two to three years. Roberts makes a case that the additional latency came from a number of other factors including the fear that even though things looked ok, jobs would be outsourced soon and a slow response by colleges.</p> <p><a href="https://danwang.co/why-so-few-computer-science-majors/">Dan Wang</a> has noted that, according to the SO survey, 3/4 of developers have a BS degree (or higher). If it's statistically &quot;hard&quot; to get a high-paying job without a CS degree and there's over a decade hangover from the 2001 downturn, that could explain why programmer compensation is so high. Of course, most of us know people in the industry without a degree, but it seems to be harder to find an entry-level position without a credential.</p> <p>It's not clear what this means for the future. Even if the lack of candidates with the appropriate credential is a major driver in programmer compensation, it's unclear what the record CS enrollments over the past few years means for future compensation. It's possible that record enrollments mean that we should expect compensation to come back down to the levels we see in other fields that require similar skills, like electrical engineering. It's also possible that enrollment continues to lag behind demand by a decade and that record enrollments are just keeping pace with demand from a decade ago, in which case we might expect elevated compensation to persist (as long as other factors, like hiring outside of the U.S., don't influence things too much). Since there's so much latency, another possibility is that enrollment has or will overshoot demand and we should expect compensaton programmer compensation to decline. And it's not even clear that the Roberts paper makes sense as an explanation for high current comp because Roberts also found a huge capacity crunch in the 80s and, while some programmers were paid very well, the fraction of programmers who were paid &quot;very well&quot; seems to have been much smaller than it is today. Google alone employs 30k engineers. If 20k are programmers in the U.S. and the estimate that there are 3 million programmers in the U.S., Google alone employs 0.6% of programmers in the U.S. If you add in the other large companies that are known to pay competitively (Amazon, Facebook, etc.), that's a significant fraction of all programmers in the U.S., which I believe is quite different from the situation in the 80s.</p> <p>The most common response I've gotten to this post is that we should expect programmers to be well-paid because software is everywhere and there will be at least as much software in the future. This exact same line of reasoning could apply to electrical engineering, which is more fundamental than software, in that software requires hardware, and yet electrical engineering comp isn't in the same league as programmer comp. Highly paid programmers couldn't get their work done without microprocessors, and there are more processors sold than ever before, but the comp packages for a &quot;senior&quot; person at places like Intel and Qualcomm aren't even within a factor of two as at Google or Facebook. You could also make a similar argument for people who work on water and sewage systems, but those folks also don't seem compensation that's in the same range as programmers either. Any argument of the form, &quot;the price for X is high because X is important&quot; implicitly assumes that there's some force constraining the supply of X. The claim that &quot;X is important&quot; or &quot;we need a lot of X&quot; is missing half the story. Another problem with claims like &quot;X is important&quot; or &quot;X is hard&quot; is that these statements don't seem any less true of industries that pay much less. If your explanation of why programmers are well paid is just as true of any &quot;classical&quot; engineering discipline, you need some explanation of why those other fields shouldn't be as well paid.</p> <p>The second most common comment that I hear is that, of course programmers are well paid, software companies are worth so much, which makes it inevitable. But there's nothing inevitable about workers actually being well compensated because a company is profitable. Someone who made this argument sent me a link to <a href="http://www.businessinsider.com/apple-facebook-alphabet-most-profitable-companies-per-employee-2017-12">this list of the most profitable companies per employee</a>. The list has some software companies that pay quite well, like Alphabet (Google) and Facebook, but we also see hardware companies like Qualcomm, Cisco, TSMC (and arguably SoftBank now that they've acquired ARM) that don't even pay as well software companies that don't turn a profit or that barely make money and have no path to being wildly profitable in the future. Moreover, the compensation at the software companies that are listed isn't very strongly related to their profit per employee.</p> <p>To take a specific example that I'm familiar with because I grew up in Madison, the execs at Epic Systems have built a company that's generated so much wealth that its founder has an estimated net worth of $3.6 billion, which is much more than all but the most successful founders in tech. But line engineers at Epic are paid significantly less than engineers at tech companies that compete with SV for talent, even tech companies that have never made any money. What is it about some software companies that make a similar amount of money that prevents them from funneling virtually all of the wealth they generate up to the top? The typical answer to this cost of living, but as we've seen, that makes even less sense than usual in this case since Google has an office in the same city as Epic, and Google pays well over double what Epic does for typical dev. If there were some kind of simple cost of living adjustment, you'd expect Google to pay less in Madison than in Toronto or London, but it seems to be the other way around. This isn't unique to Madison — just for example, you can find a number of successful software companies in Austin that pay roughly half what Amazon and Facebook pay in the same city, where upper management does very well for themselves and line engineers make a fine living, but nowhere near as much as they'd make if they moved to a company like Amazon or Facebook.</p> <p>The thing all of these theories have in common is that they apply to other fields as well, so they cannot be, as stated, the reason programmers are better paid than people in these other fields. Someone could argue that programming has a unique combination of many of these or that one these reasons should be expected to apply much more strongly than to any other field, but I haven't seen anyone make that case. Instead, people just make obviously bogus statements like &quot;programming is really hard&quot; (which is only valid as a reason, in this discussion, if literally the hardest field in existence and much harder than other engineering fields).</p> <div class="footnotes"> <hr /> <ol> <li id="fn:T">People often worry that comp surveys will skew high because people want to brag, but the reality seems to be that numbers skew low because people feel embarrassed about sounding like they're bragging. I have a theory that you can see this reflected in the prices of other goods. For example, if you look at house prices, they're generally predictable based on location, square footage, amenities, and so on. But there's a significant penalty for having the largest house on the block, for what (I suspect) is the same reason people with the highest compensation disproportionately don't participate in #talkpay: people don't want to admit that they have the highest pay, have the biggest house, or drive the fanciest car. Well, some people do, but on average, bragging about that stuff is seen as gauche. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:M">There's a funny move some companies will do where they station the new employee in Canada for a year before importing them into the U.S., which gets them into a visa process that's less competitive. But this is enough of a hassle that most employees balk at the idea. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> </ol> </div> How I learned to program learning-to-program/ Mon, 12 Sep 2016 01:41:26 -0700 learning-to-program/ <p>Tavish Armstrong has a great document <a href="https://github.com/tarmstrong/longcv/blob/master/bio.md">where he describes how and when he learned the programming skills he has</a>. I like this idea because I've found that the paths that people take to get into programming are much more varied than stereotypes give credit for, and I think it's useful to see that there are many possible paths into programming.</p> <p>Personally, I spent a decade working as an electrical engineer before taking a programming job. When I talk to people about this, they often want to take away a smooth narrative of my history. Maybe it's that my math background gives me tools I can apply to a lot of problems, maybe it's that my hardware background gives me a good understanding of performance and testing, or maybe it's that the combination makes me a great fit for hardware/software co-design problems. <a href="https://www.youtube.com/watch?v=RoEEDKwzNBw">People like a good narrative</a>. One narrative people seem to like is that I'm a good problem solver, and that problem solving ability is generalizable. But reality is messy. Electrical engineering seemed like the most natural thing in the world, and I picked it up without trying very hard. Programming was unnatural for me, and didn't make any sense at all for years. If you believe in the common &quot;you either have it or you don't&quot; narrative about programmers, I definitely don't have it. And yet, I now make a living programming, and people seem to be pretty happy with the work I do.</p> <p>How'd that happen? Well, if we go back to the beginning, before becoming a hardware engineer, I spent a fair amount of time doing failed kid-projects (e.g., writing a tic-tac-toe game and AI) and not really &quot;getting&quot; programming. I do sometimes get a lot of value out of my math or hardware skills, but I suspect I could teach someone the actually applicable math and hardware skills I have in less than a year. Spending five years in a school and a decade in industry to pick up those skills was a circuitous route to getting where I am. Amazingly, I've found that my path has been more direct than that of most of my co-workers, giving the lie to the narrative that most programmers are talented whiz kids who took to programming early.</p> <p>And while I only use a small fraction of the technical skills I've learned on any given day, I find that I have a meta-skill set that I use all the time. There's nothing profound about the meta-skill set, but because I often work in new (to me) problem domains, I find my meta skillset to be more valuable than my actual skills. I don't think that you can communicate the importance of meta-skills (like communication) by writing a blog post any more than you can <a href="https://byorgey.wordpress.com/2009/01/12/abstraction-intuition-and-the-monad-tutorial-fallacy/">explain what a monad is by saying that it's like a burrito</a>. That being said, I'm going to tell this story anyway.</p> <h3 id="ineffective-fumbling-1980s-1996">Ineffective fumbling (1980s - 1996)</h3> <p>Many of my friends and I tried and failed multiple times to learn how to program. We tried BASIC, and could write some simple loops, use conditionals, and print to the screen, but never figured out how to do anything fun or useful.</p> <p>We were exposed to some kind of lego-related programming, uhhh, thing in school, but none of us had any idea how to do anything beyond what was in the instructions. While it was fun, it was no more educational than a video game and had a similar impact.</p> <p><a href="https://www.linkedin.com/in/jeshua-smith-1a873858">One of us</a> got <a href="https://www.amazon.com/gp/product/1568301839/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1568301839&amp;linkId=408c9bc67c8e6b0a405ce7ccdd7ed7b4">a game programming book</a>. We read it, tried to do a few things, and made no progress.</p> <h3 id="high-school-1996-2000">High school (1996 - 2000)</h3> <p>Our ineffective fumbling continued through high school. Due to an interest in gaming, I got interested in benchmarking, which eventually led to learning about CPUs and CPU microarchitecture. This was in the early days of Google, before Google Scholar, and before most CS/EE papers could be found online for free, so this was mostly material from enthusiast sites. Luckily, the internet was relatively young, as were the users on the sites I frequented. Much of the material on hardware was targeted at (and even written by) people like me, which made it accessible. Unfortunately, a lot of the material on programming was written by and targeted at professional programmers, things like <a href="http://www.azillionmonkeys.com/qed/optimize.html">Paul Hsieh's optimization guide</a>. There were some beginner-friendly guides to programming out there, but my friends and I didn't stumble across them.</p> <p>We had programming classes in high school: an introductory class that covered Visual Basic and an <a href="https://en.wikipedia.org/wiki/Advanced_Placement_exams">AP class</a> that taught C++. Both classes were taught by someone who didn't really know how to program or how to teach programming. My class had a couple of kids who already knew how to program and <a href="https://www.topcoder.com/members/po/">would make good money doing programming competitions on topcoder when it opened</a>, but they failed to test out of the intro class because that test included things like a screenshot of the VB6 IDE, where you got a point for correctly identifying what each button did. The class taught about as much as you'd expect from a class where the pre-test involved identifying UI elements from an IDE.</p> <p>The AP class the year after was similarly effective. About halfway through the class, a couple of students organized an independent study group which worked through an alternate textbook because the class was clearly not preparing us for the AP exam. I passed the AP exam because it was one of those multiple choice tests that's possible to pass without knowing the material.</p> <p>Although I didn't learn much, I wouldn't have graduated high school if not for AP classes. I failed enough individual classes that I almost didn't have enough credits to graduate. I got those necessary credits for two reasons: first, a lot of the teachers had a deal where, if you scored well on the AP exam, they would give you a passing grade in the class (usually an A, but sometimes a B). Even that wouldn't have been enough if my chemistry teacher hadn't also changed my grade to a passing grade when he found out I did well on the AP chemistry test<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <p>Other than not failing out of high school, I'm not sure I got much out of my AP classes. My AP CS class actually had a net negative effect on my learning to program because the AP test let me opt out of the first two intro CS classes in college (an introduction to programming and a data structures course). In retrospect, I should have taken the intro classes, but I didn't, which left me with huge holes in my knowledge that I didn't really fill in for nearly a decade.</p> <h3 id="college-2000-2003">College (2000 - 2003)</h3> <p>Because I'd nearly failed out of high school, there was no reasonable way I could have gotten into a &quot;good&quot; college. Luckily, I grew up in Wisconsin, a state with a &quot;good&quot; school that used a formula to determine who would automatically get admitted: the GPA cutoff depended on standardized test scores, and anyone with standardized test scores above a certain mark was admitted regardless of GPA. During orientation, I talked to someone who did admissions and found out that my year was the last year they used the formula.</p> <p>I majored in computer engineering and math for reasons that seem quite bad in retrospect. I had no idea what I really wanted to study. I settled on either computer engineering or engineering mechanics because both of those sounded &quot;hard&quot;.</p> <p>I made a number of attempts to come up with better criteria for choosing a major. The most serious was when I spent a week talking to professors in an attempt to find out what day-to-day life in different fields was like. That approach had two key flaws. First, most professors don't know what it's like to work in industry; now that I work in industry and talk to folks in academia, I see that most academics who haven't done stints in industry have a lot of misconceptions about what it's like. Second, even if I managed to get accurate descriptions of different fields, it turns out that there's a wide body of research that indicates that humans are basically hopeless at predicting which activities they'll enjoy. Ultimately, I decided by coin flip.</p> <h4 id="math">Math</h4> <p>I wasn't planning on majoring in math, but my freshman intro calculus course was so much fun that I ended up adding a math major. That only happened because a high-school friend of mine passed me the application form for the honors calculus sequence because he thought I might be interested in it (he'd already taken the entire calculus sequence as well as linear algebra). The professor for the class covered the material at an unusually fast pace: he finished what was supposed to be a year-long calculus textbook in part-way through the semester and then lectured on his research for the rest of the semester. The class was theorem-proof oriented and didn't involve any of that yucky memorization that I'd previously associated with math. That was the first time I'd found school engaging in my entire life and it made me really look forward to going to math classes. I later found out that non-honors calculus involved a lot of memorization when the engineering school required me to go back and take calculus II, which I'd skipped because I'd already covered the material in the intro calculus course.</p> <p>If I hadn't had a friend drop the application for honors calculus in my lap, I probably wouldn't have majored in math and it's possible I never would have found any classes that seemed worth attending. Even as it was, all of the most engaging undergrad professors I had were math professors<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">2</a></sup> and I mostly skipped my other classes. I don't know how much of that was because my math classes were much smaller, and therefore much more customized to the people in the class (computer engineering was very trendy at the time, and classes were overflowing), and how much was because these professors were really great teachers.</p> <p>Although I occasionally get some use out of the math that I learned, most of the value was in becoming confident that I can learn and work through the math I need to solve any particular problem.</p> <h4 id="engineering">Engineering</h4> <p>In my engineering classes, I learned how to <a href="http://danluu.com/teach-debugging/">debug</a> and how computers work down to the transistor level. I spent a fair amount of time skipping classes and reading about topics of interest in the library, which included things like computer arithmetic and circuit design. I still have fond memories of <a href="https://www.amazon.com/gp/product/1568811608/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=1568811608&amp;linkId=ebede6c98dc79cb83fa8695049df4dc4">Koren's Computer Arithmetic Algorithms</a>, <a href="https://www.amazon.com/gp/product/078036001X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=078036001X&amp;linkId=e806434c2ef1eeef9e3d5963ecb68393">Chandrakasan et al.'s Design of High-Performance Microprocessor Circuits</a>. I also started reading papers; I spent a lot of time in libraries reading physics and engineering papers that mostly didn't make sense to me. The notable exception were systems papers, which I found to be easy reading. I distinctly remember reading <a href="http://www.hpl.hp.com/techreports/1999/HPL-1999-77.html">the Dynamo paper</a> (this was HP's paper on JITs, not the more recent Amazon work of the same name), but I can't recall any other papers I read back then.</p> <h4 id="internships">Internships</h4> <p>I had two internships, one at Micron where I &quot;worked on&quot; flash memory, and another at IBM where I worked on the POWER6. The Micron internship was a textbook example of a bad internship. When I showed up, my manager was surprised that he was getting an intern and had nothing for me to do. After a while (perhaps a day), he found an assignment for me: press buttons on a phone. He'd managed to find a phone that used Micron flash chips; he handed it to me, told me to test it, and walked off.</p> <p>After poking at the phone for an hour or two and not being able to find any obvious bugs, I walked around and found people who had tasks I could do. Most of them were only slightly less manual than &quot;testing&quot; a phone by mashing buttons, but I did one not-totally-uninteresting task, which was to verify that a flash chip's controller behaved correctly. Unlike my other tasks, this was amenable to automation and I was able to write a Perl script to do the testing for me.</p> <p>I chose Perl because someone had a Perl book on their desk that I could borrow, which seemed like as good a reason as any at the time. I called up a friend of mine to tell him about this great &quot;new&quot; language and we implemented Age of Renaissance, a board game we'd played in high school. We didn't finish, but Perl was easy enough to use that we felt like we could write a program that actually did something interesting.</p> <p>Besides learning Perl, I learned that I could ask people for books and read them, and I spent most of the rest of my internship half keeping an eye on a manual task while reading the books people had lying around. Most of the books had to do with either analog circuit design or flash memory, so that's what I learned. None of the specifics have really been useful to me in my career, but I learned two meta-items that were useful.</p> <p>First, no one's going to stop you from spending time reading at work or spending time learning (on most teams). Micron did its best to keep interns from learning by having a default policy of blocking interns from having internet access (managers could override the policy, but mine didn't), but no one will go out of their way to prevent an intern from reading books when their other task is to randomly push buttons on a phone.</p> <p>Second, I learned that there are a lot of engineering problems we can solve without anyone knowing why. One of the books I read was a survey of then-current research on flash memory. At the time, flash memory relied on some behaviors that were well characterized but not really understood. There were theories about how the underlying physical mechanisms might work, but determining which theory was correct was still an open question.</p> <p>The next year, I had a much more educational internship at IBM. I was attached to a logic design team on the POWER6, and since they didn't really know what to do with me, they had me do verification on the logic they were writing. They had a relatively new tool called <a href="//danluu.com/testing/#fn:S">SixthSense, which you can think of as a souped-up quickcheck</a>. The obvious skill I learned was how to write tests using a fancy testing framework, but the meta-thing I learned which has been even more useful is the fact that writing a test-case generator and a checker is often much more productive than the manual test-case writing that passes for automated testing in most places.</p> <p>The other thing I encountered for the first time at IBM was version control (CVS, unfortunately). Looking back, I find it a bit surprising that not only did I never use version control in any of my classes, but I'd never met any other students who were using version control. My IBM internship was between undergrad and grad school, so I managed to get a B.S. degree without ever using or seeing anyone use version control.</p> <h4 id="computer-science">Computer Science</h4> <p>I took a couple of CS classes. The first was algorithms, which was poorly taught and so heavily curved as a result that I got an A despite not learning anything at all. The course involved no programming and while I could have done some implementation in my free time, I was much more interested in engineering and didn't try to apply any of the material.</p> <p>The second course was databases. There were a couple of programming projects, but they were all projects where you got some scaffolding and only had to implement a few key methods to make things work, so it was possible to do ok without having any idea how to program. I got involved in a competition to see who could attend fewest possible classes, didn't learn anything, and scraped by with a B.</p> <h3 id="grad-school-2003-2005">Grad school (2003 - 2005)</h3> <p>After undergrad, I decided to go to grad school for a couple of silly reasons. One was a combination of &quot;why not?&quot; and the argument that most of professors gave, which was that you'll never go if you don't go immediately after undergrad because it's really hard to go back to school later. But the reason that people don't go back later is because they have more information (they know what both school and work are like), and they almost always choose work! The other major reason was that I thought I'd get a more interesting job with a master's degree. That's not obviously wrong, but it appears to be untrue in general for people going into electrical engineering and programming.</p> <p>I don't know that I learned anything that I use today, either in the direct sense or in a meta sense. I had some great professors<sup class="footnote-ref" id="fnref:W"><a rel="footnote" href="#fn:W">3</a></sup> and I made some good friends, but I think that this wasn't a good use of time because of two bad decisions I made at the age of 19 or 20. Rather than attended a school that had a lot of people working in an area I was interested in, I went with a school that gave me a fellowship that only had one person working in an area I was really interested. That person left just before I started.</p> <p>I ended up studying optics, and while learning a new field was a lot of fun, the experience was of no particular value to me, and I could have had fun studying something I had more of an interest in.</p> <p>While I was officially studying optics, I still spent a lot of time learning unrelated things. At one point, I decided I should learn Lisp or Haskell, probably because of something Paul Graham wrote. I couldn't find a Lisp textbook in the library, but I found a Haskell textbook. After I worked through the exercises, I had no idea how to accomplish anything practical. But I did learn about list comprehensions and got in the habit of using higher-order functions.</p> <p>Based on internet comments and advice, I had the idea learning more languages would teach me how to be a good programmer so I worked through introductory books on Python and Ruby. As far as I can tell, this taught me basically nothing useful and I would have been much better off learning about a specific area (like algorithms or networking) than learning lots of languages.</p> <h3 id="first-real-job-2005-2013">First real job (2005 - 2013)</h3> <p>Towards the end of grad school, I mostly looked for, and found, electrical/computer engineering jobs. The one notable exception was Google, which called me up in order to fly me out to Mountain View for an interview. I told them that they probably had the wrong person because they hadn't even done a phone screen, so they offered to do a phone interview instead. I took the phone interview expecting to fail because I didn't have any CS background, and I failed as expected. In retrospect, I should have asked to interview for a hardware position, but at the time I didn't know they had hardware positions, even though they'd been putting together their own servers and designing some of their own hardware for years.</p> <p>Anyway, I ended up at a little chip company called <a href="https://www.amazon.com/gp/product/B01FSZU6FK/ref=as_li_qf_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01FSZU6FK&amp;linkId=d15e514c6ecefa224be8f05d4d5837e3">Centaur</a>. I was hesitant about taking the job because the interview was the easiest interview I had at any company<sup class="footnote-ref" id="fnref:G"><a rel="footnote" href="#fn:G">4</a></sup>, which made me wonder if they had a low hiring bar, and therefore relatively weak engineers. It turns out that, on average, that's the best group of people I've ever worked with. I didn't realize it at the time, but this would later teach me that companies that claim to have brilliant engineers because they have super hard interviews are full of it, and that the interview difficulty one-upmanship a lot of companies promote is more of a prestige play than anything else.</p> <p>But I'm getting ahead of myself — my first role was something they call &quot;regression debug&quot;, which included debugging test failures for both newly generated tests as well as regression tests. The main goal of this job was to teach new employees the ins-and-outs of the x86 architecture. At the time, Centaur's testing was very heavily based on chip-level testing done by injecting real instructions, interrupts, etc., onto the bus, so debugging test failures taught new employees everything there is to know about x86.</p> <p>The Intel x86 manual is thousands of pages long and it isn't sufficient to implement a compatible x86 chip. When Centaur made its first x86 chip, they followed the Intel manual in perfect detail, and left all instances of undefined behavior up to individual implementers. When they got their first chip back and tried it, they found that some compilers produced code that relied on the behavior that's technically undefined on x86, but happened to always be the same on Intel chips. While that's technically a compiler bug, you can't ship a chip that isn't compatible with actually existing software, and ever since then, Centaur has implemented x86 chips by making sure that the chips match the exact behavior of Intel chips, down to matching officially undefined behavior<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">5</a></sup>.</p> <p>For years afterwards, I had encyclopedic knowledge of x86 and could set bits in control registers and <a href="https://en.wikipedia.org/wiki/Model-specific_register">MSRs</a> from memory. I didn't have a use for any of that knowledge at any future job, but the meta-skill of not being afraid of low-level hardware comes in handy pretty often, especially when I run into compiler or chip bugs. People look at you like you're a crackpot if you say you've found a hardware bug, but because we were so careful about characterizing the exact behavior of Intel chips, we would regularly find bugs and then have discussions about whether we should match the bug or match the spec (the Intel manual).</p> <p>The other thing I took away from the regression debug experience was a lifelong love of automation. Debugging often involves a large number of <a href="//danluu.com/teach-debugging/">mechanical steps</a>. After I learned enough about x86 that debugging became boring, I started automating debugging. At that point, I knew how to write simple scripts but didn't really know how to program, so I wasn't able to totally automate the process. However, I was able to automate enough that, for 99% of failures, I just had to glance at a quick summary to figure out what the bug was, rather than spend what might be hours debugging. That turned what was previously a full-time job into something that took maybe 30-60 minutes a day (excluding days when I'd hit a bug that involved some obscure corner of x86 I wasn't already familiar with, or some bug that my script couldn't give a useful summary of).</p> <p>At that point, I did two things that I'd previously learned in internships. First, I started reading at work. I began with online commentary about programming, but there wasn't much of that, so I asked if I could expense books and read them at work. This seemed perfectly normal because a lot of other people did the same thing, and there were at least two people who averaged more than one technical book per week, including one person who averaged a technical book every 2 or 3 days.</p> <p>I settled in at a pace of somewhere between a book a week and a book a month. I read a lot of engineering books that imparted some knowledge that I no longer use, now that I spend most of my time writing software; some &quot;big idea&quot; software engineering books like Design Patterns and Refactoring, which I didn't really appreciate because I was just writing scripts; and a ton of books on different programming languages, which doesn't seem to have had any impact on me.</p> <p>The only book I read back then that changed how I write software in a way that's obvious to me was <a href="https://www.amazon.com/gp/product/0465050654/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0465050654&amp;linkId=48b09d6605f7fcaa58922dcf4060a7df">The Design of Everyday Things</a>. The core idea of the book is that while people beat themselves up for failing to use hard-to-understand interfaces, we should blame designers for designing poor interfaces, not users for failing to use them.</p> <p>If you ever run into a door that you incorrectly try to pull instead of push (or vice versa) and have some spare time, try watching how other people use the door. Whenever I do this, I'll see something like half the people who try the door use it incorrectly. That's a design flaw!</p> <p>The Design of Everyday Things has made me a lot more receptive to API and UX feedback, and a lot less tolerant of programmers who say things like &quot;it's fine — everyone knows that the arguments to <code>foo</code> and <code>bar</code> just have to be given in the opposite order&quot; or &quot;Duh! Everyone knows that you just need to click on the menu <code>X</code>, select <code>Y</code>, navigate to tab <code>Z</code>, open <code>AA</code>, go to tab <code>AB</code>, and then slide the setting to <code>AC</code>.&quot;</p> <p>I don't think all of that reading was a waste of time, exactly, but I would have been better off picking a few sub-fields in CS or EE and learning about them, rather than reading the sorts of books O'Reilly and Manning produce.</p> <p>It's not that these books aren't useful, it's that almost all of them are written to make sense without any particular background beyond what any random programmer might have, and you can only get so much out of reading your 50th book targeted at random programmers. IMO, most non-academic conferences have the same problem. As a speaker, you want to give a talk that works for everyone in the audience, but a side effect of that is that many talks have relatively little educational value to experienced programmers who have been to a few conferences.</p> <p>I think I got positive things out of all that reading as well, but I don't know yet how to figure out what those things are.</p> <p>As a result of my reading, I also did two things that were, in retrospect, quite harmful.</p> <p>One was that I really got into functional programming and used a functional style everywhere I could. Immutability, higher-order X for any possible value of X, etc. The result was code that I could write and modify quickly that was incomprehensible to anyone but a couple of coworkers who were also into functional programming.</p> <p>The second big negative was that I became convinced that Perl was causing us a lot of problems. We had Perl scripts that were hard to understand and modify. They'd often be thousands of lines of code with only one or two functions and no tests which used every obscure Perl feature you could think of. Static! Magic sigils! Implicit everything! You name it, we used it. For me, the last straw was when I inserted a new function between two functions which didn't explicitly pass any arguments and return values — and broke the script because one of the functions was returning a value into an implicit variable which was getting read by the next function. By putting another function in between the two closely coupled functions, I broke the script.</p> <p>After that, I convinced a bunch of people to use Ruby and started using it myself. The problem was that I only managed to convince half of my team to do this The other half kept using Perl, which resulted in language fragmentation. Worse yet, in another group, they also got fed up with Perl, but started using Python, resulting in the company having code in Perl, Python, and Ruby.</p> <p>Centaur has an explicit policy of not telling people how to do anything, which precludes having team-wide or company-wide standards. Given the environment, using a &quot;better&quot; language seemed like a natural thing to do, but I didn't recognize the cost of fragmentation until, later in my career, I saw a company that uses standardization to good effect.</p> <p>Anyway, while I was causing horrific fragmentation, I also automated away most of my regression debug job. I got bored of spending 80% of my time at work reading and I started poking around for other things to do, which is something I continued for my entire time at Centaur. I like learning new things, so I did almost everything you can do related to chip design. The only things I didn't do were circuit design (the TL of circuit design didn't want a non-specialist interfering in his area) and a few roles where I was told &quot;Dan, you can do that if you really want to, but we pay you too much to have you do it full-time.&quot;</p> <p>If I hadn't interviewed regularly (about once a year, even though I was happy with my job), I probably would've wondered if I was stunting my career by doing so many different things, because the big chip companies produce specialists pretty much exclusively. But in interviews I found that my experience was valued because it was something they couldn't get in-house. The irony is that every single role I was offered would have turned me into a specialist. Big chip companies talk about wanting their employees to move around and try different things, but when you dig into what that means, it's that they like to have people work one very narrow role for two or three years before moving on to their next very narrow role.</p> <p>For a while, I wondered if I was doomed to either eventually move to a big company and pick up a hyper-specialized role, or stay at Centaur for my entire career (not a bad fate — Centaur has, by far, the lowest attrition rate of any place I've worked because people like it so much). But I later found that software companies building hardware accelerators actually have generalist roles for hardware engineers, and that software companies have generalist roles for programmers, although that might be a moot point since most software folks would probably consider me an extremely niche specialist.</p> <p>Regardless of whether spending a lot of time in different hardware-related roles makes you think of me as a generalist or a specialist, I picked up a lot of skills which came in handy <a href="http://www.anandtech.com/show/10340/googles-tensor-processing-unit-what-we-know">when I worked on hardware accelerators</a>, but that don't really generalize to the pure software project I'm working on today. A lot of the meta-skills I learned transfer over pretty well, though.</p> <p>If I had to pick the three most useful meta-skills I learned back then, I'd say they were debugging, bug tracking, and figuring out how to approach hard problems.</p> <p>Debugging is a funny skill to claim to have because everyone thinks they know how to debug. For me, I wouldn't even say that I learned how to debug at Centaur, but that I learned how to be persistent. Non-deterministic hardware bugs are so much worse than non-deterministic software bugs that I always believe I can track down software bugs. In the absolute worst case, when there's a bug that isn't caught in logs and can't be caught in a debugger, I can always add tracing information until the bug becomes obvious. The same thing's true in hardware, but &quot;recompiling&quot; to add tracing information takes 3 months per &quot;recompile&quot;; compared to that experience, tracking down a software bug that takes three months to figure out feels downright pleasant.</p> <p>Bug tracking is another meta-skill that everyone thinks they have, but when when I look at most projects I find that they literally don't know what bugs they have and they lose bugs all the time due to a failure to triage bugs effectively. I didn't even know that I'd developed this skill until after I left Centaur and saw teams that don't know how to track bugs. At Centaur, depending on the phase of the project, we'd have between zero and a thousand open bugs. The people I worked with most closely kept a mental model of what bugs were open; this seemed totally normal at the time, and the fact that a bunch of people did this made it easy for people to be on the same page about the state of the project and which areas were ahead of schedule and which were behind.</p> <p>Outside of Centaur, I find that I'm lucky to even find one person who's tracking what the major outstanding bugs are. Until I've been on the team for a while, people are often uncomfortable with the idea of taking a major problem and putting it into a bug instead of fixing it immediately because they're so used to bugs getting forgotten that they don't trust bugs. But that's what bug tracking is for! I view this as analogous to teams whose test coverage is so low and staging system is so flaky that they don't trust themselves to make changes because they don't have confidence that issues will be caught before hitting production. It's a huge drag on productivity, but people don't really see it until they've seen the alternative.</p> <p>Perhaps the most important meta-skill I picked up was learning how to solve large problems. When I joined Centaur, I saw people solving problems I didn't even know how to approach. There were folks like Glenn Henry, a fellow from IBM back when IBM was at the forefront of computing, and Terry Parks, who Glenn called the best engineer he knew at IBM. It wasn't that they were 10x engineers; they didn't just work faster. In fact, I can probably type 10x as quickly as Glenn (a hunt and peck typist) and could solve trivial problems that are limited by typing speed more quickly than him. But Glenn, Terry, and some of the other wizards knew how to approach problems that I couldn't even get started on.</p> <p>I can't cite any particular a-ha moment. It was just eight years of work. When I went looking for problems to solve, Glenn would often hand me a problem that was slightly harder than I thought possible for me. I'd tell him that I didn't think I could solve the problem, he'd tell me to try anyway, and maybe 80% of the time I'd solve the problem. We repeated that for maybe five or six years before I stopped telling Glenn that I didn't think I could solve the problem. Even though I don't know when it happened, I know that I eventually started thinking of myself as someone who could solve any open problem that we had.</p> <h3 id="grad-school-again-2008-2010">Grad school, again (2008 - 2010)</h3> <p>At some point during my tenure at Centaur, I switched to being part-time and did a stint taking classes and doing a bit of research at the local university. For reasons which I can't recall, I split my time between software engineering and CS theory.</p> <p>I read a lot of software engineering papers and came to the conclusion that we know very little about what makes teams (or even individuals) productive, and that the field is unlikely to have actionable answers in the near future. I also got my name on a couple of papers that I don't think made meaningful contributions to the state of human knowledge.</p> <p>On the CS theory side of things, I took some graduate level theory classes. That was genuinely educational and I really &quot;got&quot; algorithms for the first time in my life, as well as complexity theory, etc. I could have gotten my name on a paper that I didn't think made a meaningful contribution to the state of human knowledge, but my would-be co-author felt the same way and we didn't write it up.</p> <p>I originally tried grad school again because I was considering getting a PhD, but I didn't find the work I was doing to be any more &quot;interesting&quot; than the work I had at Centaur, and after seeing the job outcomes of people in the program, I decided there was less than 1% chance that a PhD would provide any real value to me and went back to Centaur full time.</p> <h3 id="rc-https-www-recurse-com-scout-click-t-b504af89e87b77920c9b60b2a1f6d5e8-spring-2013"><a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> (Spring 2013)</h3> <p>After eight years at Centaur, I wanted to do something besides microprocessors. I had enough friends at other hardware companies to know that I'd be downgrading in basically every dimension except name recognition if I switched to another hardware company, so I started applying to software jobs.</p> <p>While I was applying to jobs, I heard about <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>. It sounded great, maybe even too great: when I showed my friends what people were saying about it, they thought the comments were fake. It was a great experience, and I can see why so many people raved about it, to the point where real comments sound impossibly positive. It was transformative for a lot of people; I heard a lot of exclamations like &quot;I learned more here in 3 months here than in N years of school&quot; or &quot;I was totally burnt out and this was the first time I've been productive in a year&quot;. It wasn't transformative for me, but it was as fun a 3 month period as I've ever had, and I even learned a thing or two.</p> <p>From a learning standpoint, the one major thing I got out of RC was feedback from <a href="https://github.com/majek">Marek</a>, whom I worked with for about two months. While the freedom and lack of oversight at Centaur was great for letting me develop my ability to work independently, I basically didn't get any feedback on my work<sup class="footnote-ref" id="fnref:F"><a rel="footnote" href="#fn:F">6</a></sup> since they didn't do code review while I was there, and I never really got any actionable feedback in performance reviews.</p> <p>Marek is really great at giving feedback while pair programming, and working with him broke me of a number of bad habits as well as teaching me some new approaches for solving problems. At a meta level, RC is relatively more focused on pair programming than most places and it got me to pair program for the first time. I hadn't realized how effective pair programming with someone is in terms of learning how they operate and what makes them effective. Since then, I've asked a number of super productive programmers to pair program and I've gotten something out of it every time.</p> <h3 id="second-real-job-2013-2014">Second real job (2013 - 2014)</h3> <p>I was in the right place at the right time to land on a project that was just transitioning from <a href="https://www.linkedin.com/in/andrew-phelps-31438b6">Andy Phelps' pet 20% time project</a> into what would later be called the <a href="https://www.google.com/patents/WO2016186801A1">Google</a> <a href="https://www.google.com/patents/US20160342889">TPU</a>.</p> <p>As far as I can tell, it was pure luck that I was the second engineer on the project as opposed to the fifth or the tenth. I got to see what it looks like to take a project from its conception and turn it into something real. There was a sense in which I got that at Centaur, but every project I worked on was either part of a CPU, or a tool whose goal was to make CPU development better. This was the first time I worked on a non-trivial project from its inception, where I wasn't just working on part of the project but the whole thing.</p> <p>That would have been educational regardless of the methodology used, but it was a particularly great learning experience because of how the design was done. We started with a lengthy discussion on what core algorithm we were going to use. After we figured out an algorithm that would give us acceptable performance, we coded up design docs for every major module before getting serious about implementation.</p> <p><a href="https://twitter.com/codinghorror/status/724541041827713024">Many people consider writing design docs to be a waste of time nowadays</a>, but going through this process, which took months, had a couple big advantages. The first is that working through a design collaboratively teaches everyone on the team everyone else's tricks. It's a lot like the kind of skill transfer you get with pair programming, but applied to design. This was great for me, because as someone with only a decade of experience, I was one of the least experienced people in the room.</p> <p>The second is that the iteration speed is much faster in the design phase, where throwing away a design just means erasing a whiteboard. Once you start coding, iterating on the design can mean throwing away code; for infrastructure projects, that can easily be person-years or even tens of persons-years of work. Since working on the TPU project, I've seen a couple of teams on projects of similar scope insist on getting &quot;working&quot; code as soon as possible. In every single case, that resulted in massive delays as huge chunks of code had to be re-written, and in a few cases the project was fundamentally flawed in a way that required the team had to start over from scratch.</p> <p>I get that on product-y projects, where you can't tell how much traction you're going to get from something, you might want to get an <a href="https://en.wikipedia.org/wiki/Minimum_viable_product">MVP</a> out the door and iterate, but for pure infrastructure, it's often possible to predict how useful something will be in the design phase.</p> <p>The other big thing I got out of the job was a better understanding of what's possible when a company makes a real effort to make engineers productive. Something I'd seen repeatedly at Centaur was that someone would come in, take a look around, find the tooling to be a huge productivity sink, and then make a bunch of improvements. They'd then feel satisfied that they'd improved things a lot and then move on to other problems. Then the next new hire would come in, have the same reaction, and do the same thing. The result was tools that improved a lot while I was there, but not to the point where someone coming in would be satisfied with them. Google was the only place I'd worked where a lot of the tools seem like magic compared to what exists in the outside world<sup class="footnote-ref" id="fnref:magic"><a rel="footnote" href="#fn:magic">7</a></sup>. Sure, people complain that a lot of the tooling is falling over, that there isn't enough documentation, and that a lot of it is out of date. All true. But the situation is much better than it's been at any other company I've worked at. That doesn't seem to actually be a competitive advantage for Google's business, but it makes the development experience really pleasant.</p> <h3 id="third-real-job-2015-2017">Third real job (2015 - 2017)</h3> <p>This was a surprising experience. I think I'm too close to it to really know what I got out of the experience, so fully filling in this section is a TODO.</p> <p>One thing that was really interesting is there are a lot of things I used to think of as &quot;table stakes&quot; for getting things done that it appears that one can do without. An example is version control. I was and still am strongly in favor of using version control, but the project I worked on with a TL that was strongly against version control was still basically successful. There was a lot of overheard until we started using version control, but dealing with the fallout of not having version control and having people not really sync changes only cost me a day or two a week of manually merging in changes in my private repo to get the build to consistently work. That's obviously far from ideal, but, across the entire team, not enough of a cost to make the difference between success and failure.</p> <h3 id="rc-https-www-recurse-com-scout-click-t-b504af89e87b77920c9b60b2a1f6d5e8-2017-present"><a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> (2017 - present)</h3> <p>I wanted a fun break after my last job, so I went back to RC to do fun programming-related stuff and recharge. I haven't written up most of what I've worked on (e.g., an analysis of 80k games on Terra Mystica, MTA (NYC) subway data analysis, etc.). I've written up a few things, like latency analysis of <a href="//danluu.com/input-lag/">computers</a>, <a href="//danluu.com/term-latency/">terminals</a>, <a href="//danluu.com/keyboard-latency/">keyboards</a>, and <a href="//danluu.com/web-bloat/">websites</a>, though.</p> <p>One thing my time at RC has got me thinking about is why it's so hard to get <a href="//twitter.com/danluu/status/931018170420400128">paid well</a> to write. There appears to be a lot of demand for &quot;good&quot; writing, but companies don't seem very willing to create roles for people who could program but want to write. Steve Klabnik has had a tremendous impact on Rust through his writing, probably more impact than the median programmer on most projects, but my impression is that he's taking a significant pay cut over what he could make as a programmer in order to do this really useful and important thing.</p> <p>I've tried pitching this kind of role at a few places and the response so far has mostly been a combination of:</p> <ul> <li>We value writing! I don't think it makes sense to write full-time or even half-time, but you join my team where we support writing and you can write as a 20%-time project or in your spare time!</li> <li>Uhhh, we could work something out, but why would anyone who can program want to write?</li> </ul> <p>Neither of these responses makes me think that writing would actually be as valued as programming on those teams even if writing is more valued on those teams relative to most. There are some &quot;developer evangelist&quot; roles that involve writing, but when I read engineering blogs written by people with that title, most of the writing appears to be thinly disguised press releases (there are obviously exceptions to this, but even in the cases where blogs have interesting engineering output, the interesting output is often interleaved with pseudo press releases). In addition to be boring, that kind of thing seems pretty ineffective. At one company I worked for, I ran the traffic numbers for their developer evangelist blogs vs. my own blog, and there were a lot of months where my blog got more traffic than all of their hosted evangelist blogs combined. I don't think it's surprising to find that programmers would rather read explanations/analysis/history than PR, but it seems difficult to convince the right people of this, so I'll probably go back to a programming job after this. We'll see.</p> <p>BTW, this isn't to say that I don't enjoy programming or don't think that it's important. It's just that writing seems undervalued in a way that makes it relatively easy to have outsized impact through writing. But the same the same forces that make it easy to have outsized impact also make it difficult to get <a href="//danluu.com/startup-tradeoffs/">paid well</a>!</p> <h3 id="what-about-the-bad-stuff">What about the bad stuff?</h3> <p>When I think about my career, it seems to me that it's been one lucky event after the next. I've been unlucky a few times, but I don't really know what to take away from the times I've been unlucky.</p> <p>For example, I'd consider my upbringing to be mildly abusive. I remember having nights where I couldn't sleep because I'd have nightmares about my father every time I fell asleep. Being awake during the day wasn't a great experience, either. That's obviously not good and in retrospect it seems pretty directly related to the academic problems I had until I moved out, but I don't know that I could give useful advice to a younger version of myself. Don't be born into an abusive family? That's something people would already do if they had any control over the matter.</p> <p>Or to pick a more recent example, I once joined a team that scored a 1 on the <a href="http://www.joelonsoftware.com/articles/fog0000000043.html">Joel Test</a>. The Joel Test is now considered to be obsolete because it awards points for things like &quot;Do you have testers?&quot; and &quot;Do you fix bugs before writing new code?&quot;, which aren't considered best practices by most devs today. Of the items that aren't controversial, many seem so obvious that they're not worth asking about, things like:</p> <ul> <li>Do you use source control?</li> <li>Can you make a build in one step?</li> <li>Do you make (at least) daily builds?</li> <li>Do you have a bug database?</li> </ul> <p>For anyone who cares about this kind of thing, it's clearly not a great idea to join a team that does, at most, 1 item off of Joel's checklist (and the 1 wasn't any of the above). Getting first-hand experience on a team that scored a 1 didn't give me any new information that would make me reconsider my opinion.</p> <p>You might say that I should have asked about those things. It's true! I should have, and I probably will in the future. However, when I was hired, the TL who was against version control and other forms of automation hadn't been hired yet, so I wouldn't have found out about this if I'd asked. Furthermore, even if he'd already been hired, I'm still not sure I would have found out about it — this is the only time I've joined a team and then found that most of the factual statements made during the recruiting process were untrue. I made sure to ask specific, concrete, questions about the state of the project, processes, experiments that had been run, etc., but it turned out the answers were outright falsehoods. When I was on that team, every day featured a running joke between team members about how false the recruiting pitch was!</p> <p>I could try to prevent similar problems in the future by asking for concrete evidence of factual claims (e.g., if someone claims the attrition rate is X, I could ask for access to the HR database to verify), but considering that I have a finite amount of time and the relatively low probability of being told outright falsehoods, I think I'm going to continue to prioritize finding out other information when I'm considering a job and just accept that there's a tiny probability I'll end up in a similar situation in the future.</p> <p>When I look at the bad career-related stuff I've experienced, almost all of it falls into one of two categories: something obviously bad that was basically unavoidable, or something obviously bad that I don't know how to reasonably avoid, given limited resources. I don't see much to learn from that. That's not to say that I haven't made and learned from mistakes. I've made a lot of mistakes and do a lot of things differently as a result of mistakes! But my worst experiences have come out of things that I don't know how to prevent in any reasonable way.</p> <p>This also seems to be true for most people I know. For example, something I've seen a lot is that a friend of mine will end up with a manager whose view is that managers are people who dole out rewards and punishments (as opposed to someone who believes that managers should make the team as effective as possible, or someone who believes that managers should help people grow). When you have a manager like that, a common failure mode is that you're given work that's a bad fit, and then maybe you don't do a great job because the work is a bad fit. If you ask for something that's a better fit, that's refused (why should you be rewarded with doing something you want when you're not doing good work, instead you should be punished by having to do more of this thing you don't like), which causes a spiral that ends in the person leaving or getting fired. In the most recent case I saw, the firing was a surprise to both the person getting fired and their closest co-workers: my friend had managed to find a role that was a good fit despite the best efforts of management; when management decided to fire my friend, they didn't bother to consult the co-workers on the new project, who thought that my friend was doing great and had been doing great for months!</p> <p>I hear a lot of stories like that, and I'm happy to listen because I like stories, but I don't know that there's anything actionable here. Avoid managers who prefer doling out punishments to helping their employees? Obvious but not actionable.</p> <h3 id="conclusion">Conclusion</h3> <p>The most common sort of career advice I see is &quot;you should do what I did because I'm successful&quot;. It's usually phrased differently, but that's the gist of it. That basically never works. When I compare notes with friends and acquaintances, it's pretty clear that my career has been unusual in a number of ways, but it's not really clear why.</p> <p>Just for example, I've almost always had a supportive manager who's willing to not only let me learn whatever I want on my own, but who's willing to expend substantial time and effort to help me improve as an engineer. Most folks I've talked to have never had that. Why the difference? I have no idea.</p> <p>One <a href="https://www.youtube.com/watch?v=RoEEDKwzNBw">story</a> might be: the two times I had unsupportive managers, I quickly found other positions, whereas a lot of friends of mine will stay in roles that are a bad fit for years. Maybe I could spin it to make it sound like the moral of the story is that you should leave roles sooner than you think, but both of the bad situations I ended up in, I only ended up in because I left a role sooner than I should have, so the advice can't be &quot;prefer to leave roles sooner than you think&quot;. Maybe the moral of the story should be &quot;leave bad roles more quickly and stay in good roles longer&quot;, but that's so obvious that it's not even worth stating. This is arguably non-obvious because people do, in fact, stay in roles where they're miserable, but when I think of people who do so, they fall into one of two categories. Either they're stuck for extrinsic reasons (e.g., need to wait out the visa clock) or they know that they should leave but can't bring themselves to do so. There's not much to do about the former case, and in the latter case, knowing that they should leave isn't the problem. Every strategy that I can think of is either incorrect in the general case, or so obvious there's no reason to talk about it.</p> <p>Another story might be: I've learned a lot of meta-skills that are valuable, so you should learn these skills. But you probably shouldn't. The particular set of meta-skills I've picked have been great for me because they're skills I could easily pick up in places I worked (often because I had a great mentor) and because they're things I really strongly believe in doing. Your circumstances and core beliefs are probably different from mine and you have to figure out for yourself what it makes sense to learn.</p> <p>Yet another story might be: while a lot of opportunities come from serendipity, I've had a lot of opportunities because I spend a lot of time generating possible opportunities. When I passed around the draft of this post to some friends, basically everyone told me that I emphasized luck too much in my narrative and that all of my lucky breaks came from a combination of hard work and trying to create opportunities. While there's a sense in which that's true, many of my opportunities also came out of making outright bad decisions.</p> <p>For example, I ended up at Centaur because I turned down the chance to work at IBM for a terrible reason! At the end of my internship, my manager made an attempt to convince me to stay on as a full-time employee, but I declined because I was going to grad school. But I was only going to grad school because I wanted to get a microprocessor logic design position, something I thought I couldn't get with just a bachelor's degree. But I could have gotten that position if I hadn't turned my manager down! I'd just forgotten the reason that I'd decided to go to grad school and incorrectly used the cached decision as a reason to turn down the job. By sheer luck, that happened to work out well and I got better opportunities than anyone I know from my intern cohort who decided to take a job at IBM. Have I &quot;mostly&quot; been lucky or prepared? Hard to say; maybe even impossible.</p> <p>Careers don't have the logging infrastructure you'd need to determine the impact of individual decisions. Careers in programming, anyway. Many sports now track play-by-play data in a way that makes it possible to try to determine how much of success in any particular game or any particular season was luck and how much was skill.</p> <p>Take baseball, which is one of the better understood sports. If we look at the statistical understanding we have of performance today, it's clear that almost no one had a good idea about what factors made players successful 20 years ago. One thing I find particularly interesting is that we now have much better understanding of which factors are fundamental and which factors come down to luck, and it's not at all what almost anyone would have thought 20 years ago. We can now look at a pitcher and say something like &quot;they've gotten unlucky this season, but their foo, bar, and baz rates are all great so it appears to be bad luck on balls in play as opposed any sort of decline in skill&quot;, and we can also make statements like &quot;they've done well this season but their fundamental stats haven't moved so it's likely that their future performance will be no better than their past performance before this season&quot;. We couldn't have made a statement like that 20 years ago. And this is a sport that's had play-by-play video available going back what seems like forever, where play-by-play stats have been kept for a century, etc.</p> <p>In this sport where everything is measured, it wasn't until relatively recently that we could disambiguate between fluctuations in performance due to luck and fluctuations due to changes in skill. And then there's programming, where it's generally believed to be impossible to measure people's performance and the state of the art in grading people's performance is that you ask five people for their comments on someone and then aggregate the comments. If we're only just now able to make comments on what's attributable to luck and what's attributable to skill in a sport where every last detail of someone's work is available, how could we possibly be anywhere close to making claims about what comes down to luck vs. other factors in something as nebulous as a programming career?</p> <p>In conclusion, life is messy and I don't have any advice.</p> <h3 id="appendix-a-meta-skills-i-d-like-to-learn">Appendix A: meta-skills I'd like to learn</h3> <h4 id="documentation">Documentation</h4> <p>I once worked with <a href="https://www.linkedin.com/in/jared-davis-a8b62611b">Jared Davis, a documentation wizard</a> whose documentation was so good that I'd go to him to understand how a module worked before I talked to the owner the module. As far as I could tell, he wrote documentation on things he was trying to understand to make life easier for himself, but his documentation was so good that it was a force multiplier for the entire company.</p> <p>Later, at Google, I noticed a curiously strong correlation between the quality of initial design docs and the success of projects. Since then, I've tried to write solid design docs and documentation for my projects, but I still have a ways to go.</p> <h4 id="fixing-totally-broken-danluu-com-wat-situations">Fixing <a href="//danluu.com/wat/">totally broken</a> situations</h4> <p>So far, I've only landed on teams where things are much better than average and on teams where things are much worse than average. You might think that, because there's so much low hanging fruit on teams that are much worse than average, it should be easier to improve things on teams that are terrible, but it's just the opposite. The places that have a lot of problems have problems because something makes it hard to fix the problems.</p> <p>When I joined the team that scored a 1 on the Joel Test, it took months of campaigning just to get everyone to use version control.</p> <p>I've never seen an environment go from &quot;bad&quot; to &quot;good&quot; and I'd be curious to know what that looks like and how it happens. Yossi Kreinen's thesis is that <a href="http://yosefk.com/blog/people-can-read-their-managers-mind.html">only management can fix broken situations</a>. That might be true, but I'm not quite ready to believe it just yet, even though I don't have any evidence to the contrary.</p> <h3 id="appendix-b-other-how-i-became-a-programmer-stories">Appendix B: other &quot;how I became a programmer&quot; stories</h3> <p><a href="https://www.mail-archive.com/kragen-tol@canonical.org/msg00184.html">Kragen</a>. Describes 27 years of learning to program. Heavy emphasis on conceptual phases of development (e.g., understanding how to use provided functions vs. understanding that you can write arbitrary functions)</p> <p><a href="http://jvns.ca/blog/2015/02/17/how-i-learned-to-program-in-10-years/">Julia Evans</a>. Started programming on a TI-83 in 2004. Dabbled in programming until college (2006-2011) and has been working as a professional programmer ever since. Some emphasis on the &quot;journey&quot; and how long it takes to improve.</p> <p><a href="http://pgbovine.net/how-i-learned-programming.htm">Philip Guo</a>. A non-traditional story of learning to program, which might be surprising if you know that Philip's career path was MIT -&gt; Stanford -&gt; Google.</p> <p><a href="https://github.com/tarmstrong/longcv/blob/master/bio.md">Tavish Armstrong</a>. 4th grade through college. Emphasis on particular technologies (e.g., LaTeX or Python).</p> <p><a href="https://caitiem.com/2013/03/30/origin-story-becoming-a-game-developer/">Caitie McCaffrey</a>. Started programming in AP computer science. Emphasis on how interests led to a career in programming.</p> <p><a href="http://mattdeboard.net/2011/11/23/how-i-became-a-programmer/">Matt DeBoard</a>. Spent 12 weeks learning Django with the help of a mentor. Emphasis on the fact that it's possible to become a programmer without programming background.</p> <p><a href="https://www.kchodorow.com/blog/2010/11/30/how-i-became-a-programmer/">Kristina Chodorow</a>. Started in college. Emphasis on alternatives (math, grad school).</p> <p><a href="http://michaelrbernste.in/2014/12/11/you-are-learning-haskell-right-now.html">Michael Bernstein</a>. Story of learning Haskell over the course of years. Emphasis on how long it took to become even minimally proficient.</p> <p><small> Thanks to Leah Hanson, Lindsey Kuper, Kelley Eskridge, Jeshua Smith, Tejas Sapre, Joe Wilder, Adrien Lamarque, Maggie Zhou, Lisa Neigut, Steve McCarthy, Darius Bacon, Kaylyn Gibilterra, Sarah Ransohoff, @HamsterRaging, Alex Allain, and &quot;biktian&quot; for comments/criticism/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">If you happen to have contact information for Mr. Swanson, I'd love to be able to send a note saying thanks. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:B">Wayne Dickey, Richard Brualdi, Andreas Seeger, and a visiting professor whose name escapes me. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:W">I strongly recommend Andy Weiner for any class, as well as the guy who taught mathematical physics when I sat in on it, but I don't remember who that was or if that's even the exact name of the class. <a class="footnote-return" href="#fnref:W"><sup>[return]</sup></a></li> <li id="fn:G">with the exception of one government lab, which gave me an offer on the strength of a non-technical on-campus interview. I believe that was literally the first interview I did when I was looking for work, but they didn't get back to me until well after interview season was over and I'd already accepted an offer. I wonder if that's because they went down the list of candidates in some order and only got to me after N people turned them down or if they just had a six month latency on offers. <a class="footnote-return" href="#fnref:G"><sup>[return]</sup></a></li> <li id="fn:E"><p>Because Intel sees no reason to keep its competitors informed about what it's doing, this results in a substantial latency when matching new features. They usually announce enough information that you can implement the basic functionality, but behavior on edge cases may vary. We once had a bug (noticed and fixed well before we shipped, but still problematic) where we bought an engineering sample off of eBay and implemented some new features based on the engineering sample. This resulted in an MWAIT bug that caused Windows to hang; Intel had changed the behavior of MWAIT between shipping the engineering sample and shipping the final version.</p> <p>I recently saw a post that claims that you can get great performance per dollar by buying some engineering samples off of eBay. Don't do this. Engineering samples regularly have bugs. Sometimes those bugs are actual bugs, and sometimes it's just that Intel changed their minds. Either way, you really don't want to run production systems off of engineering samples.</p> <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> <li id="fn:F">I occasionally got feedback by taking a problem I'd solved to someone and asking them if they had any better ideas, but that's much less in depth than the kind of feedback I'm talking about here. <a class="footnote-return" href="#fnref:F"><sup>[return]</sup></a></li> <li id="fn:magic"><p>To pick one arbitrary concrete example, <a href="http://moishelettvin.blogspot.ca/2006/11/windows-shutdown-crapfest.html">look at version control at Microsoft</a> from someone who worked on Windows Vista:</p> <blockquote> <p>In small programming projects, there's a central repository of code. Builds are produced, generally daily, from this central repository. Programmers add their changes to this central repository as they go, so the daily build is a pretty good snapshot of the current state of the product.</p> <p>In Windows, this model breaks down simply because there are far too many developers to access one central repository. So Windows has a tree of repositories: developers check in to the nodes, and periodically the changes in the nodes are integrated up one level in the hierarchy. At a different periodicity, changes are integrated down the tree from the root to the nodes. In Windows, the node I was working on was 4 levels removed from the root. The periodicity of integration decayed exponentially and unpredictably as you approached the root so it ended up that it took between 1 and 3 months for my code to get to the root node, and some multiple of that for it to reach the other nodes. It should be noted too that the only common ancestor that my team, the shell team, and the kernel team shared was the root.</p> </blockquote> <p>Google and Microsoft both maintained their own forks of perforce because that was the most scalable source control system available at the time. Google would go on to build piper, a distributed version control system (in the distributed systems sense, not in the git sense) that solved the scaling problem, despite having a dev experience that wasn't nearly as painful. But that option wasn't really on the table at Microsoft. In the comments to the post quoted above, a then-manager at Microsoft commented that the possible options were:</p> <blockquote> <ol> <li>federate out the source tree, and pay the forward and reverse integration taxes (primarily delay in finding build breaks), or...</li> <li>remove a large number of the unnecessary dependencies between the various parts of Windows, especially the circular dependencies.</li> <li>Both 1&amp;2 #1 was the winning solution in large part because it could be executed by a small team over a defined period of time. #2 would have required herding all the Windows developers (and PMs, managers, UI designers...), and is potentially an unbounded problem.</li> </ol> </blockquote> <p>Someone else commented, to me, that they were on an offshoot team that got the one-way latency down from months to weeks. That's certainly an improvement, but why didn't anyone build a system like piper? I asked that question of people who were at Microsoft at the time, and I got answers like &quot;when we started using perforce, it was so much faster than what we'd previously had that it didn't occur to people that we could do much better&quot; and &quot;perforce was so much faster than xcopy that it seemed like magic&quot;.</p> <p>This general phenomenon, where people don't attempt to make a major improvement because the current system is already such a huge improvement over the previous system, is something I'd seen before and even something I'd done before. This example happens to use Microsoft and Google, but please don't read too much into that. There are systems where things are flipped around and the system at Google is curiously unwieldy compared to the same system at Microsoft.</p> <a class="footnote-return" href="#fnref:magic"><sup>[return]</sup></a></li> </ol> </div> Notes on concurrency bugs concurrency-bugs/ Thu, 04 Aug 2016 20:32:26 -0700 concurrency-bugs/ <p>Do concurrency bugs matter? From the literature, we know that <a href="http://danluu.com/postmortem-lessons/">most reported bugs in distributed systems</a> have really simple causes and can be caught by trivial tests, even when we only look at bugs that cause really bad failures, like loss of a cluster or data corruption. The filesystem literature echos this result -- <a href="http://danluu.com/file-consistency/">a simple checker that looks for totally unimplemented error handling can find hundreds of serious data corruption bugs</a>. Most bugs are simple, at least if you measure by bug count. But if you measure by debugging time, the story is a bit different.</p> <p>Just from personal experience, I've spent more time debugging complex non-deterministic failures than all other types of bugs combined. In fact, I've spent more time debugging some individual non-deterministic bugs (weeks or months) than on all other bug types combined. Non-deterministic bugs are rare, but they can be extremely hard to debug and they're a productivity killer. Bad non-deterministic bugs take so long to debug that relatively large investments in tools and prevention can be worth it<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <p>Let's see what the academic literature has to say on non-deterministic bugs. There's a lot of literature out there, so let's narrow things down by looking at one relatively well studied area: concurrency bugs. We'll start with the literature on single-machine concurrency bugs and then look at distributed concurrency bugs.</p> <h3 id="fonseca-et-al-dsn-10-http-concurrency-mpi-sws-org-dsn2010-concurrencybugs-pdf"><a href="http://concurrency.mpi-sws.org/dsn2010-concurrencybugs.pdf">Fonseca et al. DSN '10</a></h3> <p>They studied MySQL concurrency bugs from 2003 to 2009 and found the following:</p> <h4 id="more-non-deadlock-bugs-63-than-deadlock-https-en-wikipedia-org-wiki-deadlock-bugs-40">More non-deadlock bugs (63%) than <a href="https://en.wikipedia.org/wiki/Deadlock">deadlock</a> bugs (40%)</h4> <p>Note that these numbers sum to more than 100% because some bugs are tagged with multiple causes. This is roughly in line with the Lu et al. ASPLOS '08 paper (which we'll look at later), which found that 30% of the bugs they examined were deadlock bugs.</p> <h4 id="15-of-examined-failures-were-semantic">15% of examined failures were semantic</h4> <p>The paper defines a semantic failure as one &quot;where the application provides the user with a result that violates the intended semantics of the application&quot;. The authors also find that &quot;the vast majority of semantic bugs (92%) generated subtle violations of application semantics&quot;. By their nature, these failures are likely to be undercounted -- it's pretty hard to miss a deadlock, but it's easy to miss subtle data corruption.</p> <h4 id="15-of-examined-failures-were-latent">15% of examined failures were latent</h4> <p>The paper defines latent as bugs that &quot;do not become immediately visible to users.&quot;. Unsurprisingly, the paper finds that latent failures are closely related to semantic failures; 92% of latent failures are semantic and vice versa. The 92% number makes this finding sound more precise than it really is -- it's just that 11 out of the 12 semantic failures are latent and vice versa. That could have easily been 11 out of 11 (100%) or 10 out of 12 (83%).</p> <p>That's interesting, but it's hard to tell from that if the results generalize to projects that aren't databases, or even projects that aren't MySQL.</p> <h3 id="lu-et-al-asplos-08-http-web1-cs-columbia-edu-junfeng-10fa-e6998-papers-concurrency-bugs-pdf"><a href="http://web1.cs.columbia.edu/~junfeng/10fa-e6998/papers/concurrency-bugs.pdf">Lu et al. ASPLOS '08</a></h3> <p>They looked at concurrency bugs in MySQL, Firefox, OpenOffice, and Apache. Some of their findings are:</p> <h4 id="97-of-examined-non-deadlock-bugs-were-atomicity-violation-or-order-violation-bugs">97% of examined non-deadlock bugs were atomicity-violation or order-violation bugs</h4> <p>Of the 74 non-deadlock bugs studied, 51 were atomicity bugs, 24 were ordering bugs, and 2 were categorized as &quot;other&quot;.</p> <p>An example of an atomicity violation is this bug from MySQL:</p> <p>Thread 1:</p> <pre><code>if (thd-&gt;proc_info) fputs(thd-&gt;proc_info, ...) </code></pre> <p>Thread 2:</p> <pre><code>thd-&gt;proc_info = NULL; </code></pre> <p>For anyone who isn't used to C or C++, <code>thd</code> is a pointer, and <code>-&gt;</code> is the operator to access a field through a pointer. The first line in thread 1 checks if the field is null. The second line calls <code>fputs</code>, which writes the field. The intent is to only call <code>fputs</code> if and only if <code>proc_info</code> isn't <code>NULL</code>, but there's nothing preventing another thread from setting <code>proc_info</code> to <code>NULL</code> &quot;between&quot; the first and second lines of thread 1.</p> <p>Like most bugs, this bug is obvious in retrospect, but if we look at the original bug report, <a href="http://bugs.mysql.com/bug.php?id=3596">we can see that it wasn't obvious at the time</a>:</p> <blockquote> <p>Description: I've just noticed with the latest bk tree than MySQL regularly crashes in InnoDB code ... How to repeat: I've still no clues on why this crash occurs.</p> </blockquote> <p>As is common with large codebases, fixing the bug once it was diagnosed was more complicated than it first seemed. This bug was partially fixed in 2004, <a href="https://bugs.mysql.com/bug.php?id=38883">resurfaced again and was fixed in 2008</a>. <a href="https://bugs.mysql.com/bug.php?id=38816">A fix for another bug caused a regression in 2009</a>, which was also fixed in 2009. That fix introduced <a href="https://bugs.mysql.com/bug.php?id=60682">a deadlock that was found in 2011</a>.</p> <p>An example ordering bug is the following bug from Firefox:</p> <p>Thread 1:</p> <pre><code>mThread=PR_CreateThread(mMain, ...); </code></pre> <p>Thread 2:</p> <pre><code>void mMain(...) { mState = mThread-&gt;State; } </code></pre> <p><code>Thread 1</code> launches <code>Thread 2</code> with <code>PR_CreateThread</code>. <code>Thread 2</code> assumes that, because the line that launched it assigned to <code>mThread</code>, <code>mThread</code> is valid. But <code>Thread 2</code> can start executing before <code>Thread 1</code> has assigned to <code>mThread</code>! The authors note that they call this an ordering bug and not an atomicity bug even though the bug could have been prevented if the line in thread 1 were atomic because their &quot;bug pattern categorization is based on root cause, regardless of possible fix strategies&quot;.</p> <p>An example of an &quot;other&quot; bug, one of only two studied, is this bug in MySQL:</p> <p>Threads 1...n:</p> <pre><code>rw_lock(&amp;lock); </code></pre> <p>Watchdog thread:</p> <pre><code>if (lock_wait_time[i] &gt; fatal_timeout) assert(0); </code></pre> <p>This can cause a spurious crash when there's more than the expected amount of work. Note that the study doesn't look at performance bugs, so a bug where lock contention causes things to slow to a crawl but a watchdog doesn't kill the program wouldn't be considered.</p> <p>An aside that's probably a topic for another post is that hardware often has deadlock or livelock detection built in, and that when a lock condition is detected, hardware will often try to push things into a state where normal execution can continue. After detecting and breaking deadlock/livelock, an error will typically be logged in a way that it will be noticed if it's caught in lab, but that external customers won't see. For some reason, that strategy seems rare in the software world, although it seems like it should be easier in software than in hardware.</p> <p><a href="https://en.wikipedia.org/wiki/Deadlock">Deadlock occurs if and only if the following four conditions are true</a>:</p> <ol> <li>Mutual exclusion: at least one resource must be held in a non-shareable mode. Only one process can use the resource at any given instant of time.</li> <li>Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.</li> <li>No preemption: a resource can be released only voluntarily by the process holding it.</li> <li>Circular wait: a process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource.</li> </ol> <p>There's nothing about these conditions that are unique to either hardware or software, and it's easier to build mechanisms that can back off and replay to relax (2) in software than in hardware. Anyway, back to the study findings.</p> <h4 id="96-of-examined-concurrency-bugs-could-be-reproduced-by-fixing-the-relative-order-of-2-specific-threads">96% of examined concurrency bugs could be reproduced by fixing the relative order of 2 specific threads</h4> <p>This sounds like great news for testing. Testing only orderings between thread pairs is much more tractable than testing all orderings between all threads. Similarly, 92% of examined bugs could be reproduced by fixing the order of four (or fewer) memory accesses. However, there's a kind of sampling bias here -- only bugs that could be reproduced could be analyzed for a root cause, and bugs that only require ordering between two threads or only a few memory accesses are easier to reproduce.</p> <h4 id="97-of-examined-deadlock-bugs-were-caused-by-two-threads-waiting-for-at-most-two-resources">97% of examined deadlock bugs were caused by two threads waiting for at most two resources</h4> <p>Moreover, 22% of examined deadlock bugs were caused by a thread acquiring a resource held by the thread itself. The authors state that pairwise testing of acquisition and release sequences should be able to catch most deadlock bugs, and that pairwise testing of thread orderings should be able to catch most non-deadlock bugs. The claim seems plausibly true when read as written; the implication seems to be that virtually all bugs can be caught through some kind of pairwise testing, but I'm a bit skeptical of that due to the sample bias of the bugs studied.</p> <p>I've seen bugs with many moving parts take months to track down. The worst bug I've seen consumed nearly a person-year's worth of time. Bugs like that mostly don't make it into studies like this because it's rare that a job allows someone the time to chase bugs that elusive. How many bugs like that are out there is still an open question.</p> <h4 id="caveats">Caveats</h4> <p>Note that all of the programs studied were written in C or C++, and that this study predates C++11. Moving to C++11 and using atomics and scoped locks would probably change the numbers substantially, not to mention moving to an entirely different concurrency model. There's some academic work on how different concurrency models affect bug rates, but it's not really clear how that work generalizes to codebases as large and mature as the ones studied, and by their nature, large and mature codebases are hard to do randomized trials on when the trial involves changing the fundamental primitives used. The authors note that 39% of examined bugs could have been prevented by using transactional memory, but it's not clear how many other bugs might have been introduced if transactional memory were used.</p> <h3 id="tools">Tools</h3> <p>There are other papers on characterizing single-machine concurrency bugs, but in the interest of space, I'm going to skip those. There are also papers on distributed concurrency bugs, but before we get to that, let's look at some of the tooling for finding single-machine concurrency bugs that's in the literature. I find the papers to be pretty interesting, especially the model checking work, but realistically, I'm probably not going to build a tool from scratch if something is available, so let's look at what's out there.</p> <h4 id="hapset">HapSet</h4> <p>Uses run-time coverage to generate interleavings that haven't been covered yet. This is out of NEC labs; googling <code>NEC labs HapSet</code> returns the paper, <a href="https://www.google.com/patents/US20120089873">some patent listings</a>, but no obvious download for the tool.</p> <h4 id="chess-https-chesstool-codeplex-com"><a href="https://chesstool.codeplex.com/">CHESS</a></h4> <p>Generates unique interleavings of threads for each run. They claim that, by not tracking state, the checker is much simpler than it would otherwise be, and that they're able to avoid many of the disadvantages of tracking state via a detail that can't properly be described in this tiny little paragraph; read the paper if you're interested! Supports C# and C++. The page claims that it requires Visual Studio 2010 and that it's only been tested with 32-bit code. I haven't tried to run this on a modern *nix compiler, but IME requiring Visual Studio 2010 means that it would be a moderate effort to get it running on a modern version of Visual Studio, and a substantial effort to get it running on a modern version of gcc or clang. A quick Google search indicates that this might be patent encumbered<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">2</a></sup>.</p> <h4 id="maple-https-github-com-jieyu-maple"><a href="https://github.com/jieyu/maple">Maple</a></h4> <p>Uses coverage to generate interleavings that haven't been covered yet. Instruments pthreads. The source is up on GitHub. It's possible this tool is still usable, and I'll probably give it a shot at some point, but it depends on at least one old, apparently unmaintained tool (PIN, a binary instrumentation tool from Intel). Googling (Binging?) for either Maple or PIN gives a number of results where people can't even get the tool to compile, let alone use the tool.</p> <h4 id="pacer">PACER</h4> <p>Samples using the FastTrack algorithm in order to keep overhead low enough &quot;to consider in production software&quot;. Ironically, this was implemented on top of the Jikes RVM, which is unlikely to be used in actual production software. The only reference I could find for an actually downloadable tool is <a href="https://github.com/pangloss/pacer">a completely different pacer</a>.</p> <h4 id="conlock-magiclock-magicfuzzer">ConLock / MagicLock / MagicFuzzer</h4> <p>There's a series of tools that are from one group which claims to get good results using various techniques, but AFAICT the source isn't available for any of the tools. There's a page that claims there's a version of MagicFuzzer available, but it's a link to a binary that doesn't specify what platform the binary is for and the link 404s.</p> <h4 id="omen-wolf">OMEN / WOLF</h4> <p>I couldn't find a page for these tools (other than their papers), let alone a download link.</p> <h4 id="sherlock-atomchase-racageddon">SherLock / AtomChase / Racageddon</h4> <p>Another series of tools that aren't obviously available.</p> <h3 id="tools-you-can-actually-easily-use">Tools you can actually easily use</h3> <h4 id="valgrind-drd-helgrind">Valgrind / DRD / Helgrind</h4> <p>Instruments pthreads and easy to use -- just run valgrind with the appropriate options (<code>-drd</code> or <code>-helgrind</code>) on the binary. May require <a href="http://valgrind.org/docs/manual/drd-manual.html#drd-manual.C++11">a couple tweaks</a> if using C++11 threading.</p> <h4 id="clang-thread-sanitizer-http-clang-llvm-org-docs-threadsanitizer-html-tsan"><a href="http://clang.llvm.org/docs/ThreadSanitizer.html">clang thread sanitizer</a> (TSan)</h4> <p>Can find data races. Flags when <a href="http://stackoverflow.com/questions/11970428/how-to-understand-happens-before-consistent">happens-before</a> is violated. Works with pthreads and C++11 threads. Easy to use (just pass a <code>-fsanitize=thread</code> to clang).</p> <p>A side effect of being so easy to use and actually available is that <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43217.pdf">tsan has had a very large impact in the real world</a>:</p> <blockquote> <p>One interesting incident occurred in the open source Chrome browser. Up to 15% of known crashes were attributed to just one bug [5], which proved difficult to understand - the Chrome engineers spent over 6 months tracking this bug without success. On the other hand, the TSAN V1 team found the reason for this bug in a 30 minute run, without even knowing about these crashes. The crashes were caused by data races on a couple of reference counters. Once this reason was found, a relatively trivial fix was quickly made and patched in, and subsequently the bug was closed.</p> </blockquote> <h4 id="clang-wthread-safety">clang <code>-Wthread-safety</code></h4> <p>Static analysis that uses annotations on shared state to determine if state wasn't correctly guarded.</p> <h4 id="findbugs-https-en-wikipedia-org-wiki-findbugs"><a href="https://en.wikipedia.org/wiki/FindBugs">FindBugs</a></h4> <p>General static analysis for Java with many features. Has <code>@GuardedBy</code> annotations, similar to <code>-Wthread-safety</code>.</p> <h4 id="checkerframework-http-types-cs-washington-edu-checker-framework"><a href="http://types.cs.washington.edu/checker-framework/">CheckerFramework</a></h4> <p>Java framework for writing checkers. Has many different checkers. For concurrency in particular, uses <code>@GuardedBy</code>, like FindBugs.</p> <h4 id="rr-http-rr-project-org"><a href="http://rr-project.org/">rr</a></h4> <p>Deterministic replay for debugging. Easy to get and use, and appears to be actively maintained. Adds support for time-travel debugging in gdb.</p> <h4 id="drdebug-https-software-intel-com-en-us-articles-pintool-drdebug-pinplay-https-sites-google-com-site-pinplaypldi2016tutorial"><a href="https://software.intel.com/en-us/articles/pintool-drdebug">DrDebug</a>/<a href="https://sites.google.com/site/pinplaypldi2016tutorial/">PinPlay</a></h4> <p>General toolkit that can give you deterministic replay for debugging. Also gives you &quot;dynamic slicing&quot;, which is watchpoint-like: it can tell you what statements affected a variable, as well as what statements are affected by a variable. Currently Linux only; claims Windows and Android support coming soon.</p> <h3 id="other-tools">Other tools</h3> <p>This isn't an exhaustive list -- there's a ton of literature on this, and this is an area where, frankly, I'm pretty unlikely to have the time to implement a tool myself, so there's not much value for me in reading more papers to find out about techniques that I'd have to implement myself<sup class="footnote-ref" id="fnref:I"><a rel="footnote" href="#fn:I">3</a></sup>. However, I'd be interested in hearing about other tools that are usable.</p> <p>One thing I find interesting about this is that almost all of the papers for the academic tools claim to do something novel that lets them find bugs not found by other tools. They then run their tool on some codebase and show that the tool is capable of finding new bugs. But since almost no one goes and runs the older tools on any codebase, you'd never know if one of the newer tools only found a subset of the bugs that one of the older tools could catch.</p> <p>Furthermore, you see cycles (livelock?) in how papers claim to be novel. Paper I will claim that it does X. Paper II will claim that it's novel because it doesn't need to do X, unlike Paper I. Then Paper III will claim that it's novel because, unlike Paper II, it does X.</p> <h3 id="distributed-systems">Distributed systems</h3> <p>Now that we've looked at some of the literature on single-machine concurrency bugs, what about distributed concurrency bugs?</p> <h3 id="leesatapornwongsa-et-al-asplos-2016-http-ucare-cs-uchicago-edu-pdf-asplos16-taxdc-pdf"><a href="http://ucare.cs.uchicago.edu/pdf/asplos16-TaxDC.pdf">Leesatapornwongsa et al. ASPLOS 2016</a></h3> <p>They looked at 104 bugs in Cassandra, MapReduce, HBase, and Zookeeper. Let's look at some example bugs, which will clarify the terminology used in the study and make it easier to understanding the main findings.</p> <h4 id="message-message-race">Message-message race</h4> <p><a href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">This diagram</a> is just for reference, so that we have a high-level idea of how different parts fit together in MapReduce:</p> <p><img src="images/concurrency-bugs/mapreduce_block_diagram.gif" alt="Block diagram of MapReduce"></p> <p><a href="https://issues.apache.org/jira/browse/MAPREDUCE-3274">In MapReduce bug #3274</a>, a resource manager sends a task-init message to a node manager. Shortly afterwards, an application master sends a task-kill preemption to the same node manager. The intent is for the task-kill message to kill the task that was started with the task-init message, but the task-kill can win the race and arrive before the task-init. This example happens to be a case where two messages from different nodes are racing to get to a single node.</p> <p>For example, in <a href="https://issues.apache.org/jira/browse/MAPREDUCE-5358">MapReduce bug #5358</a>, an application master sends a kill message to node manager running a speculative task because another copy of the task finished. However, before the message is received by the node manager, the node manager's task completes, causing a complete message to be sent to the application master, causing an exception because a <code>complete</code> message was received after the task had completed.</p> <h4 id="message-compute-race">Message-compute race</h4> <p>One example is <a href="https://issues.apache.org/jira/browse/MAPREDUCE-4157">MapReduce bug# 4157</a>, where the application master unregisters with the resource manager. The application master then cleans up, but that clean-up races against the resource manager sending kill messages to the application's containers via node managers, causing the application master to get killed. Note that this is classified as a race and not an atomicity bug, which we'll get to shortly.</p> <p>Compute-compute races can happen, but they're outside the scope of this study since this study only looks at distributed concurrency bugs.</p> <h4 id="atomicity-violation">Atomicity violation</h4> <p>For the purposes of this study, atomicity bugs are defined as &quot;whenever a message comes in the middle of a set of events, which is a local computation or global communication, but not when the message comes either before or after the events&quot;. According to this definition, the message-compute race we looked at above isn't a atomicity bug because it would still be a bug if the message came in before the &quot;computation&quot; started. This definition also means that hardware failures that occur inside a block that must be atomic are not considered atomicity bugs.</p> <p>I can see why you'd want to define those bugs as separate types of bugs, but I find this to be a bit counterintuitive, since I consider all of these to be different kinds of atomicity bugs because they're different bugs that are caused by breaking up something that needs to be atomic.</p> <p>In any case, by the definition of this study, <a href="https://issues.apache.org/jira/browse/MAPREDUCE-5009">MapReduce bug #5009</a> is an atomicty bug. A node manager is in the process of committing data to HDFS. The resource manager kills the task, which doesn't cause the commit state to change. Any time the node tries to rerun the commit task, the task is killed by the application manager because a commit is believed to already be in process.</p> <h4 id="fault-timing">Fault timing</h4> <p>A fault is defined to be a &quot;component failure&quot;, such as a crash, timeout, or unexpected latency. At one point, the paper refers to &quot;hardware faults such as machine crashes&quot;, which seems to indicate that some faults that could be considered software faults are defined as hardware faults for the purposes of this study.</p> <p>Anyway, for the purposes of this study, an example of a fault-timing issue is <a href="https://issues.apache.org/jira/browse/MAPREDUCE-3858">MapReduce bug #3858</a>. A node manager crashes while committing results. When the task is re-run, later attempts to commit all fail.</p> <h4 id="reboot-timing">Reboot timing</h4> <p>In this study, reboots are classified separately from other faults. <a href="https://issues.apache.org/jira/browse/MAPREDUCE-3186">MapReduce bug #3186</a> illustrates a reboot bug.</p> <p>A resource manager sends a job to an application master. If the resource manager is rebooted before the application master sends a commit message back to the resource manager, the resource manager loses its state and throws an exception because it's getting an unexpected complete message.</p> <p>Some of their main findings are:</p> <h4 id="47-of-examined-bugs-led-to-latent-failures">47% of examined bugs led to latent failures</h4> <p>That's a pretty large difference when compared to the DSN' 10 paper that found that 15% of examined multithreading bugs were latent failures. It's plausible that this is a real difference and not just something due to a confounding variable, but it's hard to tell from the data.</p> <h4 id="63-of-examined-bugs-were-related-to-hardware-faults">63% of examined bugs were related to hardware faults</h4> <p>This is a large difference from what studies on &quot;local&quot; concurrency bugs found. I wonder how much of that is just because <a href="http://danluu.com/file-consistency/#error-frequency">people mostly don't even bother filing and fixing bugs on hardware faults in non-distributed software</a>.</p> <h4 id="64-of-examined-bugs-were-triggered-by-a-single-message-s-timing">64% of examined bugs were triggered by a single message's timing</h4> <p>44% were ordering violations, and 20% were atomicity violations. Furthermore, &gt; 90% of bugs involved three messages (or fewer).</p> <p>32% of examined bugs were due to fault or reboot timing. Note that, for the purposes of the study, a hardware fault or a reboot that breaks up a block that needed to be atomic isn't considered an atomicity bug -- here, atomicity bugs are bugs where a message arrives in the middle of a computation that needs to be atomic.</p> <h4 id="70-of-bugs-had-simple-fixes">70% of bugs had simple fixes</h4> <p>30% were fixed by ignoring the badly timed message and 40% were fixed by delaying or ignoring the message.</p> <h4 id="bug-causes">Bug causes?</h4> <p>After reviewing the bugs, the authors propose common fallacies that lead to bugs:</p> <ol> <li>One hop is faster than two hops</li> <li>Zero hops are faster than one hop</li> <li>Atomic blocks can't be broken</li> </ol> <p>On (3), the authors note that it's not just hardware faults or reboots that break up atomic blocks -- systems can send kill or pre-emption messages that break up an atomic block. A fallacy which I've commonly seen in post-mortems that's not listed here, goes something like &quot;<a href="https://en.wikipedia.org/wiki/Byzantine_fault_tolerance">bad nodes are obviously bad</a>&quot;. A classic example of this is when a system starts &quot;handling&quot; queries by dropping them quickly, causing a load balancer to shift traffic the bad node because it's handling traffic so quickly.</p> <p>One of my favorite bugs in this class from an actual system was in a ring-based storage system where nodes could do health checks on their neighbors and declare that their neighbors should be dropped if the health check fails. One node went bad, dropped all of its storage, and started reporting its neighbors as bad nodes. Its neighbors noticed that the bad node was bad, but because the bad node had dropped all of its storage, it was super fast and was able to report its good neighbors before the good neighbors could report the bad node. After ejecting its immediate neighbors, the bad node got new neighbors and raced the new neighbors, winning again for the same reason. This was repeated until the entire cluster died.</p> <h3 id="tools-1">Tools</h3> <h4 id="mace">Mace</h4> <p>A set of language extensions (on C++) that helps you build distributed systems. Mace has a model checker that can check all possible event orderings of messages, interleaved with crashes, reboots, and timeouts. The Mace model checker is actually available, but AFAICT it requires using the Mace framework, and most distributed systems aren't written in Mace.</p> <h4 id="modist">Modist</h4> <p>Another model checker that checks different orderings. Runs only one interleaving of independent actions (partial order reduction) to avoid checking redundant states. Also interleaves timeouts. Unlike Mace, doesn't inject reboots. Doesn't appear to be available.</p> <h4 id="demeter">Demeter</h4> <p>Like Modist, in that it's a model checker that injects the same types of faults. Uses a different technique to reduce the state space, which I don't know how to summarize succinctly. See paper for details. Doesn't appear to be available. Googling for Demeter returns some software used to model X-ray absorption?</p> <h4 id="samc">SAMC</h4> <p>Another model checker. Can inject multiple crashes and reboots. Uses some understanding of the system to avoid redundant re-orderings (e.g., if a series of messages is invariant to when a reboot is injected, the system tries to avoid injecting the reboot between each message). Doesn't appear to be available.</p> <h4 id="jepsen">Jepsen</h4> <p>As was the case for non-distributed concurrency bugs, there's a vast literature on academic tools, most of which appear to be grad-student code that hasn't been made available.</p> <p>And of course there's Jepsen, which doesn't have any attached academic papers, but has probably had more real-world impact than any of the other tools because <a href="https://github.com/aphyr/jepsen">it's actually available and maintained</a>. There's also <a href="https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey">chaos monkey</a>, but if I'm understanding it correctly, unlike the other tools listed, it doesn't attempt to create reproducible failures.</p> <h3 id="conclusion">Conclusion</h3> <p>Is this where you're supposed to have a conclusion? I don't have a conclusion. We've looked at some literature and found out some information about bugs that's interesting, but not necessarily actionable. We've read about tools that are interesting, but not actually available. And then there are some tools based on old techniques that are available and useful.</p> <p>For example, the idea inside clang's TSan, using &quot;happens-before&quot; to find data races, goes back ages. There's <a href="http://www.cs.columbia.edu/~junfeng/10fa-e6998/papers/hybrid.pdf">a 2003 paper</a> that discusses &quot;combining two previously known race detection techniques -- lockset-based detection and happens-before-based detection -- to obtain fewer false positives than lockset-based detection alone&quot;. That's actually what TSan v1 did, but with TSan v2 they realized the tool would be more impactful if they only used happens-before because that avoids false positives, which means that people will actually use the tool. That's not something that's likely to turn into a paper that gets cited zillions of times, though. For anyone who's looked at how <a href="http://lcamtuf.coredump.cx/afl/README.txt">afl</a> works, this story should sound familiar. AFL is emintently practical and has had a very large impact in the real world, mostly by eschewing fancy techniques from the recent literature.</p> <p>If you must have a conclusion, maybe the conclusion is that individuals like Kyle Kingsbury or Michal Zalewski have had an outsized impact on industry, and that you too can probably pick an underserved area in testing and have a curiously large impact on an entire industry.</p> <h4 id="unrelated-miscellania">Unrelated miscellania</h4> <p><a href="https://github.com/rose">Rose Ames</a> asked me to tell more &quot;big company&quot; stories, so here's a set of stories that explains why I haven't put a blog post up for a while. The proximal cause is that my VP has been getting negative comments about my writing. But the reasons for that are a bit of a long story. Part of it is the usual thing, where the comments I receive personally skew very heavily positive, but the comments my manager gets run the other way because it's weird to email someone's manager because you like their writing, but you might send an email if their writing really strikes a nerve.</p> <p>That explains why someone in my management chain was getting emailed about my writing, but it doesn't explain why the emails went to my VP. That's because I switched teams a few months ago, and the org that I was going to switch into overhired and didn't have any headcount. I've heard conflicting numbers about how much they overhired, from 10 or 20 people to 10% or 20% (the org is quite large, and 10% would be much more than 20), as well as conflicting stories about why it happened (honest mistake vs. some group realizing that there was a hiring crunch coming and hiring as much as possible to take all of the reqs from the rest of the org). Anyway, for some reason, the org I would have worked in hired more than it was allowed to by at least one person and instituted a hiring freeze. Since my new manager couldn't hire me into that org, he transferred into an org that had spare headcount and hired me into the new org. The new org happens to be a sales org, which means that I technically work in sales now; this has some impact on my day-to-day life since there are some resources and tech talks that are only accessible by people in product groups, but that's another story. Anyway, for reasons that I don't fully understand, I got hired into the org before my new manager, and during the months it took for the org chart to get updated I was shown as being parked under my VP, which meant that anyone who wanted to fire off an email to my manager would look me up in the directory and accidentally email my VP instead.</p> <p>It didn't seem like any individual email was a big deal, but since I don't have much interaction with my VP and I don't want to only be known as that guy who writes stuff which generates pushback from inside the company, I paused blogging for a while. I don't exactly want to be known that way to my manager either, but I interact with my manager frequently enough that at least I won't only be known for that.</p> <p>I also wonder if these emails to my manager/VP are more likely at my current employer than at previous employers. I've never had this happen (that I know of) at another employer, but the total number of times it's happened here is low enough that it might just be coincidence.</p> <p>Then again, I was just reading the archives of a really insightful internal blog and ran across a note that mentioned that the series of blog posts was being published internally because the author got static from <a href="https://news.ycombinator.com/item?id=11761437">Sinofsky</a> about publishing posts that contradicted the party line, which eventually resulted in the author agreeing to email Sinofsky comments related to anything under Sinofsky's purview instead of publishing the comments publicly. But now that Sinofsky has moved on, the author wanted to share emails that would have otherwise been posts internally.</p> <p>That kind of thing doesn't seem to be a freak occurance around here. At the same time I saw that thing about Sinofsky, I ran across a discussion on whether or not a PM was within their rights to tell someone to take down a negative review from the app store. Apparently, a PM found out that someone had written a negative rating on the PM's product in some app store and emailed the rater, telling them that they had to take the review down. It's not clear how the PM found out that the rater worked for us (do they search the internal directory for every negative rating they find?), but they somehow found out and then issued their demand. Most people thought that the PM was out of line, but there were a non-zero number of people (in addition to the PM) who thought that employees should not say anything that could be construed as negative about the company in public.</p> <p>I feel like I see more of this kind of thing now than I have at other companies, but the company's really too big to tell if anyone's personal experience generalizes. Anyway, I'll probably start blogging again now that the org chart shows that I report to my actual manager, and maybe my manager will get some emails about that. Or maybe not.</p> <p><em>Thanks to Leah Hanson, David Turner, Justin Mason, Joe Wilder, Matt Dziubinski, Alex Blewitt, Bruno Kim Medeiros Cesar, Luke Gilliam, Ben Karas, Julia Evans, Michael Ernst, and Stephen Tu for comments/corrections.</em></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">If you're going to debug bugs. I know some folks at startups who give up on bugs that look like they'll take more than a few hours to debug because their todo list is long enough that they can't afford the time. That might be the right decision given the tradeoffs they have, but it's not the right decision for everyone. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:P">Funny thing about US patent law: you owe treble damages for willfully infringing on a patent. A direct effect of this is that two out of three of my full-time employers have very strongly recommended that I don't read patents, so I avoid reading patents that aren't obviously frivolous. And by frivolous, I don't mean patents for obvious things that any programmer might independently discover, because patents like that are often upheld as valid. I mean patents for things like <a href="https://www.google.com/patents/US6368227">how to swing on a swing</a>. <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:I">I get the incentives that lead to this, and I don't begrudge researchers for pursuing career success by responding to those incentives, but as <a href="http://sei.pku.edu.cn/~yaoguo/PhDReading07/parnas-review.pdf">a lowly practitioner</a>, it sure would be nice if the incentives were different. <a class="footnote-return" href="#fnref:I"><sup>[return]</sup></a></li> </ol> </div> Some programming blogs to consider reading programming-blogs/ Mon, 18 Apr 2016 00:06:34 -0700 programming-blogs/ <p>This is one of those “N technical things every programmer must read” lists, except that “programmer” is way too broad a term and the styles of writing people find helpful for them are too different for any such list to contain a non-zero number of items (if you want the entire list to be helpful to everyone). So here's a list of some things you might want to read, and why you might (or might not) want to read them.</p> <h4 id="aleksey-shipilev-https-shipilev-net"><a href="https://shipilev.net/">Aleksey Shipilev</a></h4> <p>If you want to understand how the JVM really works, this is one of the best resources on the internet.</p> <h4 id="bruce-dawson-https-randomascii-wordpress-com"><a href="https://randomascii.wordpress.com/">Bruce Dawson</a></h4> <p>Performance explorations of a Windows programmer. Often implicitly has nice demonstrations of tooling that has no publicly available peer on Linux.</p> <h4 id="chip-huyen-https-huyenchip-com-blog"><a href="https://huyenchip.com/blog/">Chip Huyen</a></h4> <p>A mix of summaries of ML conferences, data analyses (e.g., <a href="https://huyenchip.com/2019/08/21/glassdoor-interview-reviews-tech-hiring-cultures.html">on interview data posted to glassdoor</a> or <a href="https://huyenchip.com/2020/01/18/tech-workers-19k-compensation-details.html">compensation data posted to levels.fyi</a>), and generaly commentary on the industry.</p> <p>One of the rare blogs that has data-driven position pieces about the industry.</p> <h4 id="chris-fenton-http-www-chrisfenton-com"><a href="http://www.chrisfenton.com/">Chris Fenton</a></h4> <p>Computer related projects, by which I mean things like <a href="http://www.chrisfenton.com/homebrew-cray-1a/">reconstructing the Cray-1A</a> and <a href="http://www.chrisfenton.com/the-numbotron/">building mechanical computers</a>. Rarely updated, presumably due to the amount of work that goes into the creations, but almost always interesting.</p> <p>The blog posts tend to be high-level, more like pitch decks than design docs, but there's often source code available if you want more detail.</p> <h4 id="cindy-sridharan-https-twitter-com-copyconstruct"><a href="https://twitter.com/copyconstruct">Cindy Sridharan</a></h4> <p>More active on Twitter than on her blog, but has posts that review papers as well as some on &quot;big&quot; topics, like <a href="https://medium.com/@copyconstruct/distributed-tracing-weve-been-doing-it-wrong-39fc92a857df">distributed tracing</a> and <a href="https://medium.com/@copyconstruct/testing-in-production-the-hard-parts-3f06cefaf592">testing in production</a>.</p> <h4 id="dan-mckinley-http-mcfunley-com"><a href="http://mcfunley.com/">Dan McKinley</a></h4> <p>A lot of great material on how engineering companies should be run. He has a lot of ideas that sound like common sense, e.g., <a href="http://mcfunley.com/choose-boring-technology">choose boring technology</a>, until you realize that it's actually uncommon to find opinions that are so sensible.</p> <p>Mostly distilled wisdom (as opposed to, say, detailed explanations of code).</p> <h4 id="eli-bendersky-http-eli-thegreenplace-net-archives-all"><a href="http://eli.thegreenplace.net/archives/all">Eli Bendersky</a></h4> <p>I think of this as “<a href="http://eli.thegreenplace.net/tag/c-c">the C++ blog</a>”, but it's much wider ranging that that. It's too wide ranging for me to sum up, but if I had to commit to a description I might say that it's a collection of deep dives into various topics, often (but not always) relatively low-level, along with short blurbs about books, often (but not always) technical.</p> <p>The book reviews tend to be easy reading, but the programming blog posts are often a mix of code and exposition that really demands your attention; usually not a light read.</p> <h4 id="erik-sink-https-ericsink-com-index-html"><a href="https://ericsink.com/index.html">Erik Sink</a></h4> <p>I think Erik has been <a href="https://twitter.com/danluu/status/1340059909090549761">the most consistently insightful writer about tech culture over the past 20 years</a>. If you look at people who were blogging back when he started blogging, much of <a href="https://twitter.com/danluu/status/1336499111789436931">Steve Yegge's</a> writing holds up as well as Erik's, but Steve hasn't continued writing consistently.</p> <p>If you look at popular writers from that era, I think they generally tend to <a href="https://twitter.com/danluu/status/784929052519837696">not</a> <a href="https://web.archive.org/web/20210912095555/https://twitter.com/rakyll/status/1413996396814823431">really</a> <a href="https://twitter.com/danluu/status/1172593136503160832">hold</a> up very well.</p> <h4 id="fabian-giesen-https-fgiesen-wordpress-com"><a href="https://fgiesen.wordpress.com/">Fabian Giesen</a></h4> <p>Covers a wide variety of technical topics. Emphasis on computer architecture, compression, graphics, and signal processing, but you'll find many other topics as well.</p> <p>Posts tend towards being technically intense and not light reading and they usually explain concepts or ideas (as opposed to taking sides and writing opinion pieces).</p> <h4 id="fabien-sanglard-http-fabiensanglard-net"><a href="http://fabiensanglard.net/">Fabien Sanglard</a></h4> <p>In depth techincal dives on game related topics, such as <a href="http://fabiensanglard.net/doomIphone/doomClassicRenderer.php">this readthrough of the Doom source code</a>, <a href="http://fabiensanglard.net/cuda/index.html">this history of Nvidia GPU architecture</a>, or <a href="http://fabiensanglard.net/rayTracing_back_of_business_card/index.php">this read of a business card raytracer</a>.</p> <h4 id="fabrice-bellard-http-bellard-org"><a href="http://bellard.org/">Fabrice Bellard</a></h4> <p>Not exactly a blog, but every time a new project appears on the front page, it's amazing. Some examples are QEMU, FFMPEG, a 4G LTE base station that runs on a PC, <a href="http://bellard.org/jslinux/">a JavaScript PC emulator that can boot Linux</a>, etc.</p> <h4 id="fred-akalin-https-www-akalin-com"><a href="https://www.akalin.com/">Fred Akalin</a></h4> <p>Explanations of CS-related math topics (with a few that aren't directly CS related).</p> <h4 id="gary-bernhardt-https-twitter-com-garybernhardt"><a href="https://twitter.com/garybernhardt">Gary Bernhardt</a></h4> <p>Another “not exactly a blog”, but it's more informative than most blogs, not to mention more entertaining. This is the best “blog” on <a href="https://twitter.com/garybernhardt/status/676560697262604288">the pervasive brokenness of modern software</a> that I know of.</p> <h4 id="jaana-dogan-https-rakyll-org"><a href="https://rakyll.org/">Jaana Dogan</a></h4> <p><a href="https://rakyll.org/">rakyll.org</a> has posts on Go, some of which are quite in depth, e.g., <a href="https://rakyll.org/generics-proposal/">this set of notes on the Go generics proposal</a> and <a href="https://medium.com/@rakyll">Jaana's medium blog</a> has some posts on Go as well as posts on various topics in distributed systems.</p> <p>Also, <a href="https://twitter.com/rakyll">Jaana's Twitter</a> has what I think of as &quot;intellectually honest critiques of the industry&quot;, which I think is unusual for critiques of the industry on Twitter. It's more typical to see people scoring points at the expense of nuance or even being vaugely in the vicinity of correctness, which is why I think it's worth calling out these honest critiques.</p> <h4 id="jamie-brandon-https-scattered-thoughts-net"><a href="https://scattered-thoughts.net/">Jamie Brandon</a></h4> <p>I'm so happy that I managed to convince Jamie that, given his preferences, it would make sense to take a crack at blogging full-time to support himself. Since <a href="https://twitter.com/danluu/status/1354239729483407360">Jamie started taking donations</a> until today, this blog has been an absolute power house with posts like <a href="https://scattered-thoughts.net/writing/select-wat-from-sql/">this</a> <a href="https://scattered-thoughts.net/writing/against-sql">series</a> on problems with SQL, <a href="https://scattered-thoughts.net/writing/an-opinionated-map-of-incremental-and-streaming-systems/">this</a> <a href="https://scattered-thoughts.net/writing/internal-consistency-in-streaming-systems/">series</a> <a href="https://scattered-thoughts.net/writing/why-query-planning-for-streaming-systems-is-hard">on</a> streaming systems, great work on technical projects like <a href="https://github.com/jamii/dida">dida</a> and <a href="https://github.com/jamii/imp">imp</a>, etc.</p> <p>It remains to be seen whether or not Jamie will be able to convince me to try blogging as a full-time job.</p> <h4 id="janet-davis-http-blogs-whitman-edu-countingfromzero"><a href="http://blogs.whitman.edu/countingfromzero/">Janet Davis</a></h4> <p>This is the story of how a professor moved from Grinnel to Whitman and <a href="http://blogs.whitman.edu/countingfromzero/2015/07/22/how-i-got-here/">started a CS program from scratch</a>. The archives are great reading if you're interested in how organizations form or CS education.</p> <h4 id="jeff-preshing-https-preshing-com"><a href="https://preshing.com/">Jeff Preshing</a></h4> <p>Mostly technical content relating to C++ and Python, but also includes topics that are generally useful for programmers, such as <a href="https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/">read-modify-write operations</a>, <a href="https://preshing.com/20121105/arithmetic-encoding-using-fixed-point-math/">fixed-point math</a>, and <a href="https://preshing.com/20120930/weak-vs-strong-memory-models/">memory models</a>.</p> <h4 id="jessica-kerr-http-blog-jessitron-com"><a href="http://blog.jessitron.com/">Jessica Kerr</a></h4> <p>Jessica is probably better known for <a href="http://jessitron.com/talks.html">her talks</a> than her blog? Her talks are great! My favorite is probably <a href="https://www.youtube.com/watch?v=yhguOt863nw">this talk with explains different concurrency models in an easy to understand way</a>, but <a href="http://blog.jessitron.com/2015/01/systems-thinking-about-wit.html">the blog also has a lot of material I like</a>.</p> <p>As is the case with her talks, the diagrams often take a concept and clarify it, making something that wasn't obvious seem very obvious in retrospect.</p> <h4 id="john-regehr-http-blog-regehr-org"><a href="http://blog.regehr.org/">John Regehr</a></h4> <p>I think of this as the <a href="http://blog.regehr.org/archives/213">“C is harder than you think, even if you think C is really hard” blog</a>, although the blog actually covers a lot more than that. Some commonly covered topics are fuzzing, compiler optimization, and testing in general.</p> <p>Posts tend to be conceptual. When there are code examples, they're often pretty easy to read, but there are also examples of bizzaro behavior that won't be easy to skim unless you're someone who knows the C standard by heart.</p> <h4 id="juho-snellman-https-www-snellman-net-blog"><a href="https://www.snellman.net/blog/">Juho Snellman</a></h4> <p>A lot of <a href="https://www.snellman.net/blog/archive/2015-08-25-tcp-optimization-in-mobile-networks/">posts about networking</a>, generally <a href="https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/">written so that they make sense even with minimal networking background</a>. I wish more people with this kind of knowledge (in depth knowledge of systems, not just networking knowledge in particular) would write up explanations for a general audience. Also has interesting non-networking content, like <a href="https://www.snellman.net/blog/archive/2015-05-11-okcupid-for-voting-the-finnish-election-engines/">this post on Finnish elections</a>.</p> <h4 id="julia-evans-http-jvns-ca"><a href="http://jvns.ca/">Julia Evans</a></h4> <p>AFAICT, the theme is “things Julia has learned recently”, which can be anything from <a href="http://jvns.ca/blog/2015/02/22/how-gzip-uses-huffman-coding/">Huffman coding</a> to <a href="http://jvns.ca/blog/2014/10/22/working-remote-8-months-in/">how to be happy when working in a remote job</a>. When the posts are on a topic I don't already know, I learn something new. When they're on a topic I know, they remind me that the topic is exciting and contains a lot of wonder and mystery.</p> <p>Many posts have more questions than answers, and are more of a live-blogged exploration of a topic than an explanation of the topic.</p> <h4 id="karla-burnett-https-karla-io"><a href="https://karla.io/">Karla Burnett</a></h4> <p>A mix of security-related topics and explanations of practical programming knowledge. <a href="https://karla.io/files/ichthyology-wp.pdf">This article on phishing</a>, which includes a set of fun case studies on how effective phising can be, even after people take anti-phishing training, is an example of a security post. <a href="https://karla.io/2016/06/13/dont-panic.html">This post on printing out text via tracert</a>. This <a href="https://karla.io/2016/04/30/ssh-for-fun-and-profit.html">post on writing an SSH client</a> and <a href="https://karla.io/2015/06/28/brain-teaser-2.html">this post</a> on <a href="https://karla.io/2015/06/28/brain-teaser-2.html">some coreutils puzzles</a> are examples of practical programming explanations.</p> <p>Although the blog is security oriented, posts are written for a general audience and don't assume specific expertise in security.</p> <h4 id="kate-murphy-https-kate-io"><a href="https://kate.io/">Kate Murphy</a></h4> <p>Mostly small, self-contained explorations like, <a href="https://kate.io/blog/2017/08/22/weird-python-integers/">what's up with this Python integer behavior</a>, <a href="https://kate.io/blog/making-your-own-exploding-git-repos/">how do you make a git blow up with a simple repo</a>, or <a href="https://kate.io/blog/simple-hash-collisions-in-lua/">how do you generate hash collisions in Lua</a>?</p> <h4 id="kavya-joshi">Kavya Joshi</h4> <p>I generally prefer technical explanations in text over video, but her exposition is so clear that I'm putting these talks in this list of blogs. Some examples include <a href="https://www.youtube.com/watch?v=5erqWdlhQLA">an explanation of the go race detector</a>, <a href="https://www.youtube.com/watch?v=Ivss-VtmbDY">simple math that's handy for performance modeling</a>, and <a href="https://www.youtube.com/watch?v=BRvj8PykSc4">time</a>.</p> <h4 id="kyle-kingsbury-https-aphyr-com"><a href="https://aphyr.com/">Kyle Kingsbury</a></h4> <p>90% of Kyle's posts are explanations of <a href="https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads">distributed systems testing, which expose bugs in real systems that most of us rely on</a>. The other 10% are <a href="https://aphyr.com/posts/319-clojure-from-the-ground-up-debugging">musings on programming that are as rigorous as Kyle's posts on distributed systems</a>. Possibly the most educational programming blog of all time.</p> <p>For those of us without a distributed systems background, understanding posts often requires a bit of Googling, despite the extensive explanations in the posts. Most new posts are now at <a href="http://jepsen.io/">jepsen.io</a></p> <h4 id="laura-lindzey-https-lindzey-github-io-index-html"><a href="https://lindzey.github.io/index.html">Laura Lindzey</a></h4> <p>Very infrequently updated (on the order of once a year) with explanations of things Laura has been working on, from <a href="https://lindzey.github.io/blog/2019/11/26/origami-butterfly/">Oragami PCB</a> to <a href="https://lindzey.github.io/blog/2015/07/27/a-brief-introduction-to-ice-penetrating-radar/">Ice-Penetrating Radar</a>.</p> <h4 id="laurie-tratt-https-tratt-net-laurie-blog"><a href="https://tratt.net/laurie/blog/">Laurie Tratt</a></h4> <p>This blog has been going since 2004 and its changed over the years. Recently, it's had some of the best posts on benchmarking around:</p> <ul> <li><a href="https://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_1.html">VM performance, part 1</a> <ul> <li>Thoroughly refutes the idea that you can run a language VM for some warmup period and then take some numbers when they become stable</li> </ul></li> <li><a href="https://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_2.html">VM performance, part 2</a></li> <li><a href="https://tratt.net/laurie/blog/entries/minimum_times_tend_to_mislead_when_benchmarking.html">Why not use minimum times when benchmarking</a> <ul> <li>&quot;Everyone&quot; who's serious about performance knows this and it's generally considered too obvious to write up, but this is still a widely used technique in benchmarking even though it's only appropriate in limited circumstances</li> </ul></li> </ul> <p>The blog isn't purely technical, <a href="https://tratt.net/laurie/blog/entries/alternative_sources_of_advice.html">this blog post on advice is also stellar</a>. If those posts don't sound interesting to you, it's worth <a href="https://tratt.net/laurie/blog/archive.html">checking out the archives</a> to see if some of the topics Lawrence used to write about more frequently are to your taste.</p> <h4 id="marc-brooker-https-brooker-co-za-blog"><a href="https://brooker.co.za/blog/">Marc Brooker</a></h4> <p>A mix of <a href="https://brooker.co.za/blog/2012/01/17/two-random.html">theory</a> and <a href="https://brooker.co.za/blog/2016/01/03/correlation.html">wisdom</a> from a distributed systems engineer on EBS at Amazon. The theory posts tend to be relatively short and easy to swallow; not at all intimidating, as theory sometimes is.</p> <h4 id="marek-majkowski-https-idea-popcount-org"><a href="https://idea.popcount.org/">Marek Majkowski</a></h4> <p>This used to be a blog about random experiments Marek was doing, <a href="https://idea.popcount.org/2013-01-30-bitsliced-siphash/">like this post on bitsliced SipHash</a>. Since Marek joined Cloudflare, this has turned into a list of things Marek has learned while working in Cloudflare's networking stack, like <a href="https://blog.cloudflare.com/the-curious-case-of-slow-downloads/">this story about debugging slow downloads</a>.</p> <p>Posts tend to be relatively short, but with enough technical specifics that they're not light reads.</p> <h4 id="nicole-express-https-nicole-express"><a href="https://nicole.express/">Nicole Express</a></h4> <p>Explorations on old systems, often gaming related. Some exmaples are <a href="https://nicole.express/2021/alf-2-alf-harder.html">this post on collision detection in Alf for the Sega Master System</a>, <a href="https://nicole.express/2021/composite-conflict-completed.html">this post on getting decent quality output from composite video</a>, and <a href="https://nicole.express/2020/neo-geo-cd-zee.html">this post on the Neo Geo CDZ</a>.</p> <h4 id="nikita-prokopov-https-tonsky-me"><a href="https://tonsky.me/">Nikita Prokopov</a></h4> <p>Nikita has two blogs, both on related topics. <a href="https://tonsky.me/">The main blog</a> has long-form articles, often how about modern software is terrible. THen there's <a href="https://grumpy.website/">grumpy.website</a>, which gives examples of software being terrible.</p> <h4 id="nitsan-wakart-http-psy-lob-saw-blogspot-com"><a href="http://psy-lob-saw.blogspot.com/">Nitsan Wakart</a></h4> <p>More than you ever wanted to know about writing fast code for the JVM, from <a href="http://psy-lob-saw.blogspot.com/2016/03/gc-nepotism-and-linked-queues.html">GV affects data structures</a> to <a href="http://psy-lob-saw.blogspot.com/2014/08/the-volatile-read-suprise.html">the subtleties of volatile reads</a>.</p> <p>Posts tend to involve lots of Java code, but the takeaways are often language agnostic.</p> <h4 id="oona-raisanen-http-www-windytan-com"><a href="http://www.windytan.com/">Oona Raisanen</a></h4> <p>Adventures in signal processing. Everything from <a href="http://www.windytan.com/2016/02/barcode-recovery-using-priori.html">deblurring barcodes</a> to figuring out <a href="http://www.windytan.com/2014/02/mystery-signal-from-helicopter.html">what those signals from helicopters mean</a>. If I'd known that signals and systems could be this interesting, I would have paid more attention in class.</p> <h4 id="paul-khuong-http-www-pvk-ca"><a href="http://www.pvk.ca/">Paul Khuong</a></h4> <p><a href="http://www.pvk.ca/Blog/2013/11/22/the-weaknesses-of-sbcls-type-propagation/">Some content on Lisp</a>, and <a href="http://www.pvk.ca/Blog/2014/04/13/k-ary-heapsort/">some on low-level optimizations</a>, with <a href="http://www.pvk.ca/Blog/2015/06/27/linear-log-bucketing-fast-versatile-simple/">a trend towards low-level optimizations</a>.</p> <p>Posts are usually relatively long and self-contained explanations of technical ideas with very little fluff.</p> <h4 id="rachel-kroll-http-rachelbythebay-com-w"><a href="http://rachelbythebay.com/w/">Rachel Kroll</a></h4> <p>Years of <a href="http://rachelbythebay.com/w/2015/09/07/noleap/">debugging stories</a> from a long-time SRE, along with stories about <a href="https://rachelbythebay.com/w/2012/02/29/derailing/">big company nonsense</a>. Many of the stories come from Lyft, Facebook, and Google. They're anonymized, but if you know about the companies, you can tell which ones are which.</p> <p>The degree of anonymization often means that the stories won't really make sense unless you're familiar with the operation of systems similar to the ones in the stories.</p> <h4 id="sophie-haskins-http-blog-pizzabox-computer"><a href="http://blog.pizzabox.computer/">Sophie Haskins</a></h4> <p>A blog about restoring old &quot;pizza box&quot; computers, with posts that generally describe the work that goes into getting these machines working again.</p> <p>An example is the HP 712 (&quot;low cost&quot; PA-RISC workstations that went for roughly $5k to $15k in 1994 dollars, which ended up doomed due to the Intel workstation onslaught that started with the Pentium Pro in 1995), <a href="https://blog.pizzabox.computer/posts/hp712-nextstep-part-1/">where the restoration process is described here in part 1</a> and <a href="https://blog.pizzabox.computer/posts/hp712-nextstep-part-2/">then here in part 2</a>.</p> <h4 id="vyacheslav-egorov-http-mrale-ph"><a href="http://mrale.ph/">Vyacheslav Egorov</a></h4> <p><a href="http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html">In-depth explanations</a> on how V8 works and <a href="http://mrale.ph/blog/2014/07/30/constructor-vs-objectcreate.html">how various constructs get optimized</a> by a compiler dev on the V8 team. If I knew compilers were this interesting, I would have taken a compilers class back when I was in college.</p> <p>Often takes topics that are considered hard and explains them in a way that makes them seem easy. Lots of diagrams, where appropriate, and detailed exposition on all the tricky bits.</p> <h4 id="whitequark-https-whitequark-org"><a href="https://whitequark.org/">whitequark</a></h4> <p>Her main site has to a variety of interesting tools she's made or worked on, many of which are FPGA or open hardware related, but some of which are completely different. <a href="https://lab.whitequark.org/">Whitequark's lab notebook</a> has a really wide variety of different results, from things like undocumented hardware quirks, to fairly serious home chemistry experiments, to various tidbits about programming and hardware development (usually low level, but not always).</p> <p>She's also fairly active <a href="https://twitter.com/whitequark/">on twitter</a>, with some commentary on hardware/firmware/low-level programming combined with a set of diverse topics that's too broad to easily summarize.</p> <h4 id="yossi-kreinin-http-yosefk-com-blog"><a href="http://yosefk.com/blog/">Yossi Kreinin</a></h4> <p>Mostly dormant since <a href="http://yosefk.com/blog/a-better-future-animated-post.html">the author started doing art</a>, but the archives have a lot of great content about hardware, low-level software, and general programming-related topics that aren't strictly programming.</p> <p>90% of the time, when I get the desire to write a post about a common misconception software folks have about hardware, <a href="http://yosefk.com/blog/a-better-future-animated-post.html">Yossi has already written the post</a> and <a href="http://yosefk.com/blog/high-level-cpu-follow-up.html">taken a lot of flak for it</a> so I don't have to :-).</p> <p>I also really like Yossi's career advice, like <a href="http://yosefk.com/blog/do-call-yourself-a-programmer-and-other-career-advice.html">this response to Patrick McKenzie</a> and <a href="http://yosefk.com/blog/people-can-read-their-managers-mind.html">this post on how managers get what they want and not what they ask for</a>.</p> <p>He's <a href="https://twitter.com/yossikreinin">active on Twitter</a>, where he posts extremely cynical and snarky takes on management and the industry.</p> <h4 id="this-blog">This blog?</h4> <p>Common themes include:</p> <ul> <li><a href="//danluu.com/file-consistency/">This thing that's often considered easy is harder than you might think</a></li> <li><a href="//danluu.com/edit-binary/">This thing that's often considered hard is easier than you might think</a></li> <li><a href="//danluu.com/dunning-kruger/">This obvious fact</a> <a href="//danluu.com/tech-discrimination/">is not obvious</a> <a href="//danluu.com/integer-overflow/">at all</a></li> <li><a href="//danluu.com/postmortem-lessons/">Humans are human</a>: humans make mistakes, <a href="//danluu.com/google-sre-book/">and systems must be designed to account for that</a></li> <li><a href="//danluu.com/postmortem-lessons/">Computers will have faults</a>, <a href="//danluu.com/limplock/">and systems must be designed to account for that</a></li> <li><a href="//danluu.com/broken-builds/">Is it just me</a>, <a href="//danluu.com/everything-is-broken/">or is stuff really broken</a>?</li> <li><a href="//danluu.com/infinite-disk/">Hey</a>! <a href="//danluu.com/intel-cat/">Look at this set of papers</a>! <a href="//danluu.com/clwb-pcommit/">Let's talk about their context and why they're interesting</a></li> </ul> <h4 id="the-end">The end</h4> <p>This list also doesn't include blogs that mostly aren't about programming, so it doesn't include, for example, <a href="http://www.benkuhn.net/">Ben Kuhn's excellent blog</a>.</p> <p>Anyway, that's all for now, but this list is pretty much off the top of my head, so I'll add more as more blogs come to mind. I'll also keep this list updated with what I'm reading as I find new blogs. Please please please <a href="https://twitter.com/danluu">suggest other blogs I might like</a>, and don't assume that I already know about a blog because it's popular. Just for example, I had no idea who either Jeff Atwood or Zed Shaw were until a few years ago, and they were probably two of the most well known programming bloggers in existence. Even with centralized link aggregators like HN and reddit, blog discovery has become haphazard and random with the decline of blogrolls and blogging as a dialogue, as opposed to the current practice of blogging as a monologue. Also, please don't assume that I don't want to read something just because it's different from the kind of blog I normally read. I'd love to read more from UX or front-end folks; I just don't know where to find that kind of thing!</p> <p><em>Last update: 2021-07</em></p> <h4 id="archive">Archive</h4> <p>Here are some blogs I've put into an archive section because they rarely or never update.</p> <h4 id="alex-clemmer-http-blog-nullspace-io"><a href="http://blog.nullspace.io/">Alex Clemmer</a></h4> <p><a href="http://blog.nullspace.io/building-search-engines.html">This post on why making a competitor to Google search is a post in classic Alex Clemmer style</a>. The post looks at a position that's commonly believed (web search isn't all that hard and someone should come up with a better Google) and explains why that's not an obviously correct position. That's also a common theme of his comments elsewhere, such as these comments on, <a href="https://news.ycombinator.com/item?id=9598224">stack ranking at MS</a>, <a href="https://news.ycombinator.com/item?id=10416015">implementing POSIX on Windows</a>, <a href="https://news.ycombinator.com/item?id=10229434">the size of the Windows codebase</a>, <a href="https://news.ycombinator.com/item?id=8867387">Bond</a>, and <a href="https://news.ycombinator.com/item?id=9735791">Bing</a>.</p> <p>He's sort of a modern mini-MSFT, in that it's incisive commentary on MS and MS related ventures.</p> <h4 id="allison-kaptur-http-akaptur-com"><a href="http://akaptur.com/">Allison Kaptur</a></h4> <p>Explorations of various areas, often Python related, such as <a href="http://akaptur.com/blog/2013/11/15/introduction-to-the-python-interpreter/">this this series on the Python interpreter</a> and <a href="http://akaptur.com/blog/2014/08/02/the-cpython-peephole-optimizer-and-you/">this series on the CPython peephole optimizer</a>. Also, thoughts on broader topics like <a href="http://akaptur.com/blog/2013/07/24/systematic-debugging/">debugging</a> and <a href="http://akaptur.com/blog/2015/10/10/effective-learning-strategies-for-programmers/">learning</a>.</p> <p>Often detailed, with inline code that's meant to be read and understood (with the help of exposition that's generally quite clear).</p> <h4 id="david-dalrymple-http-davidad-github-io"><a href="http://davidad.github.io/">David Dalrymple</a></h4> <p>A mix of things from <a href="http://davidad.github.io/blog/2014/02/18/kernel-from-scratch/">writing a 64-bit kernel from scratch shortly after learning assembly</a> to a <a href="http://davidad.github.io/blog/2014/03/12/the-operating-system-is-out-of-date/">high-level overview of computer systems</a>. Rarely updated, with few posts, but each post has a lot to think about.</p> <h4 id="epita-systems-lab-https-blog-lse-epita-fr"><a href="https://blog.lse.epita.fr/">EPITA Systems Lab</a></h4> <p>Low-level. A good example of a relatively high-level post from this blog is <a href="https://blog.lse.epita.fr/articles/74-getting-back-determinism-in-the-lfh.html">this post on the low fragmentation heap in Windows</a>. Posts like <a href="https://blog.lse.epita.fr/articles/75-sstpinball.html">how to hack a pinball machine</a> and <a href="https://blog.lse.epita.fr/articles/77-lsepc-intro.html">how to design a 386 compatible dev board</a> are typical.</p> <p>Posts are often quite detailed, with schematic/circuit diagrams. This is relatively heavy reading and I try to have pen and paper handy when I'm reading this blog.</p> <h4 id="greg-wilson-http-neverworkintheory-org"><a href="http://neverworkintheory.org/">Greg Wilson</a></h4> <p>Write-ups of papers that (should) have an impact on how people write software, like <a href="http://neverworkintheory.org/2014/10/08/simple-testing-can-prevent-most-critical-failures.html">this paper on what causes failures in distributed systems</a> or <a href="http://neverworkintheory.org/2015/08/23/perception-productivity.html">this paper on what makes people feel productive</a>. Not updated much, but <a href="http://third-bit.com/index.html">Greg still blogs on his personal site</a>.</p> <p>The posts tend to be extended abstracts that tease you into reading the paper, rather than detailed explanations of the methodology and results.</p> <h4 id="gustavo-duarte-http-duartes-org-gustavo-blog"><a href="http://duartes.org/gustavo/blog/">Gustavo Duarte</a></h4> <p>Explanations of how <a href="http://duartes.org/gustavo/blog/post/what-does-an-idle-cpu-do/">Linux works, as well as other low-level topics</a>. This particular blog seems to be on hiatus, but &quot;0xAX&quot; seems to have picked up the slack with the <a href="https://0xax.gitbooks.io/linux-insides/content/index.html">linux-insides</a> project.</p> <p>If you've read Love's book on Linux, Duarte's explanations are similar, but tend to be more about the idea and less about the implementation. They're also heavier on providing diagrams and context. &quot;0xAX&quot; is a lot more focused on walking through the code than either Love or Duarte.</p> <h4 id="huon-wilson-http-huonw-github-io"><a href="http://huonw.github.io/">Huon Wilson</a></h4> <p>Explanations of various Rust-y things, from back when Huon was working on Rust. Not updated much anymore, but the content is still great for someone who's interested in technical tidbits related to Rust.</p> <h4 id="kamal-marhubi-http-kamalmarhubi-com"><a href="http://kamalmarhubi.com/">Kamal Marhubi</a></h4> <p>Technical explorations of various topics, with a systems-y bent. <a href="http://kamalmarhubi.com/blog/2015/11/17/kubernetes-from-the-ground-up-the-scheduler/">Kubernetes</a>. <a href="http://kamalmarhubi.com/blog/2015/11/21/using-strace-to-figure-out-how-git-push-over-ssh-works/">Git push</a>. <a href="http://kamalmarhubi.com/blog/2016/04/13/rust-nix-easier-unix-systems-programming-3/">Syscalls in Rust</a>. Also, <a href="http://kamalmarhubi.com/blog/2015/05/28/the-excellence-uniform-build-and-test-infrastructure/">some musings on programming in general</a>.</p> <p>The technical explorations often get into enough nitty gritty detail that this is something you probably want to sit down to read, as opposed to skim on your phone.</p> <h4 id="mary-rose-cook-http-maryrosecook-com-blog"><a href="http://maryrosecook.com/blog/">Mary Rose Cook</a></h4> <p><a href="http://maryrosecook.com/blog/post/git-from-the-inside-out">Lengthy and very-detailed explanations of technical topics</a>, <a href="http://maryrosecook.com/blog/post/little-lisp-interpreter">mixed in</a> with <a href="http://maryrosecook.com/blog/post/the-fibonacci-heap-ruins-my-life">a wide variety of other posts</a>.</p> <p>The selection of topics is eclectic, and explained at a level of detail such that you'll come away with a solid understanding of the topic. The explanations are usually fine grained enough that it's hard to miss what's going on, even if you're a beginner programmer.</p> <h4 id="rebecca-frankel">Rebecca Frankel</h4> <p>As far as I know, Rebecca doesn't have a programming blog, but if you look at her apparently off-the-cuff comments on other people's posts as a blog, it's one of the best written programming blogs out there. She used to be prolific on <a href="//danluu.com/open-social-networks/">Piaw's</a> <a href="//danluu.com/mit-stanford/">Buzz</a> (and probably elsewhere, although I don't know where), and you occasionally see comments elsewhere, like on <a href="http://steve-yegge.blogspot.com/2008/06/done-and-gets-things-smart.html">this Steve Yegge blog post about brilliant engineers</a><sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup>. I wish I could write like that.</p> <h4 id="russell-smith-http-qqrs-github-io"><a href="http://qqrs.github.io/">Russell Smith</a></h4> <p>Homemade electronics projects from <a href="http://qqrs.github.io/blog/2013/05/03/vim-on-a-mechanical-typewriter/">vim on a mechanical typewriter</a> to <a href="http://qqrs.github.io/blog/2015/08/21/proofing-spirits-with-homemade-electrobalance/">building an electrobalance to proof spirits</a>.</p> <p>Posts tend to have a fair bit of detail, down to diagrams explaining parts of circuits, but the posts aren't as detailed as specs. But there are usually links to resources that will teach you enough to reproduce the project, if you want.</p> <h4 id="rwt-http-www-realworldtech-com"><a href="http://www.realworldtech.com/">RWT</a></h4> <p>I find the archives to be fun reading for insight into <a href="http://www.realworldtech.com/alpha-ev8-wider/">what people were thinking about microprocessors and computer architecture</a> over the past two decades. It can be a bit depressing to see that <a href="http://www.realworldtech.com/benchmark-examiner/">the same benchmarking controversies we had 15 years ago</a> are being repeated today, <a href="http://www.realworldtech.com/fair-benchmarks/">sometimes with the same players</a>. If anything, I'd say that the <a href="https://twitter.com/shipilev/status/709982588673388544">average benchmark you see passed around today</a> is worse than what you would have seen 15 years ago, even though the industry as a whole has learned a lot about benchmarking since then.</p> <h4 id="walpurgusriot-https-web-archive-org-web-20140118064249-http-walpurgisriot-github-io-blog"><a href="https://web.archive.org/web/20140118064249/http://walpurgisriot.github.io/blog">walpurgusriot</a></h4> <p>The author of walpurgisriot seems to have abanoned the github account and moved on to another user name (and a squatter appears to have picked up her old account name), but this used to be a semi-frequently updated blog with a combination of short explorations on programming and thoughts on the industry. On pure quality of prose, this is one of the best tech blogs I've ever read; the technical content and thoughts on the industry are great as well.</p> <p><em>This post was inspired by the two posts Julia Evans has on blogs she reads and by <a href="https://www.ocf.berkeley.edu/~abhishek/chicmath.htm">the Chicago undergraduate mathematics bibliography</a>, which I've found to be the most useful set of book reviews I've ever encountered.</em></p> <p><em>Thanks to Bartłomiej Filipek and Sean Barrett, Michel Schniz, Neil Henning, and Lindsey Kuper for comments/discussion/corrections.</em></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R"><p>Quote follows below, since I can see from my analytics data that relatively few people click any individual link, and people seem especially unlikely to click a link to read a comment on a blog, even if the comment is great:</p> <blockquote> <p>The key here is &quot;principally,&quot; and that I am describing motivation, not self-evaluation. The question is, what's driving you? What gets you working? If its just trying to show that you're good, then you won't be. It has to be something else too, or it won't get you through the concentrated decade of training it takes to get to that level.</p> <p>Look at the history of the person we're all presuming Steve Yegge is talking about. He graduated (with honors) in 1990 and started at Google in 1999. So he worked a long time before he got to the level of Google's star. When I was at Google I hung out on Sunday afternoons with a similar superstar. Nobody else was reliably there on Sunday; but he always was, so I could count on having someone to talk to. On some Sundays he came to work even when he had unquestionably legitimate reasons for not feeling well, but he still came to work. Why didn't he go home like any normal person would? It wasn't that he was trying to prove himself; he'd done that long ago. What was driving him?</p> <p>The only way I can describe it is one word: fury. What was he doing every Sunday? He was reviewing various APIs that were being proposed as standards by more junior programmers, and he was always finding things wrong with them. What he would talk about, or rather, rage about, on these Sunday afternoons was always about some idiocy or another that someone was trying make standard, and what was wrong with it, how it had to be fixed up, etc, etc. He was always in a high dudgeon over it all.</p> <p>What made him come to work when he was feeling sick and dizzy and nobody, not even Larry and Sergey with their legendary impatience, not even them, I mean nobody would have thought less of him if he had just gone home &amp; gone to sleep? He seemed to be driven, not by ambition, but by fear that if he stopped paying attention, something idiotically wrong (in his eyes) might get past him, and become the standard, and that was just unbearable, the thought made him so incoherently angry at the sheer wrongness of it, that he had to stay awake and prevent it from happening no matter how legitimately bad he was feeling at the time.</p> <p>It made me think of Paul Graham's comment: &quot;What do I mean by good people? One of the best tricks I learned during our startup was a rule for deciding who to hire. Could you describe the person as an animal?... I mean someone who takes their work a little too seriously; someone who does what they do so well that they pass right through professional and cross over into obsessive.</p> <p>What it means specifically depends on the job: a salesperson who just won't take no for an answer; a hacker who will stay up till 4:00 AM rather than go to bed leaving code with a bug in it; a PR person who will cold-call New York Times reporters on their cell phones; a graphic designer who feels physical pain when something is two millimeters out of place.&quot;</p> <p>I think a corollary of this characterization is that if you really want to be &quot;an animal,&quot; what you have cultivate in yourself is partly ambition, but it is partly also self-knowledge. As Paul Graham says, there are different kinds of animals. The obsessive graphic designer might be unconcerned about an API that is less than it could be, while the programming superstar might pass by, or create, a terrible graphic design without the slightest twinge of misgiving.</p> <p>Therefore, key question is: are you working on the thing you care about most? If its wrong, is it unbearable to you? Nothing but deep seated fury will propel you to the level of a superstar. Getting there hurts too much; mere desire to be good is not enough. If its not in you, its not in you. You have to be propelled by elemental wrath. Nothing less will do.</p> <p>Or it might be in you, but just not in this domain. You have to find what you care about, and not just what you care about, but what you care about violently: you can't fake it.</p> <p>(Also, if you do have it in you, you still have to choose your boss carefully. No matter how good you are, it may not be trivial to find someone you can work for. There's more to say here; but I'll have to leave it for another comment.)</p> <p>Another clarification of my assertion &quot;if you're wondering if you're good, then you're not&quot; should perhaps be said &quot;if you need reassurance from someone else that you're good, then you're not.&quot; One characteristic of these &quot;animals&quot; is that they are such obsessive perfectionists that their own internal standards so far outstrip anything that anyone else could hold them to, that no ordinary person (i.e. ordinary boss) can evaluate them. As Steve Yegge said, they don't go for interviews. They do evaluate each other -- at Google the superstars all reviewed each other's code, reportedly brutally -- but I don't think they cared about the judgments of anyone who wasn't in their circle or at their level.</p> <p>I agree with Steve Yegge's assertion that there are an enormously important (small) group of people who are just on another level, and ordinary smart hardworking people just aren't the same. Here's another way to explain why there should be a quantum jump -- perhaps I've been using this discussion to build up this idea: its the difference between people who are still trying to do well on a test administered by someone else, and the people who have found in themselves the ability to grade their own test, more carefully, with more obsessive perfectionism, than anyone else could possibly impose on them.</p> <p>School, for all it teaches, may have one bad lasting effect on people: it gives them the idea that good people get A's on tests, and better ones get A+'s on tests, and the very best get A++'s. Then you get the idea that you go out into the real world, and your boss is kind of super-professor, who takes over the grading of the test. Joel Spolsky is accepting that role, being boss as super-professor, grading his employees tests for them, telling them whether they are good.</p> <p>But the problem is that in the real world, the very most valuable, most effective people aren't the ones who are trying to get A+++'s on the test you give them. The very best people are the ones who can make up their own test with harder problems on it than you could ever think of, and you'd have to have studied for the same ten years they have to be able even to know how to grade their answers.</p> <p>That's a problem, incidentally, with the idea of a meritocracy. School gives you an idea of a ladder of merit that reaches to the top. But it can't reach all the way to the top, because someone has to measure the rungs. At the top you're not just being judged on how high you are on the ladder. You're also being judged on your ability to &quot;grade your own test&quot;; that is to say, your trustworthiness. People start asking whether you will enforce your own standards even if no one is imposing them on you. They have to! because at the top people get given jobs with the kind of responsibility where no one can possibly correct you if you screw up. I'm giving you an image of someone who is working himself sick, literally, trying grade everyone else's work. In the end there is only so much he can do, and he does want to go home and go to bed sometimes. That means he wants people under him who are not merely good, but can be trusted not to need to be graded. Somebody has to watch the watchers, and in the end, the watchers have to watch themselves.</p> </blockquote> <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> </ol> </div> Google SRE book google-sre-book/ Mon, 11 Apr 2016 01:00:58 -0700 google-sre-book/ <p><a href="https://www.amazon.com/gp/product/149192912X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=149192912X&amp;linkId=2a578b357abc8b995368a039dd517601">The book</a> starts with a story about a time Margaret Hamilton brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys that caused a prelaunch program to run during the simulated mission. Hamilton submitted a change request to add error checking code to prevent the error from happening again, but the request was rejected because the error case should never happen.</p> <p>On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve.</p> <p><em>This sounds familiar -- I’ve lost track of the number of dev post-mortems that have the same basic structure.</em></p> <p><em>This is an experiment in note-taking for me in two ways. First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by <a href="http://scattered-thoughts.net/">Jamie Brandon’s notes on books he’s read</a>. My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it's not obvious, asides from me are in italics.</em></p> <h3 id="chapter-1-introduction">Chapter 1: Introduction</h3> <p><em>Everything in this chapter is covered in much more detail later.</em></p> <p>Two approaches to hiring people to manage system stability:</p> <h4 id="traditional-approach-sysadmins">Traditional approach: sysadmins</h4> <ul> <li>Assemble existing components and deploy to produce a service</li> <li>Respond to events and updates as they occur</li> <li>Grow team to absorb increased work as service grows</li> <li>Pros <ul> <li>Easy to implement because it’s standard</li> <li>Large talent pool to hire from</li> <li>Lots of available software</li> </ul></li> <li>Cons <ul> <li>Manual intervention for change management and event handling causes size of team to scale with load on system</li> <li>Ops is fundamentally at odds with dev, which can cause pathological resistance to changes, which causes similarly pathological response from devs, which reclassify “launches” as “incremental updates”, “flag flips”, etc.</li> </ul></li> </ul> <h4 id="google-s-approach-sres">Google’s approach: SREs</h4> <ul> <li>Have software engineers do operations</li> <li>Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 - L3 networking or UNIX system internals).</li> <li>Career progress comparable to dev career track</li> <li>Results <ul> <li>SREs would be bored by doing tasks by hand</li> <li>Have the skillset necessary to automate tasks</li> <li>Do the same work as an operations team, but with automation instead of manual labor</li> </ul></li> <li>To avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of “ops” work for SREs <ul> <li>Upper bound. Actual amount of ops work is expected to be much lower</li> </ul></li> <li>Pros <ul> <li>Cheaper to scale</li> <li>Circumvents devs/ops split</li> </ul></li> <li>Cons <ul> <li>Hard to hire for</li> <li>May be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)</li> </ul></li> </ul> <p><em>I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.</em></p> <h4 id="tenets-of-sre">Tenets of SRE</h4> <ul> <li>SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning</li> </ul> <h4 id="ensuring-a-durable-focus-on-engineering">Ensuring a durable focus on engineering</h4> <ul> <li>50% ops cap means that extra ops work is redirected to product teams on overflow</li> <li>Provides feedback mechanism to product teams as well as keeps load down</li> <li>Target max 2 events per 8-12 hour on-call shift</li> <li>Postmortems for all serious incidents, even if they didn’t trigger a page</li> <li>Blameless postmortems</li> </ul> <p><em>2 events per shift is the max, but what’s the average? How many on-call events are expected to get sent from the SRE team to the dev team per week?</em></p> <p><em>How do you get from a blameful postmortem culture to a blameless postmortem culture? Now that everyone knows that you should have blameless postmortems, <a href="//danluu.com/wat/">everyone will claim to do them. Sort of like having good testing and deployment practices</a>. I’ve been lucky to be on an on call rotation that’s never gotten paged, but when I talk to folks who joined recently and are on call, they have not so great stories of finger pointing, trash talk, and blame shifting. The fact that everyone knows you’re supposed to be blameless seems to make it harder to call out blamefulness, not easier.</em></p> <h4 id="move-fast-without-breaking-slo">Move fast without breaking SLO</h4> <ul> <li>Error budget. 100% is the wrong reliability target for basically everything</li> <li>Going from 5 9s to 100% reliability isn’t noticeable to most users and requires tremendous effort</li> <li>Set a goal that acknowledges the trade-off and leaves an error budget</li> <li>Error budget can be spent on anything: launching features, etc.</li> <li>Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors</li> <li>Goal of SRE team isn’t “zero outages” -- SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity</li> </ul> <p><em>It’s not explicitly stated, but for teams that need to “move fast”, consistently coming in way under the error budget could be taken as a sign that the team is spending too much effort on reliability.</em></p> <p><em>I like this idea a lot, but when I discussed this with Jessica Kerr, she pushed back on this idea because maybe you’re just under your error budget because you got lucky and a single really bad event can wipe out your error budget for the next decade. Followup question: how can you be confident enough in your risk model that you can purposefully consume error budget to move faster without worrying that a downstream (in time) bad event will put you overbudget? Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through simulated disasters and other testing.</em></p> <h4 id="monitoring">Monitoring</h4> <ul> <li>Monitoring should never require a human to interpret any part of the alerting domain</li> <li>Three valid kinds of monitoring output <ul> <li>Alerts: human needs to take action immediately</li> <li>Tickets: human needs to take action eventually</li> <li>Logging: no action needed</li> <li>Note that, for example, graphs are a type of log</li> </ul></li> </ul> <h4 id="emergency-response">Emergency Response</h4> <ul> <li>Reliability is a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)</li> <li>For evaluating responses, we care about MTTR</li> <li>Humans add latency</li> <li>Systems that don’t require humans to respond will have higher availability due to lower MTTR</li> <li>Having a “playbook” produces 3x lower MTTR <ul> <li>Having hero generalists who can respond to everything works, but having playbooks works better</li> </ul></li> </ul> <p><em>I personally agree, but boy do we like our on call heros. I wonder how we can foster a culture of documentation.</em></p> <h4 id="change-management">Change management</h4> <ul> <li>70% of outages due to changes in a live system. Mitigation: <ul> <li>Implement progressive rollouts</li> <li>Monitoring</li> <li>Rollback</li> </ul></li> <li>Remove humans from the loop, avoid standard human problems on repetitive tasks</li> </ul> <h4 id="demand-forecasting-and-capacity-planning-http-www-amazon-com-gp-product-0596518579-ref-as-li-tl-ie-utf8-camp-1789-creative-9325-creativeasin-0596518579-linkcode-as2-tag-abroaview-20-linkid-ek2pbwyshyk26giv">Demand forecasting and <a href="http://www.amazon.com/gp/product/0596518579/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0596518579&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=EK2PBWYSHYK26GIV">capacity planning</a></h4> <ul> <li>Straightforward, but a surprising number of teams/services don’t do it</li> </ul> <h4 id="provisioning">Provisioning</h4> <ul> <li>Adding capacity riskier than load shifting, since it often involves spinning up new instances/locations, making significant changes to existing systems (config files, load balancers, etc.)</li> <li>Expensive enough that it should be done only when necessary; must be done quickly <ul> <li>If you don’t know what you actually need and overprovision that costs money</li> </ul></li> </ul> <h4 id="efficiency-and-performance">Efficiency and performance</h4> <ul> <li>Load slows down systems</li> <li>SREs provision to meet capacity target with a specific response time goal</li> <li>Efficiency == money</li> </ul> <h3 id="chapter-2-the-production-environment-at-google-from-the-viewpoint-of-an-sre">Chapter 2: The production environment at Google, from the viewpoint of an SRE</h3> <p><em>No notes on this chapter because I’m already pretty familiar with it. TODO: maybe go back and read this chapter in more detail.</em></p> <h3 id="chapter-3-embracing-risk">Chapter 3: Embracing risk</h3> <ul> <li>Ex: if a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability</li> </ul> <h4 id="managing-risk">Managing risk</h4> <ul> <li>Reliability isn’t linear in cost. It can easily cost 100x more to get one additional increment of reliability <ul> <li>Cost associated with redundant equipment</li> <li>Cost of building out features for reliability as opposed to “normal” features</li> <li>Goal: make systems reliable enough, but not too reliable!</li> </ul></li> </ul> <h4 id="measuring-service-risk">Measuring service risk</h4> <ul> <li>Standard practice: identify metric to represent property of system to optimize</li> <li>Possible metric = uptime / (uptime + downtime) <ul> <li>Problematic for a globally distributed service. What does uptime really mean?</li> </ul></li> <li>Aggregate availability = successful requests / total requests <ul> <li>Obv, not all requests are equal, but aggregate availability is an ok first order approximation</li> </ul></li> <li>Usually set quarterly targets</li> </ul> <h4 id="risk-tolerance-of-services">Risk tolerance of services</h4> <ul> <li>Usually not objectively obvious</li> <li>SREs work with product owners to translate business objectives into explicit objectives</li> </ul> <h4 id="identifying-risk-tolerance-of-consumer-services">Identifying risk tolerance of consumer services</h4> <p><em>TODO: maybe read this in detail on second pass</em></p> <h4 id="identifying-risk-tolerance-of-infrastructure-services">Identifying risk tolerance of infrastructure services</h4> <h5 id="target-availability">Target availability</h5> <ul> <li>Running ex: Bigtable <ul> <li>Some consumer services serve data directly from Bigtable -- need low latency and high reliability</li> <li>Some teams use bigtable as a backing store for offline analysis -- care more about throughput than reliability</li> </ul></li> <li>Too expensive to meet all needs generically <ul> <li>Ex: Bigtable instance</li> <li>Low-latency Bigtable user wants low queue depth</li> <li>Throughput oriented Bigtable user wants moderate to high queue depth</li> <li>Success and failure are diametrically opposed in these two cases!</li> </ul></li> </ul> <h5 id="cost">Cost</h5> <ul> <li>Partition infra and offer different levels of service</li> <li>In addition to obv. benefits, allows service to externalize the cost of providing different levels of service (e.g., expect latency oriented service to be more expensive than throughput oriented service)</li> </ul> <h4 id="motivation-for-error-budgets">Motivation for error budgets</h4> <p><em>No notes on this because I already believe all of this. Maybe go back and re-read this if involved in debate about this.</em></p> <h3 id="chapter-4-service-level-objectives">Chapter 4: Service level objectives</h3> <p><em>Note: skipping notes on terminology section.</em></p> <ul> <li>Ex: Chubby planned outages <ul> <li>Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google</li> <li>Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby</li> <li>Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down</li> </ul></li> </ul> <h4 id="what-do-you-and-your-users-care-about">What do you and your users care about?</h4> <ul> <li>Too many indicators: hard to pay attention</li> <li>Too few indicators: might ignore important behavior</li> <li>Different classes of services should have different indicators <ul> <li>User-facing: availability, latency, throughput</li> <li>Storage: latency, availability, durability</li> <li>Big data: throughput, end-to-end latency</li> </ul></li> <li>All systems care about correctness</li> </ul> <h4 id="collecting-indicators">Collecting indicators</h4> <ul> <li>Can often do naturally from server, but client-side metrics sometimes needed.</li> </ul> <h4 id="aggregation">Aggregation</h4> <ul> <li>Use distributions and not averages</li> <li>User studies show that people usually prefer slower average with better tail latency</li> <li>Standardize on common defs, e.g., average over 1 minute, average over tasks in cluster, etc. <ul> <li>Can have exceptions, but having reasonable defaults makes things easier</li> </ul></li> </ul> <h4 id="choosing-targets">Choosing targets</h4> <ul> <li>Don’t pick target based on current performance <ul> <li>Current performance may require heroic effort</li> </ul></li> <li>Keep it simple</li> <li>Avoid absolutes <ul> <li>Unreasonable to talk about “infinite” scale or “always” available</li> </ul></li> <li>Minimize number of SLOs</li> <li>Perfection can wait <ul> <li>Can always redefine SLOs over time</li> </ul></li> <li>SLOs set expectations <ul> <li>Keep a safety margin (internal SLOs can be defined more loosely than external SLOs)</li> </ul></li> <li>Don’t overachieve <ul> <li>See Chubby example, above</li> <li>Another example is making sure that the system isn’t too fast under light loads</li> </ul></li> </ul> <h3 id="chapter-5-eliminating-toil">Chapter 5: Eliminating toil</h3> <p>Carla Geisser: &quot;If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.&quot;</p> <ul> <li>Def: Toil <ul> <li>Not just “work I don’t want to do”</li> <li>Manual</li> <li>Repetitive</li> <li>Automatable</li> <li>Tactical</li> <li>No enduring value</li> <li>O(n) with service growth</li> </ul></li> <li>In surveys, find 33% toil on average <ul> <li>Numbers can be as low as 0% and as high as 80%</li> <li>Toil &gt; 50% is a sign that the manager should spread toil load more evenly</li> </ul></li> <li>Is toil always bad? <ul> <li>Predictable and repetitive tasks can be calming</li> <li>Can produce a sense of accomplishment, can be low-risk / low-stress activities</li> </ul></li> </ul> <p><em>Section on why toil is bad. Skipping notetaking for that section.</em></p> <h3 id="chapter-6-monitoring-distributed-systems">Chapter 6: Monitoring distributed systems</h3> <ul> <li>Why monitor? <ul> <li>Analyze long-term trends</li> <li>Compare over time or do experiments</li> <li>Alerting</li> <li>Building dashboards</li> <li>Debugging</li> </ul></li> </ul> <p><em>As Alex Clemmer is wont to say, our problem isn’t that we move too slowly, it’s that we build the wrong thing. I wonder how we could get from where we are today to having enough instrumentation to be able to make informed decisions when building new systems.</em></p> <h4 id="setting-reasonable-expectations">Setting reasonable expectations</h4> <ul> <li>Monitoring is non-trivial</li> <li>10-12 person SRE team typically has 1-2 people building and maintaining monitoring</li> <li>Number has decreased over time due to improvements in tooling/libs/centralized monitoring infra</li> <li>General trend towards simpler/faster monitoring systems, with better tools for post hoc analysis</li> <li>Avoid “magic” systems</li> <li>Limited success with complex dependency hierarchies (e.g., “if DB slow, alert for DB, otherwise alert for website”). <ul> <li>Used mostly (only?) for very stable parts of system</li> </ul></li> <li>Rules that generate alerts for humans should be simple to understand and represent a clear failure</li> </ul> <p><em>Avoiding magic includes avoiding ML?</em></p> <ul> <li>Lots of white-box monitoring</li> <li>Some black-box monitoring for critical stuff</li> <li>Four golden signals <ul> <li>Latency</li> <li>Traffic</li> <li>Errors</li> <li>Saturation</li> </ul></li> </ul> <p><em>Interesting examples from Bigtable and Gmail from chapter not transcribed. A lot of information on the importance of keeping alerts simple also not transcribed.</em></p> <h4 id="the-long-run">The long run</h4> <ul> <li>There’s often a tension between long-run and short-run availability</li> <li>Can sometimes fix unreliable systems through heroic effort, but that’s a burnout risk and also a failure risk</li> <li>Taking a controlled hit in short-term reliability is usually the better trade</li> </ul> <h3 id="chapter-7-evolution-of-automation-at-google">Chapter 7: Evolution of automation at Google</h3> <ul> <li>“Automation is a force multiplier, not a panacea”</li> <li>Value of automation <ul> <li>Consistency</li> <li>Extensibility</li> <li>MTTR</li> <li>Faster non-repair actions</li> <li>Time savings</li> </ul></li> </ul> <p><em>Multiple interesting case studies and explanations skipped in notes.</em></p> <h3 id="chapter-8-release-engineering">Chapter 8: Release engineering</h3> <ul> <li>This is a specific job function at Google</li> </ul> <h4 id="release-engineer-role">Release engineer role</h4> <ul> <li>Release engineers work with SWEs and SREs to define how software is released <ul> <li>Allows dev teams to focus on dev work</li> </ul></li> <li>Define best practices <ul> <li>Compiler flags, formats for build ID tags, etc.</li> </ul></li> <li>Releases automated</li> <li>Models vary between teams <ul> <li>Could be “push on green” and deploy every build</li> <li>Could be hourly builds and deploys</li> <li>etc.</li> </ul></li> <li>Hermetic builds <ul> <li>Building same rev number should always give identical results</li> <li>Self-contained -- this includes versioning everything down the compiler used</li> <li>Can cherry-pick fixes against an old rev to fix production software</li> </ul></li> <li>Virtually all changes require code review</li> <li>Branching <ul> <li>All code in main branch</li> <li>Releases are branched off</li> <li>Fixes can go from master to branch</li> <li>Branches never merged back</li> </ul></li> <li>Testing <ul> <li>CI</li> <li>Release process creates an audit trail that runs tests and shows that tests passed</li> </ul></li> <li>Config management <ul> <li>Deceptively simple, <a href="//danluu.com/postmortem-lessons/">can cause instability</a></li> </ul></li> <li>Many possible schemes (all involve storing config in source control and having strict config review)</li> <li>Use mainline for config -- config maintained at head and applied immediately <ul> <li>Originally used for Borg (and pre-Borg systems)</li> <li>Binary releases and config changes decoupled!</li> </ul></li> <li>Include config files and binaries in same package <ul> <li>Simple</li> <li>Tightly couples binary and config -- ok for projects with few config files or where few configs change</li> </ul></li> <li>Package config into “configuration packages” <ul> <li>Same hermetic principle as for code</li> </ul></li> <li>Release engineering shouldn’t be an afterthought! <ul> <li>Budget resources at beginning of dev cycle</li> </ul></li> </ul> <h3 id="chapter-9-simplicity">Chapter 9: Simplicity</h3> <ul> <li>Stability vs. agility <ul> <li>Can make things stable by freezing -- need to balance the two</li> <li>Reliable systems can increase agility</li> <li>Reliable rollouts make it easier to link changes to bugs</li> </ul></li> <li>Virtue of boring!</li> <li><a href="https://en.wikipedia.org/wiki/No_Silver_Bullet">Essential vs. accidental complexity</a> <ul> <li>SREs should push back when accidental complexity is introduced</li> </ul></li> <li>Code is a liability <ul> <li>Remove dead code or other bloat</li> </ul></li> <li>Minimal APIs <ul> <li>Smaller APIs easier to test, more reliable</li> </ul></li> <li>Modularity <ul> <li>API versioning</li> <li>Same as code, where you’d avoid misc/util classes</li> </ul></li> <li>Releases <ul> <li>Small releases easier to measure</li> <li>Can’t tell what happened if we released 100 changes together</li> </ul></li> </ul> <h3 id="chapter-10-altering-from-time-series-data">Chapter 10: Altering from time-series data</h3> <h4 id="borgmon">Borgmon</h4> <ul> <li>Similar-ish to <a href="https://prometheus.io/">Prometheus</a></li> <li>Common data format for logging</li> <li>Data used for both dashboards and alerts</li> <li>Formalized a legacy data format, “varz”, which allowed metrics to be viewed via HTTP <ul> <li>To view metrics manually, go to <a href="http://foo:80/varz">http://foo:80/varz</a></li> </ul></li> <li>Adding a metric only requires a single declaration in code <ul> <li>low user-cost to add new metric</li> </ul></li> <li>Borgmon fetches /varz from each target periodically <ul> <li>Also includes synthetic data like health check, if name was resolved, etc.,</li> </ul></li> <li>Time series arena <ul> <li>Data stored in-memory, with checkpointing to disk</li> <li>Fixed sized allocation</li> <li>GC expires oldest entries when full</li> <li>conceptually a 2-d array with time on one axis and items on the other axis</li> <li>24 bytes for a data point -&gt; 1M unique time series for 12 hours at 1-minute intervals = 17 GB</li> </ul></li> <li>Borgmon rules <ul> <li>Algebraic expressions</li> <li>Compute time-series from other time-series</li> <li>Rules evaluated in parallel on a threadpool</li> </ul></li> <li>Counters vs. gauges <ul> <li>Def: counters are non-decreasing</li> <li>Def: can take any value</li> <li>Counters preferred to gauges because gauges can lose information depending on sampling interval</li> </ul></li> <li>Altering <ul> <li>Borgmon rules can trigger alerts</li> <li>Have minimum duration to prevent “flapping”</li> <li>Usually set to two duration cycles so that missed collections don’t trigger an alert</li> </ul></li> <li>Scaling <ul> <li>Borgmon can take time-series data from other Borgmon (uses binary streaming protocol instead of the text-based varz protocol)</li> <li>Can have multiple tiers of filters</li> </ul></li> <li>Prober <ul> <li>Black-box monitoring that monitors what the user sees</li> <li>Can be queried with varz or directly send alerts to Altertmanager</li> </ul></li> <li>Configuration <ul> <li>Separation between definition of rules and targets being monitored</li> </ul></li> </ul> <h3 id="chapter-11-being-on-call">Chapter 11: Being on-call</h3> <ul> <li>Typical response time <ul> <li>5 min for user-facing or other time-critical tasks</li> <li>30 min for less time-sensitive stuff</li> </ul></li> <li>Response times linked to SLOs <ul> <li>Ex: 99.99% for a quarter is 13 minutes of downtime; clearly can’t have response time above 13 minutes</li> <li>Services with looser SLOs can have response times in the 10s of minutes (or more?)</li> </ul></li> <li>Primary vs secondary on-call <ul> <li>Work distribution varies by team</li> <li>In some, secondary can be backup for primary</li> <li>In others, secondary handles non-urgent / non-paging events, primary handles pages</li> </ul></li> <li>Balanced on-call <ul> <li>Def: quantity: percent of time on-call</li> <li>Def: quality: number of incidents that occur while on call</li> </ul></li> </ul> <p><em>This is great. We should do this. People sometimes get really rough on-call rotations a few times in a row and considering the infrequency of on-call rotations there’s no reason to expect that this should randomly balance out over the course of a year or two.</em></p> <ul> <li>Balance in quantity <ul> <li>&gt;= 50% of SRE time goes into engineering</li> <li>Of remainder, no more than 25% spent on-call</li> </ul></li> <li>Prefer multi-site teams <ul> <li>Night shifts are bad for health, multi-site teams allow elimination of night shifts</li> </ul></li> <li>Balance in quality <ul> <li>On average, dealing with an incident (incl root-cause analysis, remediation, writing postmortem, fixing bug, etc.) takes 6 hours.</li> <li>=&gt; shouldn’t have more than 2 incidents in a 12-hour on-call shift</li> <li>To stay within upper bound, want very flat distribution of pages, with median value of 0</li> </ul></li> <li>Compensation -- extra pay for being on-call (time-off or cash)</li> </ul> <h3 id="chapter-12-effective-troubleshooting">Chapter 12: Effective troubleshooting</h3> <p><em>No notes for this chapter.</em></p> <h3 id="chapter-13-emergency-response">Chapter 13: Emergency response</h3> <ul> <li>Test-induced emergency <ul> <li><a href="http://queue.acm.org/detail.cfm?id=2371516">SREs break systems to see what happens</a></li> </ul></li> <li>Ex: want to flush out hidden dependencies on a distributed MySQL database <ul> <li>Plan: block access to 1/100 of DBs</li> <li>Response: dependent services report that they’re unable to access key systems</li> <li>SRE response: SRE aborts exercise, tries to roll back permissions change</li> <li>Rollback attempt fails</li> <li>Attempt to restore access to replicas works</li> <li>Normal operation restored in 1 hour</li> <li>What went well: dependent teams escalated issues immediately, were able to restore access</li> <li>What we learned: had an insufficient understanding of the system and its interaction with other systems, failed to follow incident response that would have informed customers of outage, hadn’t tested rollback procedures in test env</li> </ul></li> <li>Change-induced emergency <ul> <li>Changes can cause failures!</li> </ul></li> <li>Ex: config change to abuse prevention infra pushed on Friday triggered crash-loop bug <ul> <li>Almost all externally facing systems depend on this, become unavailable</li> <li>Many internal systems also have dependency and become unavailable</li> <li>Alerts start firing with seconds</li> <li>Within 5 minutes of config push, engineer who pushed change rolled back change and services started recovering</li> <li>What went well: monitoring fired immediately, incident management worked well, out-of-band communications systems kept people up to date even though many systems were down, luck (engineer who pushed change was following real-time comms channels, which isn’t part of the release procedure)</li> <li>What we learned: push to canary didn’t trigger same issue because it didn’t hit a specific config keyword combination; push was considered low-risk and went through less stringent canary process, alerting was too noisy during outage</li> </ul></li> <li>Process-induced emergency</li> </ul> <p><em>No notes on process-induced example.</em></p> <h3 id="chapter-14-managing-incidents">Chapter 14: Managing incidents</h3> <p><em>This is an area where we seem to actually be pretty good. No notes on this chapter.</em></p> <h3 id="chapter-15-postmortem-culture-learning-from-failure">Chapter 15: Postmortem culture: learning from failure</h3> <p><em><a href="//danluu.com/postmortem-lessons/">I'm in strong agreement with most of this chapter</a>. No notes.</em></p> <h3 id="chapter-16-tracking-outages">Chapter 16: Tracking outages</h3> <ul> <li>Escalator: centralized system that tracks ACKs to alerts, notifies other people if necessary, etc.</li> <li>Outalator: gives time-interleaved view of notifications for multiple queues <ul> <li>Also saves related email and allows marking some messages as “important”, can collapse non-important messages, etc.</li> </ul></li> </ul> <p><em>Our version of Escalator seems fine. We could really use something like Outalator, though.</em></p> <h3 id="chapter-17-testing-for-reliability">Chapter 17: Testing for reliability</h3> <p><em>Preaching to the choir. No notes on this section. We could really do a lot better here, though.</em></p> <h3 id="chapter-18-software-engineering-in-sre">Chapter 18: Software engineering in SRE</h3> <ul> <li>Ex: Auxon, capacity planning automation tool</li> <li>Background: traditional capacity planning cycle <ul> <li>1) collect demand forecasts (quarters to years in advance)</li> <li>2) Plan allocations</li> <li>3) Review plan</li> <li>4) Deploy and config resources</li> </ul></li> <li>Traditional approach cons <ul> <li>Many things can affect plan: increase in efficiency, increase in adoption rate, cluster delivery date slips, etc.</li> <li>Even small changes require rechecking allocation plan</li> <li>Large changes may require total rewrite of plan</li> <li>Labor intensive and error prone</li> </ul></li> <li>Google solution: intent-based capacity planning <ul> <li>Specify requirements, not implementation</li> <li>Encode requirements and autogenerate a capacity plan</li> <li>In addition to saving labor, solvers can do better than human generated solutions =&gt; cost savings</li> </ul></li> <li>Ladder of examples of increasingly intent based planning <ul> <li>1) Want 50 cores in clusters X, Y, and Z -- why those resources in those clusters?</li> <li>2) Want 50-core footprint in any 3 clusters in region -- why that many resources and why 3?</li> <li>3) Want to meet demand with N+2 redundancy -- why N+2?</li> <li>4) Want 5 9s of reliability. Could find, for example, that N+2 isn’t sufficient</li> </ul></li> <li>Found that greatest gains are from going to (3) <ul> <li>Some sophisticated services may go for (4)</li> </ul></li> <li>Putting constraints into tools allows tradeoffs to be consistent across fleet <ul> <li>As opposed to making individual ad hoc decisions</li> </ul></li> <li>Auxon inputs <ul> <li>Requirements (e.g., “service must be N+2 per continent”, “frontend servers no more than 50ms away from backend servers”</li> <li>Dependencies</li> <li>Budget priorities</li> <li>Performance data (how a service scales)</li> <li>Demand forecast data (note that services like Colossus have derived forecasts from dependent services)</li> <li>Resource supply &amp; pricing</li> </ul></li> <li>Inputs go into solver (mixed-integer or linear programming solver)</li> </ul> <p><em>No notes on why SRE software, how to spin up a group, etc. TODO: re-read back half of this chapter and take notes if it’s ever directly relevant for me.</em></p> <h3 id="chapter-19-load-balancing-at-the-frontend">Chapter 19: Load balancing at the frontend</h3> <p><em>No notes on this section. Seems pretty similar to what we have in terms of high-level goals, and the chapter doesn’t go into low-level details. It’s notable that they do [redacted] differently from us, though. For more info on lower-level details, there’s <a href="https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf">the Maglev paper</a>.</em></p> <h3 id="chapter-20-load-balancing-in-the-datacenter">Chapter 20: Load balancing in the datacenter</h3> <ul> <li>Flow control</li> <li>Need to avoid unhealthy tasks</li> <li>Naive flow control for unhealthy tasks <ul> <li>Track number of requests to a backend</li> <li>Treat backend as unhealthy when threshold is reached</li> <li>Cons: generally terrible</li> </ul></li> <li>Health-based flow control <ul> <li>Backend task can be in one of three states: {healthy, refusing connections, lame duck}</li> <li>Lame duck state can still take connections, but sends backpressure request to all clients</li> <li>Lame duck state simplifies clean shutdown</li> </ul></li> <li>Def: subsetting: limiting pool of backend tasks that a client task can interact with <ul> <li>Clients in RPC system maintain pool of connections to backends</li> <li>Using pool reduces latency compared to doing setup/teardown when needed</li> <li>Inactive connections are relatively cheap, but not free, even in “inactive” mode (reduced health checks, UDP instead of TCP, etc.)</li> </ul></li> <li>Choosing the correct subset <ul> <li>Typ: 20-100, choose base on workload</li> </ul></li> <li>Subset selection: random <ul> <li>Bad utilization</li> </ul></li> <li>Subset selection: round robin <ul> <li>Order is permuted; each round has its own permutation</li> </ul></li> <li>Load balancing <ul> <li>Subset selection is for connection balancing, but we still need to balance load</li> </ul></li> <li>Load balancing: round robin <ul> <li>In practice, observe 2x difference between most loaded and least load</li> <li>In practice, most expensive request can be 1000x more expensive than cheapest request</li> <li>In addition, there’s random unpredictable variation in requests</li> </ul></li> <li>Load balancing: least-loaded round robin <ul> <li>Exactly what it sounds like: round-robin among least loaded backends</li> <li>Load appears to be measured in terms of connection count; may not always be the best metric</li> <li>This is per client, not globally, so it’s possible to send requests to a backend with many requests from other clients</li> <li>In practice, for larg services, find that most-loaded task uses twice as much CPU as least-loaded; similar to normal round robin</li> </ul></li> <li>Load balancing: weighted round robin <ul> <li>Same as above, but weight with other factors</li> <li>In practice, much better load distribution than least-loaded round robin</li> </ul></li> </ul> <p><em>I wonder what Heroku meant when they responded to <a href="http://genius.com/James-somers-herokus-ugly-secret-annotated">Rap Genius</a> by saying “after extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections”.</em></p> <h3 id="chapter-21-handling-overload">Chapter 21: Handling overload</h3> <ul> <li>Even with “good” load balancing, systems will become overloaded</li> <li>Typical strategy is to serve degraded responses, but under very high load that may not be possible</li> <li>Modeling capacity as QPS or as a function of requests (e.g., how many keys the requests read) is failure prone <ul> <li>These generally change slowly, but can change rapidly (e.g., because of a single checkin)</li> </ul></li> <li>Better solution: measure directly available resources</li> <li>CPU utilization is <em>usually</em> a good signal for provisioning <ul> <li>With GC, memory pressure turns into CPU utilization</li> <li>With other systems, can provision other resources such that CPU is likely to be limiting factor</li> <li>In cases where over-provisioning CPU is too expensive, take other resources into account</li> </ul></li> </ul> <p><em>How much does it cost to generally over-provision CPU like that?</em></p> <ul> <li>Client-side throttling <ul> <li>Backends start rejecting requests when customer hits quota</li> <li>Requests still use resources, even when rejected -- without throttling, backends can spend most of their resources on rejecting requests</li> </ul></li> <li>Criticality <ul> <li>Seems to be priority but with a different name?</li> <li>First-class notion in RPC system</li> <li>Client-side throttling keeps separate stats for each level of criticality</li> <li>By default, criticality is propagated through subsequent RPCs</li> </ul></li> <li>Handling overloaded errors <ul> <li>Shed load to other DCs if DC is overloaded</li> <li>Shed load to other backends if DC is ok but some backends are overloaded</li> </ul></li> <li>Clients retry when they get an overloaded response <ul> <li>Per-request retry budget (3)</li> <li>Per-client retry budget (10%)</li> <li>Failed retries from client cause “overloaded; don’t retry” response to be returned upstream</li> </ul></li> </ul> <p><em>Having a “don’t retry” response is “obvious”, but relatively rare in practice. A lot of real systems have a problem with failed retries causing more retries up the stack. This is especially true when crossing a hardware/software boundary (e.g., filesystem read causes many retries on DVD/SSD/spinning disk, fails, and then gets retried at the filesystem level), but seems to be generally true in pure software too.</em></p> <h3 id="chapter-22-addressing-cascading-failures">Chapter 22: Addressing cascading failures</h3> <ul> <li>Typical failure scenarios?</li> <li>Server overload</li> <li>Ex: have two servers <ul> <li>One gets overloaded, failing</li> <li>Other one now gets all traffic and also fails</li> </ul></li> <li>Resource exhaustion <ul> <li>CPU/memory/threads/file descriptors/etc.</li> </ul></li> <li>Ex: dependencies among resources <ul> <li>1) Java frontend has poorly tuned GC params</li> <li>2) Frontend runs out of CPU due to GC</li> <li>3) CPU exhaustion slows down requests</li> <li>4) Increased queue depth uses more RAM</li> <li>5) Fixed memory allocation for entire frontend means that less memory is available for caching</li> <li>6) Lower hit rate</li> <li>7) More requests into backend</li> <li>8) Backend runs out of CPU or threads</li> <li>9) Health checks fail, starting cascading failure</li> <li>Difficult to determine cause during outage</li> </ul></li> <li>Note: policies that avoid servers that serve errors can make things worse <ul> <li>fewer backends available, which get too many requests, which then become unavailable</li> </ul></li> <li>Preventing server overload <ul> <li>Load test! Must have realistic environment</li> <li>Serve degraded results</li> <li>Fail cheaply and early when overloaded</li> <li>Have higher-level systems reject requests (at reverse proxy, load balancer, and on task level)</li> <li>Perform capacity planning</li> </ul></li> <li>Queue management <ul> <li>Queues do nothing in steady state</li> <li>Queued reqs consume memory and increase latency</li> <li>If traffic is steady-ish, better to keep small queue size (say, 50% or less of thread pool size)</li> <li>Ex: Gmail uses queueless servers with failover when threads are full</li> <li>For bursty workloads, queue size should be function of #threads, time per req, size/freq of bursts</li> <li>See also, <a href="http://queue.acm.org/detail.cfm?id=2839461">adaptive LIFO and CoDel</a></li> </ul></li> <li>Graceful degradation <ul> <li>Note that it’s important to test graceful degradation path, maybe by running a small set of servers near overload regularly, since this path is rarely exercised under normal circumstances</li> <li>Best to keep simple and easy to understand</li> </ul></li> <li>Retries <ul> <li>Always use randomized exponential backoff</li> <li>See previous chapter on only retrying at a single level</li> <li>Consider having a server-wide retry budget</li> </ul></li> <li>Deadlines <ul> <li>Don’t do work where deadline has been missed (common theme for cascading failure)</li> <li>At each stage, check that deadline hasn’t been hit</li> <li>Deadlines should be propagated (e.g., even through RPCs)</li> </ul></li> <li>Bimodal latency <ul> <li>Ex: problem with long deadline</li> <li>Say frontend has 10 servers, 100 threads each (1k threads of total cap)</li> <li>Normal operation: 1k QPS, reqs take 100ms =&gt; 100 worker threads occupied (1k QPS * .1s)</li> <li>Say 5% of operations don’t complete and there’s a 100s deadline</li> <li>That consumes 5k threads (50 QPS * 100s)</li> <li>Frontend oversubscribed by 5x. Success rate = 1k / (5k + 95) = 19.6% =&gt; 80.4% error rate</li> </ul></li> </ul> <p><em>Using deadlines instead of timeouts is great. We should really be more systematic about this.</em></p> <p><em>Not allowing systems to fill up with pointless zombie requests by setting reasonable deadlines is “obvious”, but a lot of real systems seem to have arbitrary timeouts at nice round human numbers (30s, 60s, 100s, etc.) instead of deadlines that are assigned with load/cascading failures in mind.</em></p> <ul> <li>Try to avoid intra-layer communication <ul> <li>Simpler, avoids possible cascading failure paths</li> </ul></li> <li>Testing for cascading failures <ul> <li>Load test components!</li> <li>Load testing both reveals breaking and point ferrets out components that will totally fall over under load</li> <li>Make sure to test each component separately</li> <li>Test non-critical backends (e.g., make sure that spelling suggestions for search don’t impede the critical path)</li> </ul></li> <li>Immediate steps to address cascading failures <ul> <li>Increase resources</li> <li>Temporarily stop health check failures/deaths</li> <li>Restart servers (only if that would help -- e.g., in GC death spiral or deadlock)</li> <li>Drop traffic -- drastic, last resort</li> <li>Enter degraded mode -- requires having built this into service previously</li> <li>Eliminate batch load</li> <li>Eliminate bad traffic</li> </ul></li> </ul> <h3 id="chapter-23-distributed-consensus-for-reliability">Chapter 23: Distributed consensus for reliability</h3> <ul> <li>How do we agree on questions like… <ul> <li>Which process is the leader of a group of processes?</li> <li>What is the set of processes in a group?</li> <li>Has a message been successfully committed to a distributed queue?</li> <li>Does a process hold a particular lease?</li> <li>What’s the value in a datastore for a particular key?</li> </ul></li> <li>Ex1: split-brain <ul> <li>Service has replicated file servers in different racks</li> <li>Must avoid writing simultaneously to both file servers in a set to avoid data corruption</li> <li>Each pair of file servers has one leader &amp; one follower</li> <li>Servers monitor each other via heartbeats</li> <li>If one server can’t contact the other, it sends a STONITH (shoot the other node in the head)</li> <li>But what happens if the network is slow or packets get dropped?</li> <li>What happens if both servers issue STONITH?</li> </ul></li> </ul> <p><em>This reminds me of one of my favorite distributed database postmortems. The database is configured as a ring, where each node talks to and replicates data into a “neighborhood” of 5 servers. If some machines in the neighborhood go down, other servers join the neighborhood and data gets replicated appropriately.</em></p> <p><em>Sounds good, but in the case where a server goes bad and decides that no data exists and all of its neighbors are bad, it can return results faster than any of its neighbors, as well as tell its neighbors that they’re all bad. Because the bad server has no data it’s very fast and can report that its neighbors are bad faster than its neighbors can report that it’s bad. Whoops!</em></p> <ul> <li>Ex2: failover requires human intervention <ul> <li>A highly sharded DB has a primary for each shard, which replicates to a secondary in another DC</li> <li>External health checks decide if the primary should failover to its secondary</li> <li>If the primary can’t see the secondary, it makes itself unavailable to avoid the problems from “Ex1”</li> <li>This increases operational load</li> <li>Problems are correlated and this is relatively likely to run into problems when people are busy with other issues</li> <li>If there’s a network issues, there’s no reason to think that a human will have a better view into the state of the world than machines in the system</li> </ul></li> <li>Ex3: faulty group-membership algorithms <ul> <li><em>What it sounds like. No notes on this part</em></li> </ul></li> <li>Impossibility results <ul> <li><a href="https://en.wikipedia.org/wiki/CAP_theorem">CAP</a>: P is impossible in real networks, so choose C or A</li> <li><a href="http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/">FLP</a>: async distributed consensus can’t gaurantee progress with unreliable network</li> </ul></li> </ul> <h4 id="paxos-http-research-microsoft-com-en-us-um-people-lamport-pubs-paxos-simple-pdf"><a href="http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf">Paxos</a></h4> <ul> <li>Sequence of proposals, which may or may not be accepted by the majority of processes <ul> <li>Not accepted =&gt; fails</li> <li>Sequence number per proposal, must be unique across system</li> </ul></li> <li>Proposal <ul> <li>Proposer sends seq number to acceptors</li> <li>Acceptor agrees if it hasn’t seen a higher seq number</li> <li>Proposers can try again with higher seq number</li> <li>If proposer recvs agreement from majority, it commits by sending commit message with value</li> <li>Acceptors must journal to persistent storage when they accept</li> </ul></li> </ul> <h4 id="patterns">Patterns</h4> <ul> <li>Distributed consensus algorithms are a low-level primitive</li> <li>Reliable replicated state machines <ul> <li>Fundamental building block for data config/storage, locking, leader election, etc.</li> <li>See these papers: <a href="https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf">Schnieder</a>, <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.8238&amp;rep=rep1&amp;type=pdf">Aguilera</a>, <a href="http://www.cnds.jhu.edu/pub/papers/psb_ladis_08.pdf">Amir &amp; Kirsch</a></li> </ul></li> <li>Reliable repliacted data and config stores <ul> <li>Non distributed-consensus-based systems often use timestamps: problematic because clock synchrony can't be gauranteed</li> <li>See <a href="http://research.google.com/archive/spanner.html">Spanner paper</a> for an example of using distributed consensus</li> </ul></li> <li>Leader election <ul> <li>Equivalent to distributed consensus</li> <li>Where work of the leader can performed performed by one process or sharded, leader election pattern allows writing distributed system as if it were a simple program</li> <li>Used by, for example, GFS and Colussus</li> </ul></li> <li>Distributed coordination and locking services <ul> <li>Barrier used, for example, in MapReduce to make sure that Map is finished before Reduce proceeds</li> </ul></li> <li>Distributed queues and messaging <ul> <li>Queues: can tolerate failures from worker nodes, but system needs to ensure that claimed tasks are processed</li> <li>Can use leases instead of removal from queue</li> <li>Using RSM means that system can continue processing even when queue goes down</li> </ul></li> <li>Performance <ul> <li>Conventional wisdom that consensus algorithms can't be used for high-throughput low-latency systems is false</li> <li>Distributed consensus at the core of many Google systems</li> <li>Scale makes this worse for Google than most other companies, but it still works</li> </ul></li> <li>Multi-Paxos <ul> <li>Strong leader process: unless a leader has not yet been elected or a failure occurs, only one round trip required to reach consensus</li> <li>Note that another process in the group can propose at any time</li> <li>Can ping pong back and forth and pseudo-livelock</li> <li>Not unqique to multi-paxos,</li> <li>Standard solutions are to elect a proposer process or use rotating proposer</li> </ul></li> <li>Scaling read-heavy workloads <ul> <li>Ex: Photon allows reads from any replica</li> <li>Read from stale replica requres extra work, but doesn't produce bad incorrect results</li> <li>To gaurantee reads are up to date, do one of the following:</li> <li>1) Perform a read-only consensus operation</li> <li>2) Read data from replica that's guaranteed to be most-up-to-date (stable leader can provide this guarantee)</li> <li>3) Use quorum leases</li> </ul></li> <li><a href="http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105_abs.shtml">Quorum leases</a> <ul> <li>Replicas can be granted lease over some (or all) data in the system</li> </ul></li> <li>Fast Paxos <ul> <li>Designed to be faster over WAN</li> <li>Each client can send <code>Propose</code> to each member of a group of acceptors directly, instead of through a leader</li> <li><a href="http://www.sysnet.ucsd.edu/sysnet/miscpapers/hotdep07.pdf">Not necessarily faster than classic Paxos</a> -- if RTT to acceptors is long, we've traded one message across slow link plus N in parallel across fast link for N across slow link</li> </ul></li> <li>Stable leaders <ul> <li>&quot;Almost all distributed consensus systems that have been designed with performance in mind use either the single stable leader pattern or a system of rotating leadership&quot;</li> </ul></li> </ul> <p><em>TODO: finish this chapter?</em></p> <h3 id="chapter-24-distributed-cron">Chapter 24: Distributed cron</h3> <p><em>TODO: go back and read in more detail, take notes.</em></p> <h3 id="chapter-25-data-processing-pipelines">Chapter 25: Data processing pipelines</h3> <ul> <li>Examples of this are MapReduce or Flume</li> <li>Convenient and easy to reason about the happy case, but fragile <ul> <li>Initial install is usually ok because worker sizing, chunking, parameters are carefully tuned</li> <li>Over time, load changes, causes problems</li> </ul></li> </ul> <h3 id="chapter-26-data-integrity">Chapter 26: Data integrity</h3> <ul> <li>Definition not necessarily obvious <ul> <li>If an interface bug causes Gmail to fail to display messages, that’s the same as the data being gone from the user’s standpoint</li> <li>99.99% uptime means 1 hour of downtime per year. Probably ok for most apps</li> <li>99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps</li> </ul></li> <li>Backup is non-trivial <ul> <li>May have mixture of transactional and non-transactional backup and restore</li> <li>Different versions of business logic might be live at once</li> <li>If services are independently versioned, maybe have many combinations of versions</li> <li>Replicas aren’t sufficient -- replicas may sync corruption</li> </ul></li> <li>Study of 19 data recovery efforts at Google <ul> <li>Most common user-visible data loss caused by deletion or loss of referential integrity due to software bugs</li> <li>Hardest cases were low-grade corruption discovered weeks to months later</li> </ul></li> </ul> <h4 id="defense-in-depth">Defense in depth</h4> <ul> <li>First layer: soft deletion <ul> <li>Users should be able to delete their data</li> <li>But that means that users will be able to accidentally delete their data</li> <li>Also, account hijacking, etc.</li> <li>Accidentally deletion can also happen due to bugs</li> <li>Soft deletion delays actual deletion for some period of time</li> </ul></li> <li>Second layer: backups <ul> <li>Need to figure out how much data it’s ok to lose during recovery, how long recovery can take, and how far back backups need to go</li> <li>Want backups to go back forever, since corruption can go unnoticed for months (or longer)</li> <li>But changes to code and schema can make recovery of older backups expensive</li> <li>Google usually has 30 to 90 day window, depending on the service</li> </ul></li> <li>Third layer: early detection <ul> <li>Out-of-band integrity checks</li> <li>Hard to do this right!</li> <li>Correct changes can cause checkers to fail</li> <li>But loosening checks can cause failures to get missed</li> </ul></li> </ul> <p><em>No notes on the two interesting case studies covered.</em></p> <h3 id="chapter-27-reliable-product-launches-at-scale">Chapter 27: Reliable product launches at scale</h3> <p><em>No notes on this chapter in particular. A lot of this material is covered by or at least implied by material in other chapters. Probably worth at least looking at example checklist items and action items before thinking about launch strategy, though. Also see appendix E, launch coordination checklist.</em></p> <h3 id="chapters-28-32-various-chapters-on-management">Chapters 28-32: Various chapters on management</h3> <p><em>No notes on these.</em></p> <h3 id="notes-on-the-notes">Notes on the notes</h3> <p><em>I like this book a lot. If you care about building reliable systems, reading through this book and seeing what the teams around you don’t do seems like a good exercise. That being said, the book isn't perfect. The two big downsides for me stem from the same issue: this is one of those books that's a collection of chapters by different people. Some of the editors are better than others, meaning that some of the chapters are clearer than others and that because the chapters seem designed to be readable as standalone chapters, there's a fair amount of redundancy in the book if you just read it straight through. Depending on how you plan to use the book, that can be a positive, but it's a negative to me. But even including he downsides, I'd say that this is the most valuable technical book I've read in the past year and I've covered probably 20% of the content in this set of notes. If you really like these notes, you'll probably want to read the <a href="https://www.amazon.com/gp/product/149192912X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=149192912X&amp;linkId=2a578b357abc8b995368a039dd517601">full book</a>.</em></p> <p><em>If you found this set of notes way too dry, maybe try <a href="http://nostalgebraist.tumblr.com/post/142489665564/brazenautomaton-nostalgebraist-the-book-bad">this much more entertaining set of notes on a totally different book</a>. If you found this to only be slightly too dry, maybe try <a href="http://danluu.com/postmortem-lessons/">this set of notes on classes of errors commonly seen in postmortems</a>. In any case, <a href="https://twitter.com/danluu">I’d appreciate feedback on these notes</a>. Writing up notes is an experiment for me. If people find these useful, I'll try to write up notes on books I read more often. If not, I might try a different approach to writing up notes or some other kind of post entirely.</em></p> We only hire the trendiest programmer-moneyball/ Mon, 21 Mar 2016 00:23:44 -0700 programmer-moneyball/ <p>An acquaintance of mine, let’s call him Mike, is looking for work after getting laid off from a contract role at Microsoft, which has happened to a lot of people I know. Like me, Mike has 11 years in industry. Unlike me, he doesn't know a lot of folks at trendy companies, so I passed his resume around to some engineers I know at companies that are desperately hiring. My engineering friends thought Mike's resume was fine, but most recruiters rejected him in the resume screening phase.</p> <p>When I asked why he was getting rejected, the typical response I got was:</p> <ol> <li>Tech experience is in irrelevant tech</li> <li>&quot;Experience is too random, with payments, mobile, data analytics, and UX.&quot;</li> <li>Contractors are generally not the strongest technically</li> </ol> <p>This response is something from a recruiter that was relayed to me through an engineer; the engineer was incredulous at the response from the recruiter. Just so we have a name, let's call this company TrendCo. It's one of the thousands of companies that claims to have world class engineers, hire only the best, etc. This is one company in particular, but it's representative of a large class of companies and the responses Mike has gotten.</p> <p>Anyway, (1) is code for “Mike's a .NET dev, and we don't like people with Windows experience”.</p> <p>I'm familiar with TrendCo's tech stack, which multiple employees have told me is “a tire fire”. Their core systems top out under 1k QPS, which has caused them to go down under load. Mike has worked on systems that can handle multiple orders of magnitude more load, but his experience is, apparently, irrelevant.</p> <p>(2) is hard to make sense of. I've interviewed at TrendCo and one of the selling points is that it's a startup where you get to do a lot of different things. TrendCo almost exclusively hires generalists but Mike is, apparently, too general for them.</p> <p>(3), combined with (1), gets at what TrendCo's real complaint with Mike is. He's not their type. TrendCo's median employee is a recent graduate from one of maybe five “top” schools with 0-2 years of experience. They have a few experienced hires, but not many, and most of their experienced hires have something trendy on their resume, not a boring old company like Microsoft.</p> <p>Whether or not you think there's anything wrong with having a type and rejecting people who aren't your type, as <a href="https://news.ycombinator.com/item?id=11290662">Thomas Ptacek has observed</a>, if your type is the same type everyone else is competing for, “you are competing for talent with the wealthiest (or most overfunded) tech companies in the market”.</p> <p>If <a href="https://docs.google.com/spreadsheets/u/1/d/1UnLz40Our1Ids-O0sz26uPNCF6cQjwosrZQY4VLdflU/htmlview?pli=1&amp;sle=true#">you look at new grad hiring data</a>, it looks like FB is offering people with zero experience &gt; $100k/ salary, $100k signing bonus, and $150k in RSUs, for an amortized total comp &gt; $160k/yr, including $240k in the first year. Google's package has &gt; $100k salary, a variable signing bonus in the $10k range, and $187k in RSUs. That comes in a bit lower than FB, but it's much higher than most companies that claim to only hire the best are willing to pay for a new grad. Keep in mind that <a href="startup-tradeoffs/#fn:C">compensation can go much higher for contested candidates</a>, and that <a href="https://news.ycombinator.com/item?id=11314449">compensation for experienced candidates is probably higher than you expect if you're not a hiring manager who's seen what competitive offers look like today</a>.</p> <p>By going after people with the most sought after qualifications, TrendCo has narrowed their options down to either paying out the nose for employees, or offering non-competitive compensation packages. TrendCo has chosen the latter option, which partially explains why they have, proportionally, so few senior devs -- the compensation delta increases as you get more senior, and you have to make a really compelling pitch to someone to get them to choose TrendCo when you're offering $150k/yr less than the competition. And as people get more experience, they're less likely to believe the part of the pitch that explains how much the stock options are worth.</p> <p>Just to be clear, I don't have anything against people with trendy backgrounds. I know a lot of these people who have impeccable interviewing skills and got 5-10 strong offers last time they looked for work. I've worked with someone like that: he was just out of school, his total comp package was north of $200k/yr, and he was worth every penny. But think about that for a minute. He had strong offers from six different companies, of which he was going to accept at most one. Including lunch and phone screens, the companies put in an average of eight hours apiece interviewing him. And because they wanted to hire him so much, the companies that were really serious spent an average of another five hours apiece of engineer time trying to convince him to take their offer. Because these companies had, on average, a ⅙ chance of hiring this person, they have to spend at least an expected (8+5) * 6 = 78 hours of engineer time<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">1</a></sup>. People with great backgrounds are, on average, pretty great, but they're really hard to hire. It's much easier to hire people who are underrated, especially if you're not paying market rates.</p> <p>I've seen this hyperfocus on hiring people with trendy backgrounds from both sides of the table, and it's ridiculous from both sides.</p> <p>On the referring side of hiring, I tried to get a startup I was at to hire the most interesting and creative programmer I've ever met, who was tragically underemployed for years because of his low GPA in college. We declined to hire him and I was told that his low GPA meant that he couldn't be very smart. Years later, Google took a chance on him and he's been killing it since then. He actually convinced me to join Google, and <a href="//danluu.com/tech-discrimination/">at Google, I tried to hire one of the most productive programmers I know, who was promptly rejected by a recruiter for not being technical enough</a>.</p> <p>On the candidate side of hiring, I've experienced both being in demand and being almost unhireable. Because I did my undergrad at Wisconsin, which is one of the 25 schools that claims to be a top 10 cs/engineering school, I had recruiters beating down my door when I graduated. But that's silly -- that I attended Wisconsin wasn't anything about me; I just happened to grow up in the state of Wisconsin. If I grew up in Utah, I probably would have ended up going to school at Utah. When I've compared notes with folks who attended schools like Utah and Boise State, their education is basically the same as mine. Wisconsin's rank as an engineering school comes from having professors who do great research which is, at best, weakly correlated to <a href="//danluu.com/teach-debugging/">effectiveness at actually teaching undergrads</a>. Despite getting the same engineering education you could get at hundreds of other schools, I had a very easy time getting interviews and finding a great job.</p> <p>I spent 7.5 years in that great job, at <a href="https://www.amazon.com/gp/product/B01FSZU6FK/ref=as_li_qf_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01FSZU6FK&amp;linkId=d15e514c6ecefa224be8f05d4d5837e3">Centaur</a>. Centaur has a pretty strong reputation among hardware companies in Austin who've been around for a while, and I had an easy time shopping for local jobs at hardware companies. But I don't know of any software folks who've heard of Centaur, and as a result I couldn't get an interview at most software companies. There were even a couple of cases where I had really strong internal referrals and the recruiters still didn't want to talk to me, which I found funny and my friends found frustrating.</p> <p>When I could get interviews, they often went poorly. A typical rejection reason was something like “we process millions of transactions per day here and we really need someone with more relevant experience who can handle these things without ramping up”. And then Google took a chance on me and I was the second person on a project to get serious about deep learning performance, which was a 20%-time project until just before I joined. <a href="https://www.google.com/patents/US20160342889">We built the fastest deep learning system in the world</a>. From what I hear, they're now on the Nth generation of that project, but even the first generation thing we built had better per-rack performance and performance per dollar than any other production system out there for years (excluding follow-ons to that project, of course).</p> <p>While I was at Google I had recruiters pinging me about job opportunities all the time. And now that I'm at boring old Microsoft, I don't get nearly as many recruiters reaching out to me. I've been considering looking for work<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">2</a></sup> and I wonder how trendy I'll be if I do. Experience in irrelevant tech? Check! Random experience? Check! Contractor? Well, no. But two out of three ain't bad.</p> <p>My point here isn't anything about me. It's that here's this person<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">3</a></sup> who has wildly different levels of attractiveness to employers at various times, mostly due to superficial factors that don't have much to do with actual productivity. This is a really common story among people who end up at Google. If you hired them before they worked at Google, you might have gotten a great deal! But no one (except Google) was willing to take that chance. There's something to be said for paying more to get a known quantity, but a company like TrendCo that isn't willing to do that cripples its hiring pipeline by only going after people with trendy resumes, and if you wouldn't hire someone before they worked at Google and would after, the main thing you know is that the person is above average at whiteboard algorithms quizzes (or got lucky one day).</p> <p>I don't mean to pick on startups like TrendCo in particular. Boring old companies have their version of what a trendy background is, too. A friend of mine who's desperate to hire can't do anything with some of the resumes I pass his way because his group isn't allowed to hire anyone without a degree. Another person I know is in a similar situation because his group has a bright-line rule that causes them to reject people who aren't already employed.</p> <p>Not only are these decisions non-optimal for companies, they create a path dependence in employment outcomes that causes individual good (or bad) events to follow people around for decades. You can see similar effects in the literature on career earnings in a variety of fields<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">4</a></sup>.</p> <p><a href="https://news.ycombinator.com/item?id=7260087">Thomas Ptacek has this great line about how</a> “we interview people whose only prior work experience is &quot;Line of Business .NET Developer&quot;, and they end up showing us how to write exploits for elliptic curve partial nonce bias attacks that involve Fourier transforms and BKZ lattice reduction steps that take 6 hours to run.” If you work at a company that doesn't reject people out of hand for not being trendy, you'll hear lots of stories like this. Some of the best people I've worked with went to schools you've never heard of and worked at companies you've never heard of until they ended up at Google. Some are still at companies you've never heard of.</p> <p>If you read <a href="https://zachholman.com/talk/firing-people">Zach Holman</a>, you may recall that when he said that he was fired, someone responded with “If an employer has decided to fire you, then you've not only failed at your job, you've failed as a human being.” A lot of people treat employment status and credentials as measures of the inherent worth of individuals. But a large component of these markers of success, not to mention success itself, is luck.</p> <h3 id="solutions">Solutions?</h3> <p>I can understand why this happens. At an individual level, we're prone to the <a href="https://en.wikipedia.org/wiki/Fundamental_attribution_error">fundamental attribution error</a>. At an organizational level, fast growing organizations burn a large fraction of their time on interviews, and the obvious way to cut down on time spent interviewing is to only interview people with &quot;good&quot; qualifications. Unfortunately, that's counterproductive when you're chasing after the same tiny pool of people as everyone else.</p> <p>Here are the beginnings of some ideas. I'm open to better suggestions!</p> <h4 id="moneyball">Moneyball</h4> <p>Billy Beane and Paul Depodesta took the Oakland A's, a baseball franchise with nowhere near the budget of top teams, and created what was arguably the best team in baseball by finding and “hiring” players who were statistically underrated for their price. The thing I find really amazing about this is that they publicly talked about doing this, and then Michael Lewis wrote a book, titled <a href="https://www.amazon.com/gp/product/0393324818/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0393324818&amp;linkId=65d86d3a72b4c1ba73e8d3d52796eae1">Moneyball</a>, about them doing this. Despite the publicity, it took years for enough competitors to catch on enough that the A's strategy stopped giving them a very large edge.</p> <p>You can see the exact same thing in software hiring. Thomas Ptacek has been talking about how they hired unusually effective people at Matasano for at least half a decade, maybe more. Google bigwigs regularly talk about the hiring data they have and what hasn't worked. I believe they talked about how focusing on top schools wasn't effective and didn't turn up employees that have better performance years ago, but that doesn't stop TrendCo from focusing hiring efforts on top schools.</p> <h4 id="training-mentorship">Training / mentorship</h4> <p>You see a lot of talk about moneyball, but for some reason people are less excited about… trainingball? Practiceball? Whatever you want to call taking people who aren't “the best” and teaching them how to be “the best”.</p> <p>This is another one where it's easy to see the impact through the lens of sports, because there is so much good performance data. Since it's basketball season, if we look at college basketball, for example, we can identify a handful of programs that regularly take unremarkable inputs and produce good outputs. And that's against a field of competitors where every team is expected to coach and train their players.</p> <p>When it comes to tech companies, most of the competition isn't even trying. At the median large company, you get a couple days of “orientation”, which is mostly legal mumbo jumbo and paperwork, and the occasional “training”, which is usually a set of videos and a set of multiple-choice questions that are offered up for compliance reasons, not to teach anyone anything. And you'll be assigned a mentor who, more likely than not, won't provide any actual mentorship. Startups tend to be even worse! It's not hard to do better than that.</p> <p>Considering how much money companies spend on <a href="https://news.ycombinator.com/item?id=11314449">hiring and retaining</a> &quot;the best&quot;, you'd expect them to spend at least a (non-zero) fraction on training. It's also quite strange that companies don't focus more or training and mentorship when trying to recruit. Specific things I've learned in specific roles have been tremendously valuable to me, but it's almost always either been a happy accident, or something I went out of my way to do. Most companies don't focus on this stuff. Sure, recruiters will tell you that &quot;you'll learn so much more here than at Google, which will make you more valuable&quot;, implying that it's worth the $150k/yr pay cut, but if you ask them what, specifically, they do to make a better learning environment than Google, they never have a good answer.</p> <h4 id="process-tools-culture">Process / tools / culture</h4> <p>I've worked at two companies that both have effectively infinite resources to spend on tooling. One of them, let's call them ToolCo, is really serious about tooling and invests heavily in tools. People describe tooling there with phrases like “magical”, “the best I've ever seen”, and “I can't believe this is even possible”. And I can see why. For example, if you want to build a project that's millions of lines of code, their build system will make that take somewhere between 5s and 20s (assuming you don't enable <a href="https://en.wikipedia.org/wiki/Interprocedural_optimization">LTO</a> or anything else that can't be parallelized)<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">5</a></sup>. In the course of a regular day at work you'll use multiple tools that seem magical because they're so far ahead of what's available in the outside world.</p> <p>The other company, let's call them ProdCo <a href="http://yosefk.com/blog/people-can-read-their-managers-mind.html">pays lip service to tooling, but doesn't really value it</a>. People describing ProdCo tools use phrases like “world class bad software” and “I am 2x less productive than I've ever been anywhere else”, and “I can't believe this is even possible”. ProdCo has a paper on a new build system; their claimed numbers for speedup from parallelization/caching, onboarding time, and reliability, are at least two orders of magnitude worse than the equivalent at ToolCo. And, in my experience, the actual numbers are worse than the claims in the paper. In the course of a day of work at ProdCo, you'll use multiple tools that are multiple orders of magnitude worse than the equivalent at ToolCo in multiple dimensions. These kinds of things add up and can easily make a larger difference than “hiring only the best”.</p> <p>Processes and culture also matter. I once worked on a team that didn't use version control or have a bug tracker. For every no-brainer item on the <a href="http://www.joelonsoftware.com/articles/fog0000000043.html">Joel test</a>, there are teams out there that make the wrong choice.</p> <p>Although I've only worked on one team that completely failed the Joel test (they scored a 1 out of 12), every team I've worked on has had glaring deficiencies that are technically trivial (but sometimes culturally difficult) to fix. When I was at Google, we had really bad communication problems between the two halves of our team that were in different locations. My fix was brain-dead simple: I started typing up meeting notes for all of our local meetings and discussions and taking questions from the remote team about things that surprised them in our notes. That's something anyone could have done, and it was a huge productivity improvement for the entire team. I've literally never found an environment where you can't massively improve productivity with something that trivial. Sometimes people don't agree (e.g., it took months to get the non-version-control-using-team to use version control), but that's a topic for another post.</p> <p>Programmers are woefully underutilized at most companies. What's the point of <a href="//danluu.com/wat/">hiring &quot;the best&quot; and then crippling them</a>? You can get better results by hiring undistinguished folks and setting them up for success, and <a href="https://twitter.com/patio11/status/706884144538648576">it's a lot cheaper</a>.</p> <h3 id="conclusion">Conclusion</h3> <p>When I started programming, I heard a lot about how programmers are down to earth, not like those elitist folks who have uniforms involving suits and ties. You can even wear t-shirts to work! But if you think programmers aren't elitist, try wearing a suit and tie to an interview sometime. You'll have to go above and beyond to prove that you're not a bad cultural fit. We like to think that we're different from all those industries that judge people based on appearance, but we do the same thing, only instead of saying that people are a bad fit because they don't wear ties, we say they're a bad fit because they do, and instead of saying people aren't smart enough because they don't have the right pedigree… wait, that's exactly the same.</p> <p><strong>See also: <a href="//danluu.com/hiring-lemons/">developer hiring and the market for lemons</a></strong></p> <p><small>Thanks to Kelley Eskridge, Laura Lindzey, John Hergenroeder, Kamal Marhubi, Julia Evans, Steven McCarthy, Lindsey Kuper, Leah Hanson, Darius Bacon, Pierre-Yves Baccou, Kyle Littler, Jorge Montero, Sierra Rotimi-Williams, and Mark Dominus for discussion/comments/corrections.</small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:T"><p>This estimate is conservative. The math only works out to 78 hours if you assume that you never incorrectly reject a trendy candidate and that you don't have to interview candidates that you “correctly” fail to find good candidates. If you add in the extra time for those, the number becomes a lot larger. And if you're TrendCo, and you won't give senior ICs $200k/yr, let alone new grads, you probably need to multiply that number by at least a factor of 10 to account for the reduced probability that someone who's in high demand is going to take a huge pay cut to work for you.</p> <p>By the way, if you do some similar math you can see that the “no false positives” thing people talk about is bogus. The only way to reduce the risk of a false positive to zero is to not hire anyone. If you hire anyone, you're trading off the cost of firing a bad hire vs. the cost of spending engineering hours interviewing.</p> <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:P">I consider this to generally be a good practice, at least for folks like me who are relatively early in their careers. It's good to know what your options are, even if you don't exercise them. When I was at <a href="https://www.amazon.com/gp/product/B01FSZU6FK/ref=as_li_qf_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=B01FSZU6FK&amp;linkId=d15e514c6ecefa224be8f05d4d5837e3">Centaur</a>, I did a round of interviews about once a year and those interviews made it very clear that I was lucky to be at Centaur. I got a lot more responsibility and a wider variety of work than I could have gotten elsewhere, I didn't have to deal with as much nonsense, and I was pretty well paid. I still did the occasional interview, though, and you should too! If you're worried about wasting the time of the hiring company, when I was interviewing speculatively, I always made it very clear that I was happy in my job and unlikely to change jobs, and most companies are fine with that and still wanted to go through with interviewing. <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:M"><p>It's really not about me in particular. At the same time I couldn't get any company to talk to me, a friend of mine who's a much better programmer than me spent six months looking for work full time. He eventually got a job at Cloudflare, was half of the team that wrote their DNS, and is now one of the world's experts on DDoS mitigation for companies that don't have infinite resources. That guy wasn't even a networking person before he joined Cloudflare. He's a brilliant generalist who's created everything from a widely used JavaScript library to one of the coolest toy systems projects I've ever seen. He probably could have picked up whatever problem domain you're struggling with and knocked it out of the park. Oh, and between the blog posts he writes and the talks he gives, he's one of Cloudflare's most effective recruiters.</p> <p>Or Aphyr, one of the world's most respected distributed systems verification engineers, who <a href="https://twitter.com/aphyr/status/762042095330877440">failed to get responses to any of his job applications when he graduated from college less than a decade ago</a>.</p> <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:C"><p>I'm not going to do a literature review because there are just so many studies that link career earnings to external shocks, but I'll cite a result that I found to be interesting, <a href="http://citec.repec.org/d/eee/labeco/v_17_y_2010_i_2_p_303-316.html">Lisa Kahn's 2010 Labour Economics paper</a>.</p> <p>There have been a lot of studies that show, for some particular negative shock (like a recession), graduating into the negative shock reduces lifetime earnings. But most of those studies show that, over time, the effect gets smaller. When Kahn looked at national unemployment as a proxy for the state of the economy, she found the same thing. But when Kahn looked at <em>state level unemployment</em>, she found that the effect actually compounded over time.</p> <p>The overall evidence on what happens in the long run is equivocal. If you dig around, you'll find studies where earnings normalizes after “only” 15 years, causing a large but effectively one-off loss in earnings, and studies where the effect gets worse over time. The results are mostly technically not contradictory because they look at different causes of economic distress when people get their first job, and it's possible that the differences in results are because the different circumstances don't generalize. But the “good” result is that it takes 15 years for earnings to normalize after a single bad setback. Even a very optimistic reading of the literature reveals that external events can and do have very large effects on people's careers. And if you want an estimate of the bound on the &quot;bad&quot; case, check out, for example, the <a href="http://www.nber.org/papers/w14278">Guiso, Sapienza, and Zingales paper that claims to link the productivity of a city today to whether or not that city had a bishop in the year 1000</a>.</p> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:B">During orientation, the back end of the build system was down so I tried building one of the starter tutorials on my local machine. I gave up after an hour when the build was 2% complete. I know someone who tried to build a real, large scale, production codebase on their local machine over a long weekend, and it was nowhere near done when they got back. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> </ol> </div> Harry Potter and the Methods of Rationality review by su3su2u1 su3su2u1/hpmor/ Tue, 01 Mar 2016 00:00:00 +0000 su3su2u1/hpmor/ <p><em>These are archived from the now defunct <a href="http://su3su2u1.tumblr.com/">su3su2u1 tumblr</a>. Since there was some controversy over su3su2u1's identity, I'll note that I am not su3su2u1 and that hosting this material is neither an endorsement nor a sign of agreement.</em></p> <h3 id="harry-potter-and-the-methods-of-rationality-full-review">Harry Potter and the Methods of Rationality full review</h3> <p>I opened up a bottle of delicious older-than-me scotch when Terry Pratchett died, and I’ve been enjoying it for much of this afternoon, so this will probably be a mess and cleaned up later.</p> <p>Out of 5 stars, I’d give HPMOR a 1.5. Now, to the review (this is almost certainly going to be long)</p> <h4 id="the-good">The good</h4> <p>HPMOR contains some legitimately clever reworkings of the canon books to fit with Yudkowsky’s modified world:</p> <p>A few examples- In HPMOR, the “interdict of Merlin” prevents wizards from writing down powerful spells, so Slytherin put the Basilisk in the chamber of secrets to pass on his magical lore. The prophecy “the dark lord will mark him as his own” was met when Voldemort gave Hariezer the same grade he himself had received.</p> <p>Yudkowsky is also well read, and the story is peppered with reference to legitimately interesting science. If you google and research every reference, you’ll learn a lot. The problem is that most of the in-story references are incorrect, so if you don’t google around you are likely to pick up dozens of incorrect ideas.</p> <p>The writing style during action scenes is pretty good. It keeps the pace moving and brisk and can be genuinely fun to read.</p> <h4 id="the-bad">The bad</h4> <h4 id="stilted-repetitive-writing">Stilted, repetitive writing</h4> <p>A lot of this story involves conversations that read like ham-fisted attempts at manipulation, filled with overly stilted language. Phrases like “Noble and Most Ancient House,” “General of Sunshine,” “ General of Chaos,”etc are peppered in over and over again. It’s just turgid. It smooths out when events are happening, but things are rarely happening.</p> <h4 id="bad-ideas">Bad ideas</h4> <p>HPMOR is full of ideas I find incredibly suspect- the only character trait worth anything in the story (both implicitly and explicitly) is intelligence, and the primary use of intelligence within the story is manipulation. This leads to cloying levels of a sort of nerd elitism. Ron and Hagrid are basically dismissed out of hand in this story (Ron explicitly as being useless, Hagrid implicitly so) because they aren’t intelligent enough, and Hariezer explicitly draws implicit NPC vs real-people distinctions.</p> <p>The world itself is constructed to back up these assertions- nothing in the wizarding world makes much sense, and characters often behave in silly ways (”like NPCs”) to be a foil for Hariezer.</p> <p>The most ridiculous example of this is that the wizarding world justice is based on two cornerstones- poltiicans decide guilt or innocence for all wizard crimes, and the system of blood debts. All of the former death eaters who were pardoned (for claiming to be imperius cursed) apparently owe a blood debt to Hariezer, and so as far as wizarding justice is concerned he is above the law. He uses this to his advantage at a trial for Hermione.</p> <h4 id="bad-pedagogy">Bad pedagogy</h4> <p>Hariezer routinely flubs the scientific concepts the reader is supposed to be learning. Almost all of the explicit in story science references are incorrect, as well as being overly-jargon filled.</p> <p>Some of this might be on purpose- Hariezer is supposed to be only 11. However, this is terrible pedagogy. The reader’s guide to rationality is completely unreliable. Even weirder, the main antagonist, Voldemort, is also used as author mouthpiece several times. So the pedagogy is wrong at worst, and completely unreliable at best.</p> <p>And implicitly, the method Hariezer relies on for the majority of his problem solving is Aristotelian science. He looks at things, thinks real hard, and knows the answer. This is horrifyingly bad implicit pedagogy.</p> <h4 id="bad-plotting">Bad plotting</h4> <p>Over the course of the story, Hariezer moves from pro-active to no-active. At the start of the story he has a legitimate positive agenda- he wants to use science to uncover the secrets of magic. As the story develops, however, he completely loses sight of that goal, and he instead becomes just a passenger in the plot- he competes in Quirrell’s games and goes through school like any other student. When Voldemort starts including Hariezer in his plot, Hariezer floats along in a completely reactive way,etc.</p> <p>Not until Hermione dies, near the end of the story, does Hariezer pick up a positive goal again (ending death) and he does absolutely nothing to achieve it. He floats along reacting to everything, and Voldemort defeats death and revives Hermione with no real input from Hariezer at all.</p> <p>For a character who is supposed to be full of agency, he spends very little time exercising it in a proactive way.</p> <h4 id="nothing-has-consequences-boring">Nothing has consequences (boring!)</h4> <p>And this brings me to another problem with the plotting- nothing in this story has any consequences. Nothing that goes wrong has any lasting implications for the story at all, which makes all the evens on hand ultimately boring. Several examples- early in the story Hariezer uses his time turner to solve even the simplest problems. Snape is asking you questions about potions you don’t know? Time travel. Bullies are stealing a meaningless trinket? Time travel,etc. As a result of these rule violations, his time turner is locked down by Professor Mcgonagall. Despite this Hariezer continues to use his time turner to solve all of his problems- the plot introduces another student willing to send a time turner message for a small amount of money via. “slytherin mail” it’s even totally anonymous.</p> <p>Another egregious example of this is Quirrell’s battle game- the prize for the battle game is handed out by Quirrell in chapter 35 or so, and there are several more battle games after the prize! The reader knows that it doesn’t at all matter who wins these games- the prize is already awarded! What’s the point? The reader knows the prize has been given out, why are they invested in the proceedings at all?</p> <p>When Hariezer becomes indebted to Luscious Malfoy, it never constrains him in any way. He becomes in debt, Dumbledore tells him it’s bad, he does literally nothing to deal with the problem. Two weeks later, Hermione dies and the debt gets cancelled.</p> <p>When Hermione DIES Hariezer does nothing, and a few weeks later Voldemort brings her back. Nothing that happens ever matters.</p> <p>The closest thing to long term repercussions is Hariezer helping Bellatrix Black escape- but we literally never see Bellatrix after that.</p> <p>Hariezer never acts positively to fix his problems, he just bounces along whining about how humans need to defeat death until his problems get solved for him.</p> <h4 id="mystery-dramatic-irony-and-genre-savvy">Mystery, dramatic irony and genre savvy</h4> <p>If you’ve read the canon books, you know at all times what is happening in the story. Voldemort has possessed Quirrell, Hariezer is a horcrux, Quirrell wants the philsopher’s stone, etc. There are bits and pieces that are modified, but the shape of the story is exactly canon. So all the mystery is just dramatic irony.</p> <p>This is fine, as far as it goes, but there is a huge amount of tension because Hariezer is written as “genre savvy” and occasionally says things like “the hero of story such-and-such would do this” or “I understand mysterious prophecies from books.” The story is poking at cliches that the story wholeheartedly embraces. Supposedly Hariezer has read enough books just like this that dramatic irony liked this shouldn’t happen, as the story points out many times,- he should be just as informed as the reader. AND YET…</p> <p>The author is practically screaming “wouldn’t it be lazy that Harry’s darkside is because he is a horcrux?” And yet, Harry’s darkside is because he is a horcrux.</p> <p>Even worse, the narration of the book takes lots of swipes at the canon plots while “borrowing” the plot of the books.</p> <h4 id="huge-tension-between-the-themes-lessons-and-the-setting">Huge tension between the themes/lessons and the setting</h4> <p>The major themes of this book are in major conflict with the setting throughout the story.</p> <p>One major theme is the need for secretive science to hide dangerous secrets- it’s echoed in the way Hariezer builds his “bayesian conspiracy,” reinforced by Hariezer and Quirrell’s attitudes toward nuclear weapons (and their explicit idea that people smart enough to build atomic weapons wouldn’t use them), and it’s reinforced at the end of the novel when Hariezer’s desire to dissolve some of the secrecy around magic is thwarted by a vow he took to not-end-the-world.</p> <p>Unfortunately, that same secrecy is portrayed as having stagnated the progress of the wizarding world, and preventing magic from spreading. That same secrecy might well be why the wizarding world hasn’t already ended death and made thousands of philosopher’s stones.</p> <p>Another major theme is fighting death/no-afterlife. But this is a fantasy story with magic. There are ghosts, a gate to the afterlife, a stone to talk to your dead loved ones,etc. The story tries to lamp shade it a bit, but that fundamental tension doesn’t go away. Some readers even assumed that Hariezer was simply wrong about an afterlife in the story- because they felt the tension and used my point above (unreliable pedagogy) to put the blame on Hariezer. In the story, the character who actually ended death WAS ALSO THE ANTAGONIST. Hariezer’s attempts are portrayed AS SO DANGEROUS THEY COULD END THE WORLD.</p> <p>And finally- the major theme of this story is the supremacy of Bayesian reasoning. Unfortunately, as nostalgebraist pointed out explicitly, a world with magic is a world where your non-magic based Bayesian prior is worthless. Reasoning time and time again from that prior leads to snap conclusions unlikely to be right- and yet in the story this works time and time again. Once again, the world is fighting the theme of the story in obvious ways.</p> <h4 id="let-s-talk-about-hermoine">Let’s talk about Hermoine</h4> <p>The most explicitly feminist arc in this story is the arc where Hermione starts SPHEW, a group dedicated to making more wizarding heroines. The group starts out successful, gets in over their head, and Hariezer has to be called in to save the day (with the help of Quirrell).</p> <p>At the end of the arc, Hariezer and Dumbledore have a long conversation about whether or not they should have let Hermione and friends play their little bully fighting game- which feels a bit like retroactively removing the characters agency. Sure, the women got to play at their fantasy, but only at the whim of the real heroes.</p> <p>By the end of the story, Hermione is an indestructible part-unicorn/part-troll immortal. And what is she going to do with this power? Become Hariezer’s lab assistant, more or less. Be sent on quests by him. It just feels like Hermione isn’t really allowed to grow into her own agency in a meaningful way.</p> <p>This isn’t to say that it’s intentional (pretty much the only character with real, proactive agency in this story is Quirrell) - but it does feel like women get the short end of the stick here.</p> <h4 id="sanderson-s-law-of-magic">Sanderson’s law of magic</h4> <p>So I’ve never read Sanderson, but someone point me to his first law of magic</p> <p>Sanderson’s First Law of Magics: An author’s ability to solve conflict with magic is DIRECTLY PROPORTIONAL to how well the reader understands said magic. The idea here is that if your magic is laid out with clear rules, the author should feel free to solve problems with it- if your magic is mysterious and vague like Gandolf you shouldn’t solve all the conflict with magic, but if you lay out careful rules you can have the characters magic up the occasional solution. I’m not sure I buy into the rule fully, but it does make a good point- if the reader doesn’t understand your magic the solution might feel like it comes out of nowhere.</p> <p>Yudkowsky never clearly lays out most of the rules of magic, and yet still solves all his problems via magic (and magic mixed with science). We don’t know how brooms work, but apparently if you strap one to a rocket you can actually steer the rocket, you won’t fall off the thing, and you can go way faster than other broomsticks.</p> <p>This became especially problematic when he posted his final exam- lots of solutions were floated around each of which relied on some previously ill-defined aspect of the magic. Yudkowsky’s own solution relied on previously ill-defined transfiguration.</p> <p>And when he isn’t solving problems like that, he is relying on the time turner over and over again. Swatting flies with flame throwers over and over again.</p> <p>Coupled with the world being written as “insane” and it just feels like it’s lazy conflict resolution.</p> <h4 id="conclusion">Conclusion</h4> <p>A largely forgettable, overly long nerd power fantasy, with a bit of science (most of it wrong) and a lot of bad ideas. 1.5 stars.</p> <p><em>Individual chapter reviews below</em>.</p> <h3 id="hpmor-1">HPMOR 1</h3> <p>While at lunch, I dug into the first chapter of HPMOR. A few notes:</p> <p>This isn’t nearly as bad as I remember, the writing isn’t amazing but its serviceable.. Either some editing has taken place in the last few years, or I’m less discerning than I once was.</p> <p>There is this strange bit, where Harry tries to diffuse an argument his parents are having with:</p> <blockquote> <p>“&quot;Mum,” Harry said. “If you want to win this argument with Dad, look in chapter two of the first book of the Feynman Lectures on Physics. There’s a quote there about how philosophers say a great deal about what science absolutely requires, and it is all wrong, because the only rule in science is that the final arbiter is observation - that you just have to look at the world and report what you see. ”</p> </blockquote> <p>This seems especially out of place, because no one is arguing about what science is.</p> <p>Otherwise, this is basically an ok little chapter. Harry and Father are skeptical magic could exist, so send a reply letter to Hogwarts asking for a professor to come and show them some magics.</p> <h3 id="hpmor-2-in-which-i-remember-why-i-hated-this">HPMOR 2: in which I remember why I hated this</h3> <p>This chapter had me rolling my eyes so hard that I now I have quite the headache. In this chapter, Professor McGonagall shows up and does some magic, first levitating Harry’s father, and then turning into a cat. Upon seeing the first, Harry drops some Bayes, saying how anticlimatic it was ‘to update on an event of infinitesimal probability,’ upon seeing the second, Hariezer Yudotter greets us with this jargon dump:</p> <p>“You turned into a cat! A SMALL cat! You violated Conservation of Energy! That’s not just an arbitrary rule, it’s implied by the form of the quantum Hamiltonian! Rejecting it destroys unitarity and then you get FTL signalling!”</p> <p>First, this is obviously atrocious writing. Most readers will get nothing out of this horrific sentence. He even abbreviated faster-than-light as FTL, to keep the density of understandable words to a minimum.</p> <p>Second, this is horrible physics for the following reasons:</p> <ul> <li>the levitation already violated conservation of energy,which you found anticlimactic fuck you Hariezer</li> <li>the deep area of physics concerned with conservation of energy is not quantum mechanics, its thermodynamics. Hariezer should have had a jargon dump about perpetual motion machines. To see how levitation violates conservation of energy, imagine taking a generator like the Hoover dam and casting a spell to levitate all the water from the bottom of the dam back up to the top to close the loop. As long as you have wizard to move the water, you can generate power forever. Exercise for the reader- devise a perpetual motion machine powered by shape changers (hint:imagine an elevator system of two carts hanging over a pully. On one side, an elephant, on the other a man. Elephant goes down, man goes up. At the bottom, the elephant turns into a man and at the top the man turns into an elephant. What happens to the pulley over time?) -the deeper area related to conservation of energy is not unitarity, as is implied in the quote. There is a really deep theorem in physics, due to Emmy Noether, that tells us that conservation of energy really means that physics is time translationaly invariant. This means there aren’t special places in time, the laws tomorrow are basically the same as yesterday and today. (tangential aside- this is why we shouldn’t worry about a lack of energy conservation at the big bang, if the beginning of time was a special point, no one would expect energy to be conserved there). Unitarity in quantum mechanics is basically a fancy way of saying probability is conserved. You CAN have unitarity without conservation of energy. Technical aside- its easy to show that if the unitary operator is time-translation invariant, there is an operator that commutes with the unitary operator, usually called the hamiltonian. Without that assumption, we lose the hamiltonian but maintain unitarity. -None of this has much to do at all with faster than light signalling, which would be the least of our concern if we had just discovered a source of infinite energy.</li> </ul> <p>I used to teach undergraduates, and I would often have some enterprising college freshman (who coincidentally was not doing well in basic mechanics) approach me to talk about why string theory was wrong. It always felt like talking to a physics madlibs book. This chapter let me relive those awkward moments.</p> <p>Sorry to belabor this point so much, but I think it sums up an issue that crops up from time to time in Yudkowsky’s writing, when dabbling in a subject he doesn’t have much grounding in, he ends up giving actual subject matter experts a headache.</p> <p>Summary of the chapter- McGonagall visits and does some magic, Harry is convinced magic is real, and they are off to go shop for Harry’s books.</p> <h3 id="never-read-comments">Never Read Comments</h3> <p>I read the comments on an HPMOR chapter, which I recommend strongly against. I wish I could talk to several of the commentators, and gently talk them out of a poor financial decision.</p> <p>Poor, misguided random internet person- your donation to MIRI/LessWrong will not help save the world. Even if you grant all their (rather silly) assumptions MIRI is a horribly unproductive research institute- in more than a decade, it has published fewer peer reviewed papers than the average physics graduate student does while in grad school. The majority of money you donate to MIRI will go into the generation of blog posts and fan fiction. If you are fine with that, then go ahead and spend your money, but don’t buy into the idea that this money will save the world.</p> <h3 id="hpmor-3-uneventful-inoffensive">HPMOR 3: uneventful, inoffensive</h3> <p>This chapter is worse than the previous chapters. As Hariezer (I realize this portmanteau isn’t nearly as clever as I seem to think it is, but I will continue to use it) enters diagon alley, he remarks</p> <blockquote> <p>It was like walking through the magical items section of an Advanced Dungeons and Dragons rulebook (he didn’t play the game, but he did enjoy reading the rulebooks).</p> </blockquote> <p>For reasons not entirely clear to me, the line filled me with rage.</p> <p>As they walk McGonagall tells Hariezer about Voldemort, noting that other countries failed to come to Britain’s aid. This prompts Hariezer to immediately misuse the idea of the Bystander Effect (an exercise left to the reader- do social psychological phenomena that apply to individuals also apply to collective entities, like countries? Are the social-psychological phenomena around failure to act in people likely to also explain failure to act as organizations?)</p> <p>Thats basically it for this chapter. Uneventful chapter- slightly misused scientific stuff, a short walk through diagon alley, standard Voldemort stuff. The chapter ends with some very heavy handed foreshadowing:</p> <blockquote> <p>(And somewhere in the back of his mind was a small, small note of confusion, a sense of something wrong about that story; and it should have been a part of Harry’s art to notice that tiny note, but he was distracted. For it is a sad rule that whenever you are most in need of your art as a rationalist, that is when you are most likely to forget it.)</p> </blockquote> <p>If Harry had only attended more CFAR workshops…</p> <h3 id="hpmor-4-in-which-for-the-first-time-i-wanted-the-author-to-take-things-further">HPMOR 4: in which, for the first time, I wanted the author to take things further</h3> <p>So first, I actually like this chapter more than the previous few, because I think its beginning to try to deliver on what I want in the story. And now, my bitching will commence:</p> <p>A recurring theme of the LessWrong sequences that I find somewhat frustrating is that (apart from the Bayesian Rationalist) the world is insane. This same theme pops up in this MOR chapter, where the world is created insane by Yudkowsky, so that Hariezer can tell you why.</p> <p>Upon noticing the wizarding world uses coins of silver and gold, Hariezer asks about exchange rates, and asks the bank goblin how much it would cost to get a big chunk of silver turned into coins, the goblin says he’ll check with his superiors, Hariezer asks him to estimate, and the estimate is that the fee is about 5% of the silver.</p> <p>This prompts Hariezer to realize that he could do the following:</p> <ol> <li>Take gold coins and buy silver with them in the muggle world</li> <li>bring the silver to Gringots and have it turned into coins</li> <li>convert the silver coins to gold coins, ending up with more gold than you started with, start the loop over until the muggle prices make it not profitable</li> </ol> <p>(of course, the in-story explanation is overly-jargon filled as usual)</p> <p>This is somewhat interesting, and its the first look at what I want in a story like this- the little details of the wizarding world that would never be covered in a children’s story. Stross wrote a whole book exploring money/economics in a far future society (Neptune’s Brood, its only ok), there is a lot of fertile ground for Yudkowsky here.</p> <p>In a world where wizards can magic wood into gold, how do you keep counterfeiting at bay? Maybe the coins are made of special gold only goblins know how to find (maybe the goblin hordes hoard (wordplay!) this special gold like De beers hoards diamonds).</p> <p>Maybe the goblins carefully magic money into and out of existence in order to maintain a currency peg. Maybe its the perfect inflation- instead of relying on banks to disperse the coins every and now and then the coins in people’s pockets just multiply at random.</p> <p>Instead, we get a silly, insane system (don’t blame Rowling either- Yudkowsky is more than willing to go off book, AND the details of this simply aren’t discussed, for good reason, in the genre Rowling wrote the books in), and rationalist Hariezer gets an easy ‘win’. Its not a BAD section, but it feels lazy.</p> <p>And a brief note on the writing style- its still oddly stilted, and I wonder how good it would be at explaining ideas to someone unfamiliar. For instance, Hariezer gets lost in thought McGonagall says something, Hariezer replies:</p> <blockquote> <p>“&quot;Hm?” Harry said, his mind elsewhere. “Hold on, I’m doing a Fermi calculation.&quot; &quot;A what? ” said Professor McGonagall, sounding somewhat alarmed. “It’s a mathematical thing. Named after Enrico Fermi. A way of getting rough numbers quickly in your head…”“</p> </blockquote> <p>Maybe it would feel less awkward for Hariezer to say &quot;Hold on, I’m trying to estimate how much gold is in the vault.” And then instead of saying “its a math thing,” we could follow Hariezer’s thoughts as he carefully constructs his estimate (as it is, the estimate is crammed into a hard-to-read paragraph).</p> <p>Its a nitpick, sure, but the story thus far is loaded with such nits.</p> <p>Chapter summary- Harry goes to gringots, takes out money.</p> <h3 id="hpmor-5-in-which-the-author-assures-us-repeatedly-this-chapter-is-funny">HPMOR 5: in which the author assures us repeatedly this chapter is funny</h3> <p>This chapter is, again, mostly inoffensive, although there is a weird tonal shift. The bulk of this chapter is played broadly for laughs. There is actually a decent description of the fundamental attribution error, although its introduced with this twerpy bit of dialogue</p> <p>Harry looked up at the witch-lady’s strict expression beneath her pointed hat, and sighed. “I suppose there’s no chance that if I said fundamental attribution error you’d have any idea what that meant.”</p> <p>This sort of thing seems like awkward pedagogy. If the reader doesn’t know it, Hariezer is now exasperated with the reader as well as with whoever Yudkowsky is currently using as a foil.</p> <p>Now, the bulk of this chapter involves Hariezer being left alone to buy robes, where he meets and talks to Draco Malfoy. Hariezer, annoyed at having people say to him “OMG, YOU ARE HARRY POTTER!” upon meeting and learning Malfoy’s name, exclaims “OMG, YOU ARE DRACO MALFOY!”. Malfoy accepts this as a perfectly normal reaction to his imagined fame, and a mildly amusing conversation occurs. Its a fairly clever idea.</p> <p>Unfortunately, its marred by the literary equivalent of a sitcom laugh track. Worried that the reader isn’t sure if they should be laughing, Yudkowsky interjects phrases like these throughout:</p> <blockquote> <p>Draco’s attendant emitted a sound like she was strangling but kept on with her work One of the assistants, the one who’d seemed to recognise Harry, made a muffled choking sound. One of Malkin’s assistants had to turn away and face the wall. Madam Malkin looked back silently for four seconds, and then cracked up. She fell against the wall, wheezing out laughter, and that set off both of her assistants, one of whom fell to her hands and knees on the floor, giggling hysterically.</p> </blockquote> <p>The reader is constantly told that the workers in the shop find it so funny they can barely contain their laughter. It feels like the author constantly yelling GET IT YOU GUYS? THIS IS FUNNY!</p> <p>As far as the writing goes, the tonal shift to broad comedy feels a bit strange and happens with minimal warning (there is a brief conversation earlier in the chapter thats also played for a laugh), and everything is as stilted as its always been. For example, when McGonagall walks into the robe shop in time to hear Malfoy utter some absurdities, Harry tells her</p> <blockquote> <p>“He was in a situational context where those actions made internal sense -”</p> </blockquote> <p>Luckily, Hariezer gets cut off before he starts explaining what a joke is.</p> <p>Chapter summary- Hariezer buys robes, talks to Malfoy.</p> <h3 id="hpmor-6-yud-lets-it-all-hang-out">HPMOR 6: Yud lets it all hang out</h3> <p>The introduction suggested that the story really gets moving after chapter 5. If this is an example of what “really moving” looks like, I fear I’ll soon stop reading. Apart from my rant about chapter 2, things had been largely light, and inoffensive up until this chapter. Here, I found myself largely recoiling. We shift from the broad comedy of the last chapter to a chapter filled with weirdly dark little rants.</p> <p>As should be obvious by now, I find the line between Eliezer and Harry to be pretty blurry (hence my annoying use of Hariezer). In this chapter, that line disappears completely as we get passages like this</p> <blockquote> <p>Harry had always been frightened of ending up as one of those child prodigies that never amounted to anything and spent the rest of their lives boasting about how far ahead they’d been at age ten. But then most adult geniuses never amounted to anything either. There were probably a thousand people as intelligent as Einstein for every actual Einstein in history. Because those other geniuses hadn’t gotten their hands on the one thing you absolutely needed to achieve greatness. They’d never found an important problem.</p> </blockquote> <p>There are dozens of such passages that could be ripped directly from some of Hariezer’s friendly AI writing and pasted right into MOR. Its a bit disconcerting, in part because its forcing me to face just how much of Eliezer’s other writing of wasted time with.</p> <p>The chapter begins strongly enough, Hariezer starts doing some experiments with his magic pouch. If he asks for 115 gold coins, it comes, but not if he asks for 90+25 gold coins. He tries using other words for gold in other languages, etc. Unfortunately, it leads him to say this:</p> <blockquote> <p>“I just falsified every single hypothesis I had! How can it know that ‘bag of 115 Galleons’ is okay but not ‘bag of 90 plus 25 Galleons’? It can count but it can’t add? It can understand nouns, but not some noun phrases that mean the same thing?…The rules seem sorta consistent but they don’t mean anything! I’m not even going to ask how a pouch ends up with voice recognition and natural language understanding when the best Artificial Intelligence programmers can’t get the fastest supercomputers to do it after thirty-five years of hard work,”</p> </blockquote> <p>So here is the thing- it would be very easy to write a parser that behaves exactly like what Hariezer describes with his bag. You would just have a look-up table with lots of single words for gold in various languages. Nothing fancy at all. Its behaving oddly ENTIRELY BECAUSE ITS NOT DOING NATURAL LANGUAGE. I hope we revisit the pouch in a later chapter to sort this out. I reiterate, its stuff like this that (to me at least) were the whole premise of this story- flesh out the rules of this wacky universe.</p> <p>Immediately after this, the story takes a truly bizarre turn. Hariezer spots a magic first aid kit, and wants to buy it. In order to be a foil for super-rationalist Harry, McGonagall then immediately becomes immensely stupid, and tries to dissuade him from purchasing it. Note, she doesn’t persuade him by saying “Oh, there are magical first aid kits all over the school,” or “there are wizards watching over the boy who lived who can heal you with spells if something happens” or anything sensible like that, she just starts saying he’d never need it.</p> <p>This leads Harry to a long description of the planning fallacy, and he says to counter it he always tries to assume the worst possible outcomes. (Note to Harry and the reader: the planning fallacy is a specific thing that occurs when people or organizations plan to accomplish a task. What Harry is trying to overcome is more correctly optimism bias.).</p> <p>This leads McGonagall to start lightly suggesting (apropos of nothing) that maybe Harry is an abused child. Hariezer responds with this tale:</p> <blockquote> <p>&quot;There’d been some muggings in our neighborhood, and my mother asked me to return a pan she’d borrowed to a neighbor two streets away, and I said I didn’t want to because I might get mugged, and she said, ‘Harry, don’t say things like that!’ Like thinking about it would make it happen, so if I didn’t talk about it, I would be safe. I tried to explain why I wasn’t reassured, and she made me carry over the pan anyway. I was too young to know how statistically unlikely it was for a mugger to target me, but I was old enough to know that not-thinking about something doesn’t stop it from happening, so I was really scared.” … I know it doesn’t sound like much,” Harry defended. “But it was just one of those critical life moments, you see? … That’s when I realised that everyone who was supposed to protect me was actually crazy, and that they wouldn’t listen to me no matter how much I begged them So we are back to the world is insane, as filtered through this odd little story.</p> </blockquote> <p>Then McGonagall asks if Harry wants to buy an owl, and Harry says no he’d be too worried he’d forget to feed it or something. Which prompts McGonagall AGAIN to suggest Harry had been abused, which leads Harry into an odd rant about how false accusations of child abuse ruin families (which is true, but seriously, is this the genre for this rant? What the fuck is happening with this chapter?) This ends up with McGonagall implying Harry must have been abused because he is so weird, and maybe some cast a spell to wipe his memory of it (the spell comes up after Harry suggests repressed memories are BS pseudoscience, which again, is true, BUT WHY IS THIS HAPPENING IN THIS STORY?)</p> <p>Harry uses his ‘rationalist art’ (literally “Harry’s rationalist skills begin to boot up again”) to suggest an alternative explanation</p> <blockquote> <p>&quot;I’m too smart, Professor. I’ve got nothing to say to normal children. Adults don’t respect me enough to really talk to me. And frankly, even if they did, they wouldn’t sound as smart as Richard Feynman, so I might as well read something Richard Feynman wrote instead. I’m isolated, Professor McGonagall. I’ve been isolated my whole life. Maybe that has some of the same effects as being locked in a cellar. And I’m too intelligent to look up to my parents the way that children are designed to do. My parents love me, but they don’t feel obliged to respond to reason, and sometimes I feel like they’re the children - children who won’t listen and have absolute authority over my whole existence. I try not to be too bitter about it, but I also try to be honest with myself, so, yes, I’m bitter.</p> </blockquote> <p>After that weird back and forth the chapter moves on, Harry goes and buys a wand, and then from conversation begins to suspect that the Voldemort might still be alive. When McGonagall doesn’t want to tell him more, “a terrible dark clarity descended over his mind, mapping out possible tactics and assessing their consequences with iron realism.”</p> <p>This leads Hariezer to blackmail McGonagall- he won’t tell people Voldemort is still alive if she tells him about the prophecy. Its another weird bit in a chapter absolutely brimming with weird bits.</p> <p>Finally they go to buy a trunk, but they are low on gold (note to the reader: here would have been an excellent example of the planning fallacy). But luckily Hariezer had taken extra from the vault. Rather than simply saying “oh, I brought some extra”, he says</p> <blockquote> <p>So - suppose I had a way to get more Galleons from my vaultwithout us going back to Gringotts, but it involved me violating the role of an obedient child. Would I be able to trust you with that, even though you’d have to step outside your own role as Professor McGonagall to take advantage of it?</p> </blockquote> <p>So he logic-chops her into submission, or whatever, and they buy the trunk.</p> <p>This chapter for me was incredibly uncomfortable. McGonagall behaves very strangely so she can act as a foil for all of Hariezer’s rants, and when the line between Hariezer and Eliezer fell away completely, it felt a bit oddly personal.</p> <p>Oh, right, there was also a conversation about the rule against underage magic</p> <p>&quot;Ah,” Harry said. “That sounds like a very sensible rule. I’m glad to see the wizarding world takes that sort of thing seriously.”</p> <p>I can’t help but draw parallels to the precautions Yud wants with AI.</p> <p>Summary: Harry finished buying school supplies(I hope).</p> <h3 id="hpmor-7-uncomfortable-elitism-and-rape-threats">HPMOR 7: Uncomfortable elitism, and rape threats</h3> <p>A brief warning: Like always I’m typing this thing on my phone, so strange spell-check driven typos almost certainly abound. However, I’m also pretty deep in my cups (one of the great privileges of leaving academia is that I can afford to drink Lagavulin more than half my age like its water. The downside is no longer get to teach and so must pour my wisdom out in the form of a critique of a terrible fan fiction, that all of one person is probably reading)</p> <p>This chapter took the weird tonal shift from the last chapter and just ran with it.</p> <p>We are finally heading toward Hogwarts,so the chapter opens with the classic platform 9 3/4 bit from the book. And then it takes an uncomfortable elitist tone: Harry asks Ron Weasley to call him “Mr. Spoo” so that he can remain incognito, and Ron, a bit confused says “Sure Harry.” That one slip up allows Hariezer to immediately peg Ron as an idiot. In the short conversation that follows he mentally thinks of Ron as stupid several times and then he tries to explain to Ron why Quidditch is a stupid game.</p> <p>It is my understanding from a (rather loose reading of the) books, that like cricket, quidditch games last weeks, EVEN MONTHS. In a game lasting literally weeks, one team could conceivably be up by 15 goals. In one of the books, I believe an important match went against the team that caught the snitch in one of the books. This is not to entirely defend quidditch, but it doesn’t HAVE to be an easy target. I think part of the ridicule that quidditch gets is that non-British/non-Indian audiences are perhaps not capable of appreciating that there are sports (cricket) that are played out over weeks that are very high scoring.</p> <p>Either way, the WAY that Hariezer attacks quidditch is at expense of Ron, and it feels like a nerd sneering at a jock for liking sports. But thats just the lead up to the cloying nerd-elitism. Draco comes over, Hariezer is quick to rekindle that budding friendship, and we get the following conversation about Ron:</p> <blockquote> <p>If you didn’t like him,” Draco said curiously, “why didn’t you just walk away?” &quot;Um… his mother helped me figure out how to get to this platform from the King’s Cross Station, so it was kind of hard to tell him to get lost. And it’s not that I hate this Ron guy,” Harry said, “I just, just…” Harry searched for words. &quot;Don’t see any reason for him to exist?&quot; offered Draco. &quot;Pretty much.</p> </blockquote> <p>Just cloying, uncomfortable levels of nerd-elitism.</p> <p>Now that Hariezer and Draco are paired back up, they can have lot of uncomfortable conversations. First, Draco shares something only slightly personal, which leads to this</p> <blockquote> <p>&quot;Why are you telling me that? It seems sort of… private…” Draco gave Harry a serious look. “One of my tutors once said that people form close friendships by knowing private things about each other, and the reason most people don’t make close friends is because they’re too embarrassed to share anything really important about themselves.” Draco turned his palms out invitingly. “Your turn?”</p> </blockquote> <p>Hariezer consider this a masterful use of the social psychology idea of reciprocity (which just says if you do something for someone, they’re likely to do it for you). Anyway, this exchange is just a lead up to this, which feels like shock value for no reason:</p> <blockquote> <p>&quot;Hey, Draco, you know what I bet is even better for becoming friends than exchanging secrets? Committing murder.&quot; &quot;I have a tutor who says that,&quot; Draco allowed. He reached inside his robes and scratched himself with an easy, natural motion. &quot;Who’ve you got in mind?&quot; Harry slammed The Quibbler down hard on the picnic table. “The guy who came up with this headline.” Draco groaned. “Not a guy. A girl. A ten-year-old girl, can you believe it? She went nuts after her mother died and her father, who owns this newspaper, is convinced that she’s a seer, so when he doesn’t know he asks Luna Lovegood and believes anything she says.” … Draco snarled. “She has some sort of perverse obsession about the Malfoys, too, and her father is politically opposed to us so he prints every word. As soon as I’m old enough I’m going to rape her.”</p> </blockquote> <p>So, Hariezer is joking about the murder (its made clear later), but WHAT THE FUCK IS HAPPENING? These escalating friendship-tests feel contrived, reciprocity is effective when you don’t make demands immediately, which is why when you get a free sample at the grocery store the person at the counter doesn’t say “did you like that? Buy this juice or we won’t be friends anymore.” This whole conversation feels ham fisted, Hariezer is consistently telling us about all the manipulative tricks they are both using. Its less a conversation and more people who just sat through a shitty marketing seminar trying to try out what they learned. WITH RAPE.</p> <p>After that, Draco has a whole spiel about how the legal system of the wizard world is in the pocket of the wealthy, like the Malfoys, which prompts Hariezer to tell us that only countries descended from the enlightenment have law-and-order (and I take it from comments that originally there was some racism somewhere in here that has since been edited out). Note: the wizarding world HAS LITERAL MAGIC TRUTH POTIONS, but we are to believe our enlightenment legal system works better? This seems like an odd, unnecessary narrative choice.</p> <p>Next, Hariezer tries to recruit Draco to the side of science with this:</p> <blockquote> <p>Science doesn’t work by waving wands and chanting spells, it works by knowing how the universe works on such a deep level that you know exactly what to do in order to make the universe do what you want. If magic is like casting Imperio on someone to make them do what you want, then science is like knowing them so well that you can convince them it was their own idea all along. It’s a lot more difficult than waving a wand, but it works when wands fail, just like if the Imperiusfailed you could still try persuading a person.</p> </blockquote> <p>I’m not sure why you’d use persuasion/marketing as a shiny metaphor for science, other than its the theme of this chapter. ”If you know science you can manipulate people as if you were literally in control of them” seems like a broad, and mostly untrue claim. AND IT FOLLOWS IMMEDIATELY AFTER HARRY EXPLAINED THE MOON LANDING TO DRACO. Science can take you to the fucking moon, maybe thats enough.</p> <p>This chapter also introduces comed-tea, a somewhat clever pun drink. If you open a can, at some point you’ll do a spit-take before finishing it. I’m not sure the point of this new magical introduction, hopefully Hariezer gets around to exploring it (seriously, hopefully Hariezer begins to explore ANYTHING to do with the rules of magic. I’m 7 full chapters in and this fanfic has paid lip service to science without using it to explore magic at all).</p> <p>Chapter summary: Hariezer makes it to platform 9 3/4, ditches Ron as somehow sub-human. Has a conversation with Draco that is mostly framed as conversation-as-explicit manipulation between Hariezer and Draco, and its very ham-fisted, but luckily Hariezer assures us its actual masterful manipulation, saying things like this, repeatedly:</p> <blockquote> <p>And Harry couldn’t help but notice how clumsy, awkward, graceless his attempt at resisting manipulation / saving face / showing off had appeared compared to Draco.</p> </blockquote> <p>Homework for the interested reader: next time you are meeting someone new, share something embarrassingly personal and then ask them immediately to reciprocate, explicitly saying ‘it’ll make us good friends.’ See how that works out for you.</p> <p>WHAT DO PEOPLE SEE IN THIS? It wouldn’t be so bad, but we are clearly supposed to identify with Hariezer, who looks at Draco as someone he clearly wants on his side, and who instantly dismisses someone (with no “Bayesian updates” whatsoever as basically less than human). I’m honestly surprised that anyone read past this chapter. But I’m willing to trudge on, for posterity. Two more glasses of scotch, and then I start chapter 8. I’m likely to need alcohol to soldier on from here on out.</p> <p>Side note: I’ve consciously not mentioned all the “take over the world” Hariezer references, but there are probably 3 or 4 per chapter. They seem at first like bad jokes, but they keep getting revisited so much that I think Hariezer’s explicit goal is perhaps not curiosity driven (figure out the rules of magic), but instead power driven (find out the rules of magic in order to take over the world). He assures Draco he really is Ravenclaw, but if he were written with consistency maybe he wouldn’t need to be? Hariezer doesn’t ask questions (like I would imagine a Ravenclaw would), he gives answers. Thus far, he has consistently decided the wizarding world has nothing to teach him. Arithmancy books he finds only go up to trigonometry, etc. He certainly has shown only limited curiosity this far. Its unclear to me why a curiosity driven, scientist character would feel a strong desire to impress, manipulate Draco Malfoy, as written here. This is looking less like a love-song to science, and more a love-song to some weird variant of How to Win Friends and Influence People.</p> <h3 id="a-few-observations-regarding-hariezer-yudotter">A few Observations Regarding Hariezer Yudotter</h3> <p>After drunkenly reading chapters 8,9 and 10 last night (I’ll get to the posts soon, hopefully), I was flipping channels and somehow settled on an episode of that old TV show with Steve Urkel (bear with me, this will get relevant in a second).</p> <p>In the episode, the cool kid Eddie gets hustled at billiards, and Urkel comes in and saves the day because his knowledge of trigonometry and geometry makes him a master at the table.</p> <p>I think perhaps this is a common dream of the science fetishist- if only I knew ALL OF THE SCIENCE I would be unstoppable at everything. Hariezer Yudotter is a sort of wish fulfillment character of that dream. Hariezer isn’t motivated by curiosity at all really, he wants to grow his super-powers by learning more science. Its why we can go 10 fucking chapters without Yudotter really exploring much in the way of the science of magic (so far I count one lazy paragraph exploring what his pouch can do, in 10 chapters). Its why he constantly describes his project as “taking over the world.” And its frustrating, because this obviously isn’t a flaw to be overcome its part of Yudotter’s “awesomeness.”</p> <p>I have a phd in a science, and it has granted me these real world super-powers:</p> <ol> <li>I fix my own plumbing, do my own home repairs,etc.</li> <li>I made a robot of legos and a raspberry pi that plays connect 4 incredibly well (my robot sidekick, I guess)</li> <li>Via techniques I learned in the sorts of books that in fictional world Hariezer uses to be a master manipulator, I can optimize ads on webpages such that up to 3% of people will click on them (that is, seriously, the power of influence in reality. Not Hannibal Lector but instead advertisers trying to squeeze an extra tenth of a percent on conversions), for which companies sometime pay me</li> <li>If you give me a lot of data, I can make a computer find patterns in it, for which companies sometimes pay me.</li> </ol> <p>Thats basically it. Back when I worked in science, I spent nearly a decade of my life calculating various background processes related to finding a Higgs boson, and I helped design some software theorists now use to calculate new processes quickly. These are the sorts of projects scientists work on, and most days its hard work and total drudgery, and there is no obvious ‘instrumental utility’- BUT I REALLY WANTED TO KNOW IF THERE WAS A HIGGS FIELD.</p> <p>And thats why I think the Yudotter character doesn’t feel like a scientist- he wants to be stronger, more powerful, take over the world, but he doesn’t seem to care what the answers are. Its all well and good to be driven, but most importantly, you have to be curious.</p> <h3 id="hpmor-8-back-to-the-inoffensive-chapters-of-yesteryear">HPMOR 8: Back to the inoffensive chapters of yesteryear</h3> <p>And a dramatic tonal shift and we are back to a largely inoffensive chapter.</p> <p>There is another lesson in this chapter, this time the lesson is confirmation bias (though Yudkowsky/Hariezer refer to it as ‘positive bias’), but once again, the pedagogical choices are strange. As Hariezer winds into his lesson to Hermione, she thinks the following:</p> <blockquote> <p>She was starting to resent the boy’s oh-so-superior tone…but that was secondary to finding out what she’d done wrong.</p> </blockquote> <p>So Yudkowsky knows his Hariezer has a condescending tone, but he runs with it. So as a reader, if I already know the material I get to be on the side of truth, and righteousness and I can condescend to the simps with Hariezer, OR, I don’t know the material, and then Hermione is my stand in, and I have to swallow being condescended to in order to learn.</p> <p>Generally, its not a good idea when you want to teach someone something to immediately put them on the defensive- I’ve never stood in front of a class, or tutored someone by saying</p> <blockquote> <p>&quot;The sad thing is you probably did everything the book told you to do… unless you read the very, very best sort of books, they won’t quite teach you how to do science properly…</p> </blockquote> <p>And Yudkowsky knows enough that his tone is off-putting to point to it. So I wonder- is this story ACTUALLY teaching people things? Or is it just a way for people who already know some of the material to feel superior to Hariezer’s many foils? Do people go and read the sequences so that they can move from Hariezer-foil, to Hariezer’s point of view? (these are not rhetorical questions, if anyone has ideas on this).</p> <p>As for the rest of the chapter- its good to see Hermione merits as human, unlike Ron. There is a strange bit in the chapter where Neville asks a Gryffindor prefect to find his frog, and the prefect says no (why? what narrative purpose does this serve?).</p> <p>Chapter summary: Hariezer meets Neville and Hermione on the train to Hogwarts. Still no actual exploration of magic rules. None of the fun candy of the original story.</p> <h3 id="hpmor-9-and-10">HPMOR 9 and 10</h3> <p>EDIT: I made a drunken mistake in this one, <a href="http://thinkingornot.tumblr.com/post/104047979906/chapters-9-and-10">see this response</a>. I do think my original point still goes through because the hat responds to the attempted blackmail with:</p> <blockquote> <p>I know you won’t follow through on a threat to expose my nature, condemning this event to eternal repetition. It goes against the moral part of you too strongly, whatever the short-term needs of the part of you that wants to win the argument.</p> </blockquote> <p>So the hat doesn’t say “I don’t care about this,” the hat says “you won’t do it.” My point is, however, substantially weakened.</p> <p>END EDIT</p> <p>Alright, the lagavulin is flowing, and I’m once more equipped to pontificate.</p> <p>These chapters are really one chapter split in two. I’m going to use them to argue against Yudkowsky’s friendly AI concept a bit. There is this idea, called ‘orthgonality’ that says that an AIs goals can be completely independent of its intelligence. So you can say ‘increase happiness’ and this uber-optimizer can tile the entire universe with tiny molecular happy faces, because its brilliant at optimizing but incapable of evaluating its goals. Just setting the stage for the next chapter.</p> <p>In this chapter, Harry gets sorted. When the sorting hat hits his head, Harry wonders if its self-aware, which because of some not-really-explained magical hat property, instantly makes the hat self-aware. The hat finds being self aware uncomfortable, and Hariezer worries that he’ll destroy an intelligent being when the hat is removed. The hat assures us that the hat cares only for sorting children. As Hariezer notes</p> <blockquote> <p>It [the hat] was still imbued with only its own strange goals…</p> </blockquote> <p>Even still, Hariezer manages to blackmail the hat- he threatens to tell all the other kids to wonder if the hat is self-aware. The hat concedes to the demand.</p> <p>So how does becoming self-aware over and over effect the hat’s goals of sorting people? It doesn’t. The blackmail should fail. Yudkowsky imagines that the minute it became self-aware, the hat couldn’t help but pick up some new goals. Even Yudkowsky imagines that becoming self-aware will have some effects on your goals.</p> <p>This chapter also has some more weirdly personal seeming moments when the line between Yudkowsky’s other writing and HPMOR breaks down completely.</p> <p>Summary: Harry gets sorted into ravenclaw.</p> <p>I am immensely frustrated that I’m 10 chapters into this thing, and we still don’t have any experiments regarding the rules of magic.</p> <h3 id="hpmor-11">HPMOR 11</h3> <p>Chapter 11 is “omake.” This is a personal pet-peeve of mine, because I’m a crotchety old man at heart. The anime culture takes Japanese words, for which we have perfectly good english words, and snags them (kawaii/kawaisa is a big one). Omake is another one. I have nothing against Japanese (I’ve been known to speak it), just don’t like unnecessary loaner words in general. I know this is my failing, BUT I WANT SO BAD TO HOLD IT AGAINST THIS FANFIC.</p> <p>Either way, I’m skipping the extra content, because I can only take so much.</p> <h3 id="hpmor-12">HPMOR 12</h3> <p>Nothing much in this chapter. Dumbledore gives his post-dinner speech.</p> <p>Harry cracks open a can of comed-tea and does the requisite spit-take when Dumbledore starts his dinner speech with random nonsense. He considers casting a spell to make his sense of humor very specific, and then he can use comed-tea to take over the world.</p> <p>Chapter summary: dinner is served and eaten</p> <h3 id="hpmor-13-bill-and-teds-excellent-adventure">HPMOR 13: Bill and Teds Excellent Adventure</h3> <p>There is a scene in Bill and Ted’s Excellent Adventure, toward the end, where they realize their time machine gives them super-human powers. They need to escape a jail, so they agree to get the keys later and travel back in time and hide them, and suddenly there the keys are. After yelling to remember a trash can, they have a trash can to incapacitate a guard with,etc. They can do anything they want.</p> <p>Anyway, this chapter is that idea, but much longer. The exception is that we don’t know there has been a time machine (actually, I don’t KNOW for sure thats what it is, but the Bill and Ted fan in me says thats what happened this chapter, I won’t find out until next chapter. If I were a Bayesian rationalist, I would say that the odds ratio is pi*10^^^^3 in my favor. ).</p> <p>Hariezer wakes up and finds a note saying he is part of a game. Everywhere he looks, as if by magic he finds more notes, deducting various game “points,” and some portraits have messages for him. The notes lead him to a pack of bullies beating up some hufflepuffs, and pies myseriously appears for Hariezer to attack with. The final note tells him to go to Mcgonagall’s office, and the chapter ends.</p> <p>I assume next chapter, Hariezer will recieve his time machine and future Hariezer will use it to set up the “game” as a prank on past Hariezer. Its a clever enough chapter.</p> <p>This chapter was actually decent, but what the world really needs is Harry Potter/Bill and Ted’s Excellent Adventure cross over fiction.</p> <h3 id="hpmor-14-lets-talk-about-computability">HPMOR 14: Lets talk about computability</h3> <p>This chapter has created something in my brain like Mr Burn’s Three Stooges Syndrome. So many things I want to talk about, I don’t know where to start!</p> <p>First, I was slightly wrong about last chapter. It wasn’t a time machine Hariezer used to accomplish the prank in the last chapter, it was a time machine AND an invisibility cloak. BIll and Ted did not lead me astray.</p> <p>On to the chapter- Hariezer gets a time machine (Hariezer lives 26 hour days, so he needs he is given a time turner to correct his sleep schedule) this prompts this:</p> <blockquote> <p>Say, Professor McGonagall, did you know that time-reversed ordinary matter looks just like antimatter? Why yes it does! Did you know that one kilogram of antimatter encountering one kilogram of matter will annihilate in an explosion equivalent to 43 million tons of TNT? Do you realise that I myself weigh 41 kilograms and that the resulting blast would leave A GIANT SMOKING CRATER WHERE THERE USED TO BE SCOTLAND?</p> </blockquote> <p>Credit where credit is due- this is correct physics. In fact, its completely possible (though a bit old-fashioned and unwieldy), to treat quantum field theory such that all anti-matter is simply normal matter moving backward in time. Here is an example, look at this diagram:</p> <p><img src="images/su3su2u1/gamma.png"></p> <p>If we imagine time moving from the bottom of the diagram toward the top, we see two electrons traveling forward in time, and exchanging a photon and changing directions.</p> <p>But now imagine time moves left to right in the diagram instead- what we see is one electron and one positron coming together and destroying each other, and then a new pair forming from the photon. BUT, we COULD say that what we are seeing is really an electron moving forward in time, and an electron moving backward in time. The point where they “disappear” is really the point where the forward moving electron changed directions and started moving backward in time.</p> <p>This is probably very confusing, if anyone wants a longer post about this, I could probably try for it sober. I need to belabor this though- the takeaway point I need you to know- the best theory we have of physics so far can be interpreted as having particles that change direction in time, AND HARIEZER KNOWS THIS AND CORRECTLY NOTES IT.</p> <p>Why is this important? Because a paragraph later he says this:</p> <blockquote> <p>You know right up until this moment I had this awful suppressed thought somewhere in the back of my mind that the only remaining answer was that my whole universe was a computer simulation like in the book Simulacron 3 but now even that is ruled out because this little toy ISN’T TURING COMPUTABLE! A Turing machine could simulate going back into a defined moment of the past and computing a different future from there, an oracle machine could rely on the halting behavior of lower-order machines, but what you’re saying is that reality somehow self-consistently computes in one sweep using information that hasn’t… happened… yet..</p> </blockquote> <p>This is COMPLETE NONSENSE (this is also terrible pedagogy again, either you know what Turing computable means are you drown in jargon). For this discussion, Turing computable means ‘capable of being calculated using a computer’ The best theory of physics we have (a theory Hariezer already knows about) allows the sort of thing that Hariezer is complaining about. Both quantum mechanics and quantum field theory are Turing computable. Thats not to say Hariezer’s time machine won’t require you to change physics a bit- you definitely will have to, but its almost certainly computable.</p> <p>Now computable does not mean QUICKLY computable (or even feasibly computable). The new universe might not be computable in polynomial time (quantum field theory may not be, at least one problem with in it, the fermion sign problem, is not).</p> <p>I don’t think the time machine makes P = NP either. Having a time machine will allow you to speed up computations (you could wait until a computation was done, and then send the answer back in time). However, Hariezer’s time machine is limited, it can only be used to move back 6 hours total, and can only be used 3 time in a day, so I don’t think it could generally solve an NP-complete problem in polynomial time (after your 6 hour head start is up, things proceed at the original scaling). If you don’t know anything about computational complexity, I guess if I get enough asks I can explain it in another, non-potter post.</p> <p>But my point here is- the author here is supposedly an AI theorist. How is he flubbing computability stuff? This should be bread and butter stuff.</p> <p>I have so much more to say about this chapter. Another post will happen soon.</p> <p>Edit: I was’t getting the P = NP thing, but I get the argument now (thanks Nostalgebraist), the idea is that you say “I’m going to compute some NP problem and come back with the solution” and then ZIP, out pops another you from the time machine, and hands you a slip of paper with the answer on it. Now you have 6 hours to verify the calculation, and then zip back to give it your former self.</p> <p>But any problem in NP is checkable in P, so for any problem small enough to be checkable in 6 hours (which is a lot of problems, including much of NP), is now computable in no time at all. Its not a general P = NP, but its much wider in applicability than I was imagining.</p> <h3 id="hpmor-14-continued-comed-tea-newcomb-s-problem-science">HPMOR 14 continued: Comed Tea, Newcomb’s problem, science</h3> <p>One of the odd obsession’s of LessWrong is an old decision theory problem called Newcomb’s Paradox. It goes like this- a super intelligence that consistently predicts correctly challenges you to a game. There are two boxes, A and B. And you are allowed to take one or both boxes.</p> <p>Inside box A is $10, and inside box B the super intelligence has already put $10,000 IF AND ONLY IF it predicted you will only take box B. What box should you take?</p> <p>The reason this is a paradox is that one group of people (call them causal people) might decide that because the super intelligence ALREADY made its call, you might as well take both boxes. You can’t alter the past prediction.</p> <p>Other people (call these LessWrongians) might say, ‘well, the super intelligence is always right, so clearly if I take box B I’ll get more money’. This camp includes LessWrongians. Yudkowsky himself had tried to formalize a decision theory that picks box B, that involved allowing causes to propagate backward in time.</p> <p>A third group of people (call them ‘su3su2u1 ists’) might say “this problem is ill-posed. The idea of the super-intelligence might well be incoherent, depending on your model of how decisions are made,” Here is why- imagine human decisions can be quite noisy. For instance, what if I flip an unbiased coin to decide which box to take. Now the super-intelligence can only have had a 50/50 chance to successfully predict which box I’d take, which contradicts the premise.</p> <p>There is another simple way to show the problem is probably ill-posed. Imagine another we take another super-intelligence of the same caliber as the first (call the first 1 and the second 2). 1 offers the same game to 2, and now 2 takes both boxes if it predicts that 1 put the money in box B. It takes only box B if 1 did not put the money in box B. Obviously, either intelligence 1 is wrong, or intelligence 2 is wrong, which contradicts the premise, so the idea must be inconsistent (note, you can turn any person into super-intelligence number 2 by making the boxes transparent).</p> <p>Anyway, Yudkowsky has a pet decisions theory he has tried to formalize that allows causes to propagate backward in time. He likes this approach because you can get the LessWrongian answer to Newcomb every time. The problem is, his formalism has all sorts of problems with inconsistency because of the issues I raised about the inconsistency of a perfect predictor.</p> <p>Why do I bring this up? Because Hariezer decides in this chapter that comed-tea MUST work by causing you to drink it right before something spit-take worthy happens. The tea predicts the humor, and then magics you into drinking it. Of course, he does no experiments to test this hypothesis at all (ironic that just a few chapters ago he lecture Hermione about only doing 1 experiment to test her idea).</p> <p>So unsurprisingly perhaps, the single most used magic item in the story thus far is a manifestation of Yudkowsky’s favorite decision theory problem.</p> <p>And my final note from this chapter- Hariezer drops this on us, regarding the brain’s ability to understand time travel:</p> <p>Now, for the first time, he was up against the prospect of a mystery that was threatening to be permanent…. it was entirely possible that his human mind never could understand, because his brain was made of old-fashioned linear-time neurons, and this had turned out to be an impoverished subset of reality.</p> <p>This seems to misunderstand the nature of mathematics and its relation to science. I can’t visualize a 4 dimensional curved space,certainly not the way I visualize 2 and 3d objects. But that doesn’t stop me from describing it and working with it as a mathematical object.</p> <p>Time is ALREADY very strange and impossible to visualize. But mathematics allows us to go beyond what our brain can visualize to create notations and languages that let us deal with anything we can formalize and that has consistent rules. Its amazingly powerful.</p> <p>I never thought I’d see Hariezer Yudotter, who just a few chapters back was claiming science could let us perfectly manipulate and control people (better than an imperio curse, or whatever the spell that lets you control people) argue that science/mathematics couldn’t deal with linear time.</p> <p>I hope that this is a moment where in later chapters we see growth from Yudotter, and he revisits this last assumption. And I hope he does some experiments to test his comed-tea hypothesis. Right now it seems like experiments are things Hariezer asks people around him to do (so they can see things his way), but for him pure logic is good enough.</p> <p>Chapter summary: I drink three glasses of scotch. Hariezer gets a time machine.</p> <h3 id="hpmor-15-in-which-once-again-i-want-more-science">HPMOR 15: In which, once again, I want more science</h3> <p>I had a long post but the internet ate it earlier this week, so this is try 2. I apologize in advance, this blog post is mostly me speculating about some magi-science.</p> <p>This chapter begins the long awaited lessons in magic. The topic of today’s lesson consists primarily of one thing, don’t transfigure common objects into food or drink.</p> <blockquote> <p>Mr. Potter, suppose a student Transfigured a block of wood into a cup of water, and you drank it. What do you imagine might happen to you when the Transfiguration wore off?” There was a pause. “Excuse me, I should not have asked that of you, Mr. Potter, I forgot that you are blessed with an unusually pessimistic imagination -“ &quot;I’m fine,&quot; Harry said, swallowing hard. &quot;So the first answer is that I don’t know,” the Professor nodded approvingly, “but I imagine there might be… wood in my stomach, and in my bloodstream, and if any of that water had gotten absorbed into my body’s tissues - would it be wood pulp or solid wood or…” Harry’s grasp of magic failed him. He couldn’t understand how wood mapped into water in the first place, so he couldn’t understand what would happen after the water molecules were scrambled by ordinary thermal motions and the magic wore off and the mapping reversed.</p> </blockquote> <p>We get a similar warning regarding transfiguring things into any gasses or liquids:</p> <blockquote> <p>You will absolutely never under any circumstances Transfigure anything into a liquid or a gas. No water, no air. Nothing like water, nothing like air. Even if it is not meant to drink. Liquid evaporates, little bits and pieces of it get into the air.</p> </blockquote> <p>Unfortunately, once again, I want the author to take it farther. Explore some actual science! What WOULD happen if that wood-water turned back into wood in your system?</p> <p>So lets take a long walk off a short speculative pier together, and try to guess what might happen. First, we’ll assume magic absorbs any major energy differences and smooths over any issues at the time of transition. Otherwise when you magic in a few wood large wood molecules in place of much smaller water molecules, there will suddenly be lots of energy from the molecules repelling each other (this is called a steric mismatch) which will likely cause all sorts of problems (like a person exploding).</p> <p>To even begin to answer, we have to pick a rule for the transition. Lets assume each water molecule turns into one “wood molecule” (wood is ill-defined on a molecular scale, its made up lots of shit. However, that shit is mostly long hydrocarbon chains called polysaccharides.)</p> <p>So you’d drink the water, which gets absorbed pretty quickly by your body (any thats lingering in your gut unabsorbed will just turn into more fiber in your diet). After awhile, it would spread through your body, be taken up by your cells, and then these very diffuse water molecules would turn into polysaccharides. Luckily for you, your body probably knows how to deal with this, polysaccharides are hanging out all over your cells anway. Maybe somewhat surprisingly, you’d probably be fine. I think for lots of organic material, swapping out one organic molecule with another is likely to not harm you much. Of course, if the thing you swap in is poison, thats another story.</p> <p>Now, I’ve cheated somewhat- I could pick another rule where you’d definitely die. Imagine swapping in a whole splinter of wood for each water molecule. You’d be shredded. The details of magic matter here, so maybe a future chapter will give us the info needed to revisit this.</p> <p>What if instead of wood, we started with something inorganic like gold? If the water molecules turn into elemental gold (and you don’t explode from steric mismatches mentioned above), you’d be fine as long as the gold didn’t ionize. Elemental gold is remarkably stable, and it takes quite a bit of gold to get any heavy metal poisoning from it.</p> <p>On the other hand, if it ionizes you’ll probably die. Gold salts (which split into ionic gold + other stuff in your system) have a semi-lethal doses (the dose that kills half of the people who take it) of just a few mg per kg, so a 70 kg person couldn’t survive more than 5g or so of the salt, which is even less ionic gold. So in this case, as soon as the spell wore off you’d start to be poisoned. After a few hours, you’d probably start showing signs of liver failure (jaundice, etc).</p> <p>Water chemistry/physics is hard, so I have no idea if the gold atoms will actually ionize. Larger gold crystals definitely would not, and they are investigating using gold nanoparticles for medicine, which are also mostly non-toxic. However, individual atoms might still ionize.</p> <p>What if we don’t drink the water? What if we just get near a liquid evaporating? Nothing much at all, as it turns out. Evaporation is a slow process, as is diffusion.</p> <p>Diffusion constants are usually a few centimeters^2 per second, and diffusion is a slow process that moves forward with the square root of time (to move twice as far it takes 4 times as much time).</p> <p>So even if the transformation into water lasts a full hour, a single water molecule that evaporates from the glass will travel less than 100 centimeters! So unless you are standing with your face very close to the glass, you are unlikely to encounter even a single evaporated molecule. Even with your face right near the glass, that one molecule will mostly likely just be breathed in and breathed right back out. You have a lot of anatomic dead-space in your lungs in which no exchange takes place, and the active area is optimized for picking up oxygen.</p> <p>So how about transfiguring things to a gas? What happens there? Once again, this will depend on how we choose the rules of magic. When you make the gas, does it come in at room temperature and pressure? If so, this sets the density. Then you can either bring in an equal volume of gas to the original object with very few molecules, or you bring in an equal number of molecules, with a very large density.</p> <p>At an equal number of molecules, you’ll get hundreds of liters of diffuse gas. Your lungs are only hold about 5 liters, so you are going to get a much smaller dose then you’d get from the water (a few percent at best), where all the molecules get taken up by your body. Also, your lungs won’t absorb most of the gas, much will get blown back out, further lowering the dose.</p> <p>If its equal volume to the original object, then there will be very few gas molecules over a small area, and the diffusion argument applies- unless you get very near where you created the gas you aren’t likely at all to breathe any in.</p> <p>Thus concludes a bit of speculative magi-science guess work. Sorry if I bored you.</p> <p>Anyway- this chapter, I admit, intrigued me enough to spend some time thinking about what WOULD happen if something un-transfigured inside you. Not a bad chapter, really, but it again feels a tad lazy. We get some hazy worries about liquids evaporating (SCIENCE!) but no order-of-magnitude estimate about whether or not it matters (does not, unless maybe you boiled the liquid you made). There are lots of scientific ideas the author could play with, but they just get set aside.</p> <p>As for the rest of the chapter, Hariezer gets shown up by Hermione, who is out-performing him and has already read her school books. A competition for grades is launched.</p> <h3 id="hpmor-16-shades-of-ender-s-game">HPMOR 16: Shades of Ender’s game</h3> <p>I apologize for the longish break from HPMOR, sometimes my real job calls.</p> <p>This chapter mostly comprises Hariezer’s first defense against the dark arts class. We meet the ultra-competent Quirrel (although I suppose like in the original its really the ultra-competent Voldemort) for the first time.</p> <p>The lessons open with a bit of surprising anti-academic sentiment- Quirrel gives a long speech about how you needn’t learn to defend yourself against anything specific in the wizarding world because you could either magically run away or just use the instant killing spell, so the entire “Ministry-mandated” course with its “useless” textbooks is unnecessary. Of course, this comes from the mouth of the ostensible bad guy, so its unclear how much we are supposed to be creeped out by this sentiment (though Hariezer applauds).</p> <p>After this, we get to the lesson. After teaching a light attack spell, Quirrel asks Hermoine (who mastered it fastest) to attack another student. She refuses, so Quirrel moves on to Malfoy who is quick to acquiesce by shooting Hermoine.</p> <p>Then Quirrel puts Hariezer on the spot and things get sort of strange. When asked for unusual combat uses of everyday items, Hariezer comes up with a laundry list of outlandish ways to kill people, which leads Quirrel to observe that for Hariezer Yudotter nothing is defensive- he settles only for the destruction of his enemy. This feels very Ender’s game (Hariezer WINS, and that makes him dangerous), and sort of a silly moment.</p> <p>Chapter summary: weirdly anti-academic defense against the dark arts lesson. We once more get magic, but no rules of magic.</p> <h3 id="hpmor-17-introducing-dumbledore-and-some-retreads-on-old-ideas">HPMOR 17: Introducing Dumbledore, and some retreads on old ideas</h3> <p>This chapter opens with a little experiment with Hariezer trying to use the time turner to verify an NP-complete problem, as we discussed in a previous chapter section. Since its old ground, we won’t retread it.</p> <p>From here, we move on to the first broomstick lesson, which proceeds much like the book, only with shades of elitism. Hariezer drops this nugget on us:</p> <blockquote> <p>There couldn’t possibly be anything he could master on the first try which would baffle Hermione, and if there was and it turned out to be broomstick riding instead of anything intellectual, Harry would just die.</p> </blockquote> <p>Which feels a bit like the complete dismissal of Ron earlier. So the anti-jock Hariezer, who wouldn’t be caught dead being good at broomsticking doesn’t get involved in racing around to try to get Neville’s remember all, instead the entire class ends up in a stand off, wands drawn. So Hariezer challenges the Slytherin who has it to a strange duel. Using his time turner in proper Bill and Ted fashion, he hides a decoy remember all and wins. Its all old stuff at this point, I’m starting to worry there is nothing new under the sun- more time turner, more Hariezer winning (in case we don’t get it, there is a conversation with McGonagall where Hariezer once more realizes he doesn’t even consider NOT winning).</p> <p>AND THEN we meet Dumbledore, who is written as a lazy man’s version of insane. He’ll say something insightful, drop a Lord of the Ring’s quote and then immediately do something batshit. One moment he is trying to explain that Harry can trust him, the next he is setting a chicken on fire (yes this happens). In one baffling moment, he presents Hariezer with a big rock, and this exchange happens:</p> <blockquote> <p>So… why do I have to carry this rock exactly?” &quot;I can’t think of a reason, actually,&quot; said Dumbledore. &quot;…you can’t.&quot; Dumbledore nodded. “But just because I can’t think of a reason doesn’t mean there is no reason.” &quot;Okay,&quot; said Harry, &quot;I’m not even sure if I should be saying this, but that is simply not the correct way to deal with our admitted ignorance of how the universe works.&quot;</p> </blockquote> <p>Now, if someone gave you a large heavy rock and said “keep this on you, just in case” how would you begin to tell them they’re wrong? Here is Hariezer’s approach:</p> <blockquote> <p>How can I put this formally… um… suppose you had a million boxes, and only one of the boxes contained a diamond. And you had a box full of diamond-detectors, and each diamond-detector always went off in the presence of a diamond, and went off half the time on boxes that didn’t have a diamond. If you ran twenty detectors over all the boxes, you’d have, on average, one false candidate and one true candidate left. And then it would just take one or two more detectors before you were left with the one true candidate. The point being that when there are lots of possible answers, most of the evidence you need goes into just locating the true hypothesis out of millions of possibilities - bringing it to your attention in the first place. The amount of evidence you need to judge between two or three plausible candidates is much smaller by comparison. So if you just jump ahead without evidence and promote one particular possibility to the focus of your attention, you’re skipping over most of the work.</p> </blockquote> <p>Thank God Hariezer was able to use his advanced reasoning skills to make an analogy with diamonds in boxes to explain WHY THE IDEA THAT CARRYING A ROCK AROUND FOR NO REASON IS STUPID. This was the chapter’s rationality idea- seriously, its like Yudkowsky didn’t even try on this one.</p> <p>Chapter summary: Hariezer sneers at broomstick riding, some (now standard) time turner hijinks, Hariezer meets a more insane than wise Dumbledore</p> <h3 id="hpmor-18-what">HPMOR 18: What?</h3> <p>I wanted to discuss the weird anti-university/school-system under currents of the last few chapters, but I started into chapter 18 and it broke my brain.</p> <p>This chapter is absolutely ludicrous. We meet Snape for the first time, and he behaves as you’d expect from the source material. He makes a sarcastic remark and asks Hariezer a bunch of questions Hariezer does not know the answer to.</p> <p>This leads to Hariezer flipping out:</p> <blockquote> <p>The class was utterly frozen. &quot;Detention for one month, Potter,&quot; Severus said, smiling even more broadly. &quot;I decline to recognize your authority as a teacher and I will not serve any detention you give.&quot; People stopped breathing. Severus’s smile vanished. “Then you will be -” his voice stopped short. &quot;Expelled, were you about to say?&quot; Harry, on the other hand, was now smiling thinly. &quot;But then you seemed to doubt your ability to carry out the threat, or fear the consequences if you did. I, on the other hand, neither doubt nor fear the prospect of finding a school with less abusive professors. Or perhaps I should hire private tutors, as is my accustomed practice, and be taught at my full learning speed. I have enough money in my vault. Something about bounties on a Dark Lord I defeated. But there are teachers at Hogwarts who I rather like, so I think it will be easier if I find some way to get rid of you instead.”</p> </blockquote> <p>Think about this- THE ONLY THINGS SNAPE HAS DONE are make a snide comment and ask Hariezer a series of questions he doesn’t know the answer to.</p> <p>The situation continues to escalate, until Hariezer locks himself in a closet and uses his invisibility cloak and time turner to escape the classroom.</p> <p>This leads to a meeting with the headmaster where Hariezer THREATENS TO START A NEWSPAPER CAMPAIGN AGAINST SNAPE (find a newspaper interested in the ‘some students think professor too hard on them, for instance he asked Hariezer Yudotter 3 hard questions in a row’ story)</p> <p>AND EVERYONE TAKES THIS THREAT SERIOUSLY, AS IF IT COULD DO REAL HARM. HARIEZER REPEATEDLY SAYS HE IS PROTECTING STUDENTS FROM ABUSE. THEY TAKE THIS THREAT SERIOUSLY ENOUGH THAT HARIEZER NEGOTIATES A TRUCE WITH SNAPE AND DUMBLEDORE. Snape agrees to be less demanding of discipline, Hariezer agrees to apologize.</p> <p>Nowhere in this chapter does Hariezer consider that he deprived other students of the damn potions lesson. In his ruminations about why Snape keeps his job, he never considers that maybe Snape knows a lot about potions/is actually a good potions teacher.</p> <p>This whole chapter is basically a stupid power struggle that requires literally everyone in the chapter to behave in outrageously silly ways. Hariezer throws a temper tantrum befitting a 2 year old, and everyone else gives him his way.</p> <p>On the plus side, Mcgonagall locks down Hariezer’s time turner, so hopefully that device will stop making an appearance for awhile, its been the “clever” solution to every problem for several chapters now.</p> <p>One more chapter this bad and I might have to abort the project.</p> <h3 id="hpmor-19-i-can-t-even-what">HPMOR 19: I CAN’T EVEN… WHAT?…</h3> <p>I… this… what…</p> <p>So there is a lot I COULD say here, about inconsistent characterization, ridiculously contrived events,etc. But fuck it- here is the key event of this chapter: Quirrel, it turns out, is quite the martial artist (because of course he is, who gives a fuck about genre consistency or unnecessary details, PILE IN MORE “AWESOME”). The lesson he claims to have learned from martial arts (at a mysterious dojo, because of course) that Hariezer needs to learn (as evidenced by his encounter with Snape) is how to lose.</p> <p>How does Quirrel teach Hariezer “how to lose”? He calls Hariezer to the front, insists Hariezer not defend himself, and then has a bunch of slytherins beat the shit out of him.</p> <p>Thats right- a character who one fucking chapter ago couldn’t handle being asked three hard questions in a row (ITS ABUSE, I’ll CALL THE PAPERS) submits to being literally beaten by a gang at a teacher’s suggestion.</p> <p>An 11 year old kid, at a teachers suggestion, submits to getting beaten by a bunch of 16 year olds. All of this is portrayed in a positive light.</p> <h3 id="ideas-around-chapter-19">Ideas around chapter 19</h3> <p>In light of the recent anon, I’m going to attempt to give the people (person?) what they want. Also, I went from not caring if people were reading this, to being a tiny bit anxious I’ll lose the audience I unexpectedly picked up. SELLING OUT.</p> <p>If we ignore the literal child abuse of the chapter, the core of the idea is still somewhat malignant. Its true throughout that Hariezer DOES have a problem with “knowing how to lose,” but the way you learn to lose is by losing, not by being ordered to take a beating.</p> <p>Quirrell could have challenged Hariezer to a game of chess, he could have asked questions Hariezer didn’t know the answer to (as Snape did, which prompted the insane chapter 18), etc. But the problem is the author is so invested in Hariezer being the embodiment of awesome that even when he needs to lose for story purposes, to learn a lesson, Yudkowsky doesn’t want to let Hariezer actually lose at something. Instead he gets ordered to lose, and he isn’t ordered to lose at something in his wheel house, but in the “jock-stuff” repeatedly sneered at in the story (physical confrontation)</p> <h3 id="hpmor-20-why-is-this-chapter-called-bayes-theorem">HPMOR 20: why is this chapter called Bayes Theorem?</h3> <p>A return to what passes for “normal.” No child beating in this chapter, just a long, boring conversation.</p> <p>This chapter opens with Hariezer ruminating about how much taking that beating sure has changes his life. He knows how to lose now, he isn’t going to become dark lord now! Quirrell quickly takes him down a peg:</p> <blockquote> <p>&quot;Mr. Potter,&quot; he said solemnly, with only a slight grin, &quot;a word of advice. There is such a thing as a performance which is too perfect. Real people who have just been beaten and humiliated for fifteen minutes do not stand up and graciously forgive their enemies. It is the sort of thing you do when you’re trying to convince everyone you’re not Dark, not -“</p> </blockquote> <p>Hariezer protests, and we get</p> <blockquote> <p>There is nothing you can do to convince me because I would know that was exactly what you were trying to do. And if we are to be even more precise, then while I suppose it is barely possible that perfectly good people exist even though I have never met one, it is nonetheless improbable that someone would be beaten for fifteen minutes and then stand up and feel a great surge of kindly forgiveness for his attackers. On the other hand it is less improbable that a young child would imagine this as the role to play in order to convince his teacher and classmates that he is not the next Dark Lord. The import of an act lies not in what that act resembles on the surface, Mr. Potter, but in the states of mind which make that act more or less probable</p> </blockquote> <p>How does Hariezer take this? Does he point out “if no evidence can sway your priors, your priors are too strong?” or some other bit of logic-chop Bayes-judo? Nope, he drops some nonsensical jargon:</p> <blockquote> <p>Harry blinked. He’d just had the dichotomy between the representativeness heuristic and the Bayesian definition of evidence explained to him by a wizard.</p> </blockquote> <p>Where is Quirrell using bayesian evidence? He isn’t, he is neglecting all evidence because all evidence fits his hypothesis. Where does the representativeness heuristic come into play? It doesn’t.</p> <p>The representative heuristic is making estimates based on how typical of a class something is. i.e. show someone a picture of a stereotypical ‘nerd’ and say “is this person more likely an english or a physics grad student?” The representative heuristic says “you should answer physics.” Its a good rule-of-thumb that psychologists think is probably hardwired into us. It also leads to some well-known fallacies I won’t get into here.</p> <p>Quirrell is of course doing none of that- Quirrell has a hypothesis that fits anything Hariezer could do, so no amount of evidence will dissuade him.</p> <p>After this, Quirrell and Hariezer have a long talk about science (because of course Quirrell too has a fascination with space travel). This leads to some real Less Wrong stuff.</p> <p>Quirrell tells us that of course muggle scientists are dangerous because</p> <blockquote> <p>There are gates you do not open, there are seals you do not breach! The fools who can’t resist meddling are killed by the lesser perils early on, and the survivors all know that there are secrets you do not share with anyone who lacks the intelligence and the discipline to discover them for themselves!</p> </blockquote> <p>And of course, Hariezer agrees</p> <blockquote> <p>This was a rather different way of looking at things than Harry had grown up with. It had never occurred to him that nuclear physicists should have formed a conspiracy of silence to keep the secret of nuclear weapons from anyone not smart enough to be a nuclear physicist</p> </blockquote> <p>Which is a sort of weirdly elitist position- after all lots of nuclear physicists are plenty dangerous. Its not intelligence that makes you less likely to drop a bomb. But this fits the general Yudkowsky/AI fear- an open research community is less important than hiding dangerous secrets. This isn’t necessarily the wrong position, but its a challenging one that merits actual discussion.</p> <p>Anyone who has done research can tell you how important the open flow of ideas is for progress. I’m of the opinion that the increasing privatization of science is actually slowing us down in a lot of ways by building silos around information. How much do we retard progress in order to keep dangerous ideas out of people’s hands? Who gets to decide what is dangerous? Who decides who gets let into “the conspiracy?” Intelligence alone is no guarantee someone won’t drop a bomb, despite how obvious it seems to Quirrell and Yudotter.</p> <p>After this digression about nuclear weapons, we learn from Quirrell that he snuck into NASA and enchanted the Pioneer gold plaque that will “make it last a lot longer than it otherwise would.” Its unclear to me what that wear and tear Quirrell is protecting the plaque from. Hariezer suggest that Quirrell might have snuck a magic portrait or a ghost into the plaque, because nothing makes more sense then dooming an (at least semi) sentient being to a near eternity of solitary confinement.</p> <p>Anyway, partway through this chapter, Dumbledore bursts in angry that Quirrell had Hariezer beaten. Hariezer defends him, etc. The resolution is that its agreed Hariezer will start learning to protect himself from mind readers.</p> <p>Chapter summary- long, mostly boring conversation, peppered with some existential risk/we need to escape the planet rhetoric. Its also called Bayes theorem despite that theorem making no appearance whatsoever.</p> <p>And a note on the really weird pedagogy- we now have Quirrell who in the books is possessed by Voldemort acting as a mouthpiece for the author. This seems like a bad choice, because at some point I assume we’ll there will be a reveal, and it will turn out the reader should have trusted Quirrell.</p> <h3 id="hpmor-21-secretive-science">HPMOR 21: secretive science</h3> <p>So this chapter begins quite strangely- Hermione is worried that she is “bad” because she is enjoying being smarter than Hariezer. She then decides that she isn’t “bad”, its a budding romance. Thats the logic she uses. But because she won the book-reading contest against Hariezer (he doesn’t flip out, it must be because he learned “how to lose”), she gets to go on a date with him. The date is skipped over.</p> <p>Next we find Hariezer meeting Malfoy in a dark basement, discussing how they will go about doing science. Malfoy is written as uncharacteristically stupid, in order to be a foil once more for Hariezer, peppering the conversation with such gems as:</p> <blockquote> <p>Then I’ll figure out how to make the experimental test say the right answer!</p> <p>&quot;You can always make the answer come out your way,” said Draco. That had been practically the first thing his tutors had taught him. “It’s just a matter of finding the right arguments.”</p> </blockquote> <p>We get a lot of platitudes from Hariezer about how science humbles you before nature. But then we get the same ideas Quirrell suggested previously, because “science is dangerous”, they are going to run their research program as a conspiracy.</p> <blockquote> <p>&quot;As you say, we will establish our own Science, a magical Science, and that Science will have smarter traditions from the very start.” The voice grew hard. “The knowledge I share with you will be taught alongside the disciplines of accepting truth, the level of this knowledge will be keyed to your progress in those disciplines, and you will share that knowledge with no one else who has not learned those disciplines. Do you accept this?”</p> </blockquote> <p>And the name of this secretive scienspiracy?</p> <blockquote> <p>And standing amid the dusty desks in an unused classroom in the dungeons of Hogwarts, the green-lit silhouette of Harry Potter spread his arms dramatically and said, “This day shall mark the dawn of… the Bayesian Conspiracy.”</p> </blockquote> <p>Of course. I mentioned in the previous chapter, anyone who has done science knows that its a collaborative process that requires an open exchange of ideas.</p> <p>And see what I mean about the melding of ideas between Quirrell and Hariezer? Its weird to use them both as author mouthpieces. The Bayesian Conspiracy is obviously an idea Yudkowsky is fond of, and here Hariezer gets the idea largely from Quirrell just one chapter back.</p> <h3 id="hpmor-22-science">HPMOR 22: science!</h3> <p>This chapter opens strongly enough. Hariezer decides that the entire wizarding world has probably been wrong about magic, and don’t know the first thing about it.</p> <p>Hermione disagrees, and while she doesn’t outright say “maybe you should read a magical theory book about how spells are created” (such a thing must exist), she is at least somewhat down that path.</p> <p>To test his ideas, Hariezer creates a single-blind test- he gets spells from a book, changes the words or the wrist motion or what not and gets Hermione to cast them. Surprisingly, Hariezer is proven wrong by this little test. For once, the world isn’t written as insane as a foil for our intrepid hero.</p> <blockquote> <p>It seemed the universe actually did want you to say ‘Wingardium Leviosa’ and it wanted you to say it in a certain exact way and it didn’t care what you thought the pronunciation should be any more than it cared how you felt about gravity.</p> </blockquote> <p>There are a few anti-academic snipes, because it wouldn’t be HPMOR without a little snide swipe at academia:</p> <blockquote> <p>But if my books were worth a carp they would have given me the following important piece of advice…Don’t worry about designing an elaborate course of experiments that would make a grant proposal look impressive to a funding agency.</p> </blockquote> <p>Weird little potshots about academia (comments like “so many bad teachers, its like 8% as bad as Oxford,” “Harry was doing better in classes now, at least the classes he considered interesting”) have been peppered throughout the chapters since Hariezer arrived at Hogwarts. Oh academia, always trying to make you learn things that might be useful, even if they are a trifle boring. So full of bad teachers, etc. Just constant little comments attacking school and academia.</p> <p>Anyway, this chapter would be one of the strongest chapters, except there is a second half. In the second half, Hariezer partners with Draco to get to the bottom of wizarding blood purity.</p> <blockquote> <p>Harry Potter had asked how Draco would go about disproving the blood purist hypothesis that wizards couldn’t do the neat stuff now that they’d done eight centuries ago because they had interbred with Muggleborns and Squibs.</p> </blockquote> <p>Here is the thing about science, step 0 needs to be make sure you’re trying to explain a real phenomena. Hariezer knows this, he tells the story of N-rays earlier in the chapter, but completely fails to understand the point.</p> <p>Hariezer and Draco have decided, based on one anecdote (the founders of Hogwarts were the best wizards ever, supposedly) that wizards are weaker today than in the past. The first thing they should do is find out if wizards are actually getting weaker. After all, the two most dangerous dark wizards ever were both recent, Grindelwald and Voldemort. Dumbledore is no slouch. Even four students were able to make the marauders map just one generation before Harry. (Incidentally, this is exactly where neoreactionaries often go wrong- they assume things are getting worse without actually checking, and then create elaborate explanations for non-existent facts).</p> <p>Anyway, for the purposes of the story, I’m sure it’ll turn out that wizards are getting weaker, because Yudkoswky wrote it. But this would have been a great chance to teach an actually useful lesson, and it would make the N-ray story told earlier a useful example, and not a random factoid.</p> <p>Anyway, to explain the effect they come up with a few obvious hypotheses:</p> <ol> <li>Magic itself is fading.</li> <li>Wizards are interbreeding with Muggles and Squibs.</li> <li>Knowledge to cast powerful spells is being lost.</li> <li>Wizards are eating the wrong foods as children, or something else besides blood is making them grow up weaker.</li> <li>Muggle technology is interfering with magic. (Since 800 years ago?)</li> <li>Stronger wizards are having fewer children. (Draco = only child? Check if 3 powerful wizards, Quirrell / Dumbledore / Dark Lord, had any children.)</li> </ol> <p>They miss some other obvious ones (there is a finite amount of magic power, so increasing populations = more wizards = less power per wizard, for instance. Try to come up with your own, its easy and fun).</p> <p>They come up with some ways to collect some evidence- find out what the first year curriculum was throughout Hogwarts history, and do some wizard genealogy by talking to portraits.</p> <p>Still, finally some science, even if half of it was infuriating.</p> <h3 id="hpmor-23-wizarding-genetics-made-way-too-simple">HPMOR 23: wizarding genetics made (way too) simple</h3> <p>Alright, I need to preface this: I have the average particle physicists knowledge of biology (a few college courses, long ago mostly forgotten). That said, the lagavulin is flowing, so I’m going to pontificate as if I’m obviously right, so please reblog me with corrections if I am wrong.</p> <p>In this chapter, Hariezer and Draco are going to explore what I think of as the blood hypothesis- that wizardry is carried in the blood, and that intermarriage with non-magical types is diluting wizardry.</p> <p>Hariezer gives Draco a brief, serviceable enough description of DNA (more like pebbles than water), He lays out two models- there are lots of wizarding genes, and the more wizard genes you have, the more powerful the wizard you are. In this case, Hariezer reasons, as powerful wizards marry less powerful wizards, or non-magical types, the frequency of the magical variant of wizard genes in the general population becomes diluted. In this model, two squibs might rarely manage to have a wizard child, but they are likely to be weaker than wizard-born wizards. Call this model 1.</p> <p>The other model Hariezer lays out is that magic lies on a single recessive gene. He reasons squibs have one dominant, non-magical version, and one recessive magical version of the gene. So of kids born to squibs, 1/4 will be wizards. In this version, you either have magic or you don’t, so if wizards married the non-magical, wizards themselves could become more rare, but the power of wizards won’t be diluted. Call this model 2.</p> <p>The proper test between model 1 and 2, suggests Hariezer, is to look at the children born to two squibs. If about one fourth of them are wizards, its evidence of model 2, otherwise, evidence of model 1.</p> <p><strong>There is a huge problem with this. Do you see it? Here is a hint, What other predictions does model 2 make? While you are thinking about it, read on.</strong></p> <p>Before I answer the question, I want to point out that Hariezer ignores tons of other plausible models. Here is one I just made up. Imagine, for instance, a single gene that switches magic on and off, and a whole series of other genes that make you a better wizard. Maybe some double-jointed-wrist gene allows you to move your wand in unusually deft ways. Maybe some mouth-shape gene allows you to pronounce magical sounds no one else can. In this case, magical talent can be watered down as in model 1, and wizard inheritance could still look like Mendel would suggest, as in model 2.</p> <p><strong>Alright, below I’m going to answer my query above. Soon there will be no time to figure it for yourself.</strong></p> <p>Squibs are, by definition, the non-wizard children of wizard parents. Hariezer’s model 2 predicts that squibs cannot exist. It is already empirically disproven.</p> <p>Hariezer, of course, does not notice this massive problem with his favored model, and Draco’s collected genealogy suggests about 6 out of 28 squib born children were wizards, so he declares model 2 wins the test.</p> <p>Draco flips out, because now that he “knows” that magic isn’t being watered down by breeding he can’t join the death eaters and his whole life is ruined,etc. Hariezer is happy that Draco has “awakened as a scientist.” (I hadn’t complained about the stilted language in awhile, just reminding you that its still there), but Draco lashes out and casts a torture spell and locks Hariezer in the dungeon. After some failed escape attempts, he once against resorts to the time turner, because even now that its locked down, its the solution to every problem.</p> <p>One other thing of note- to investigate the hypothesis that really strong spells can’t be cast anymore, Hariezer tries to look up a strong spell and runs into “the interdict of Merlin” that strong spells can’t be written down, only passed from wizard to wizard.</p> <p>Its looking marginally possible that it will turn out that this natural secrecy is exactly whats killing off powerful magic- its not open so ideas aren’t flourishing or being passed on. Hariezer will notice that and realize his “Bayesian Conspiracy” won’t be as effective as an open science culture, and I’ll have to take back all of my criticisms around secretive science (it will be a lesson Hariezer learns, and not an idea Hariezer endorses). It seems more likely given the author’s existential risk concerns, however, that this interdict of Merlin will be endorsed.</p> <h3 id="some-more-notes-regarding-hpmor">Some more notes regarding HPMOR</h3> <p>There is a line in the movie Clueless (if you aren’t familiar, Clueless was an older generation’s Mean Girls) where a woman is described as a “Monet”- in that like the painting, it looks good from afar but up close is a mess.</p> <p>So I’m now nearly 25 chapters into this thing, and I’m starting to think that HPMOR is this sort of a monet- if you let yourself get carried along, it seems ok-enough. It references a lot of things that a niche group of people,myself included, like (physics! computational complexity! genetics! psychology!). But as you stare at it more, you start noticing that it doesn’t actually hang together, its a complete mess.</p> <p>The hard science references are subtly wrong, and often aren’t actually explained in-story (just a jargon dump to say ‘look, here is a thing you like’).</p> <p>The social science stuff fairs a bit better (its less wrong ::rimshot::), but even when its explanation is correct, its power is wildly exaggerated- conversations between Quirrell/Malfoy/Potter seem to follow scripts of the form</p> <p>&quot;Here is an awesome manipulation I’m using against you&quot;</p> <p>&quot;My, that is an effective manipulation. You are a dangerous man&quot;</p> <p>&quot;I know, but I also know that you are only flattering me as an attempt to manipulate me.&quot; p &quot;My, what an effective use of Bayesian evidence that is!&quot;</p> <p>Other characters get even worse treatment, either behaving nonsensically to prove how good Harry is at manipulation (as in the chapter where Harry tells off Snape and then tries to blackmail the school because Snape asked him questions he didn’t know), OR acting nonsensically so Harry can explain why its nonsensical (“Carry this rock around for no reason.” “Thats actually the fallacy of privileging the hypothesis.”) The social science/manipulation/marketing psychology stuff is just a flavoring for conversations.</p> <p>No important event in the story has hinged on any of this rationality- instead basically every conflict thus far is resolved via the time turner.</p> <p>And if you strip all this out, all the wrongish science-jargon and the conversations that serve no purpose but to prove Malfoy/Quirrell/Harry are “awesome” by having them repeatedly think/tell each other how awesome they are, the story has no real structure. Its just a series of poorly paced (if you strip out the “awesome” conversations, then there are many chapters where nothing happens), disconnected events. There is no there there.</p> <h3 id="hpmor-24-evopsych-rorschach-test">HPMOR 24: evopsych Rorschach test</h3> <p>Evolutionary psychology is a field that famously has a pretty poor bullshit filter. Satoshi Kanazawa once published a series of articles that beautiful people will have more female children (because beauty is more important for girls) and engineers/mathematicians will have more male children (because only men need the logic-brains). The only thing his papers proved was that he is bad at statistics (in fact, Kanazawa made an entire career out of being bad at statistics, such is the state of evo-psych).</p> <p>One of the core criticisms is that for any fact observed in the world, you can tell several different evolutionary stories, and there is no real way to tell which, if any is actually true. Because of this, when someone gives you an evopsych explanation for something, its often telling you more about what they believe then it is about science or the world (there are exceptions, but they are rare).</p> <p>So this chapter is a long, pretty much useless conversation between Draco and Hariezer about how they are manipulating each other and Dumbledore or whatever, but smack in the middle we get this rumination:</p> <blockquote> <p>In the beginning, before people had quite understood how evolution worked, they’d gone around thinking crazy ideas like human intelligence evolved so that we could invent better tools.</p> <p>The reason why this was crazy was that only one person in the tribe had to invent a tool, and then everyone else would use it…the person who invented something didn’t have much of a fitness advantage, didn’t have all that many more children than everyone else. [<strong>SU comment- could the inventor of an invention perhaps get to occupy a position of power within a tribe? Could that lead to them having more wealth and children?</strong>]</p> <p>It was a natural guess… A natural guess, but wrong.</p> <p>Before people had quite understood how evolution worked, they’d gone around thinking crazy ideas like the climate changed, and tribes had to migrate, and people had to become smarter in order to solve all the novel problems.</p> <p>But human beings had four times the brain size of a chimpanzee. 20% of a human’s metabolic energy went into feeding the brain. Humans were ridiculously smarter than any other species. That sort of thing didn’t happen because the environment stepped up the difficulty of its problems a little…. [<strong>SU challenge to the reader- save this climate change evolutionary argument with an ad-hoc justification</strong>]</p> <p>Ending up with that gigantic outsized brain must have taken some sort of runawayevolutionary process…And today’s scientists had a pretty good guess at what that runaway evolutionary process had been….</p> <p>[It was] Millions of years of hominids trying to outwit each other - an evolutionary arms race without limit - [that] had led to… increased mental capacity.</p> </blockquote> <p>What does his preferred explanation for the origin of intelligence (people evolved to outwit each other) say about the author?</p> <h3 id="hpmor-24-25-26-mangled-narratives">HPMOR 24/25/26: mangled narratives</h3> <p>This chapter is going to be entirely about the way the story is being told in this section of chapters. There is a big meatball of a terrible idea, but I’m getting sick of that low hanging fruit, so I’ll only mention it briefly in passing.</p> <p>I’m a sucker for stories about con artists. In these stories, there is a tradition of breaking with the typical chronological order of story telling- instead they show the end result of the grand plan first, followed by all the planning that went into it (or some variant of that). In that way, the audience gets to experience the climax first from the perspective of the mark, and then from the perspective of the clever grifters. Yudkowsky himself successfully employs this pattern in the first chapter with the time turner.</p> <p>In this chapter, however, this pattern is badly mangled. The chapter is setting up an elaborate prank on Rita Skeeter (Draco warned Hariezer that Rita was asking questions during one of many long conversations), but jumbling the narrative accomplishes literally nothing.</p> <p>Here are the events, in the order laid out in the narrative</p> <ol> <li><p>Hariezer tells Draco he didn’t tell on him about the torture, and borrows some money from him</p></li> <li><p>(this is the terrible idea meatball) Using literally the exact same logic that Intelligent Design proponents use (and doing exactly 0 experiments), Hariezer decides while thinking over breakfast:</p></li> </ol> <blockquote> <p>Some intelligent engineer, then, had created the Source of Magic, and told it to pay attention to a particular DNA marker.</p> <p>The obvious next thought was that this had something to do with “Atlantis”.</p> </blockquote> <ol> <li><p>Hariezer meets with Dumbledore, and refuses to tell on Draco, says getting tortured is all part of his manipulation game.</p></li> <li><p>Fred and George Weasley meet with a mysterious man named Flume and tells him the-boy-who-lived needs the mysterious man’s help. There is a Rtia Skeeter story mentioned that says Quirrell is secretly a death eater and is training Hariezer to be the next dark lord, a story Flume says was planted by the elder Malfoy.</p></li> <li><p>Quirrell tells Rita Skeeter he has no dark mark, Rita ignores him.</p></li> <li><p>Hariezer hires Fred and George (presumably with Malfoy’s money) to perpetuate a prank on Rita Skeeter- to convince her of something totally false.</p></li> <li><p>Hariezer has lunch with Quirrell, reads a newspaper story with the headline</p></li> </ol> <blockquote> <p>HARRY POTTER SECRETLY BETROTHED TO GINEVRA WEASLEY</p> </blockquote> <p>This is the story the Weasley’s planted apparently (the prank), and apparently there was a lot of supporting evidence or something, because Quirrell is incredulous it could be done. And then Quirrell after speculating that Rita Skeeter could be capable of turning to a small animal, crushes a beetle.</p> <p>So whats the problem with this narrative order? First, there is absoltuely no payoff to jumbling the chronology. The prank is left until the end, and its exactly what we expected- a false story was planted in the newspaper. It doesn’t even seem like that big a deal- just a standard gossip column story (of course, Harry and Quirrell react like its a huge, impossible-to-have-done prank, to be sure the reader knows its hard.)</p> <p>Second, most of the scenes are redundant, they contain no new information whatsoever and they are therefore boring- the event covered in 3 (talking with Dumbledore) is covered in full in 1 (telling Malfoy he didn’t tell on him to Dumbledore). The events of 6 (Hariezer hiring the Weasley’s to prank for him) are completely covered in 4 (when the Weasley’s hire Flume, they tell him its for Hariezer). This chapter is twice as long as it should be, for no reason.</p> <p>Third, the actual prank is never shown from either the marks or the grifter’s perspective. It happens entirely off-stage so to speak. We don’t see Rita Skeeter encountering all this amazing evidence about Hariezer’s betrothal and writing up her career making article. We don’t see Fred and George’s elaborate plan (although if I were a wizard and wanted to plant a false newspaper story, I’d just plant a false memory in a reporter).</p> <p>What would have been more interesting, the actual con happening off-stage, or the long conversations about nothing that happen in these chapters? These chapters are just an utter failure. The narrative decisions are nonsensical, and everything continues to be tell, tell, tell, never show.</p> <p>Also of note- Quirrell gives Hariezer Roger Bacon’s diary of magic, because of course thats a thing that exists.</p> <h3 id="hpmor-27-answers-from-a-psychologist">HPMOR 27: answers from a psychologist</h3> <p>In order to answer last night’s science question, I spent today slaving on the streets, polling professionals for answers (i.e. I sent one email to an old college roommate who did a doctorate in experimental brain stuff). This will basically be a guest post.</p> <p>Here is the response:</p> <blockquote> <p>The first thing you need to know, this is called “the simulation theory of empathy.” Now that you have a magic google phrase, you can look up everything you’d want, or read on my (not so) young padawan.</p> <p>You are correct that no one knows how empathy works, its too damn complicated, but what we can look at is motor control, and in motor control the smoking gun for simulation is mirror neurons. Rizzolatti and collaborators discovered the certain neurons in macaque monkey’s inferior frontal gyrus that related to the motor-vocabulary activate not only when they do a gesture, but also when they see someone else doing that same gesture. So maybe, says Rizzolatti, the same neurons responsible for action are also responsible for understanding action (action-understanding). This is not the only explanation, it could be a simple priming effect. This would be big support for simulation explanations of understanding others. Unfortunately, its not the only explanation for action-understanding. There are other areas of the macaque brain (in particular the superior temporal sulcus) that aren’t involved in action, but do appear to have some role in action[understanding.</p> <p>It is not an understatement to say that this discoveryy of mirror neurons caused the entire field to lose their collective shit. For some reason, motor explanations for brain phenomena are incredibly appealing to large portions of the field, and always have been. James Woods (the old dead behaviorist, not the awesome actor) had a theory that thought itself was related to the motor-neurons that control speech. Its just something that the entire field is primed to lose their shit over. Some philosophers of the mind made all sorts of sweeping pronouncements (“mirror-neurons are responsible for the great leap forward in human evolution”, pretty sure thats a direct quote)</p> <p>The problem is that the gold standard for monkey tests is to see what a lesion in that portion of the brain does. Near as anyone can tell, lesions in F5 (portion of the inferior frontal gyrus where the mirror neurons on) does not impair action-understanding.</p> <p>The next, bigger problem for theories of human behavior is that there is no solid evidence of mirror neurons in humans. A bunch of fmri studies showed a bit of activity in one region, and then meta-studies suggested not that region, maybe some other region, etc. fmri studies are tricky Google dead salmon fmri.</p> <p>But even if mirror neurons are involved in humans, there is really strong evidence they can’t be involved in action-understanding. The mirror proponents suggest speech is a strong trigger for suggested mirror neurons. For instance, in the speech system, we’ve known since Paul Broca (really old French guy) that lesions can destroy your ability to speak without understanding your ability to understand speech. This is a huge problem for models that link action-understanding to action, killing those neurons should destroy both.</p> <p>Also,suggested human mirror neurons do not fire in regards to pantomime actions. Also in autism spectrum disorders, action-understanding is often impaired with no impairment to action.</p> <p>So in summary, the simulation theory of empathy got a big resurgence after mirror neurons, but there is decently strong empirical evidence against a mirror-only theory of action-understanding in humans. That doesn’t mean mirror neurons have no role to play (though if they aren’t found in humans, it does mean they have no role to play), it just means that the brain is complicated. I think the statement you quoted to me would have been something you could read from a philosopher of mind in the late 80s or early 90s, but not something anyone involved in experiments would say. By the mid 2000s, a lot of that enthusiasm had pittered a bit. Then I left the field.</p> </blockquote> <p>So on this particular bit of science, it looks like Yudkowsky isn’t wrong he is just presenting conjecture and hypothesis as settled science. Still I learned something here, I’d never encountered this idea before. I’ll have an actual post about chapter 27 tomorrow.</p> <h3 id="hpmor-27-mostly-retreads">HPMOR 27: mostly retreads</h3> <p>This is another chapter where most of the action is stuff that has happened before, we are getting more and more retreads.</p> <p>The new bit is that Hariezer is learning to defend himself from mental attacks. The goal, apparently, is to perfectly simulate someone other than yourself, in that way the mind reader learns the wrong things. This leads in to the full-throated endorsement of the simulation theory of empathy that was discussed by a professional in my earlier post. Credit where credit is due- this was an idea I’d never encountered before, and I do think HPMOR is good for some of that- if you don’t trust the presentation and google ideas as they come up, you could learn quite a bit.</p> <p>We also find out Snape is a perfect mind-reader. This is an odd choice- in the original books Snape’s ability to block mind-reading was something of a metaphor for his character- you can’t know if you can trust him because he is so hard to read, his inscrutableness even fooled the greatest dark wizard ever, etc. It was, fundamentally, hid cryptic dodginess that helped the cause, but it also fermented the distrust that some of the characters in the story felt toward him.</p> <p>Now for the retreads-</p> <p>More pointless bitching about quidditch. Nothing was said here that wasn’t said in the earlier bitching about quidditch.</p> <p>Snape enlists Hareizer’s help to fight anti-slytherin bullies (for no real reason, near as I can tell), the bullies are fought once more with con-artist style cleverness (much like in the earlier chapter with the time turner and invisitbility cloak. In this chapter, its just with an invisibility cloak).</p> <p>Snape rewards Hariezer’s rescue with a conversation about Hariezer’s parents, during which Hariezer decides his mother was shallow, which upsets Snape. Its an odd moment, but the odd moments in HPMOR dialogue have piled up so high its almost not worth mentioning.</p> <p>And the chapter culminates when the bullied slytherin tells Hariezer about Azkaban, pleading with Hariezer to save his parents. Of course, Hariezer can’t (something tells me he will in the near future), and we get this:</p> <blockquote> <p>&quot;Yeah,&quot; said the Boy-Who-Lived, &quot;that pretty much nails it. Every time someone cries out in prayer and I can’t answer, I feel guilty about not being God.&quot;</p> <p>…</p> <p>The solution, obviously, was to hurry up and become God.</p> </blockquote> <p>So another retread- Hariezer is once more making clear his motives aren’t curiosity, they are power. This was true after chapter 10, its still true now.</p> <p>This is the only real action for several chapters now, unfortunately all the action feels like its already happened before in other chapters.</p> <h3 id="hpmor-28-hacking-science-map-territory">HPMOR 28: hacking science/map-territory</h3> <p>Finally we get back to some attempts to do magi-science, but its again deeply frustrating. Its more transfiguration- the only magic we have thus far explored, and it leads to a discussion of map vs. territory distinctions that is horribly mangled.</p> <p>At the opening of hte chapter, instead of using science to explore magic, the new approach is to treat magic as a way to hack science itself. To that end, Hariezer tries (and fails) to transfigure something into “cure for Alzheimer’s,” and then tries (successfully) to transfigure a rope of carbon nanotubes. I guess the thought here is he can then give these things to scientists to study? Its unclear, really.</p> <p>Frustrated with how useless this seems, Hermione make this odd complaint:</p> <blockquote> <p>&quot;Anyway,&quot; Hermione said. Her voice shook. &quot;I don’t want to keep doing this. I don’t believe children can do things that grownups can’t, that’s only in stories.&quot;</p> </blockquote> <p>Poor Hermione- thats the feeblest of objections, especially in a story where every character acts like they are in their late teens or twenties. Its almost as if the author was looking for some knee jerk complaint you could throw out that everyone would write-off as silly on its face.</p> <p>So Hariezer decides he needs to do something adults can’t to appease Hermione. To do this, he decides to attack the constraints he knows about magic, starting with the idea that you can only transfigure a whole object, and not part of an object (a constraint I think was introduced just for this chapter?).</p> <p>So Hariezer reasons: things are made out of atoms. There isn’t REALLY a whole object there,so why can’t I do part of an object? This prompted me to wonder- if you do transform part of an object, what happens at the interface? Does this whole-object constraint have something to do with the interface? I mentioned in the chapter 15 section that magicking in a lot large gold molecule into water could cause steric mismatches (just volume constraints really) with huge energy differences, hence explosions. What happens at the micro level when you take some uniform crystalline solid and try to patch on some organic material like rubber at some boundary? If you deform the (now rubbery) material, what happens when it changes back and the crystal spacing is now messed up? Could you partially transform something if you carefully worked out the interface?</p> <p>It will not surprise someone who has read this far that none of these questions are asked or answered.</p> <p>Instead, Hariezer thinks really hard about how atoms are real, in the process we get ruminations on the map and the territory:</p> <blockquote> <p>But that was all in the map, the true territory wasn’t like that, reality itself had only a single level of organization, the quarks, it was a unified low-level process obeying mathematically simple rules.</p> </blockquote> <p>This seems innocuous enough, but a fundamental mistake is being made here. For better or for worse, physics is limited in what it can tell you about the territory, it can just provide you with more accurate maps. Often it provides you with multiple, equivalent maps for the same situation with no way to choose between them.</p> <p>For instance, quarks (and gluons) have this weird property- they are well defined excitations at very high energies, but not at all well-defined at low energies, where bound states become fundamental excitations. There is no such thing as a free-quark at low energy. For some problems, the quark map is useful, for many, many more problems the meson/hadron (proton,neutron,kaon,etc) map is much more useful. The same theory at a different energy scale provides a radically different map (renormalization is a bitch, and a weak coupling becomes strong).</p> <p>Continuing in this vein, he keeps being unable to transform only part of an object, so he keeps trying different maps, and making the same map/territory confusion culminating in:</p> <blockquote> <p>If he wanted power, he had to abandon his humanity, and force his thoughts to conform to the true math of quantum mechanics.</p> <p>There were no particles, there were just clouds of amplitude in a multiparticle configuration space and what his brain fondly imagined to be an eraser was nothing except a gigantic factor in a wavefunction that happened to factorize,</p> </blockquote> <p>(Side note: for Hariezer its all about power, not about curiosity, as I’ve said dozens of time now. Also, I know as much physics as anyone, and I don’t think I’ve abandoned my humanity.)</p> <p>This is another example of the same problem I’m getting at above. There is no “true math of quantum mechanics.” In non-relativistic, textbook quantum mechanics, I can formulate one version of quantum mechanics on 3 space dimension 1 time dimension, and calculate things via path integrals. I can also build a large configuration space (Hilbert space) with 3 space dimensions, and 3 momentum dimensions per particle, (and one overall time dimension) and calculate things via operators on that space. These are different mathematical formulations, over different spaces, that are completely equivalent. Neither map is more appropriate than the other. Hariezer arbitrarily thinks of configuration space as the RIGHT one.</p> <p>This isn’t unique to quantum mechanics, most theories have several radically different formulations. Good old newtonian mechanics has a formulation on the exact same configuration space Hariezer is thinking of.</p> <p>The big point here is that the same theory has different mathematical formulations. We don’t know which is “the territory” we just have a bunch of different, but equivalent maps. Each map has its own strong suits, and its not clear that any one of them is the best way to think about all problems. Is quantum mechancis 3+1 dimensions (3 space, 1 time) or is it 6N+1 (3 space and 3 momentum + 1 time dimension)? Its both and neither (more appropriately, its just not a question that physics can answer for us).</p> <p>What Hariezer is doing here isn’t separating the map and the territory, its reifying one particular map (configuration space)!</p> <p>(Less important: I also find it amusing, in a physics elitist sort of way (sorry for the condescension) that Yudkowsky picks non-relativistic quantum mechanics as the final, ultimate reality. Instead of describing or even mentioning quantum field theory, which is the most low-level theory we (we being science) know of, Yudkowsky picks non-relativistic quantum mechanics, the most low-level theory HE knows.)</p> <p>Anyway, despite obviously reifying a map, in-story it must be the “right” map, because suddenly he manages to transform part of an object, although he tells Hermione</p> <blockquote> <p>Quantum mechanics wasn’t enough,” Harry said. “I had to go all the way down to timeless physics before it took.</p> </blockquote> <p>So this is more bad pedagogy: timeless physics isn’t even a map, its the idea of a map. No one has made a decent formulation of quantum mechanics without a specified time direction (technical aside: its very hard to impose unitarity sensibly if you are trying to make time emerge from your theory, instead of being inbuilt). Its pretty far away from mainstream theory attempts, but here its presented as the ultimate idea in physics. It seems very odd to just toss in a somewhat obscure ideas as the pinnacle of physics.</p> <p>Anyway, Hariezer shows Dumbledore and McGonagall his new found ability to transfigure part of an object, chapter ends.</p> <h3 id="hpmor-29-not-much-here">HPMOR 29: not much here</h3> <p>Someone called Yudkowsky out on the questionable decision to include his pet theories as established science, so chapter 29 opens with this (why didn’t he stick this disclaimer on the chapters where the mistakes were made?):</p> <blockquote> <p>Science disclaimers: Luosha points out that the theory of empathy in Ch. 27 (you use your own brain to simulate others) isn’t quite a known scientific fact. The evidence so far points in that direction, but we haven’t analyzed the brain circuitry and proven it. Similarly, timeless formulations of quantum mechanics (alluded to in Ch. 28) are so elegant that I’d be shocked to find the final theory had time in it, but they’re not established yet either.</p> </blockquote> <p>He is still wrong about timeless formulations of quantum though, they aren’t more elegant, they don’t exist.</p> <p>The rest of this chapter seems like its just introductory for something coming later- Hariezer, Draco and Hermione are all named as heads of Quirrell’s armies and are all trying to manipulate each other. Some complaints from Hermione that broomstick riding is jock-like and stupid, old hat by now.</p> <p>There, is however, one exceptionally strange bit- apparently in this version of the world, the core plot of Prisoner of Azkaban (Scabbers the rat was really Peter Petigrew) was just a delusion that a schizophrenic Weasley brother had. Just a stupid swipe at the original book for no real reason.</p> <h3 id="hpmor-30-31-credit-where-credit-is-due">HPMOR 30-31: credit where credit is due</h3> <p>So credit where credit is due -- these two chapters are pretty decent. We finally get some action in a chapter, there is only one bit of wrongish science, and the overall moral of the episode is a good one.</p> <p>In these chapters, three teams, “armies” lead by Draco, Hariezer and Hermione, compete in a mock-battle, Quirrell’s version of a team sport. The action is more less competently written (despite things like having Neville yell “special attack”), and its more-or-less fun and quick to read. It feels a bit like a lighter-hearted version of the beginning competitions of Ender’s game (which no doubt inspired these chapters.</p> <p>The overall “point” of the chapters is even pretty valuable- Hermione, who is written off as an idiot by both Draco and Hariezer splits her army and has half attack Draco and half attack Hariezer. She is seemingly wiped out almost instantly. Draco and Hariezer then fight each other nearly to death, and out pops Hermione’s army- turns out they only faked defeat. Hermione wins, and we learn that unlike Draco and Hariezer, Hermione delegated and collaborated with the rest of her team to develop strategies to win the fight. There is a (very unexpected given the tone of everything thus far) lesson about teamwork and collaboration here.</p> <p>That said -- I still have nits to pick. Hariezer’s army is organized in quite possibly the dumbest possible way:</p> <blockquote> <p>Harry had divided the army into 6 squads of 4 soldiers each, each squad commanded by a Squad Suggester. All troops were under strict orders to disobey any orders they were given if it seemed like a good idea at the time, including that one… unless Harry or the Squad Suggester prefixed the order with “Merlin says”, in which case you were supposed to actually obey.</p> </blockquote> <p>This might seem like a good idea, but anyone who has played team sports can testify- there is a reason that you work out plays in advance, and generally have delineated roles. I assume the military has a chain of command for similar reasons, though I have never been a solider. I was hoping to see this idea for a creatively-disorganized army bite Hariezer, but it does not. There seems to be no confusion at all over orders, etc. Basically, none of what you’d expect would happen from telling an army “do what you want, disobey all orders” happens.</p> <p>And it wouldn’t be HPMOR without potentially bad social science, here is today’s reference:</p> <blockquote> <p>There was a legendary episode in social psychology called the Robbers Cave experiment. It had been set up in the bewildered aftermath of World War II, with the intent of investigating the causes and remedies of conflicts between groups. The scientists had set up a summer camp for 22 boys from 22 different schools, selecting them to all be from stable middle-class families. The first phase of the experiment had been intended to investigate what it took to start a conflict between groups. The 22 boys had been divided into two groups of 11 -</p> <ul> <li>and this had been quite sufficient.</li> </ul> <p>The hostility had started from the moment the two groups had become aware of each others’ existences in the state park, insults being hurled on the first meeting. They’d named themselves the Eagles and the Rattlers (they hadn’t needed names for themselves when they thought they were the only ones in the park) and had proceeded to develop contrasting group stereotypes, the Rattlers thinking of themselves as rough-and-tough and swearing heavily, the Eagles correspondingly deciding to think of themselves as upright-and-proper.</p> <p>The other part of the experiment had been testing how to resolve group conflicts. Bringing the boys together to watch fireworks hadn’t worked at all. They’d just shouted at each other and stayed apart. What had worked was warning them that there might be vandals in the park, and the two groups needing to work together to solve a failure of the park’s water system. A common task, a common enemy.</p> </blockquote> <p>Now, I readily admit to not having read the original Robber’s Cave book, but I do have two textbooks that reference it, and Yudkowsky gets the overall shape of the study right, but fails to mention some important details. (If my books are wrong, and Yudkowsky is right, which seems highly unlikely given his track record please let me know)</p> <p>Both descriptions I have suggest the experiment had 3 stages, not two. The first stage was to build up the in-groups, then the second stage was to introduce them to each other and build conflict, and then the third stage was to try and resolve the conflict. In particular, this aside from Yudkowsky originally struck me as surprising insightful:</p> <blockquote> <p>They’d named themselves the Eagles and the Rattlers (they hadn’t needed names for themselves when they thought they were the only ones in the park)</p> </blockquote> <p>Unfortunately, its simply not true- during phase 1 the researchers asked the groups to come up with names for themselves, and let the social norms for the groups develop on their own. The “in-group” behavior developed before they met their rival groups.</p> <p>While tensions existed from first meeting, real conflicts didn’t develop until the two groups competed in teams for valuable prizes.</p> <p>This stuff matters- Yudkowsky paints a picture of humans diving so easily into tribes that simply setting two groups of boys loose in the same park will cause trouble. In reality, taking two groups of boys, encouraging them to develop group habits, group names, group customs, and then setting the groups to directly competing for scarce prizes (while researchers encourage the growth of conflicts) will cause conflicts. This isn’t just a subtlety.</p> <h3 id="hpmor-32-interlude">HPMOR 32: interlude</h3> <p>Chapter 32 is just a brief interlude, nothing here really, just felt the need to put this in for completeness.</p> <h3 id="hpmor-33-it-worked-so-well-the-first-time-might-as-well-try-it-again">HPMOR 33: it worked so well the first time, might as well try it again</h3> <p>This chapter has left me incredibly frustrated. After a decent chapter, we get a terrible retread of the same thing. For me, this chapter failed so hard that I’m actually feeling sort of dejected, it undid any good will the previous battle chapter had built up.</p> <p>This section of chapters is basically a retread of the dueling armies just a brief section back. Unfortunately, this second battle section flubs completely a lot of the things that worked pretty well in the first battle section. There is a lot to talk about here that I think failed, so this might be long.</p> <p>There is an obvious huge pacing problem here. The first battle game happens just a brief interlude before the second battle game. Instead of spreading this game out over the course of the Hogwarts school year (or at least putting a few of the other classroom episodes in between) these just get slammed together. First battle, one interlude, last battle. That means that a lot of the evolution of the game over time, how people are reacting to it, etc. is left as a tell rather than a show. A lot of this chapter is spent dealing with big changes to Hogwarts that have been developing as student’s get super-involved in this battle game, but we never see any of that.</p> <p>Imagine if Ender’s game (a book fresh on my mind because of the incredibly specific references in this chapter) were structured so that you get the first battle game, and then a flash-forward to his final battle against the aliens, with Ender explaining all the strategy he learned over the rest of that year. This chapter is about as effective as that last Ender’s game battle would be.</p> <p>The chapter opens with Dumbledore and McGonagall worried about the school-</p> <blockquote> <p>Students were wearing armbands with insignia of fire or smile or upraised hand, and hexing each other in the corridors.</p> </blockquote> <p>Loyalty to armies over house or school is tearing the school apart!</p> <p>But then we turn to the army generals- apparently the new rules of the game allowed soldiers in armies to turn traitor, and its caused the whole game to spiral out of control- Draco complains:</p> <blockquote> <p>You can’t possibly do any real plots with all this stuff going on. Last battle, one of my soldiers faked his own suicide.</p> </blockquote> <p>Hermione agrees, everyone is losing control of their armies because of all the traitors.</p> <p>&quot;But.. wait…&quot; I can hear you asking, &quot;how can that make sense? Loyalty to the armies is so absolute people are hexing each other in the corridors? But at the same time, almost all the students in the armies are turning traitor and plotting against their generals? Both of those can’t be true?&quot; I agree, you smart reader you, both of these things don’t work together. NOT ONLY IS YUDKOWSKY TELLING INSTEAD OF SHOWING, WE ARE BEING TOLD CONTRADICTORY THINGS. Yudkowsky wanted to be able to follow through on the Robber’s Cave idea he developed earlier, but he also needed all these traitors for his plot, so he tried to run in both directions at once.</p> <p>Thats not the only problem with this chapter (it wouldn’t be HPMOR without misapplied science/math concepts)- it turns out Hermione is winning, so the only way for Draco and Hariezer to try to catch up is to temporarily team up, which leads to a long explanation where Hariezer explains the prisoner’s dilemma and Yudkowsky’s pet decision theory.</p> <p>Here is the big problem- In the classic prisoner’s dilemma:</p> <p>If my partner cooperates, I can either:</p> <p>-cooperate, in which case I spend a short time in jail, and my partner spends a short time in jail</p> <p>-defect, in which case I spend no time in jail, and my partner serves a long time in jail</p> <p>If my partner defects, I can either:</p> <p>-cooperate, in which case I spend a long time in jail, and my partner goes free</p> <p>-defect, in which case I spend a long time in jail, as does my partner.</p> <p>The key insight of the prisoner’s dilemma is that no matter what my partner does, defecting improves my situation. This leads to a dominant strategy where everyone defects, even though the both-defect is worse than the both-cooperate.</p> <p>In the situation between Draco and Hariezer:</p> <p>If Draco cooperates, Hariezer can either:</p> <p>-cooperate in which case both Hariezer and Draco both have a shot at getting first or second</p> <p>-defect, in which case Hariezer is guaranteed second, Draco guaranteed 3rd place</p> <p>If Draco defects, Hariezer can either</p> <p>-cooperate, in which case Hariezer is guaranteed 3rd, and Draco gets 2nd.</p> <p>-defect, in which case Hariezer and Draco are fighting it out for 2nd and third.</p> <p>Can you see the difference here? If Draco is expected to cooperate, Hariezer has no incentive to defect- both cooperate is STRICTLY BETTER than the situation where Hariezer defects against Draco. This is not at all a prisoner’s dilemma, its just cooperating against a bigger threat. All the pontificating about decision theories that Hariezer does is just wasted breath, because no one is in a prisoner’s dilemma.</p> <p>After the pointless digression about the non-prisoner’s dilemma (seriously, this is getting absurd, and frustrating- I’m hard pressed to find a single science reference in this whole thing that’s unambiguously applied correctly.).</p> <p>After these preliminaries, the battle begins. Unlike the light hearted, winking reference to Ender’s game of the previous chapter, Yudkowsky feels the need to make it totally explicit- they fight in the lake, so that Hariezer can use exactly the stuff he learned from Ender’s game to give him an edge. It turns the light homage of the last battle into just the setup for the beat-you-over-the-head reference this time. There is a benefit to subtlety, and assuming your reader isn’t an idiot.</p> <p>Anyway, during the battle, everyone betrays everyone and the overall competition ends in a tie.</p> <h3 id="hpmor-34-35-aftermath-of-the-war-game">HPMOR 34-35: aftermath of the war game</h3> <p>These chapters contain a lot of speechifying, but in this case it actually fits, as a resolution to the battle game. Its expected and isn’t overly long.</p> <p>The language, as throughout, is still horribly stilted, but I think I’m getting used to it (when Hariezer referred to Hermione as “General of Sunshine” I almost went right past it without a mental complaint). Basically, I’m likely to stop complaining about the stilted language but its still there, its always there.</p> <p>Angry at the ridiculousness of the traitors, Hemione and Draco insist that if Hariezer uses traitors in his army that they will team up and destroy him. He insists he will keep using them.</p> <p>Next, Quirrell gives a long speech about how much chaos the traitors were able to create, and makes an analogy to the death eaters. He insists that the only way to guard against such is essentially fascism.</p> <p>Hariezer than speaks up, and says that you can do just as much damage in the hunt for traitors as traitors can do themselves, and stands up for a more open society. The themes of these speeches can be found in probably hundreds of books, but they work well enough here.</p> <p>Every army leader gets a wish, Draco and Hermione decide to wish for their houses to win the house cup. In an attempt to demonstrate his argument for truth, justice and the American way, Hariezer wishes for quidditch to no longer contain the snitch. I guess nothing will rally students around an open society like one person fucking with the sport they love.</p> <p>We also find out that Dumbledore helped the tie happen by aiding in the plotting(the plot was “too complicated” for any student, according to Quirrell, so it must have been Dumbledore- apparently, ‘betray everyone to keep the score close’ is a genius master plan), but we are also introduced to a mysterious cloaked stranger who was also involved but wiped all memory of his passing.</p> <p>These are ok chapters, as HPMOR chapters go.</p> <h3 id="hariezer-s-arrogance">Hariezer's Arrogance</h3> <p>In response to some things that kai-skai has said, I started thinking about how should we view Hariezer’s arrogance. Should we view it as a character flaw? Something Hariezer will grow and overcome? I don’t think its being presented that way.</p> <p>My problem with the arrogance are several:</p> <p>-the author intends for Hariezer to be a teacher. He is supposed to be the master rationalist that the reader (and other characters) learn from. His arrogance makes that off-putting. If you aren’t familiar at all with the topics Hariezer happens to be discussing, you are being condescended to along with the characters in the story (although if you know the material you get to condescend to the simpletons along with Hariezer). Its just a bad pedagogical choice. You don’t teach people by putting them on the defensive.</p> <p>-The arrogance is not presented by the author as a character flaw. In the story, its not a flaw to overcome, its part of what makes him “awesome.” His arrogance has not harmed him, he hasn’t felt the need to revisit it. When he thinks he knows better than everyone else, the story invariably proves him right. He hasn’t grown or been presented with a reason to grow. I would bet a great deal of money that Hariezer ends HPMOR exactly the same arrogant twerp he starts as.</p> <p>-This last one is a bit of a personal reaction. Hariezer gets a lot of science wrong (I think all of it is wrong, actually, up to where I am now), and is incredibly arrogant while doing so. I’ve taught a number of classes at the college level, and I’ve had a lot of confidently, arrogantly wrong students. Hariezer’s attitude and lack of knowledge repeatedly remind me of the worst students I ever had- smart kids too arrogant to learn (and these were physics classes, where wrong or right is totally objective).</p> <h3 id="hpmor-36-37-christmas-break-not-much-happens">HPMOR 36/37: Christmas break/not much happens</h3> <p>Like all the Harry Potter books, Yudkowsky includes a Christmas break. I note that a Christmas break would make a lot of sense toward the middle of the book, not less than &lt;1/3 of the way through. Like the original books, this is just a light bit of relaxation</p> <p>Not a lot happens over break. Hariezer is a twerp who think his parents don’t respect him enough, they go to Hermione’s house for Christmas, Hariezer yells at Hermione’s parents for not respecting her intelligence enough, the parents say Hermione and Hariezer are like an old married couple (it would have been nice to see the little bonding moments in the earlier chapters). Quirrell visits Hariezer on Christmas Eve.</p> <h3 id="hpmor-38-a-cryptic-conversation-not-much-here">HPMOR 38: a cryptic conversation, not much here</h3> <p>This whole chapter is just a conversation between Malfoy and Hariezer. It fits squarely into the “one party doesn’t really know what the conversation is about” mold, with Hariezer being the ignorant party. Malfoy is convinced Hariezer is working with someone other than Quirrell or Dumbledore.</p> <h3 id="hpmor-39-your-transhumanism-is-showing">HPMOR 39: your transhumanism is showing</h3> <p>This was a rough chapter, in which primarily Hariezer and Dumbledore have an argument about death. Hariezer takes up the transhumanist position. If you aren’t familiar with the transhumanist position on death, its basically that death is bad (duh!) and that the world is full of deathists who have convinced themselves that death is good. This usually leads into the idea that some technology will save us from death (nanotech, SENS,etc), and even if they don’t we can all just freeze our corpses to be reanimated when that whole death thing gets solved. I find this position somewhat childish, as I’ll try and get to.</p> <p>So, as a word of advice to future transhumanist authors who want to write literary screeds arguing against the evil deathists, FANTASY LITERATURE IS A UNIQUELY BAD CHOICE FOR ARGUING YOUR POINT. To be fair, Yudkowsky noticed this, and lampshaded it, when Hariezer says there is no afterlife, Dumbledore argues back with:</p> <blockquote> <p>“How can you not believe it? ” said the Headmaster, looking completely flabbergasted. “Harry, you’re a wizard! You’ve seen ghosts! ” …And if not ghosts, then what of the Veil? What of the Resurrection Stone?”</p> </blockquote> <p>i.e. how can you not believe in an afterlife with there is a <em>literal gateway</em> to the fucking afterlife sitting in the ministry of magic basement. Hariezer attempts to argue his way out of this, we get this story for instance:</p> <blockquote> <p>You know, when I got here, when I got off the train from King’s Cross…I wasn’t expecting ghosts. So when I saw them, Headmaster, I did something really dumb. I jumped to conclusions. I, I thought there wasan afterlife… I thought I could meet my parents who died for me, and tell them that I’d heard about their sacrifice and that I’d begun to call them my mother and father -</p> <p>&quot;And then… asked Hermione and she said that they were just afterimages… And I should have known! I should have known without even having to ask! I shouldn’t have believed it even for all of thirty seconds!… And that was when I knew that my parents were really dead and gone forever and ever, that there wasn’t anything left of them, that I’d never get a chance to meet them and, and, and the other children thought I was crying because I was scared of ghosts</p> </blockquote> <p>So, first point- <em>this could have been a pretty powerful moment if Yudkowsky had actually structured the story to relate this WHEN HARIEZER FIRST MET A GHOST. Instead, the first we hear of it is this speech.</em> Again, tell, tell, tell, never show</p> <p>Second point- what exactly does Hariezer assume is being “afterimaged?” Clearly some sort of personality, something not physical is surviving in the wizarding world after death. If fighting death is this important to Hariezer, why hasn’t he even attempted to study ghosts yet? (full disclosure, I am an atheist personally. However, if I lived in a world WITH ACTUAL MAGIC, LITERAL GHOSTS, a stone that resurrects the dead, and a FUCKING GATEWAY TO THE AFTERLIFE I might revisit that position).</p> <p>Here is Hariezer’s response to the gateway to the afterlife:</p> <blockquote> <p>That doesn’t even sound like an interesting fraud,” Harry said, his voice calmer now that there was nothing there to make him hope, or make him angry for having hopes dashed. “Someone built a stone archway, made a little black rippling surface between it that Vanished anything it touched, and enchanted it to whisper to people and hypnotize them.”</p> </blockquote> <p>Do you see how incurious Hariezer is? If someone told me there was a LITERAL GATEWAY TO THE AFTERLIFE I’d want to see it. I’d want to test it, see it. Can we try to record and amplify the whispers? Are things being said?</p> <p>Why do they think its a gateway to the afterlife? Who built it? Minimally, this could have lead to a chapter where Hariezer debunks wizarding spiritualists like a wizard-world Houdini. (Houdini spent a great deal of his time exposing mediums and psychics who ‘contacted the dead’ as frauds.) I’m pretty sure I would have even enjoyed a chapter like that.</p> <p>In the context of the wizarding world, there is all sorts of non-trivial evidence for an afterlife that simply doesn’t exist in the real world. Its just a bad choice to present these ideas in the context of this story.</p> <p>Anyway, ignoring what a bad choice it is to argue against an afterlife in the context of fantasy fiction, lets move on:</p> <p>Dumbledore presents some dumb arguments so that Hariezer can seem wise. Hariezer tells us death is the most frightening thing imaginable, its not good,etc. Basically, death is scary, no one should have to die. If we had all the time imaginable we would actually use it. Pretty standard stuff, Dumbledore drops the ball presenting any real arguments.</p> <p>So I’ll take up Dumbledore’s side of the argument. I have some bad news for Hariezer’s philosophy. You are going to die. I’m going to die. Everyone is going to die. It sucks, and its unfortunate, sure, but there is no way around it. Its not a choice! We aren’t CHOOSING death. Even if medicine can replace your body (which doesn’t seem likely in my lifetime), the sun will explode some day. Even if we get away from the solar system, eventually we’ll run out of free energy in the universe.</p> <p>But you do have one choice regarding death- you can accept that you’ll die someday, or you can convince yourself there is some way out. Convince yourself that if you say the right prayers, or in the Less Wrong case, work on the right decision theory to power an AI you’ll get to live forever. Convince yourself that if you give a life insurance policy to the amateur biologists that run croynics organizations you’ll be reanimated.</p> <p>The problem with the second choice is that there is an opportunity cost- time spent praying or working on silly decision theories is time that you aren’t doing things that might matter to other humans. We accept death to be more productive in life. Stories about accepting death aren’t saying death is good they are saying death is inevitable.</p> <p>Edit: I take back a bit about cognitive dissonance that was here.</p> <h3 id="hpmor-40-short-follow-up-to-39">HPMOR 40: short follow up to 39</h3> <p>Instead of Dumbledore’s views, in this chapter we get Quirrell’s view of death. He agrees with Hariezer, unsurprisingly.</p> <h3 id="hpmor-41-another-round-of-quirrell-s-battle-game">HPMOR 41: another round of Quirrell's battle game</h3> <p>It seems odd that AFTER the culminating scene, the award being handed out, and the big fascist vs. freedom speechifying that we have yet another round of Quirrell’s battle game.</p> <p>Draco and Hermione are now working together against Hariezer. Through a series of circumstances, Draco has to drop Hermione off a roof to win.</p> <p>Edit: I also point out that we don’t actually get details of the battle in this time, it opens with</p> <blockquote> <p>Only a single soldier now stood between them and Harry, a Slytherin boy named Samuel Clamons, whose hand was clenched white around his wand, held upward to sustain his Prismatic Wall.</p> </blockquote> <p>We then get a narrator summary of the battle that had lead up that moment. Again, tell, tell, tell never show.</p> <h3 id="hpmor-42-is-there-a-point-to-this">HPMOR 42: is there a point to this?</h3> <p>Basically an extraneous chapter, but one strange detail at the end.</p> <p>So in this chapter, Hariezer is worried that its his fault that in the battle last chapter. Hermione got dropped off a roof. Hermione agrees to forgive him as long as he lets Draco drop him off the same roof.</p> <p>He takes a potion to help him fall slowly and is dropped, but so many young girls try to summon him to their arms (yes, this IS what happens) that he ends up falling, luckily Remus Lupin is there to catch him.</p> <p>Afterwards, Remus and Hariezer talk. Hariezer learns that his father was something of a bully. And, for some reason, that Peter Petigrew and Sirius Black were lovers. Does anyone know what the point of making Petigrew and Black lovers would be?</p> <h3 id="conversations">Conversations</h3> <p>My girlfriend: “What have you been working on over there?”</p> <p>Me: “Uhhhh… so…. there is this horrible Harry Potter fan fiction… you know, when people on the internet write more stories about Harry Potter? Yea, that. Anyway, this one is pretty terrible so I thought I’d read it and complain about it on the internet…. So I’m listening to me say this out loud and it sounds ridiculous, but.. well, it IS ridiculous… but…”</p> <h3 id="hpmor-43-46-subtle-metaphors">HPMOR 43-46: Subtle Metaphors</h3> <p>These chapters actually moved pretty decently. When Yudkowsky isn’t writing dialogue, his prose style can actually be pretty workman-like. Nothing that would get you to stop and marvel at the word play, but it keeps the pace brisk and moving.</p> <p>Now, in JK Rowling’s original books, it always seemed to me that the dementors were a (not-so-subtle) nod to depression. They leave people wallowing in their worst memories, low energy, unable to remember the happy thoughts,etc.</p> <p>In HPMOR, however, Hariezer (after initially failing to summon a patronus) decides that the dementors really represent death. You see in HPMOR, instead of relieving their saddest, most depressing memories the characters just see a bunch of rotting corpses when the dementors get near.</p> <p>This does, of course, introduce new questions? What does it mean that the dementors guard Azkaban? Why don’t the prisoner’s instantly die? Why doesn’t a dementor attack just flat-out kill you?</p> <p>Anyway, apparently the way to kill death is to just imagine that someday humans will defeat death, in appropriately Carl Sagan-esque language:</p> <blockquote> <p>The Earth, blazing blue and white with reflected sunlight as it hung in space, amid the black void and the brilliant points of light. It belonged there, within that image, because it was what gave everything else its meaning. The Earth was what made the stars significant, made them more than uncontrolled fusion reactions, because it was Earth that would someday colonize the galaxy, and fulfill the promise of the night sky.</p> <p>Would they still be plagued by Dementors, the children’s children’s children, the distant descendants of humankind as they strode from star to star? No. Of course not. The Dementors were only little nuisances, paling into nothingness in the light of that promise; not unkillable, not invincible, not even close.</p> </blockquote> <p>Once you know this, your patronus becomes a human, and kills the dementor. Get it THE PATRONUS IS HUMANS (represented in this case by a human) and THE DEMENTOR IS DEATH. Humans defeat death. Very subtle.</p> <p>Another large block of chapters with no science.</p> <h3 id="hpmor-47-racism-is-bad">HPMOR 47: Racism is Bad</h3> <p>Nothing really objectionable here, just more conversations and plotting.</p> <p>Hariezer spends much of this chapter explaining to Draco that racism is bad, and that a lot of pure bloods probably hate mudbloods because it gives them a chance to feel superior. Hariezer suggests these racist ideas are poisoning slytherin.</p> <p>We also find out that Draco and his father seem to believe that Dumbledore burned Draco’s mother alive. This is clearly a departure from the original books. Hariezer agrees to take as an enemy whoever killed Draco’s mother. Feels like it’ll end up being more plots-within-plots stuff.</p> <p>Another chapter with no science explored. We do find out Hariezer speaks snake language.</p> <h3 id="hpmor-48-utilitarianism">HPMOR 48: Utilitarianism</h3> <p>This chapter is actually solid as far as these things go. After learning he can talk to snakes Hariezer begins to wonder if all animals are sentient, after all snakes can talk. This has obvious implications for meat eating.</p> <p>From there, he begins to wonder if plants might be sentient, in which case he wouldn’t be able to eat anything at all. This leads him to the library for research.</p> <p>He also introduces scope insensitivity and utilitarianism, even though it isn’t really required at all to explain his point to Hermione. Hermione asks why he is freaking out, and instead of answering “I don’t want to eat anything that thinks and talks,” he says stuff like</p> <blockquote> <p>&quot;Look, it’s a question of multiplication, okay? There’s a lot of plants in the world, if they’renot sentient then they’re not important, but if plants are people then they’ve got more moral weight than all the human beings in the world put together. Now, of course your brain doesn’t realize that on an intuitive level, but that’s because the brain can’t multiply. Like if you ask three separate groups of Canadian households how much they’ll pay to save two thousand, twenty thousand, or two hundred thousand birds from dying in oil ponds, the three groups will respectively state that they’re willing to pay seventy-eight, eighty-eight, and eighty dollars. No difference, in other words. It’s called scope insensitivity.</p> </blockquote> <p>Is that really the best way to describe his thinking? Why say something with 10 words while several hundred will do. What does scope insensitivity have to do with the idea “I don’t want to eat things that talk and think?”</p> <p><strong>Everything below here is unrelated to HPMOR and has more to do with scope insensitivity as a concept:</strong></p> <p>Now, because I have taught undergraduates intro physics, I do wonder (and have in the past)- is Kahneman’s scope insensitivity related to the general innumeracy of most people? i.e. how many people who hear that question just mentally replace literally any number with “a big number”?</p> <p>The first time I taught undergraduates I was surprised to learn that most of the students had no ability to judge if their answers seemed plausible. I began adding a question “does this answer seem order of magnitude correct?” I’d also take off more points for answers that were the wrong order of magnitude, unless the student put a note saying something like “I know this is way too big, but I can’t find my mistake.”</p> <p>You could ask a question about a guy throwing a football, and answers would range from 1 meter/second all the way to 5000 meters/second. You could ask a question about how far someone can hit a baseball and answers would similarly range from a few meters to a few kilometers. No one would notice when answers were wildly wrong. Lest someone think this is a units problem (Americans aren’t used to metric units), even if I forced them to convert to miles per hour, miles, or feet students couldn’t figure out if the numbers were the right order of magnitude.</p> <p>So I began to give a few short talks on what I thought as basic numeracy. Create mental yardsticks (the distance from your apartment to campus might be around a few miles, the distance between this shitty college town and the nearest actual city might be around a few hundred miles,etc). When you encounter unfamiliar problems, try to relate it back to familiar ones. Scale the parameters in equations so you have dimensionless quantities * yardsticks you understand. And after being explicitly taught most of the students got better at understanding the size of numbers.</p> <p>Since I began working in the business world I’ve noticed that most people never develop that skill. Stick a number in a sentence and people just mentally run right over it, you might as well have inserted some klingon phrases. Some of the better actuaries do have some nice numerical intuition, but a surprising number don’t. They can calculate, but they don’t understand what the calculations are really telling them, like Searle’s chinese room but with numbers.</p> <p>In Kahneman’s scope neglect questions, there are big problems with innumeracy- if you ask people how much they’d spend on X where X is any charity that seems importantish, you are likely to get an answer of around $100. In some sense, it is scope neglect, in another sense you just max out people’s generosity/spending cash really quickly.</p> <p>If your rephrase it to “how much should the government spend” you hit general innumeracy problems, and you also hit general innumeracy problems when you specify large, specific numbers of birds.</p> <p>I suspect Kahneman would have gotten different results had he asked his questions varying questions as: “what percentage of the federal government’s wildlife budget should be spent preventing disease for birds in your city?” vs. &quot;what percentage of the federal government’s wildlife budget should be spent preventing disease for birds in your state?&quot; vs. &quot;what percentage of the federal government’s wildlife budget should be spent preventing disease for birds in the whole country?&quot; (I actually ran this experiment on a convenience sample of students in a 300 level physics class several years ago and got 5%,8% and 10% respectively, but the differences weren’t significant, though the trend was suggestive.)</p> <p>I suspect the problem isn’t that “brains can’t multiply” so much as “most people are never taught how to think about numbers.”</p> <p>If anyone knows of further literature on this, feel free to pass it my way.</p> <h3 id="hpmor-49-not-much-here">HPMOR 49: not much here</h3> <p>I thought I posted something about this last weekend, I think tumblr are it. So this will be particularly light. Hariezer notices that Quirrell <em>knows too much</em> (phrased as “his priors are too good”) but hasn’t yet put it together that Quirrell.</p> <p>There is also (credit where credit is due) a clever working in of the second book into Yudkowsky’s world. The “interdict of Merlin” Yudkowsky invented prevents wizards from writing spells down, so Slytherin’s basilisk was placed in Hogwarts to pass spells on to “the heir of Slytherin.” Voldemort learned those secets and then killed the basilisk, so Hariezer has no shortcut to powerful spells.</p> <h3 id="hpmor-50-in-need-of-an-editor">HPMOR 50: In Need of an Editor</h3> <p>So this is basically a complete rehash again- it fits into the “Hariezer uses the time turner and the invisibility cloak to solve bullying” mold we’ve already seen a few times. The time turner + invisibility cloak is the solution to all problems, and when Yudkowsky needs a conflict, he throws in bullying. I think we’ve seen this exact conflict with this exact solution at least three other times.</p> <p>In this chapter, its Hermione being bullied, he protects her by creating an alibi with his timer turner, dressing in an invisibility cloak, and whispering some wisdom in the bullies ear. Because most bullies just need the wisdom of an eleven year old whispered into their ear.</p> <h3 id="hpmor-51-63-economy-of-language">HPMOR 51-63: Economy of Language</h3> <p>So this block of chapters is roughly the length of the Maltese Falcon or the first Harry Potter book, probably 2/3 the length of the Hobbit. This one relatively straight-forward episode of this story is the length of The Maltese Falcon. Basically, the ratio of things-happening/words-written is terrible.</p> <p>This chapter amounts to a prison break- Quirrell tells Hariezer that Bellatrix Black was innocent, so they are going to break her out. Its a weird section, given how Black escaped in the original novels (i.e. the dementors sided with the dark lord, so all he had to do was go to the dementors say “you are on my side, please let out bellatrix black, and everyone else while you are at it.”)</p> <p>The plan is to have Hariezer use his patronus while Quirrell travels in snake form in his pouch. They’ll replace Bellatrix with a corpse, so everyone will just think she is dead. It becomes incredibly clear upon meeting Bellatrix that she wasn’t “innocent” at all, though she might be not guilt in the by-reason-of-insanity sense.</p> <p>This doesn’t phase Hariezer, they just keep moving forward with the plan, which goes awry pretty quickly when an auror stumbles on them. Quirrell ties to kill the auror. Hariezer tries to block the killing spell and ends up knocking out Quirrell and turning his patronus off, and the plan goes to hell.</p> <p>To escape, Hariezer first scares the dementors off by threatening to blast them with his uber-patronus (even death is apparently scared of death in this story). Then Quirrell wakes up, and with Quirrell’s help he transfigures a hole in the wall, and transfigures a rocket which he straps to his broomstick, and out they fly. The rocket goes so fast the aurors can’t keep up.</p> <p>Its a decent bit of action in a story desperately needing a bit of action, but its marred by excessive verbosity. We have huge expanses of Hariezer talking with Quirrell, Hariezer talking to himself, Hariezer thinking about dementors, etc. Instead of a tense, taught 50 pages we get a turgid 300.</p> <p>After they get to safety, Quirrell and Hariezer discuss the horror that is Azkaban. Quirrell tells Hariezer that only a democracy could produce such a torturous prison. A dark lord like Voldemort would have no use for it once got bored:</p> <blockquote> <p>You know, Mr. Potter, if He-Who-Must-Not-Be-Named had come to rule over magical Britain, and built such a place as Azkaban, he would have built it because he enjoyed seeing his enemies suffer. And if instead he began to find their suffering distasteful, why, he would order Azkaban torn down the next day.</p> </blockquote> <p>Hariezer doesn’t take up the pro-democracy side, and only time will tell if he goes full-on reactionary like Quirrell by the end of our story. By the end, Hariezer is ruminating on the Milgram experiment, although I don’t think its really applicable to the horror of Azkaban (its not like the dementors are “just following orders”- they live to kill).</p> <p>Hariezer then uses his time turner to go back to right before the prison breakout, the perfect alibi to the perfect crime.</p> <p>Dumbledore and Mcgonagall suspect Hariezer played a part in the escape, because of the use of the rocket. They ask Hariezer to use his time turner to send a message back in time (which he wouldn’t be able to do it if he had already used his turner to hide his crime).</p> <p>Hariezer solves this through the time-turner-ex-machina of Quirrell knowing someone else with a time turner, because when Yudkowsky can’t solve a problem with a time turner, he solves it with two time turners.</p> <h3 id="hpmor-64-65-respite">HPMOR 64/65: respite</h3> <p>Chapter 64 is again “omake” so I didn’t read it.</p> <p>Chapter 65 appears to be a pit-stop before another long block of chapters. Hariezer is chaffing that he has been confined to Hogwarts in order to protect him from the dark lord, so he and Quirrell are thinking of hiring a play-actor to pretend to be Voldemort, so that Quirrell can vanquish him.</p> <p>These were a brief respite between the huge 12- chapter block I just got through and <em>another giant 12 chapter block</em>. Its looking like the science ideas are slowing down in these long chapter blocks, as the focus shifts to action. youzicha has suggested a lot of the rest will be Hariezer cleverly “hacking” his way out of situations, like the rocket in the previous 12 chapter block. The sweet spot for me has been discussing the science presented in these chapters, so between the expected lack of science and the increasing length of chapter blocks, expect slower updates.</p> <h3 id="hpmor-66-77-absolutely-appallingly-awful">HPMOR 66-77: absolutely, appallingly awful</h3> <p>There is a general problem with fanfiction (although usually not in serial fiction where things tend to stay a bit more focused for whatever reason), where the side/B-plots are written entirely in one pass instead of intertwined along side the main plot. Instead of being a pleasant diversion, the side-plot piles up in one big chunk. This is one such side-plot.</p> <p>Also worth noting these chapters combine basically everything I dislike about HPMOR into one book-length bundle of horror. It was honest-to-god work to continue to power through this section. So this will be just a sketch of this awful block of chapters.</p> <p>We opened with another superfluous round of the army game, in which nothing notable really happens other than some character named Daphne challenges Neville to “a most ancient duel” WHICH IS APPARENTLY A BATTLE WITH LIGHTSABERS. My eyes rolled so hard I almost had a migraine, and this was the first chapter of the block.</p> <p>After the battle, Hermoine becomes concerned that women are underrepresented among heros of the wizarding world, and starts a “Society for the Promotion of Heroic Equality for Witches” or SPHEW. They star with a protest in front of Dumbledore’s office and then decide to heroine it up and put an end to bullying. You see, in the HPMOR world, bullying isn’t a question of social dynamics, or ostracizing kids. Bullying is coordinated ambushes of kids in hallways by groups of older kids, and an opportunity for “leveling-up.” The way to fight bullies in this strange world is to engage in pitched wizard-battles in the hallways (having fought an actual bully in reality as a middle schooler I can tell you that at least for me “fight back” doesn’t really solve the problem in any way). In this world, the victims of the bullying are barely mentioned and don’t have names.</p> <p>And of course, <em>the authority figures like McGonagall don’t even really show up during all of this</em>. Students are constantly attacking each other in the hallways and no one is doing anything about it. Because the way to make your characters seem “rational” is to make sure the entire world is insane.</p> <p>Things quickly escalate until 44 bullies get together to ambush the eight girls in SPHEW. A back of the envelope calculation suggests Hogwarts has maybe 300 students. So we are to expect slightly more than 10% of the population of students are the sort of “get together and plot an ambush” bullies that maybe you find in 90s highschool TV shows. Luckily, Hariezer had asked Quirrell to protect the girls, so disguised Quirrell takes down the 44 bullies.</p> <p>We get a “lesson” (lesson in this context means ‘series of insanely terrible ideas’) on “heroic responsibility” in the form of Hariezer lecturing to Harmoine .</p> <blockquote> <p>The boy didn’t blink. “You could call it heroic responsibility, maybe,” Harry Potter said. “Not like the usual sort. It means that whatever happens, no matter what, it’s always your fault… Following the school rules isn’t an excuse, someone else being in charge isn’t an excuse, even trying your best isn’t an excuse. There just aren’t any excuses, you’ve got toget the job done no matter what.”… Being a heroine means your job isn’t finished until you’ve done whatever it takes to protect the other girls, permanently.”</p> </blockquote> <p>You know a good way to solve bullying? Expel the bullies. You know who has the power to do that? McGonagall and Dumbledore. A school is a system and has procedures in place to deal with problems. The proper response is almost always “tell an authroity figure you trust.” Being “rational” is knowing when to trust the system to do its job.</p> <p>In this case, Yudkowsky hasn’t even pulled his usual trick of writing the system as failing- no one even attempts to tell an authority figure about the bullying and no authority figure engages with it, besides Quirrell who engages by disguising himself and attacking students, and Snape who secretly (unknown even to SPHEW) directs SPHEW to where the bullies will be. The system of school discipline stops existing for this entire series of chapters.</p> <p>We get a final denouement between Hariezer and Dumbledore where the bullying situation is discussed by referencce to Ghandi’s passive resistance in India, WW2 and Churchill, and the larger wizarding war that feel largely overwrought because <em>it was bullying</em>. Big speeches about how Hermoine has been put in danger, etc ring empty because <em>it was bullying</em>. Yes, being bullied is traumatic (sometimes life-long traumatic), but its not WORLD WAR traumatic.</p> <p>I also can’t help but note the irony that the block’s action largely started on Hermoine’s attempt to “self-actualize” by behaving more heroically, and ends with Dumbledore and Hariezer discussing whether it was the doing the right thing to <em>let Hermoine play her silly little game</em>.</p> <p>Terrible things in HPMOR</p> <ul> <li>Lack of science: I have no wrong science to complain about, because these chapters have no science references at all really.</li> <li>the world/characters behave in silly ways as a foil for the characters: the authority figures don’t do anything to prevent the escalating bullying/conflict, aside from Snape and Quirrell who get actively involved. The bullying itself isn’t an actual social dynamic, its just general “conflict” to throw at the characters.</li> <li>Time turner/invisibility cloak solves all problems: in a slight twist, Snape gives a time turner to a random student and uses her to pass messages to SPHEW so they can find and attack bullies</li> <li>Superfluous retreads of previous chapters: the army battle that starts its off, much of the bullying is retread. There are several separate bully-fights in this block of chapters.</li> <li>Horrible pacing: this whole block of chapters is a B-plot roughly the length of an entire book.</li> <li>Stilted language. Everyone refers to the magic light sabers as “the most ancient blade” every time they reference it</li> </ul> <h3 id="munchkinism">Munchkinism</h3> <p>I’ve been meaning make a post like this for several weeks, since yxoque reminded me of the idea of the munchkin. jadagul mentioned me in a post today that reminded me I had never made it. Anyway:</p> <p>I grew up playing Dungeons and Dragons, which was always an extremely fun way to waste a middle school afternoon. The beauty of Dungeons and Dragons is that it provides structure for a group of kids to sit around and tell a shared story as a group. The rules of the game are flexible, and one of the players acts as a living rule-interpreter to guide the action and keep the story flowing.</p> <p>Somehow, every Dungeons and Dragons community I’ve ever been part of (middle school, highschool and college) had the same word for a particularly common failure mode of the game, and that word was munchkin, or munchkining (does anyone know if there was a gaming magazine that used this phrase?). The failure is simple - people get wrapped up in the letter of the rules, instead of the spirit, and start building the most powerful character possible instead of a character that makes sense as a role. Instead of story flow, the game gets bogged down in dice rolls and checks so that the munchkins can demonstrate how powerful they are. Particularly egregious munchkins have been known to cheat on their character creation rolls to boost all their abilities. With one particular group in highschool, I witnessed one particularly hot-headed munchkin yell at everyone else playing the game when the dungeon master (the human rule interpreter) slightly modified a rule and ended up weakening the muchkin’s character.</p> <p>The frustrating thing about HPMOR is that Hariezer is designed, as yxoque pointed out, to be a munchkin- using science to exploit the rules of the magical world (which could be an interesting question), but because Yudkowsky is writing the rules of magic as he goes, Hariezer is essentially cheating at a game he is making up on the fly.</p> <p>All of the cleverness isn’t really cleverness- its easy to find loopholes in the rules you yourself create as you go especially if you created them to have giant loopholes.</p> <p>In Azkaban, Hariezer uses science to escape by transfiguring himself a rocket. This only makes sense because for some unknown reason magic brooms aren’t as fast as rockets.</p> <p>In one of his army games, Hariezer uses gloves with gecko setae to climb down a wall, because for some reason broomsticks aren’t allowed. For some reason, there is no ‘grip a wall’ spell.</p> <p>Yudkowsky isn’t bound by the handful of constraints in Rowling’s world (where Dementors represent depression, not death), hell he doesn’t even stick to his own constraints. In Hariezer’s escape from Azkaban he violates literally the only constraint he had laid down (don’t transfigure objects into something you plan to burn).</p> <p>Every other problem in the story is solved by using the time turner as a deus ex machina. Even when plot constraints mean Hariezer’s time turner can’t be used, Yudkowsky just introduces another time turner rather than come up with a novel and clever solution for his characters.</p> <p>Hariezer’s plans in HPMOR work only because the other characters become temporarily dumb to accommodate his “rationality” and because the magic is written around the idea of him succeeding.</p> <h3 id="genre-savvy">&quot;Genre savvy&quot;</h3> <p>So a lot of people have asked me to take a look at the Yudkowsky writing guide, and I will eventually (first I have to finish HPMOR ,which is taking forever because I’m incredibly bored with it, but I HAVE MADE A COMMITMENT- hopefully more HPMOR live blogging after Thanksgiving).</p> <p>But I did hit something that also applies to HPMOR, and a lot of other stories. Yudkowsky advocated that characters “have read the books you’ve read” so they can solve those problems. One of my anonymous asked used the phrase “genre savvy” for this- and google lead me to TV tropes page. The problem with this idea is that as soon as you insert a genre savvy character, your themes shift, much like having a character break the fourth wall. Suddenly your story is about stories. Your story is now a commentary on the genre/genre conventions.</p> <p>Now, there are places where this can work fairly well- those Scream movies, for instance, were supposed to (at least in part) ABOUT horror movies as much as they WERE horror movies. Similarly, every fan-fiction is (on some level) a commentary on the original works, so “genre savvy” fan fiction self-inserts aren’t nearly as bad an idea as they could be.</p> <p>HOWEVER (and this is really important)- MOST STORIES SHOULD NOT BE ABOUT STORIES IN THE ABSTRACT/GENRE/GENRE CONVENTIONS, and this means it is a terrible idea having characters that constantly approach things on a meta level “this is like in this fiction book I read” . If you don’t have anything interesting to say about the actual genre conventions, then adding a genre savvy character is almost certainly going to do you more harm then good. If you are bored with a genre convention, you’ll almost certainly get more leverage out of subverting it (if you lead both the character AND the reader to expect a zig, and instead they get a zag it can liven things up a bit) then by sticking in a genre-savvy character.</p> <p>Sticking in a genre-savvy character just says “look at this silly convention!” and then when that convention is used anyway, it just feels like the writer being a lazy hipster. Sure your reader might get a brief burst of smugness “he/she’s right, all those genre books ARE stupid! Look how smart I am!” but you aren’t really moving your story forward. You are critiquing lazy conventions while also trying to use them.</p> <p>If you don’t like the conventions of a genre, don’t write in that genre, or subvert them to make things more interesting. Or simply refuse to use those conventions all together, go your own way.</p> <h3 id="hpmor-78-action-without-consequences">HPMOR 78: action without consequences</h3> <p>A change of tactics- this is chapter is part of another block of chapters, but I’m having trouble getting through it, so I’m going to write in installments chapter by chapter, instead of a dump on a 12 chapter block again.</p> <p>This chapter is another installment of Quirrell’s battle game. This time, the parents are in the stands, which becomes important when Hermione out-magics Draco.</p> <p>Afterwards, Draco is upset because his father saw him getting out magiced by a mud blood. This causes Draco, in an effort to save face or get revenge or something, to send a note to lure Hermione to meet him alone. Then, cut to the next morning- Hermione is arrested for the attempted murder of Draco. So thats it for the chapter summary.</p> <p>But I want to use this chapter to touch on something that has bothered me about this story- most of the action is totally without stakes or consequences for the characters. As readers we don’t care what happens. In the case for the Quirrell battle game, the prize for victory was already handed out at the Christmas break, none of the characters have anything on the line, and the story doesn’t really act like winning or losing has real consequences for anyone involved. A lot is happening, but its ultimately boring.</p> <p>The same thing happened in the anti-bullying chapters. Most of the characters being victimized lack names or personalities. Hermione and team aren’t defending characters we care about and like, they are fighting the abstract concept of bullying (and the same is largely true of Hariezer’s forays into fighting bullies.)</p> <p>Part of this is because of the obvious homage to Ender’s game, without understanding Ender’s game was doing something very different- the whole point of Ender’s Game is that the series of games absolutely do feel low stakes. Even when Ender kills another kid, its largely shrugged off as Ender continuing to win (which is the first sign something a bit deeper is happening). It supposed to feel game-y so the reader rides along with Ender and doesn’t viscerally notice the genocide happening. The contrast between the real world stakes and the games being played is the point of the story. Where Ender’s game failed for me is after the battles- we don’t feel Ender’s horror at learning what happened. Sure Ender becomes speaker for the dead, but the book doesn’t make us feel Ender’s horror the same way we ride along with the game stuff. I think this is why so many people I know largely missed the point of the book and walked away with “War games are awesome!” (SCROLL DOWN FOR Fight Club FOOTNOTE THAT WAS MAKING THIS PARAGRAPH TOO LONG) But I digress- if your theme isn’t something to do with the the connection between war and games and the way people perceive violence vs games, etc, turning down the emotional stakes and the consequences for the characters make your story feel like reading a video game play-by-play, which is horribly boring.</p> <p>If you cut out all the Quirell game chapters after chapter 35, no one would notice- there is nothing at stake.</p> <p>ALSO- this chapter has an example of what I’ll call “DM munchkining” i.e. its easy to Munchkin when you write the rules. Hariezer is looking for powerful magic to aid him in battle, and starts reading up on potion making. He needs a way to make potions in the woods without magical ingredients, so he deduces by reading books that you don’t really need a magical ingredient, you get out from a potion ingredient what went in to making it. So Hariezer makes a potion with acorns that gets back all the light that went in to creating the acorn via photosynthesis. My point here is that this rule was created in this chapter entirely to be exploited by Hariezer in this battle. In a previous battle chapter, Hariezer exploits the fact that metal armor can block spells, a rule created specifically for that chapter to be exploited. <em>Its not munchkining, its calvinball.</em></p> <p>FOOTNOTE: This same problem happens with Fight Club. The tone of the movie builds up Tyler Durden as this awesome dude and the tone doesn’t shift when Ed Norton’s narrator character starts to realize how fucked everything is. So you end up with this movie thats suppose to be satirical but no one notices. They rebel against a society they find dehumanizing, BY CREATING A SOCIETY WHERE THEY LITERALLY HAVE NO NAMES, but the tone is strong enough that people are like “GO PROJECT MAYHEM! WE SHOULD START A FIGHT CLUB!”</p> <h3 id="hpmor-79">HPMOR 79</h3> <p>This chapter continues on from 78. Hermione has been arrested for murder, but Hariezer now realizes in a sudden insight that she has given a false memory.</p> <p>Hariezer also realizes this is how the Weasley twins planted Rita Skeeter’s false news story- they simply memory charmed Rita. Of course, this opens up more questions then it solves- if false memory charming can be done with such precision, wouldn’t there be a rash of manipulations of this type? Its such an obvious manipulation technique that chapters 24-26 with the Fred and George “caper” was written in a weirdly non-linear style to try to make it seem more mysterious.</p> <p>Anyway, Hariezer tells the adults who start investigating who might have memory charmed Hermione (you’d think wizard police would do some sort of investigation but its HPMOR, so the world needs to be maximally silly as a foil to Hariezer).</p> <p>And then he has a discussion with the other kids who are bad mouthing Hermione:</p> <blockquote> <p>Professor Quirrell isn’t here to explain to me how stupid people are, but I bet this time I can get it on my own. People do something dumb and get caught and are given Veritaserum. Not romantic master criminals, because they wouldn’t get caught, they would have learned Occlumency. Sad, pathetic, incompetent criminals get caught, and confess under Veritaserum, and they’re desperate to stay out of Azkaban so they say they were False-Memory-Charmed. Right? So your brain, by sheer Pavlovian association, links the idea of False Memory Charms to pathetic criminals with unbelievable excuses. You don’t have to consider the specific details, your brain just pattern-matches the hypothesis into a bucket of things you don’t believe, and you’re done. Just like my father thought that magical hypotheses could never be believed, because he’d heard so many stupid people talking about magic. Believing a hypothesis that involves False Memory Charms is low-status.</p> </blockquote> <p>This sort of thing bothers the hell out me. Not only is cloying elitism creeping in, but in HPMOR as in the real world, arguments regarding “status” are just thinly disguised ad-hominems. True or not true, they aren’t really attacking an argument, just the people making them.</p> <p>After all, if we fall back on the “Bayesian conspiracy” confessing to a crime/having a memory of a crime is equal evidence for having done the crime and having been false memory charmed, so all the action here is in the prior. CLAIMING a false memory charm is evidence of nothing at all.</p> <p>So, if the base rate of false memory charms is so low that its laughable and “low status,” then the students are correctly using Bayesian reasoning.</p> <p>Although Hariezer may point out that they aren’t taking into account evidence about what sort of person Hermione is, but if the base rates of false memory charms are really so low that is unlikely to matter much- after all Hariezer doesn’t have any specific positive evidence she was false memory charmed, and she has been behaving strangely toward Draco for awhile (which Hariezer suggests is a symptom of the way the perpetrator went about the false memory charm, but could just as easily be evidence she did it- the action is still in the prior).</p> <p>Similarly, his father didn’t believe in magic because it SHOULDN’T have been believed- until the story begins he has supposedly lived his whole life in our world- where magic is quite obviously not a real thing, regardless of “status. “</p> <p>OF COURSE- if the world were written as a non-silly place, the base rate for false memory charms would be through the roof and everyone would say “yea, she was probably false memory charmed! Who just blurts out a confession?” and the wizard cops would just do their job.</p> <h3 id="remember-when">Remember when</h3> <p>Way back in chapter 20 something, Quirrell gave Hariezer Roger Bacon’s magic diary, and it was going to jump start his investigation of the rules of magic? And then it was literally never mentioned again? The aptly named Checkov’s Roger Bacon’s Magi-science Diary probably applies here.</p> <h3 id="hpmor-80">HPMOR 80</h3> <p>Apparently in the wizarding world, the way a trial is conducted involves a bunch of politicians voting if someone is guilty or innocent, so in this chapter the elder Malfoy uses his influence to convict Hermione. Not much to this chapter really.</p> <p>BUT in some asides, we do get some flirting with neoreaction:</p> <blockquote> <p>At that podium stands an old man, with care-lined face and a silver beard that stretches down below his waist; this is Albus Percival Wulfric Brian Dumbledore… Karen Dutton bequeathed the Line to Albus Dumbledore… each wizard passing it to their chosen successor, back and back in unbroken chain to the day Merlin laid down his life. That (if you were wondering) is how the country of magical Britain managed to elect Cornelius Fudge for its Minister, and yet end up with Albus Dumbledore for its Chief Warlock. Not by law (for written law can be rewritten) but by most ancient tradition, the Wizengamot does not choose who shall preside over its follies. Since the day of Merlin’s sacrifice, the most important duty of any Chief Warlock has been to exercise the highest caution in their choice of people who are both good and able to discern good successors.</p> </blockquote> <p>And we get the PC/NPC distinction used by Hariezer to separate himself from the sheeple:</p> <blockquote> <p>The wealthy elites of magical Britain have collective force, but not individual agency; their goals are too alien and trivial for them to have personal roles in the tale. As of now, this present time, the boy neither likes nor dislikes the plum-colored robes, because his brain does not assign them enough agenthood to be the subjects of moral judgment. He is a PC, and they are wallpaper.</p> </blockquote> <p>Hermione is convicted and Hariezer is sad he couldn’t figure out something to do about it (he did try to threaten the elder Malfoy to no avail).</p> <h3 id="hpmor-81">HPMOR 81</h3> <p>Our last chapter ended with Hermione in peril- she was found guilty of the attempted murder of Draco! How will Hariezer get around this one?</p> <p>Luckily the way the wizard world justice system works is fucking insane- being found guilty puts Hermione in the Malfoy’s “blood debt.” So Hariezer tells Malfoy:</p> <blockquote> <p>By the debt owed from House Malfoy to House Potter!…I’m surprised you’ve forgotten…surely it was a cruel and painful period of your life, laboring under the Imperius curse of He-Who-Must-Not-Be-Named, until you were freed of it by the efforts of House Potter. By my mother, Lily Potter, who died for it, and by my father, James Potter, who died for it, and by me, of course.</p> </blockquote> <p>So Hariezer wants the blood debt transferred to him so he can decide Hermione’s fate (what a convenient and ridiculous way to handle a system of law and order).</p> <p>But blood debts don’t transfer in this stupid world, instead you also have to pay money. So Malfoy demands something like twice the money in Hariezer’s vault. Hariezer waffles a bit, but decides to pay. Because the demand is such a large sum, this will involve going into debt to the Malfoys.</p> <p>And then things get really stupid- Dumbledore says, as guardian of Hariezer’s vault he won’t let the transaction happen.</p> <blockquote> <p>I’m - sorry, Harry - but this choice is not yours - for I am still the guardian of your vault.”</p> <p>“What? ” said Harry, too shocked to compose his reply.</p> <p>&quot;I cannot let you go into debt to Lucius Malfoy, Harry! I cannot! You do not know - you do not realize -&quot;</p> </blockquote> <p>So… here is a question- if Hariezer is going to go into a lot of debt to pay Malfoy how does blocking him access to his money help avoid the debt? Wouldn’t Hariezer just take out a bigger loan from Malfoy?</p> <p>Anyway, despite super rationality, Hariezer doesn’t think through how stupid Dumbledore’s threat is. Hariezer instead threatens to destroy Azkaban if Dumbledore won’t let him pay Malfoy, so Dumbledore relents.</p> <p>Malfoy tries to weasel out of this nebulous blood debt arrangement because the rules of wizard justice change on the fly, but Hermione swears allegiance to House Potter and that prevents Malfoy’s weasel attempt.</p> <blockquote> <p>I acknowledge the debt, but the law does not strictly oblige me to accept it in cancellation,” said Lord Malfoy with a grim smile. “The girl is no part of House Potter; the debt I owe House Potter is no debt to her…</p> <p>And Hermione, without waiting for any further instructions, said, the words spilling out of her in a rush, “I swear service to the House of Potter, to obey its Master or Mistress, and stand at their right hand, and fight at their command, and follow where they go, until the day I die.”</p> </blockquote> <p>The implications here are obvious- if you saved all of magical britain from a dark lord, and literally everyone owes you a “blood debt” you are totally above the law. Hariezer should just steal the money he owes Malfoy from some other magical families.</p> <h3 id="hpmor-82">HPMOR 82</h3> <p>So the trial is wrapped up, but to finish off the section we get a long discussion between Dumbledore and Hariezer.</p> <p>First, credit where credit is due there is an atypical subversion here- now its Dumbledore attempting to give a rationality lesson to Hariezer, and Hariezer agrees that he is right. Its an attempt to mix up the formula a bit, and I appreciate it even if the rest of this chapter is profoundly stupid.</p> <p>So what is the rationality lesson here?</p> <blockquote> <p>“Yes,&quot; Harry said, &quot;I flinched away from the pain of losing all the money in my vault. But Idid it! That’s what counts! And you -” The indignation that had faltered out of Harry’s voice returned. “You actually put a price on Hermione Granger’s life, and you put it below a hundred thousand Galleons!”</p> <p>&quot;Oh?&quot; the old wizard said softly. &quot;And what price do you put on her life, then? A million Galleons?&quot;</p> <p>&quot;Are you familiar with the economic concept of ‘replacement value’?&quot; The words were spilling from Harry’s lips almost faster than he could consider them. &quot;Hermione’s replacement value is infinite! There’s nowhere I can go to buy another one!”</p> <p>Now you’re just talking mathematical nonsense, said Slytherin. Ravenclaw, back me up here?</p> <p>&quot;Is Minerva’s life also of infinite worth?&quot; the old wizard said harshly. &quot;Would you sacrifice Minerva to save Hermione?&quot;</p> <p>&quot;Yes and yes,&quot; Harry snapped. &quot;That’s part of Professor McGonagall’s job and she knows it.&quot;</p> <p>&quot;Then Minerva’s value is not infinite,&quot; said the old wizard, &quot;for all that she is loved. There can only be one king upon a chessboard, Harry Potter, only one piece that you will sacrifice any other piece to save. And Hermione Granger is not that piece. Make no mistake, Harry Potter, this day you may well have lost your war.&quot;</p> </blockquote> <p>Basically, the lesson is this- you have to be willing to put a value on human life, even if it seems profane. Its actually a good lesson and very important to learn. If everyone was more familiar with this, the semi frequent GOVERNMENT HEALTHCARE IS DEATH PANELS panic would never happen. Although I’d add a caveat- anyone who has worked in healthcare does this so often that we start to make a mistake the other way (forgetting that underneath the numbers are actual people).</p> <p>Anyway, to justify the rationality lesson further we get a reference to some of Tetlock’s work (note: I’m unfamiliar with the work cited here, so I’m taking Yudkowsky at his word- if you’ve read the rest of my hpmor stuff, you know this is dangerous).</p> <blockquote> <p>You’d already read about Philip Tetlock’s experiments on people asked to trade off a sacred value against a secular one, like a hospital administrator who has to choose between spending a million dollars on a liver to save a five-year-old, and spending the million dollars to buy other hospital equipment or pay physician salaries. And the subjects in the experiment became indignant and wanted to punish the hospital administrator for even thinking about the choice. Do you remember reading about that, Harry Potter? Do you remember thinking how very stupid that was, since if hospital equipment and doctor salaries didn’t also save lives, there would be no point in having hospitals or doctors? Should the hospital administrator have paid a billion pounds for that liver, even if it meant the hospital going bankrupt the next day?</p> </blockquote> <p>To bring it home, we find out that Voldemort captured Dumbledore’s brother and demanded ransom, and Mad Eye counseled thusly</p> <blockquote> <p>&quot;You ransom Aberforth, you lose the war,&quot; the man said sharply. &quot;That simple. One hundred thousand Galleons is nearly all we’ve got in the war-chest, and if you use it like this, it won’t be refilled. What’ll you do, try to convince the Potters to empty their vault like the Longbottoms already did? Voldie’s just going to kidnap someone else and make another demand. Alice, Minerva, anyone you care about, they’ll all be targets if you pay off the Death Eaters. That’s not the lesson you should be trying to teach them.&quot;</p> </blockquote> <p>So instead of ransoming Aberforth he burned Lucious Malfoy’s wife alive (or at least convinced the death eaters that he did). That way they would think twice about targeting him.</p> <p>I think the rationality lesson is fine and dandy, just one problem- this situation is not at all like the hospital administrator in the example given. The problem here is that the idea of putting a price on a human life is only a useful concept in day-to-day reality where money has some real meaning. In an actual war, even one against a sort of weird guerrilla army of dark wizards money only becomes useful if you can exchange it for more resources, and in the wizard war resources means wizards.</p> <p>Ask yourself this- would a death eater target someone close to Dumbledore even if there was no possibility of ransom? OF COURSE THEY WOULD- the whole point is defeating Dumbledore, the person standing against them. Voldemort wouldn’t ask for ransom, because its a stupid thing to do- he would kill Aberforth and send pieces of him to Dumbledore by owl. This idea that ransoming makes targets of all of Dumbledore’s allies is just idiotic- they are already targets.</p> <p>Next, ask yourself this- does Voldemort have any use for money? Money is an abstract, useful because we can exchange for useful things. But its pretty apparent that Voldemort doesn’t really need money- he has no problem killing, taking and stealing. The parts of magical Britain that are willing to stand up to him won’t sell his death eaters goods at any price, and the rest are so scared they’ll give him anything for free.</p> <p>Meanwhile, Dumbledore is leading a dedicated resistance- basically a volunteer army. He doesn’t need to buy people’s time, they are giving it freely! Mad Eye himself notes that he could ask the Longbottoms or the Potters to empty their vaults and they would. What the resistance needs isn’t money, its people willing to fight. So in the context of this sort of war, and able fighting man like Aberforth is worth basically infinite money- money is common and useless and people willing to stand up to Voldemort are in extremely tight supply.</p> <p>It would have made a lot more sense to have Voldemort ask for prisoner exchange or something like that. Aberforth in exchange for Bellatrix Black. Then both sides would be trading value for value. But then the Tetlock reference wouldn’t be nearly as on-the-nose.</p> <p>At least this chapter makes clear the reason for the profoundly stupid wizard justice system and the utterly absurd blood-debt-punishment system. The whole idea was to arrange things so Hariezer could be asked to pay a ransom to Luscious Malfoy, so the reader can learn about Tetlock’s research/putting a price on lives,etc.</p> <p>At least I only have like 20 chapters of this thing left.</p> <h3 id="name-of-the-wind-bitching">Name of the Wind bitching</h3> <p>Whelp, Kvothe’s “awesomeness” has totally overwhelmed the narrative. Kvothe now has several “awesome” skills- he plays music so well that he was spontaneously given 7 talons (which is 7 times the standard Rothfuss unit for “a lot of money”). He plays music for money in a low-end tavern.</p> <p>He is a journeyman artificer, which means he can make magic lamps and what not and sell them for cash. He is brilliant enough that he could easily tutor students. He has two very wealthy friends who he could borrow from.</p> <p>AND YET he is constantly low on cash. To make this seem plausible, the book is weighed down by super long exposition in which Kvothe explains to the reader why all these obvious ways to make money aren’t working for him. When Kvothe isn’t explaining it to the reader directly, we cut to the framing story where Kvothe is explaining it to his two listeners. The book is chock-full of these paragraphs that are like “I know this is really stupid but here is why it actually makes sense.” Removing all this justification/exposition would probably cut the length of the book by at least 1/4.</p> <p>I could look past all of this if we were meeting other interesting characters at wizard school, but that isn’t happening. Kvothe has two good friends among the students, Wil and Sim. I’ve read countless paragraphs of banter between them and Kvothe, but I don’t know what they study, or really anything about them other than one has an accent.</p> <p>Another character Auri, a homeless girl who is that fantasy version of “mentally ill” that just makes people extra quirky, became friends with Kvothe off screen. Literally, we find out she exists after she has already listened to Kvothe playing music for days. She shows up for a scene then vanishes again for awhile.</p> <p>And we get a love interest who mostly just banters wittily with Kvothe and then vanishes. After pages of witty banter, Kvothe will then remind the reader he is shy around women (despite, you know, having just wittily bantered for pages, because that’s how characterization works in this novel).</p> <h3 id="hpmor-bitching">HPMOR bitching</h3> <p>Much like my previous name of the wind complaints, HPMOR is heavy with exposition- and for a similar reason. Hariezer is too “awesome” which leads to heavy-handed exposition (if for slightly different reasons than name of the wind).</p> <p>The standard rule of show,don’t tell implies that the best way to teach your audience something in a narrative is to have your characters learn from experience. Your characters need to make a mistake, or have something backfire. That way they can come out the other side stronger, having learned something. If you don’t trust your audience to have gotten the lesson, you can cap it off with some exposition outlining exactly what you want to learn, but the core of the lesson should be taught by the characters experience.</p> <p>But Yudkowsky inserted a Hariezer fully-equipped with the “methods of rationality.” So we get lots of episodes that set-up a conflict, and then Hariezer has a huge dump of exposition that explains why its not really a problem because rationality-judo, and the tension drains away. It would be far better to have Hariezer learn over time, so the audience can learn along with him.</p> <p>So Hariezer isn’t going to grow, he is just going to exposition dump most of his problems away. We can at least watch him explore the world, right? After all, Yudkowsky has put a “real scientist” into Hogwarts so we can finally see what material you actually learn at wizard school! All that academic stuff missing from the original novels! NOPE- we haven’t had a single class in the last 60 chapters. Hariezer isn’t even learning magic in a systematic way.</p> <p>I really, really don’t see what people see in this. The handful of chapters I found amusing feel like an eternity ago, it ran off the rails dozens of chapters ago! People sell the story as “using the scientific method in JK Rowling’s universe” but a more accurate description would be “it starts as using the scientific method in JK rowling’s universe, but stops doing that around chapter 25 or so. Then mostly its just about a war game, with some political maneuvering.”</p> <h3 id="hpmor-83-84">HPMOR 83-84</h3> <p>These are just rehashes of things we’ve already been presented with (so many words, so little content). The other students still think Hermione did it (although this is written in an akward tell rather than show style- Hariezer tells Hermione what is going on, rather than Hermione or the reader experiencing it). We get gems of cloying elitism like this:</p> <blockquote> <p>Hermione, you’ve told me a lot of times that I look down too much on other people. But if I expected too much of them - if I expected people to get things right - I really would hate them, then. Idealism aside, Hogwarts students don’t actually know enough cognitive science to take responsibility for how their own minds work. It’s not their fault they’re crazy.</p> </blockquote> <p>There is one bit of new info- as part of this investigation of the attempted murder of Draco, I guess Quirrell was investigated, and the aurors seem to think he is some missing wizard lord or something. This is totally superfluous, I assume we all know Quirrell is Voldemort. I’m hoping this doesn’t turn into a plot line.</p> <p>And finally, Quirrell tries to convince Hermione to leave and go to a wizard school where people don’t think she tried to kill someone. This is fine, but in part of it, Quirrell gives us this gem on being a hero:</p> <blockquote> <p>Long ago, long before your time or Harry Potter’s, there was a man who was hailed as a savior. The destined scion, such a one as anyone would recognize from tales, wielding justice and vengeance like twin wands against his dreadful nemesis…</p> <p>&quot;In all honesty…I still don’t understand it. They should have known that their lives depended on that man’s success. And yet it was as if they tried to do everything they could to make his life unpleasant. To throw every possible obstacle into his way. I was not naive, Miss Granger, I did not expect the power-holders to align themselves with me so quickly - not without something in it for themselves. But their power, too, was threatened; and so I was shocked how they seemed content to step back, and leave to that man all burdens of responsibility. They sneered at his performance, remarking among themselves how they would do better in his place, though they did not condescend to step forward.”</p> </blockquote> <p>So… the people seem mostly to rally around Dumbledore. He has a position of power and influence because of his dark-wizard vanquishing deeds. There aren’t a lot of indications people are actively attempting to make Dumbledore’s life unpleasant, he has the position he wants, turned down the position of minister of magic,etc. People are mostly in awe of Dumbledore.</p> <p>But there is some other hero, we are supposed to believe, who society mocked? I can’t help but draw parallels to Friendly AI research here…</p> <h3 id="hpmor-85">HPMOR 85</h3> <p>A return to my blogging obsession of old (which has been a slog for at least 20 chapters now, but if there is one thing that is true of all phds- we finish what we fucking start, even if it’s an awful idea).</p> <p>This chapter is actually not so bad, mostly Hariezer just reflecting on the difficulty of weighing his friends lives against “the cause” as Dumbledore suggested he failed to do with Hermione in her trial a few chapters ago.</p> <p>There are some good bits. For instance, this interesting bit about bow and arrow’s in Australia:</p> <blockquote> <p>A year ago, Dad had gone to the Australian National University in Canberra for a conference where he’d been an invited speaker, and he’d taken Mum and Harry along. And they’d all visited the National Museum of Australia, because, it had turned out, there was basically nothing else to do in Canberra. The glass display cases had shown rock-throwers crafted by the Australian aborigines - like giant wooden shoehorns, they’d looked, but smoothed and carved and ornamented with painstaking care. In the 40,000 years since anatomically modern humans had migrated to Australia from Asia, nobody had invented the bow-and-arrow. It really made you appreciate how non-obvious was the idea of Progress.</p> </blockquote> <p>I always thought the fact that Australians (and lot of small islanders) lost the bow and arrow (which is interesting! They had it and then they forgot about it!) was an interesting observation about the power of sharing ideas and the importance of large groups for creativity. Small, isolated populations seem to lose the ability to innovate. Granted, almost all of my knowledge about this comes from one anthropology course I only half remember.</p> <p>And of course there are always some sections that filled me with rage-</p> <blockquote> <p>Even though Muggle physics explicitly permitted possibilities like molecular nanotechnology or the Penrose process for extracting energy from black holes, most people filed that away in the same section of their brain that stored fairy tales and history books, well away from their personal realities:</p> </blockquote> <p>Molecular nanotechnology is just the words that sci-fi authors (and Eric Drexler) use for magic. And the nearest black hole is probably something like 2000 light years away. The reason people treat this stuff as far from their personal reality is exactly the same reason Yudkowsky treats it as far from his personal reality- IT IS. Black holes are neat, and GR is a ton of fun, but we aren’t going to be engineering with black holes in my lifetime.</p> <blockquote> <p>No surprise, then, that the wizarding world lived in a conceptual universe bounded - not by fundamental laws of magic that nobody even knew - but just by the surface rules of known Charms and enchantments…Even if Harry’s first guess had been mistaken, one way or another it was still inconceivable that the fundamental laws of the universe contained a special case for human lips shaping the phrase ‘Wingardium Leviosa’. …What were theultimate possibilities of invention, if the underlying laws of the universe permitted an eleven-year-old with a stick to violate almost every constraint in the Muggle version of physics? You know what would be awesome? IF YOU GOT AROUND TO DOING SOME EXPERIMENTS AND EXPLORING THIS IDEA. The absolute essence of science is NOT asking these questions, it’s deciding to try to find out the fucking answers! You can’t be content to just wonder about things, you have to put the work in! Hariezer’s wonderment never gets past the stoned-college-kid wondering aloud and into ACTUAL exploration, and its getting really frustrating. YOU PROMISED ME YOU WERE GOING TO USE THE SCIENTIFIC METHOD TO LEARN THE SECRETS OF MAGIC. WAY BACK IN THE EARLY CHAPTERS.</p> </blockquote> <p>Anyway, towards the end of the ruminations, Fawkes visits Hariezer and basically offers to take him to Azkaban to try to take out the evil place. Hariezer (probably wisely) decides not to go. And the chapter ends.</p> <h3 id="hpmor-86">HPMOR 86</h3> <p>I just realized I have like 145 followers (HOLY SHIT!) and they probably came for the HPMOR thing. So I better keep the updates rolling!</p> <p>Anyway, this chapter is basically Hariezer and friends (Dumbledore, Snape, Mcgonagall, Mad-eye Moody) all trying to guess who might have been responsible for trying to frame Hermione. No real conclusions are drawn, not much to see her.</p> <p>A few notable things here- magic apparently works by the letter of the law, rather than the spirit:</p> <blockquote> <p>You say someone with the Dark Mark can’t reveal its secrets to anyone who doesn’t already know them. So to find out how the Dark Mark operates, write down every way you can imagine the Dark Mark might work, then watch Professor Snape try to tell each of those things to a confederate - maybe one who doesn’t know what the experiment is about - I’ll explain binary search later so that you can play Twenty Questions to narrow things down - and whatever he can’t say out loud is true. His silence would be something that behaves differently in the presence of true statements about the Mark, versus false statements, you see.</p> </blockquote> <p>Luckily, Voldemort thought of the test, thus freeing Snape to tell how the mark actually works:</p> <blockquote> <p>The Dark Lord was no fool, despite Potter’s delusions. The moment such a test is suspected, the Mark ceases to bind our tongues. Yet I could not hint at the possibility, but only wait for another to deduce it.</p> </blockquote> <p>Why not just make sure the death eaters don’t actually know the secrets of the mark? Seems like memory spells are everywhere already, and it would be way easier than this silly logic puzzle.</p> <p>Finding out the secrets of the dark mark prompts Hariezer to try a Bayesian estimate of whether Voldemort is actually dangerous. I repeat that for emphasis:</p> <blockquote> <p>Harry Potter, first year of Hogwarts who has only really succeeded at 1 thing in his learn-the-science-of-magic plan (partial transfiguration), and who knows he is not the most dangerous wizard at Hogwarts (Quirrel, Dumbledore), wonders whether Voldemort could possibly be a threat.</p> </blockquote> <p>Here are some of the things he considers:</p> <p>Harry had been to a convocation of the Wizengamot. He’d seen the laughable ‘security precautions’, if you could call them that, guarding the deepest levels of the Ministry of Magic. They didn’t even have the Thief’s Downfall which goblins used to wash away Polyjuice and Imperius Curses on people entering Gringotts … [if it] took you more than ten years to fail to overthrow the government of magical Britain, it meant you were stupid. But might they have some other precautions? Maybe they use some sort of secret precautions Harry himself doesn’t yet know about yet? Or might the wizards of the Wizengamot be pretty powerful in their own right?</p> <blockquote> <p>There were hypotheses where the Dark Lord was smart and the Order of the Phoenixdidn’t just instantly die, but those hypotheses were more complicated and ought to get complexity penalties. After the complexity penalties of the further excuses were factored in, there would be a large likelihood ratio from the hypotheses ‘The Dark Lord is smart’ versus ‘The Dark Lord was stupid’ to the observation, ‘The Dark Lord did not instantly win the war’. That was probably worth a 10:1 likelihood ratio in favor of the Dark Lord being stupid… but maybe not 100:1. You couldn’t actually say that ‘The Dark Lord instantly wins’ had a probability of more than 99 percent, assuming the Dark Lord started out smart; the sum over all possible excuses would be more than .01.</p> </blockquote> <p>Dude, do you even Bayesian? Probability the dark mark still works if Voldemort is dead. ~0 (everyone who knows magic thinks that the mark still existing is proof he is still out there). Given that Voldemort is alive, probability he successfully complete some sort of immortality ritual ~1. Probability someone who completed an immortality ritual knows more magic than (and therefore is a threat to) Hariezer Yudotter ~1.</p> <p>So given that the dark mark is still around, Voldie is crazy dangerous, regardless of priors or base rates.</p> <p>It’s helpful to look at where the information is, instead of trying to estimate the probability Voldemort could have instantly killed some of the most powerful wizards on the fucking planet.</p> <p>Anyway….</p> <p>OH, another thing that happens- Hariezer challenges Mad-eye to a little mini duel. Guess how he solves the problem of winning against Mad-eye? Any ideas? What could he use? I’ll give you a hint, it rhymes with time turner. This story really should be called Harry Potter And The Method of Time Turners. Seriously- time turners solve basically all the problems in this book. Anyway, he goes to Flitwick, learns a bending spell, and then time turners back into the room to pop Moody.</p> <p>It’s not actually a bad scene, there is a bit of action and it moves pretty quickly. The problem is that the time turner solution is so damn boring at this point.</p> <p>Also, we find out in this chapter that everyone believes Quirrell is really somebody named David Monroe whose family was killed by Voldemort and who was a leader during the war against Voldemort.</p> <p>So we have some potential possibilities-</p> <ol> <li><p>Voldemort was impersonating/half-invented the personality of David Monroe in order to play both sides during the war. After all, all of Monroe’s family was killed but him. Maybe all of Monroe’s family was killed, including him, and Voldemort started impersonating the dead guy. This could be a neat dynamic I guess. Could “Voldemort” have been a Watchmen style plan to unite magical Britain against a common threat that went awry for Monroe/Riddle? Quirrell really did get body snatched in this scenario. We could imagine an ending here where Monroe/Riddle are training Potter to be the leader of magical Britain that Monroe/Riddle wanted to be.</p></li> <li><p>Monroe was a real dude, Voldemort body-snatched him, and now you’ve got Monroe brain fighting Voldemort brain inside. For some reason, they are impersonating Quirrell?</p></li> </ol> <p>If its not the first scenario, I’m going to be sort of annoyed, because scenario 2 doesn’t provide us with much reason for the weird Monroe bit- you could just give Quirrell all of Monroe’s backstory.</p> <p>Anyway, 86 chapters to go, I think this damn thing is going to clock in around 120 when all is said and done. ::sigh:: Time for a scotch.</p> <h3 id="hpmor-87-skinner-box-your-friends">HPMOR 87: skinner box your friends</h3> <p>Hariezer is worried Hermione will be uncomfortable around him after the trial. So what is his solution?</p> <blockquote> <p>&quot;I was going to give you more space,&quot; said Harry Potter, &quot;only I was reading up on Critch’s theories about hedonics and how to train your inner pigeon and how small immediate positive and negative feedbacks secretly control most of what we actually do, and it occurred to me that you might be avoiding me because seeing me made you think of things that felt like negative associations, and I really didn’t want to let that run any longer without doing something about it, so I got ahold of a bag of chocolates from the Weasley twins and I’m just going to give you one every time you see me as a positive reinforcement if that’s all right with you -&quot;</p> </blockquote> <p>Now, this idea of positive/negative reinforcement is an old one, and goes back to probably the psychologists associated with behaviorism (BF Skinner, Pavlov, etc).</p> <p>The weird thing is, there is no “Critch” I can find associated with the behaviorists, or really any of the stuff attributed above. I also emailed my psych friend, who also has never heard of it (but “it’s not really my field at all”). I’m thinking there is like a 90% chance that Yudkowsky just invented a scientist here? Why not just say BF Skinner, or Pavlov here? WHAT IS GOING ON HERE?</p> <p>Anyway, Hermione and Hariezer are brainstorming ways to make money when they get into an argument because Hariezer has been sciencing with Draco:</p> <blockquote> <p>&quot;You were doing SCIENCE with him? &quot; &quot;Well -&quot; &quot;You were doing SCIENCE with him? You were supposed to be doing science with ME! &quot; Hermione, I get it. You wanted to figure out how magic works you’ve got some curiosity about the world. And now you think Hariezer kept his program going, but cut you out of the big discoveries, will leave you off the publications. But I’ve got news for you, girl, he hasn’t been doing science WITH ANYONE for like 60 chapters now. HE JUST FORGOT ABOUT IT.</p> </blockquote> <p>Anyway, this argument blows up, and Hariezer explains puberty:</p> <blockquote> <p>But even with all that weird magical stuff letting me be more adult than I should be, I haven’t gone through puberty yet and there’s no hormones in my bloodstream and my brain is physically incapable of falling in love with anyone. So I’m not in love with you! I couldn’t possibly be in love with you!</p> </blockquote> <p>And then drops some evopsych</p> <blockquote> <p>and besides I’ve been reading about evolutionary psychology, and, well, there are all these suggestions that one man and one woman living together happily ever afterward may be more the exception rather than the rule, and in hunter-gatherer tribes it was more often just staying together for two or three years to raise a child during its most vulnerable stages - and, I mean, considering how many people end up horribly unhappy in traditional marriages, it seems like it might be the sort of thing that needs some clever reworking - especially if we actually do solve immortality To the story’s credit, this works about as well as you’d expect and Hermione storms off.</p> </blockquote> <p>I think the evopsych dropping could have been sort of funny if it were played more for laughs (Hariezer’s inept way of calming Hermione down), but here it just seems like a way to shoehorn this bit of evopsych into the story.</p> <p>The final scene in the chapter is played for laughs, with another student coming over after seeing Hermione storm off and saying “Witches! go figure, huh?”</p> <h3 id="hpmor-88-in-which-i-complain-about-a-lack-of-time-turners">HPMOR 88: in which I complain about a lack of time turners</h3> <p>The problem with solving every problem in your story with time turners is that it becomes incredibly conspicuous when you don’t solve a problem with time turners.</p> <p>In this chapter, the bit of canon from book 1 with the troll in the dungeon is introduced- someone comes running into the dining hall yelling troll. Luckily, Quirrell has the students well prepared:</p> <blockquote> <p>Without any hesitation, the Defense Professor swung smoothly on the Gryffindor table and clapped his hands with a sound like a floor cracking through. &quot;Michelle Morgan of House Gryffindor, second in command of Pinnini’s Army,&quot; the Defense Professor said calmly into the resulting quiet. &quot;Please advise your Head of House.&quot; Michelle Morgan climbed up onto her bench and spoke, the tiny witch sounding far more confident than Minerva remembered her being at the start of the year. “Students walking through the hallways would be spread out and impossible to defend. All students are to remain in the Great Hall and form a cluster in the center… not surrounded by tables, a troll would jump right over tables… with the perimeter defended by seventh-year students. From the armies only, no matter how good they are at duelling, so they don’t get in each other’s lines of fire.”</p> </blockquote> <p>So everyone will be safe from troll, but WAIT- Hariezer realizes Hermione is missing. What does he do? Does he commit himself to time turning himself a message telling him where Hermione is (to be fair, the time is noon, and the earliest he can reach with a time turner is 3pm. However he knows of another student who uses a time turner and is willing to send messages with it, from the post Azkaban escape. He also knows other powerful wizards use time turners, so he could ask one of them to pass the message,etc).</p> <p>I suspect we are approaching an important plot moment that time turnering would somehow break. Maybe we finally get a Quirrell reveal? Anywho, it’s jarring to not see Hariezer go immediately for the time turner. Instead he tries to enlist the aid of other students (and not ask if anyone has a time turner).</p> <p>Anyway, Hariezer decides they need to go look for her as fast as possible- but then</p> <blockquote> <p>The witch he’d named turned from where she’d been staring steadily out at the perimeter, her expression aghast for the one second before her face closed up. &quot;The Deputy Headmistress ordered us all to stay here, Mr. Potter.&quot; It took an effort for Harry to unclench his teeth. “Professor Quirrell didn’t say that and neither did you. Professor McGonagall isn’t a tactician, she didn’t think to check if we had missing students and she thought it was a good idea to start marching students through the hallways. But Professor McGonagall understands after her mistakes are pointed out to her, you saw how she listened to you and Professor Quirrell, and I’m certain that she wouldn’t want us to just ignore the fact that Hermione Granger is out there, alone -“</p> </blockquote> <p>So Hariezer flags this as</p> <blockquote> <p>Harry’s brain flagged this as I’m talking to NPCs again and he spun on his heel and dashed back for the broomstick.</p> </blockquote> <p>Yes, Hariezer, in this world you are talking to NPCs- characters Yudkowsky wrote in, entirely to be stupid so that you can appear brilliant.</p> <p>Anyway, he rushes off the Weasley twins to go find Hermione, and just as he finds her the chapter ends. I look forward to tuning in next time for the thrilling conclusion.</p> <h3 id="hpmor-89-grimdark">HPMOR 89: grimdark</h3> <p>There will be spoilers ahead. Although if you cared about spoilers why are you reading this?</p> <p>So I thought the plot moment we were leading up to was a Quirrell reveal and I was dead wrong (a pun, because Hermione dies). By the time Hariezer arrives, Hermione has already been troll smashed (should have used the time turner,bro).</p> <p>A brief battle scene ensues in which the Weasleys fail to be very effective, andHariezer kills the troll by floating his father’s rock (which he has been wearing in a ring) into the trolls mouth and then letting it go back to its original size, which pops the troll head.</p> <p>Hermione then utters her final words “not your fault” and then dies. Hariezer is obviously upset by this.</p> <p>Not a bad chapter really, even though it required a sort “rationality failure” involving the time turners to get here. Normally I wouldn’t care about this sort of thing, but the fact that <em>basically every problem thus far was solved with time turners</em> makes it very hard to suspend my disbelief here. It feels a touch to much like characters are doing things just to make the plot happen (and not following their ‘natural’ actions).</p> <p>I fear the next ten chapters will be just reflections on this (instead of things happening).</p> <h3 id="hpmor-90-hariezer-s-lack-of-self-reflection">HPMOR 90: Hariezer's lack of self reflection</h3> <p>Brief note- it’s mardi gras, and I’m about as over served as I ever have been. I LIKE HOW OVER SERVED AS A PHRASE BLAMES THE BARTENDER AND NOT ME. THIS IS A THEME FOR THIS CHAPTER. Anyway, hopefullly this will not lack my usual (non) eloquence.</p> <p>This chapter begins what appears to be a 9 part section on Hariezer trying to cope with the death of his friend.</p> <p>As the chapter opens, Hariezer cools Hermione’s body to try to preserve it. I guess that will slow decay, but probably not by enough to matter.</p> <p>And then Hariezer gets understandably mopey. Everyone is concerned he is withdrawing from the world, so Mcgonagall goes to talk to him and we get this bit:</p> <blockquote> <p>&quot;Nothing I could have done? &quot; Harry’s voice rose on the last word. &quot;Nothing I could have…Or if I’d just approached the whole problem from a different angle - if I’d looked for a student with a Time-Turner to send a message back in time..</p> </blockquote> <p>It’s the one in bold that is especially troubling because the time turner is seriously what Hariezer always turns to (TURNS TO! GET IT! IT’S AN AWFUL PUN). When your character is defined by his munchkining ability to solve problems via time turner, and the one time he doesn’t go for the time turner a major plot point happens, it’s jarring to the reader. Almost as if characters are behaving entirely to make the plot happen…</p> <p>Anyway,</p> <blockquote> <p>She was aware now that tears were sliding down her cheeks, again. “Harry - Harry, you have to believe that this isn’t your fault!” &quot;Of course it’s my fault. There’s no one else here who could be responsible for anything.&quot; &quot;No! You-Know-Who killed Hermione!&quot; She was hardly aware of what she was saying, that she hadn’t screened the room against who might be listening. &quot;Not you! No matter what else you could’ve done, it’s not you who killed her, it was Voldemort! If you can’t believe that you’ll go mad, Harry!&quot; &quot;That’s not how responsibility works, Professor.&quot; Harry’s voice was patient, like he was explaining things to a child who was certain not to understand. He wasn’t looking at her anymore, just staring off at the wall to her right side. &quot;When you do a fault analysis, there’s no point in assigning fault to a part of the system you can’t change afterward</p> </blockquote> <p>So keep this in mind- Hariezer says it’t no use blaming anyone but himself, because he can’t change their actions. This seems like a silly NPC/PC distinction- no one can change their past actions, but everyone can learn how they could have improved things.</p> <blockquote> <p>&quot;All right, then,&quot; Harry said in a monotone. &quot;I tried to do the sensible thing, when I saw Hermione was missing and that none of the Professors knew. I asked for a seventh-year student to go with me on a broomstick and protect me while we looked for Hermione. I asked for help. I begged for help. And nobody helped me. Because you gave everyone an absolute order to stay in one place or they’d be expelled, no excuses…. So when something you didn’t foresee happened and it would’ve made perfect sense to send out a seventh-year student on a fast broom to look for Hermione Granger, the students knew you wouldn’t understand or forgive. They weren’t afraid of the troll, they were afraid of you. The discipline, the conformity, the cowardice that you instilled in them delayed me just long enough for Hermione to die. Not that I should’ve tried asking for help from normal people, of course, and I will change and be less stupid next time. But if I were dumb enough to allocate responsibility to someone who isn’t me, that’s what I’d say.&quot;</p> </blockquote> <p>What exactly does Hariezer think she should have said here? If a fire had broken out in the meal hall does Hariezer think that everyone would have stayed in the cafeteria and burned to death out of fear of Mcgonagall? Also, it certainly sounds as if Hariezer has plenty of blame for people not himself. ”I only blame me, but also you suck in the following ways…”</p> <blockquote> <p>But normal people don’t choose on the basis of consequences, they just play roles. There’s a picture in your head of a stern disciplinarian and you do whatever that picture would do, whether or not it makes any sense….People like you aren’t responsible for anything, people like me are, and when we fail there’s no one else to blame.” I AM THE ONLY PC, YOU ARE ALL NPC. I AM THE ONLY FULL HUMAN. TREMBLE BEFORE MY AGENTYNESS. I get that Harizer is mourning, but is their any more condescending way to mourn? ”Everything is my fault because you aren’t all even fully human?” You are a fucking twerp Hariezer, even when you mourn.</p> <p>His hand darted beneath his robes, brought forth the golden sphere that was the Ministry-issued protective shell of his Time Turner. He spoke in a dead, level voice without any emphasis. “This could’ve saved Hermione, if I’d been able to use it. But you thought it was your role to shut me down and get in my way. No, Hariezer, you were told THERE WERE RULES and you violated them. You yourself have said that time travel can be dangerous and you were using it because Snape asked questions you didn’t know the answer to, and really to solve any trivial problem. You broke the rules, and it locked your time turner down when you might have really wanted it. Total boy-who-cried-wolf situation, and yet its conspicuously absent from your discussion above- you blame yourself in lots of ways, but not in this way.</p> <p>Unable to speak, she brought forth her wand and did so, releasing the time-keyed enchantment she’d laced into the shell’s lock.</p> </blockquote> <p>The only lessons learned from this are other character “updating towards” the idea that Hariezer Yudotter is always right. And he fails when other people have prevented his natural PC based awesomenes.</p> <p>Anyway, Mcgonagall sends in the big guns (Quirrell) to try to talk to Hariezer, which leads Hariezer to say to him:</p> <blockquote> <p>The boy’s voice was razor-sharp. “I’d forgotten there was someone else in Hogwarts who could be responsible for things.”</p> </blockquote> <p>And later in the conversation:</p> <blockquote> <p>&quot;You did want to save her. You wanted it so strongly that you made some sort of actual effort. I suppose your mind, if not theirs, would be capable of that.&quot;</p> </blockquote> <p>So you see- it’s clearly not about assigning himself all the blame (because he can only change his own actions), it’s about separating the world into ‘real people’ and ‘NPCs’ Only real people can get any blame for anything, everyone else is just window dressing. Maybe it’s a pet peeve, but I react in abhorrence to this “you aren’t even human enough to share some blame” schtick.</p> <p>8 more chapters in this fucking section.</p> <h3 id="hpmor-91">HPMOR 91</h3> <p>Total retread of the last chapter. Hariezer is still blaming himself, Snape tries to talk to him. They bring his parents in to try to talk to him. Nothing here really.</p> <h3 id="hpmor-92">HPMOR 92</h3> <p>Really, still nothing here. Quirrell is also concerned about Hariezer, but as before his concern seems less than totally genuine. I fear this arc is basically just a lot of retreads.</p> <h3 id="hpmor-93">HPMOR 93</h3> <p>Still very little going on in these chapters…</p> <p>So Mcgonagall completes the transformation she began two chapters ago, and realizes rules are for suckers and Hariezer is always right</p> <blockquote> <p>&quot;I am ashamed,&quot; said Minerva McGonagall, &quot;of the events of this day. I am ashamed that there were only two of you. Ashamed of what I have done to Gryffindor. Of all the Houses, it should have been Gryffindor to help when Hermione Granger was in need, when Harry Potter called for the brave to aid him. It was true, a seventh-year could have held back a mountain troll while searching for Miss Granger. And you should have believed that the Head of House Gryffindor,&quot; her voice broke, &quot;would have believed in you. If you disobeyed her to do what was right, in events she had not foreseen. And the reason you did not believe this, is that I have never shown it to you. I did not believe in you. I did not believe in the virtues of Gryffindor itself. I tried to stamp out your defiance, instead of training your courage to wisdom.</p> </blockquote> <p>Maybe I’m projecting too much of canon McGonagall onto my reading of this one in this fanfic, but has she really been stamping out all defiance and overly stern? Would any student really have believed they would have expelled for trying to help find a missing student in a dire situation?</p> <p>Hariezer certainly wasn’t expelled (or punished in anyway) for his experimenting with transfiguration/discovering partial transfiguration. He was punished for flaunting his time turner despite explicit instructions not to… But in a school for magic users, that is probably a necessity.</p> <p>Also, Hermione’s body has gone missing. I suspect Hariezer is cryonicsing it.</p> <h3 id="hpmor-94">HPMOR 94</h3> <p>This is the best chapter of this “reflect on what just happened” giant block of chapters, but that’s not saying much.</p> <p>Hariezer might not have taken Hermione’s body, but seems unconcerned that it’s missing (maybe he took it to freeze the brain, maybe Voldie took it to resurrect Hermione or brain upload her or something). That’s the only real thing of merit that happens in this chapter (a conversation between Dumbledore and Hariezer, a conversation between Neville and Hariezer).</p> <p>Hariezer has finally convinced himself that Voldemort is smart, which leads to this rumination</p> <blockquote> <p>Okay, serious question. If the enemy is that smart, why the heck am I still alive? Is it seriously that hard to poison someone, are there Charms and Potions and bezoars which can cure me of literally anything that could be slipped into my breakfast? Would the wards record it, trace the magic of the murderer? Could my scar contain the fragment of soul that’s keeping the Dark Lord anchored to the world, so he doesn’t want to kill me? Instead he’s trying to drive off all my friends to weaken my spirit so he can take over my body? It’d explain the Parselmouth thing. The Sorting Hat might not be able to detect a lich-phylactery-thingy. Obvious problem 1, the Dark Lord is supposed to have made his lich-phylactery-thingy in 1943 by killing whatshername and framing Mr. Hagrid. Obvious problem 2, there’s no such thing as souls.</p> </blockquote> <p>So, all the readers are already on board this train, because they’ve read the canon novel, so I guess it’s nice that the “super rationalist” is considering it (although Voldemort is smart, therefore I have a Voldemort fragment trying to possess me is a huge leap. You didn’t even Bayes that shit bro).</p> <p>But seriously, “there’s no such thing as souls?” SO DON’T CALL IT A SOUL, CALL IT A MAGIC RESURRECTION FRAGMENT. Are we really getting hung up on semantics?</p> <p>These chapter are intensely frustrating because any “rising action” in this story (we are nearing the conclusion after all) is blunted because after anything happens, we need 10 chapters for everyone to talk about everything and digest the events. The ratio of words/plot is ridiculously huge.</p> <p>We do maybe get a bit of self-reflection when Neville tries to blame himself for Hermione’s death:</p> <blockquote> <p>&quot;Wow,&quot; the empty air finally said. &quot;Wow. That puts a pretty different perspective on things, I have to say. I’m going to remember this the next time I feel an impulse to blame myself for something. Neville, the term in the literature for this is ‘egocentric bias’, it means that you experience everything about your own life but you don’t get to experience everything else that happens in the world. There was way, way more going on than you running in front of me. You’re going to spend weeks remembering that thing you did there for six seconds, I can tell, but nobody else is going to bother thinking about it. Other people spend a lot less time thinking about your past mistakes than you do, just because you’re not the center of their worlds. I guarantee to you that nobody except you has even consideredblaming Neville Longbottom for what happened to Hermione. Not for a fraction of a second. You are being, if you will pardon the phrase, a silly-dilly. Now shut up and say goodbye.&quot;</p> </blockquote> <p>It would be nice for Hariezer to more explicitly use this to come to terms with his own grieving (instead of insisting on “heroic responsibility” for himself a few sections back, and also insisting it’s McGonagall’s fault for trying to enforce rules, and now insisting that blaming yourself is egocentric bias). I hope this is Hariezer realizing that he shouldn’t blame himself, and growing a bit, but fear this is Hariezer suggesting that Neville isn’t important enough to blame.</p> <p>Anyway, Hariezer insists that Neville leave for awhile to help keep him safe.</p> <h3 id="hpmor-95">HPMOR 95</h3> <p>So the chapter opens with more incuriousness, which is the rest of the chapter in miniature:</p> <blockquote> <p>Harry had set the alarm upon his mechanical watch to tell him when it was lunchtime, since he couldn’t actually look at his wrist, being invisible and all that. It raised the question of how his eyeglasses worked while he was wearing the Cloak. For that matter the Law of the Excluded Middle seemed to imply that either the rhodopsin complexes in his retina were absorbing photons and transducing them to neural spikes, or alternatively, those photons were going straight through his body and out the other side, but not both. It really did seem increasingly likely that invisibility cloaks let you see outward while being invisible yourself because, on some fundamental level, that was how the caster had - not wanted - but implicitly believed - that invisibility should work.</p> </blockquote> <p>This would be an excellent fucking question to explore, maybe via some experiments. But no. I’ve totally given up on this story exploring the magic world in any detail at all. Anyway, Hariezer skips straight from “I wonder how this works” to “it must work this way, how could we exploit it”</p> <blockquote> <p>Whereupon you had to wonder whether anyone had tried Confunding or Legilimizing someone into implicitly and matter-of-factly believing that Fixus Everythingus ought to be an easy first-year Charm, and then trying to invent it. Or maybe find a worthy Muggleborn in a country that didn’t identify Muggleborn children, and tell them some extensive lies, fake up a surrounding story and corresponding evidence, so that, from the very beginning, they’d have a different idea of what magic could do.</p> </blockquote> <p>This skips all the interesting hard work of science.</p> <p>The majority of the chapter is a long discussion between Quirrell and Hariezer where Quirrell tries to convince Hariezer not to try to raise the dead. It’s too dangerous, may end the universe, etc.</p> <p>Lots of discussion about how special Quirrell and Hariezer are because only they would even think to fight death,etc. It’s all a boring retread of ideas already explored in earlier chapters,etc.</p> <p>It reads a lot like any discussion of cryonics with a cryonics true believer:</p> <blockquote> <p>The Defense Professor’s voice was also rising. “The Transfiguration Professor is reading from a script, Mr. Potter! That script calls for her to mourn and grieve, that all may know how much she cared. Ordinary people react poorly if you suggest that they go off-script. As you already knew!”</p> </blockquote> <p>Also, it’s sloppy world building- do we really think no wizards in the HPMOR universe have spent time investigating death/spells to reverse aging/spells to deal with head injuries,etc?</p> <p>THERE IS A RESURRECTION STONE AND A LITERAL GATEWAY TO THE AFTERLIFE IN THE BASEMENT OF THE MINISTRY OF MAGIC. Maybe Hariezer’s FIRST FUCKING STOP if he wanted to investigate bringing back the dead SHOULD BE THAT GATE. Maybe some scientific experiments?</p> <p>It’s like the above incuriousness with the invisibility cloak (and the typical transhumanist approach to science)- assume all the problems are solved and imagine what the world be like, how dangerous that power might be. This is no way to explore a question. It’s not even producing a very interesting story.</p> <p>Quirrell assumes Hariezer might end the world even though he has shown 0 aptitude with any magic even approaching dangerous…</p> <h3 id="hpmor-96-more-of-the-same">HPMOR 96: more of the same</h3> <p>Remus takes Hariezer to Godric’s Hollow to try to cheer him up or whatever.</p> <p>Hariezer discovers the Potter’s family motto is apparently the passage from Corinthians:</p> <blockquote> <p>The Last Enemy to Be Destroyed is Death</p> </blockquote> <p>Hariezer is super glad that his family has a long history of trying to end death, and (at least) realizes that other wizards have tried. Of course, the idea of actually looking at their research doesn’t fucking occur to him because this story is very silly.</p> <p>We get this rumination from Hariezer on the Peverell’s ‘deathly hallows’ from the books:</p> <blockquote> <p>Hiding from Death’s shadow is not defeating Death itself. The Resurrection Stone couldn’t really bring anyone back. The Elder Wand couldn’t protect you from old age.</p> </blockquote> <p>HOW THE FUCK DO YOU KNOW THE RESURRECTION STONE CAN’T BRING ANYONE BACK? HAVE YOU EVEN SEEN IT?</p> <p>Step 1- assume that the resurrection stone doesn’t work because you can’t magically bring back the dead</p> <p>Step 2- decide you want to magically resurrect the dead</p> <p>Step 3- never revisit step 1.</p> <p>SCIENCE!</p> <p>GO INVESTIGATE THE DOORWAY TO THE AFTERLIFE! GO TALK TO PEOPLE ABOUT THE RESURRECTION STONE! DO SOME FUCKING RESEARCH! ”I’m going to resurrect the dead by thinking really hard about how much death sucks and doing nothing else.”</p> <h3 id="hpmor-97-plot-points-resolved-arbitrarily">HPMOR 97: plot points resolved arbitrarily</h3> <p>Next on the list to talk with Hariezer regarding Hermione’s death? The Malfoys who call Hariezer to Gringotts under the pretense of talking about Hariezer’s debt.</p> <p>On the way in he passes a goblin, which prompts this</p> <blockquote> <p>If I understand human nature correctly - and if I’m right that all the humanoid magical species are genetically human plus a heritable magical effect -</p> </blockquote> <p>How did you come to that conclusion Hariezer? What did you do to study it? Did you just make it up with no justification whatsoever? This story confuses science jargon for science.</p> <p>Anyway, Lucious is worried that he’ll be blamed for Hermione’s death (although given that it has already been established that the wizard court votes exactly as he wants it to I’m not sure why he is worried about it) so he agrees to cancel Hariezer’s debt and return all his money if Hariezer swears Lucious didn’t have anything to do with the troll that killed Hermione.</p> <p>This makes very little sense- why would anyone listen to Hariezer on this? Hariezer doesn’t actually know that the Malfoys weren’t involved. If he is asked “how do you know?” he’ll have to say “I don’t.” If he Bayesed that shit, the Malfoys should be near the fucking top of the suspect list…</p> <p>Anyway, the Malfoys try to convince Hariezer that Dumbledore killed Hermione as some sort of multi-level plot.</p> <p>I’m so bored.</p> <h3 id="hpmor-98-this-block-is-nearly-over">HPMOR 98: this block is nearly over!</h3> <p>The agreement put in place in the previous chapter is enacted.</p> <p>Malfoy announces to Hogwarts that Hermione was innocent. Hariezer says there is no ill will between the Potters and the Malfoys. Why did we even need this fucking scene?</p> <p>Through backstage maneuvering by Hariezer and Malfoy,the Hogwarts board of governors enacts some rules for safety of students (travel in packs, work together,etc). Why they needed the maneuvering I don’t know (just ask McGonagall to implement whatever rules you want. No effort required).</p> <p>Also, Neville was sent away from Hogwarts like.. three chapters ago. But now he is in Hogwarts and stands up to read some of the rules? And Draco, who was closer to Hariezer, returns to Hogwarts? This makes no sense given Hariezer’s fear for his friends? ”No one is safe! Wait, I changed my mind even though nothing has happened.”</p> <p>There was also a surreal moment where the second worst thing I’ve ever read referenced the first:</p> <blockquote> <p>&quot;Remind me to buy you a copy of the Muggle novel Atlas Shrugged,&quot;</p> </blockquote> <h3 id="hpmor-99">HPMOR 99</h3> <p>This chapter is literally one sentence long. Unicorn died at Hogwarts. Why not just slap it into the previous chapter?</p> <h3 id="hpmor-100">HPMOR 100</h3> <p>Remember that mysterious bit about the unicorns dying? That merited a whole one-sentence chapter? Luckily, it’s totally resolved in this chapter.</p> <p>Borrowing a scene from canon, we have Draco and some slytherin palls (working to fix the school) investigating the forest with Hagrid as part of a detention. This leads to a variant of an old game theory/cs joke:</p> <blockquote> <p>Meself,” Hagrid continued, “I think we might ‘ave a Parisian hydra on our ‘ands. They’re no threat to a wizard, yeh’ve just got to keep holdin’ ‘em off long enough, and there’s no way yeh can lose. I mean literally no way yeh can lose so long’s yeh keep fightin’. Trouble is, against a Parisian hydra, most creatures give up long before. Takes a while to cut down all the heads, yeh see.” &quot;Bah,&quot; said the foreign boy. &quot;In Durmstrang we learn to fight Buchholz hydra. Unimaginably more tedious to fight! I mean literally, cannot imagine. First-years not believe us when we tell them winning is possible! Instructor must give second order, iterate until they comprehend.&quot;</p> </blockquote> <p>This time, it’s just Draco and friends in detention, no Hariezer/</p> <p>When Draco encounters the unicorn killer, all of a sudden Hariezer and aurors come riding in to save the day:</p> <blockquote> <p>After Captain Brodski had learned that Draco Malfoy was in the Forbidden Forest, seemingly in the company of Rubeus Hagrid, Brodski had begun inquiring to find out who had authorized this, and had still been unable to find out when Draco Malfoy had missed check-in. Despite Harry’s protests, the Auror Captain, who was authorized to know about Time-Turners, had refused to allow deployment to before the time of the missed check-in; there were standard procedures involving Time. But Brodski had given Harry written orders allowing him to go back and deploy an Auror trio to arrive one second after the missed check-in time.</p> </blockquote> <p>So… why does Hariezer come with the aurors? For what purpose? He is always talking about avoiding danger,etc so why ride into danger when the battle wizards will probably be enough?</p> <p>Anyway, we all know its Quirrell killing unicorns, so I’ll skip to the Hariezer/Quirrell interaction:</p> <blockquote> <p>The use of unicorn’s blood is too well-known.” &quot;I don’t know it,&quot; Harry said. &quot;I know you do not,&quot; the Defense Professor said sharply. &quot;Or you would not be pestering me about it. The power of unicorn’s blood is to preserve your life for a time, even if you are on the very verge of death.&quot;</p> </blockquote> <p>And then</p> <blockquote> <p>&quot;And why -&quot; Harry’s breath hitched again. &quot;Why isn’t unicorn’s blood standard in healer’s kits, then? To keep someone alive, even if they’re on the very verge of dying from their legs being eaten?&quot; &quot;Because there are permanent side effects,&quot; Professor Quirrell said quietly. &quot;Side effects? Side effects? What kind of side effect is medically worse than DEATH? &quot; Harry’s voice rose on the last word until he was shouting. &quot;Not everyone thinks the same way we do, Mr. Potter. Though, to be fair, the blood must come from a live unicorn and the unicorn must die in the drinking. Would I be here otherwise?&quot; Harry turned, stared at the surrounding trees. “Have a herd of unicorns at St. Mungos. Floo the patients there, or use portkeys.” &quot;Yes, that would work.&quot;</p> </blockquote> <p>So do you remember a few chapters back when Hariezer was worried about eating plants or animals that might be conscious (after he learned snake speech)?</p> <p>He knows literally nothing about unicorns here, nothing about what the side effects are,etc. I know lots of doctors who have living wills because they aren’t ok with the side effects of certain life-preserving treatments.</p> <p>This feels again like canon is fighting the transhumanist message the author wants to insert.</p> <h3 id="hpmor-101">HPMOR 101</h3> <p>Still in the woods, Hariezer encounters a centaur who tries to kill him, because he divines that Hariezer is going to make all the stars die.</p> <p>There are some standard anti-astrology arguments, which again seems to be fighting the actual situation because the centaurs successfully use astrology to divine things.</p> <p>We get this:</p> <blockquote> <p>&quot;Cometary orbits are also set thousands of years in advance so they shouldn’t correlate much to current events. And the light of the stars takes years to travel from the stars to Earth, and the stars don’t move much at all, not visibly. So the obvious hypothesis is that centaurs have a native magical talent for Divination which you just, well, project onto the night sky.&quot;</p> </blockquote> <p>There are so, so many other hypothesis Hariezer. Maybe starlight has a magical component that waxes and wanes as stars align into different magical symbols are some such. The HPMOR scientific method:</p> <p>observation -&gt; generate 1 hypothesis -&gt; assume you are right -&gt; it turns out that you are right.</p> <p>Quirrell saves Hariezer and I guess in the aftermath Filch and Hagrid both get sacked (we aren’t actually shown this, instead Dumbledore and Hariezer have a discussion about it, because why show when you can have characters talk about! So much more interesting!)</p> <p>Anyway, Dumbeldore is a bit sad by the loss of Filch and especially Hagrid, but Hariezer says</p> <blockquote> <p>&quot;Your mistake,&quot; Harry said, looking down at his knees, feeling at least ten percent as exhausted as he’d ever been, &quot;is a cognitive bias we would call, in the trade, scope insensitivity. Failure to multiply. You’re thinking about how happy Mr. Hagrid would be when he heard the news. Consider the next ten years and a thousand students taking Magical Creatures and ten percent of them being scalded by Ashwinders. No one student would be hurt as much as Mr. Hagrid would be happy, but there’d be a hundred students being hurt and only one happy teacher.&quot;</p> </blockquote> <p>First “in the trade”? Really?</p> <p>Anyway, Hariezer isn’t multiplying in the obvious tangible benefits of an enthusiastic teacher who really knows his shit regarding magical creatures. Yes, more students will be scalded but its because there will be SUPER AWESOME LESSONS WHERE KIDS COULD BE SCALDED!</p> <p>In the balance, I think Hariezer was right about Filch and Dumbledore was right about Hagrid.</p> <p>Anyway, thats it for this chapter, its a standard “chapter where people do nothing that talk.”</p> <p>Harry Potter and the Methods of Expository Dialogue.</p> <h3 id="hpmor-102-open-borders-and-death-spells">HPMOR 102: open borders and death spells</h3> <p>Quirrell is still dying, Hariezer brings him a unicorn he turned into a stone.</p> <p>We learn how horcruxes work in this world:</p> <blockquote> <p>Only one who doess not believe in common liess will reasson further, ssee beneath obsscuration, realisse how to casst sspell. Required murder iss not ssacrificial ritual at all. Ssudden death ssometimes makess ghosst, if magic burssts and imprintss on nearby thing. Horcrux sspell channelss death-bursst through casster, createss your own ghosst insstead of victim’ss, imprintss ghosst in sspecial device. Ssecond victim pickss up horcrux device, device imprintss your memoriess into them. But only memoriess from time horcrux device wass made. You ssee flaw?”</p> </blockquote> <p>Wait? A ghost has all the memories of the person who died? Why isn’t Hariezer reading everything he can about how these imprints work? If the Horcrux can transfer ghost-like stuff into a person, could you return any ghost to a new body? I feel like Hariezer just says “I’m going to end death! Humanity should end death! I can’t believe no one is trying to end death!” But he isn’t actually doing anything about it himself.</p> <p>Also, if that is how a horcrux works WHY THE FUCK WOULD VOLDEMORT PUT ONE ON A PIONEER PROBE? The odds of that encountering people again are pretty much nill. At least we’ve learned horcruxes aren’t conscious- I had assumed Voldemort had condemned one of his copies to an eternity of isolation.</p> <p>We also learn that in HPMOR world</p> <blockquote> <p>There is a second level to the Killing Curse. Harry’s brain had solved the riddle instantly, in the moment of first hearing it; as though the knowledge had always been inside him, waiting to make itself known. Harry had read once, somewhere, that the opposite of happiness wasn’t sadness, but boredom; and the author had gone on to say that to find happiness in life you asked yourself not what would make you happy, but what would excite you. And by the same reasoning, hatred wasn’t the true opposite of love. Even hatred was a kind of respect that you could give to someone’s existence. If you cared about someone enough to prefer their dying to their living, it meant you were thinking about them. It had come up much earlier, before the Trial, in conversation with Hermione; when she’d said something about magical Britain being Prejudiced, with considerable and recent justification. And Harry had thought - but not said - that at least she’d been let into Hogwarts to be spat upon. Not like certain people living in certain countries, who were, it was said, as human as anyone else; who were said to be sapient beings, worth more than any mere unicorn. But who nonetheless wouldn’t be allowed to live in Muggle Britain. On that score, at least, no Muggle had the right to look a wizard in the eye. Magical Britain might discriminate against Muggleborns, but at least it allowed them inside so they could be spat upon in person. What is deadlier than hate, and flows without limit? &quot;Indifference,&quot; Harry whispered aloud, the secret of a spell he would never be able to cast; and kept striding toward the library to read anything he could find, anything at all, about the Philosopher’s Stone.</p> </blockquote> <p>So standard open borders stuff, not worth spending time with.</p> <p>But I want to talk about the magic here- apparently you can only cast the killing curse at people you hate, and toward people you are indifferent toward. So you can’t kill your loved ones! Big limitation!</p> <p>Also, Hariezer “99% of the fucking planet is NPCs” Yudotter isn’t indifferent to anyone? I call BS.</p> <h3 id="hpmor-103-very-punny">HPMOR 103: very punny</h3> <p>Credit where credit is due, this whole chapter sets up a pretty clever pun.</p> <p>The students take an exam, and then receive their final “battle magic” grades. Hermione is failed because she made the mistake of dying. Hariezer gets an exceeds expectations, which Quirrell informs Hariezer “It is the same grade… that I received in my own first year.”</p> <p>Get it? He marked him as an equal.</p> <h3 id="hpmor-104-plot-threads-hastily-tied-up-also-some-nonsense">HPMOR 104: plot threads hastily tied up/also some nonsense</h3> <p>So this chapter opens with a quidditch game, in an attempt to wrap up an earlier plot thread- Quirrell’s reward for his battle game (a reward given out back in chapter 34 or so, and literally never mentioned again until this chapter) was that slytherin and ravenclaw would tie for the house cup and Hogwarts would stop playing quidditch with the snitch.</p> <p>Going into this game, Hufflepuff is in the lead for house cup “by something like five hundred points.” Quirrell is out of commission with his sickness, but the students have taken matters into their own hands- it appears the plan is just to not catch the snitch?</p> <blockquote> <p>It was at eight pm and six minutes, according to Harry’s watch, when Slytherin had just scored another 10 points bringing the score to 170-140, when Cedric Diggory leapt out of his seat and shouted “Those bastards!” &quot;Yeah!&quot; cried a young boy beside him, leaping to his own feet. &quot;Who do they think they are, scoring points?&quot; &quot;Not that!&quot; cried Cedric Diggory. &quot;They’re - they’re trying to steal the Cup from us! &quot; &quot;But we’re not in the running any more for -&quot; &quot;Not the Quidditch Cup! The House Cup!&quot;</p> </blockquote> <p>What? It’s totally unclear to me how this is supposed to work. In the books, as I remember it, points were awarded for winning quidditch games NOT for simply scoring points within a quidditch game? Winning 500 to 500 will just result in some fixed amount of points going to the winner.</p> <p>Also, there appears to be a misunderstanding of quidditch:</p> <blockquote> <p>The game had started at six o’ clock in the afternoon. A typical game would have gone until seven or so, at which point it would have been time for dinner. No, as I recall, games go on for days not one hour. I think the books mention a game lasting longer than a month. No one would be upset at a game where the snitch hasn’t been caught in a few hours.</p> </blockquote> <p>Basically, this whole thing feels really ill-conceived.</p> <p>Luckily, the chapter pivots away from the quidditch game pretty quickly, Hariezer gets a letter from himself.</p> <blockquote> <p>Beware the constellation, and help the watcher of stars and by the wise and the well-meaning. in the place that is prohibited and bloody stupid. Pass unseen by the life-eaters’ confederates, Six, and seven in a square,</p> </blockquote> <p>I note that Hariezer established way back when somewhere that he has a system in place to communicate with himself, with secret codes for his notes to make sure they really are for him. I’m too lazy to dig this back up, but I definitely remember reading it. Probably in chapter 13 with the time travel game?</p> <p>Anyway, apparently Hariezer has forgotten this (I hope this comes up and it’s not just a weird problem introduced for no reason?) because this turns out to be a decoy note from Quirrell to lure him to the forbidden corridor. After a whole bunch of people all show up at the forbidden corridor at the same time, and some chaos breaks out, Hariezer and Quirrell are the last men standing, which leads to this:</p> <blockquote> <p>An awful intuition had come over Harry, something separate from all the reasoning he’d done so far, an intuition that Harry couldn’t put into words; except that he and the Defense Professor were very much alike in certain ways, and faking a Time-Turned message was just the sort of creative method that Harry himself might have tried to bypass all of a target’s protections - … And Professor Quirrell had known a password that Bellatrix Black had thought identified the Dark Lord and his presence gave the Boy-Who-Lived a sense of doom and his magic interacted destructively with Harry’s and his favorite spell was Avada Kedavra and and and … Harry’s mouth was dry, even his lips were trembling with adrenaline, but he managed to speak. “Hello, Lord Voldemort.” Professor Quirrell inclined his head in acknowledgement, and said, “Hello, Tom Riddle.”</p> </blockquote> <p>We also indirectly find out that Quirrell killed Hermione (but we already knew that), although he did it by controlling professor Sprout (I guess to throw off the scent if he got caught?)</p> <p>Anyway, this pivotal plot moment seems to rely entirely on the fact that Hariezer forgot his own coded note system?</p> <h3 id="hpmor-105">HPMOR 105</h3> <p>So Quirrell gets Hariezer to cooperate with him, by threatening students, and offering to resurrect Hermione if he gets the philosopher’s stone</p> <blockquote> <p>And know this, I have taken hostages. I have already set in motion a spell that will kill hundreds of Hogwarts students, including many you called friends. I can stop that spell using the Stone, if I obtain it successfully. If I am interrupted before then, or if I choose not to stop the spell, hundreds of students will die.</p> </blockquote> <p>Hariezer does manage to extract a concession:</p> <blockquote> <p>Agreed,” hissed Professor Quirrell. “Help me, and you sshall have ansswerss to your quesstions, sso long ass they are about passt eventss, and not my planss for the future. I do not intend to raisse my hand or magic againsst you in future, sso long ass you do not raisse your hand or magic againsst me. Sshall kill none within sschool groundss for a week, unlesss I musst. Now promisse that you will not attempt to warn againsst me or esscape. Promisse to put forth your own besst efforts toward helping me to obtain the Sstone. And your girl-child friend sshall be revived by me, to true life and health; nor sshall me or mine ever sseek to harm her.” A twisted smile. “Promisse, boy, and the bargain will be sstruck.”</p> </blockquote> <p>So coming up we’ll get one of those chapters where the villain explains everything. Always a good sign when the villain does apparently nothing for 90 or so out of 100 chapters, and then explains the significance of everything at the very end.</p> <h3 id="hpmor-106">HPMOR 106</h3> <p>Not much happens here, Quirrell kills the 3 headed cerberus to get past the first puzzle. When Hariezer points out that might have alerted someone, Quirrell is all “eh, I fucked all the wards up.”</p> <p>So I guess more time to go before we get the villain monologue chapter.</p> <h3 id="hpmor-107">HPMOR 107</h3> <p>Still no villain monologue. Quirrell and Hariezer encounter the other puzzles from the book, and Quirrell blasts them to death with magic fire rather than actually solve them.</p> <p>However, Quirrell has some random reasons to not blast apart the potion room (he respects Snape or something, blah,bah). Anyway, apparently this means he’ll have to make a long and complicated potion, which will give Quirrell and Hariezer some time to talk.</p> <p>Side note: credit where credit is due, I again notice these chapters flow much better, and have a much smoother writing style. There is some wit in the way that Quirrell just hulk-smashes all the puzzles (although stopping at Snape;s puzzle seems like a contrived way to drop the monologue we know is coming next chapter or so into the story) When things are happening, HPMOR can be decently written.</p> <h3 id="hpmor-108-monologue">HPMOR 108: monologue</h3> <p>So we get the big “explain everything” monologue, and it’s kind of a let down?</p> <p>The first secret we get- Hariezer is indeed a copy of Voldemort (which was just resolving some dramatic irony, we all knew this because we read the books). In a slight twist, we find out that he was intentionally a horcrux-</p> <blockquote> <p>It occurred to me how I might fulfill the Prophecy my own way, to my own benefit. I would mark the baby as my equal by casting the old horcrux spell in such fashion as to imprint my own spirit onto the baby’s blank slate… I would arrange with the other Tom Riddle that he should appear to vanquish me, and he would rule over the Britain he had saved. We would play the game against each other forever, keeping our lives interesting amid a world of fools.</p> </blockquote> <p>But apparently creating the horcrux created some sort of magic resonance and killed his body. But he had somehow made true-immortal horcruxes. Unfortunately, he had put them stupid places like on the pioneer probe or in volcanos where people would never touch them, so he never managed to find a host (remember when I complained about that a few chapters back?)</p> <p>Hariezer does point out that Voldemort should have tested the new horcrux spell. He suggests Voldie failed to do so because he doesn’t think about doing nice things, but Voldie could have just horcruxed someone, killed them to test it, then destroyed the horcrux, then killed them for real. Not nice, pretty straightforward. It feels like this is going to be Voldemort’s weakness that gets exploited.</p> <p>We find out that the philosopher’s stone makes transfigurations permanent, which I guess is a minor twist on the traditional legend? Really, just a specific way of making it work- in the legends it can transmute metals, heal sickness, bring dead plants back to life, let you make homunculi,etc.</p> <p>In HPMOR, powerful magical artifacts can’t have been produced recently, because lore is lost of whatever, so we get a grimdark history of the stone, involving a Hogwarts student seducing professor Baba Yaga to trick her into taking her virginity so she could steal the stone. Really incidental to anything. Anyway, Flamel, who stole the stone, is both man and woman and uses the stone to transmute back and forth, and apparently gave Dumbledore power to fight Grindlewald.</p> <p>Quirrell killed Hermione (duh) because</p> <blockquote> <p>I killed Miss Granger to improve your position relative to that of Lucius Malfoy, since my plans did not call for him to have so much leverage over you.</p> </blockquote> <p>I don’t think this actually makes much sense at all? It’s pretty clear Voldie plans to kill Hariezer as soon as this is over, so why should he care about Malfoy at all in this? I had admittedly assume he killed Hermione to help his dark side take over Hariezer or something.</p> <p>Apparently they raided Azkaban to find out where Black had hidden Quirrell’s wand.</p> <p>Also, as expected, Voldemort was both Monroe and Voldemort and was playing both sides in order to gain political power. He wanted to get political power because he was afraid muggles would destroy the world.</p> <p>Basically, every single reveal is basically what you’d expect from the books. Harry Potter and The Obvious Villain Monologue.</p> <p>The only open question is why the Hariezer-crux, given how that spell is supposed to work, didn’t have any of Voldemort’s memories up until that time? I expect we are supposed to chalk it up to “the spell didn’t quite work because of the resonance that blew everything up” or whatever.</p> <h3 id="hpmor-109-supreme-author-wank">HPMOR 109: supreme author wank</h3> <p>We get to the final mirror, and we get this bit of author wank:</p> <blockquote> <p>Upon a wall of metal in a place where no one had come for centuries, I found written the claim that some Atlanteans foresaw their world’s end, and sought to forge a device of great power to avert the inevitable catastrophe. If that device had been completed, the story claimed, it would have become an absolutely stable existence that could withstand the channeling of unlimited magic in order to grant wishes. And also - this was said to be the vastly harder task - the device would somehow avert the inevitable catastrophes any sane person would expect to follow from that premise. The aspect I found interesting was that, according to the tale writ upon those metal plates, the rest of Atlantis ignored this project and went upon their ways. It was sometimes praised as a noble public endeavor, but nearly all other Atlanteans found more important things to do on any given day than help. Even the Atlantean nobles ignored the prospect of somebody other than themselves obtaining unchallengeable power, which a less experienced cynic might expect to catch their attention. With relatively little support, the tiny handful of would-be makers of this device labored under working conditions that were not so much dramatically arduous, as pointlessly annoying. Eventually time ran out and Atlantis was destroyed with the device still far from complete. I recognise certain echoes of my own experience that one does not usually see invented in mere tales.”</p> </blockquote> <p>Get it? It’s friendly AI and we are all living in Atlantis! And Yud is bravely toiling away in obscurity to save us all! (Note: toiling in obscurity in this context means soliciting donations to continue running one of the least productive research organizations in existence.)</p> <p>Anyway, after this bit of wankery, Voldie and Hariezer return to the problem of how to get the stone. The answer turns out to be by confounding himself into thinking that he is Dumbledore wanting the stone back after Voldemort has been defeated.</p> <p>I point out that the book’s original condition where the way to get the stone was to not want the stone was vastly more clever. Don’t think of elephants, and all that.</p> <p>Anyway, after Voldemort gets the stone, Dumbledore shows up.</p> <h3 id="hpmor-110">HPMOR 110</h3> <p>Apparently Dumbledore was going to use the mirror as a trap to banish Voldemort. But when he saw Hariezer was with him, Dumbledore sacrificed himself to save Hariezer. So now Dumbledore is banished somewhere.</p> <h3 id="so-i-read-chapter-114">So I read chapter 114</h3> <p>And I kind of don’t get how Hariezer’s solution worked? He turned the ground near his wand into spider silk, but how did he get it around everyone’s neck? Why did he make it spider silk before turning it into nanowire?</p> <h3 id="hpmor-111">HPMOR 111</h3> <p>So in this chapter, Hariezer is stripped of wand, pouch and time turner, and Voldemort has the philosopher’s stone. It’s looking pretty bad for our hero.</p> <p>Voldemort walks Hariezer to an altar he has prepared,does some magic stuff, and a new shiny body appears for him. So now he has fully resurrected.</p> <p>Voldemort then brings back Hermione (Hariezer HAD stolen the remains). I note that I don’t think Hariezer has actually done anything in this entire story- his first agenda- study magic using science- was a complete bust, and then his second agenda was a big resolution to bring back Hermione, but Voldemort did it for him. Voldemort also crossed Hermione with a troll and unicorn (creating a class Trermionecorn) so she is now basically indestructible. Why didn’t Voldemort do this to his own body? No idea. Why would someone obsessed with fighting death and pissed as hell about how long it took him to reclaim his old body think to give Hermione wolverine-level indestructibility but not himself? Much like the letter Hariezer got to set this up, it’s always bad when characters have to behave out of character in order to set the plot up.</p> <p>Anyway, to bring Hermione back Voldie gives Hariezer his wand back so he can hit her with his super-patronus. So he now has a wand.</p> <p>Voldemort then asks Hariezer for Roger Bacon’s diary (which he turns into a Hermione horcrux), which prompts Hariezer to say</p> <blockquote> <p>I tried translating a little at the beginning, but it was going slowly -” Actually, it had been excruciatingly slow and Harry had found other priorities.</p> </blockquote> <p>Yep, doing experiments to discover magic, despite being the premise of the first 20ish chapters then immediately stopped being any priority at all. Luckily it shows up here as an excuse for Hariezer to get his pouch back (he retrieved it to get the diary).</p> <p>While Voldemort is distracted making the horcrux, Hariezer whips a gun out of the pouch and shoots Voldemort. Something tells me it didn’t work.</p> <h3 id="hpmor-112">HPMOR 112</h3> <p>Not surprisingly, the shots did nothing to Voldemort who apparently can create a wall of dirt faster than a bullet travels. Apparently it was a trick, because Voldemort had created some system where no Tom Riddle could kill any other Tom Riddle unless the first had attacked him. Somehow, Hariezer hadn’t been bound to it, and now neither of them are.</p> <p>I don’t know why Yudkowsky introduced this curse and then had it broken immediately? It would have been more interesting to force Hariezer to deal with Voldemort without taking away his immortality. Actually, given all the focus on memory charms in the story, it’s pretty clear that when he achieves ultimate victory Hariezer will do it with a memory charm- turning Tom Riddle into an immortal amnesiac or implanting other memories in him (so he is an immortal guy who works in a restaurant or something).</p> <p>In retaliation, Voldemort hits him with a spell that takes everything but his glasses and wand, so he is naked in a graveyard, and a bunch of death eaters teleport in (37 of them).</p> <h3 id="hpmor-113">HPMOR 113</h3> <p>This is a short chapter, Hariezer is still in peril. As Voldemort’s death eaters pop in, one of them tries to turn against him but Voldemort kills him dead.</p> <p>And then he forces Hariezer to swear an oath to not try to destroy the world- Voldemort’s plan is apparently to resurrect Hermione, force Hariezer to not destroy the world, and then to kill him. (I note that if he did things in a slightly different order, he’d have already won…)</p> <p>So this chapter ends with Hariezer surrounded by death eaters, all with wands pointed at him, ready to kill him, and we get this (sorry, it’s a long quote).</p> <blockquote> <p>This is your final exam. Your solution must at least allow Harry to evade immediate death, despite being naked, holding only his wand, facing 36 Death Eaters plus the fully resurrected Lord Voldemort. <em>12:01AM Pacific Time</em> (8:01AM UTC) on Tuesday, March 3rd, 2015, the story will continue to Ch. 121. Everyone who might want to help Harry thinks he is at a Quidditch game. he cannot develop wordless wandless Legilimency in the next 60 seconds. the Dark Lord’s utility function cannot be changed by talking to him. the Death Eaters will fire on him immediately. if Harry cannot reach his Time-Turner without Time-Turned help - then the Time-Turner will not come into play. Harry is allowed to attain his full potential as a rationalist, now in this moment or never, regardless of his previous flaws. if you are using the word ‘rational’ correctly, is just a needlessly fancy way of saying ‘the best solution’ or ‘the solution I like’ or ‘the solution I think we should use’, and you should usually say one of the latter instead. (We only need the word ‘rational’ to talk about ways of thinking, considered apart from any particular solutions.) if you know exactly what a smart mind would do, you must be at least that smart yourself….</p> </blockquote> <p>The issue here is that literally any solution, no matter how outlandish will work if you can design the rules to make the munchkin work.</p> <p>Hariezer could use partial transfiguration to make an atomic lattice under the earth with a few atoms of anti-matter under each death eater to blow up each of them.</p> <p>He could use the fact that Voldemort apparently cast broomstick spells on his legs to levitate Voldemort. And after levitating Voldemort away convince the death eaters he is even more powerful than Voldie.</p> <p>He could use the fact that Tom Riddle controls the dark mark to burn all the death eaters where they stand- Voldemort can’t kill Hariezer with magic because of the resonance, and a simple shield should stop bullets.</p> <p>He could partially transfigure a dead man’s switch of sorts, so that if he dies fireworks go up and the quidditch game gets disrupted- he knows it didn’t, so he knows he won’t die.</p> <p>In my own solution I posted earlier, he could talk his way out using logic-ratio-judo. Lots of options other than my original posted solution here- convince them that killing Hariezer makes the prophecy come true because of science reasons,etc.</p> <p>He could partially transfigure poison gas or knock out gas.</p> <p>Try to come up with your own outlandish solution. The thing is, all of these work because we don’t really know the rules of HPMOR magic- we have a rough outline, but there is still a lot of author flexibility. So there is that.</p> <h3 id="hpmor-114-solutions-to-the-exam">HPMOR 114: solutions to the exam</h3> <p>Hariezer transfigures part of his wand into carbon nanotubes that he makes into a net around each death eater’s head and around voldemort’s hands.</p> <p>He uses it to decapitate all the death eaters (killing Malfoy’s father almost certainly) and taking Voldemort’s hand’s off.</p> <p>Voldemort then charges at him but Hariezer hits him with a stunning spell.</p> <p>I note that this solution is as outlandish as any.</p> <p>It WAS foreshadowed in the very first chapter, but that doesn’t make it less outlandish.</p> <h3 id="hpmor-115-116-117">HPMOR 115-116-117</h3> <p>Throughout the story, whenever there is action we then have 5-10 chapters where everyone digests what happened. These chapters all fit in that vein.</p> <p>Hariezer rearranges the scene of his victory to put Voldemort’s hands around Hermione’s neck and then rigs a big explosion. He then time turners back to the quidditch game (why the hell did he have time left on his time turner?)</p> <p>Anyway, when he gets back to the game he does a whole “I CAN FEEL THE DARK LORD COMING” thing, and says the Hermione followed him back. I guess the reason for this plot is just to keep him out of it/explain how Hermione came back? You’d think he could have just hung around the scene of the crime, waited ‘till he was discovered and explain what happened?</p> <p>Then in chapter 117, Mcgonagall explains to the school what was found- Malfoy’s father is dead, Hermione is alive, etc.</p> <h3 id="so-after-the-big-hpmor-reveal">So after the big HPMOR reveal</h3> <p>It sort of feels like HPMOR is just the overly wordy gritty Potter reboot with some science stuff slapped on to the first 20 chapters or so.</p> <p>Like, Potter is still a horcrux, Voldemort still wanted to take take over the wizard world and kill the muggles, etc.</p> <p>Even the anti-death themes have fallen flat because of “show, don’t tell”- Dumbledore was a “deathist” but he was on the side of not killing all the muggles, Voledmort actually defeated death but that position rides along side the kill-everyone ethos, and Hariezer’s resolution to end death apparently was going to blow up the entire world. So the characters might argue the positions, but for reasons I don’t comprehend, actually following through is shown as a terrible idea.</p> <h3 id="i-read-the-rest-of-hpmor-113-puzzle">I read the rest of HPMOR/113 puzzle</h3> <p>I read HPMOR, and will put chapter updates when I have time, but I wanted to put down my version of how Hariezer will get out of the predicament at the end of 113. I fear if I put this down after the next chapter is released, and if it’s correct, people will say I looked ahead.</p> <p>Anyway, the way that this would normally be solved in HPMOR is simple the time turner- Hariezer would resolve to go and find Mcgonagall or Bones or whoever when this was all over and tell them to time turner into this location and bring the heat. Or whatever. But that has been disallowed by the rules.</p> <p>But I think Yud is setting this up as a sort of “AI box” experiment, because he has his obsessions and they show up time and time again. So the solution is simply to convince Voldemort to let him go. How? In a version of Roko’s basilisk he needs to convince Voldemort that they might be in a simulation- i.e. maybe they are still looking in the mirror. Hasn’t everything gone a little too well since they first looked in? Dumbledore was vanquished, bringing back Hermione was practically easy, every little thing has been going perfectly. Maybe the mirror is just simulating what he wants to see?</p> <p>So two ways to go from here- Hariezer is also looking in the mirror (and he has also gotten what he wanted, Hermione being brought back) so he might be able to test this just by wishing for something.</p> <p>Or, Hariezer can convince Voldemort that the only way to know for sure is for Voldemort to fail to get something he wants, and the last thing he wants is for Hariezer to die.</p> <h3 id="hpmor-118">HPMOR 118</h3> <p>The story is still in resolution mood, and I want to point out one thing that this story does right that the original books failed at- which is an actual resolution. The one big failure of the original Harry Potter books, in my mind, was that after Voldemort was defeated, we flashed immediately to the epilogue. No funeral for the departed (no chance to say goodbye to the departed Fred Weasley,etc).</p> <p>Of course, in the HPMOR style, there is a huge resolution after literally every major even in the story, so it’s at least in part a stopped clock situation.</p> <p>This chapter is Quirrell’s funeral, which is mostly a student giving a long eulogy (remember, Hariezer dressed things up to make it look like Quirrell died fighting Voldemort, which is sort of true, but not the Quirrell anyone knew.)</p> <h3 id="hpmor-119">HPMOR 119</h3> <p>Still in resolution mode.</p> <p>Hariezer comes clean with (essentially) the order of the Phoenix members, and tells them about how Dumbledore is trapped in the mirror. This leads to him receiving some letters Dumbledore left.</p> <p>We find out that Dumbledore has been acting to fulfill a certain prophecy that Hariezer plays a role in-</p> <blockquote> <p>Yet in your case, Harry, and in your case alone, the prophecies of your apocalypse have loopholes, though those loopholes be ever so slight. Always ‘he will end the world’, not ‘he will end life’.</p> </blockquote> <p>So I guess he’ll bring in the transhumanist future.</p> <p>Hariezer has also been given Dumbledore’s place, which Amelia Bones is annoyed at, so he makes her regent until he is old enough.</p> <p>We also get one last weird pointless rearrangement of the canon books- apparently Peter Pettigrew was one of those shape shifting wizards and somehow got tricked into looking like Sirius Black. So the wrong person has been locked up in Azkaban. I don’t really “get” this whole Sirius/Peter were lovers/Sirius was evil/the plot of book 3 was a schizophrenic delusion running thread (also, Hariezer deduces, with only the evidence that there is a Black in prison and a dead Black among the death eaters, that Peter Pettigrew was a shapeshifter, that Peter immitated black, and that Peter is the one in Azkaban.)</p> <p>And Hariezer puts a plan in place to open a hospital using the philsophers stone, so BAM, death is defeated, at least in the wizarding world. Unless it turns out the stone has limits or something.</p> <h3 id="hpmor-120">HPMOR 120</h3> <p>More resolution.</p> <p>Hariezer comes clean to Draco about how the death eaters died. (why did he go to the effort of the subterfuge, if he was going to come clean to everyone afterwards? It just added a weird layer to all this resolution).</p> <p>Draco is sad his parents are dead. BUT, surprise- as I predicted way back when, Dumbledore only faked Narcissa Malfoy’s death and put her in magic witness protection.</p> <h3 id="i-think-one-of-the-things-i-strongly-dislike-about-hpmor-is-that-there-doesn-t-seem-to-be-any-joy">I think one of the things I strongly dislike about HPMOR is that there doesn’t seem to be any joy...</h3> <p>I think one of the things I strongly dislike about HPMOR is that there doesn’t seem to be any joy purely in the discovery. People have fun playing the battle games, or fighting bullies with time turners, or generally being powerful, but no one seems to have fun just trying to figure things out.</p> <p>For some reason (the reason is that I have a fair amount of scotch in me actually), my brain keeps trying to put together an imprecise metaphor to old SNES rpgs- a friend of mine in grade school loved FF2, but he always went out of his way to find all the powerups and do all the side quests,etc. This meant he was always powerful enough to smash boss fights in one or two punches. And I always hated that- what is the fun in that? What is the challenge? When things got too easy, I started running from all the random encounters and stopped buying equipment so that the boss battles were more fun.</p> <p>And HPMOR feels like playing the game the first way- instead of working hard at the fun part (discovery), you get to just use Aristotle’s method (Harry Potter and Methods of Aristotelian Science) and slap an answer down. And that answer makes you more powerful- you can time turner all your problems away like shooing a misquito with a flamethrower, when a dementor shows up you get to destroy it just by thinking hard- no discovery required. The story repeatedly skips the fun part- the struggle, the learning, the discovery.</p> <h3 id="hpmor-121">HPMOR 121</h3> <p>Snape leaves hogwarts, thus completing an arc I don’t think I ever cared about.</p> <h3 id="hpmor-122-the-end-of-the-beginning">HPMOR 122: the end of the beginning</h3> <p>So unlike the canon books, the end of HPMOR sets it up more as an origin story than a finished adventure. After the canon books, we get the impression Harry, Ron and Hermione settled into peaceful wizard lives. After HPMOR, Hariezer has set up a magical think tank to solve the problem of friendly magic, with Hermione as his super-powered, indestructible lab assistant (tell me again how Hariezer isn't a self insert?) , and we get the impression the real work is just starting. He also has the idea to found CFAR:</p> <blockquote> <p>It would help if Muggles had classes for this sort of thing, but they didn't. Maybe Harry could recruit Daniel Kahneman, fake his death, rejuvenate him with the Stone, and put him in charge of inventing better training methods...</p> </blockquote> <p>We also learn that a more open society of idea sharing is an idea so destructive that Hariezer's vow to end the world wouldn't let him do it:</p> <blockquote> <p>Harry could look back now on the Unbreakable Vow that he'd made, and guess that if not for that Vow, disaster might have already been set in motion yesterday when Harry had wanted to tear down the International Statute of Secrecy.</p> </blockquote> <p>So secretive magiscience lead by Hariezer (with Hermione as super-powered &quot;Sparkling Unicorn Princess&quot; side kick) will save the day, sometime in the future.</p> su3su2u1 physics tumblr archive su3su2u1/physics/ Tue, 01 Mar 2016 00:00:00 +0000 su3su2u1/physics/ <p>These are archived from the now defunct <a href="http://su3su2u1.tumblr.com/">su3su2u1 tumblr</a>.</p> <h3 id="a-roundabout-approach-to-quantum-mechanics">A Roundabout Approach to Quantum Mechanics</h3> <p>This will be the first post in what I hope will be a series that outlines some ideas from quantum mechanics. I will try to keep it light, and not overly math filled- which means I’m not really teaching you physics. I’m teaching you some flavor of the physics. I originally wrote here “you can’t expect to make ice cream just having tasted it,” but I think a better description might be “you can’t expect to make ice cream just having heard someone describe what it tastes like.” AND PLEASE, PLEASE PLEASE ask questions. I’m used to instant feedback on my (attempts at) teaching, so if readers aren’t getting anything out of this, I want to stop or change or something.</p> <p>Now, unfortunately I can’t start with quantum mechanics without talking about classical physics first. Most people think they know classical mechanics, having learned it on their mother’s knee, but there are so,so many ways to formulate classical physics, and most physics majors don’t see some really important ones (in particular Hamiltonian and Lagrangian mechanics) until after quantum mechanics. This is silly, but at the same time university is only 4 years. I can’t possibly teach you all of these huge topics, but I will need to rely on a few properties of particle and of light. And unlike intro Newtonian mechanics, I want to focus on paths. Instead of asking something like “a particle starts here with some velocity, where does it go?” I want to focus on “a particle starts here, and ends there. What path did it take?”</p> <p>So today we start with light, and a topic I rather love. Back in the day, before “nerd-sniping” several generations of mathematicians, Fermat was laying down a beautiful formulation of optics-</p> <p>Light always takes the path of least time</p> <p>I hear an objection “isn’t that just straight lines?” We have to combine this insight with the notion that light travels at different speeds in different materials. For instance, we know light slows down in water by a factor of about 1.3.</p> <p>So lets look at a practical problem, you see a fish swimming in water (I apologize in advance for these diagrams):</p> <p><img src="images/su3su2u1/image_001.jpg"></p> <p>I drew the (hard to see) dotted straight line between your eye and the fish.</p> <p>But that isn’t what the light does- there is a path that saves the light some time. The light travels faster in air than in water, so it can travel further in the air, and take a shorter route in the water to the fish.</p> <p><img src="images/su3su2u1/image_002.jpg"></p> <p>This is a more realistic path for the light- it bends when it hits the water- it does this in order to take paths of least time between points in the water and points in the air. Exercise for the mathematical reader- you can work this out quantitatively and derive Snell’s law (the law of refraction) just from the principle of least time.</p> <p>And one more realistic example: Lenses. How do they work?</p> <p><img src="images/su3su2u1/image_003.jpg"></p> <p>So that bit in the middle is the lens and we are looking at light paths that leave 1 and travel to 2 (or visa versa, I guess).</p> <p>The lens is thickest in the middle, so the dotted line gets slowed the most. Path b is longer, but it spends less time in the lens- that means with careful attention to the shape of the lens we can make the time of path b equal to the time of the dotted path.</p> <p>Path a is the longest path, and just barely touches the lens, so is barely slowed at all, so it too can be made to take the same time as the dotted path (and path b).</p> <p>So if we design our lens carefully, all of the shortest-time paths that touch the lens end up focused back to one spot.</p> <p>So thats the principle of least time for light. When I get around to posting on this again we’ll talk about particles.</p> <p>Now, these sort of posts take some effort, so PLEASE PLEASE PLEASE tell me if you got something out of this.</p> <p>Edit: And if you didn’t get anything out of this, because its confusing, ask questions. Lots of questions, any questions you like.</p> <h3 id="more-classical-physics-of-paths">More classical physics of paths</h3> <p>So after some thought, these posts will probably be structured by first discussing light, and then turning to matter, topic by topic. It might not be the best structure, but its at least giving me something to organize my thoughts around.</p> <p>As in all physics posts, please ask questions. I don’t know my audience very well here, so any feedback is appreciated. Also, there is something of an uncertainty principle between clarity and accuracy. I can be really clear or really accurate, but never both. I’m hoping to walk the middle line here.</p> <p>Last time, I mentioned that geometric optics can be formulated by the simple principle that light takes the path of least time. This is a bit different than many of the physics theories you are used to- generally questions are phrased along the lines of “Alice throws a football from position x, with velocity v, where does Bob need to be to catch the football.” i.e. we start with an initial position and velocity. Path based questions are usually “a particle starts at position x_i,t_i and ends at position x_f,t_f, what path did it take?”</p> <p>For classical non-relativistic mechanics, the path based formulation is fairly simple, we construct a quantity called the “Lagrangian” which is defined by subtracting potential energy from kinetic energy (KE - PE). Recall that kinetic energy is 1/2 mv^2, where m is the mass of the particle and v is the velocity, and potential energy depends on the problem. If we add up the Lagrangian at every instant along a path we get a quantity called the action (S is the usual symbol for action, for some reason) and particles take the path of least action. If you know calculus, we can put this as</p> <p>[S = \int KE-PE dt ]</p> <p>The action has units of energy*time, which will be important in a later post.</p> <p>Believe it or not, all of Newton’s laws are all contained in this minimization principle. For instance, consider a particle moving with no outside influences (no potential energy). Such a particle has to minimize its v^2 over the path it takes.</p> <p>Any movement away from the straight line will cause an increase in the length of the path, so the particle will have to travel faster, on average, to arrive at its destination. We want to minimize v^2, so we can deduce right away the particle will take a straight line path.</p> <p>But what about its speed? Should a particle move very slowly to decrease v^2 as it travels, and then “step on the gas” near the end? Or travel at a constant speed? Its easy to show that minimum action is the constant speed path (give it a try!). This gives us back Newton’s first law.</p> <p>You can also consider the case of a ball thrown straight up into the air. What path should it take? Now we have potential energy mgh (where h is the height, and g is a gravitational constant). But remember, we subtract the potential energy in the action- so the particle can lower its action by climbing higher.</p> <p>Along the path of least action in a gravitational field, the particle will move slowly at high h to spend more time at low-action, and will speed up as h decreases (it needs to have an average velocity large enough to get to its destination on time). If you know calculus of variations, you can calculate the required relationship, and you’ll find you get back exactly the Newtonian relationship (acceleration of the particle = g).</p> <p>Why bother with this formulation? It makes a lot of problems easier. Sometimes specifying all the forces is tricky (imagine a bead sliding on a metal hoop. The hoop constrains the bead to move along the circular hoop, so the forces are just whatever happens to be required to keep the bead from leaving the hoop. But the energy can be written very easily if we use the right coordinates). And with certain symmetries its a lot more elegant (a topic I’ll leave for another post).</p> <p>So to wrap up both posts- light takes the path of least time, particles take the path of least action. (One way to think about this is that light has a Lagrangian that is constant. This means that the only way to lower the action is to find the path that takes the least time). This is the take away points I need for later- in classical physics particles take the path of least action.</p> <p>I feel like this is a lot more confusing than previous posts because its hard to calculate concrete examples. Please ask questions if you have them.</p> <h3 id="semi-classical-light">Semi classical light</h3> <p>As always math will not render properly on tumblr dash, but will on the blog. This post contains the crux of this whole series of posts, so its really important to try to understand this argument.</p> <p>Recall from the first post I wrote that one particularly elegant formulation of geometric optics is Fermat’s principle:</p> <p>light takes the path of least time</p> <p>But, says a young experimentalist (pun very much intended!), look what happens when I shine light through two slits, I get a pattern like this:</p> <p><img src="images/su3su2u1/image_004.jpg"></p> <p>Light must be a wave.</p> <p>&quot;Wait, wait, wait!&quot; I can hear you saying. Why does this two slit thing mean that light is a wave?</p> <p>Let us talk about the key feature of waves- when waves come together they can combine in different ways:</p> <p>TODO: broken link.</p> <p>So when a physicists want to represent waves, we need to take into account not just the height of the wave, but also the phase of the wave. The wave can be at “full hump” or “full trough” or anywhere in between.</p> <p>The technique we use is called “phasors” (not to be confused with phasers). We represent waves as little arrows, spinning around in a circle:</p> <p>TODO: broken link.</p> <p>The ;length of the arrow A is called the amplitude and represents the height of the wave. The angle, (\theta) represents the phase of the wave. (The mathematical sophisticates among us will recognize these as complex numbers of the form (Ae^{i\theta}) With these arrows, we can capture all the add/subtract/partially-add features of waves:</p> <p>TODO: broken link.</p> <p>So how do we use this to explain the double slit experiment? First, we assume all the light that leaves the same source has the same amplitude. And the light has a characteristic period, T. It takes T seconds for the light to go from “full trough” back to “full trough” again.</p> <p>In our phasor diagram, this means we can represent the phase of our light after t seconds as:</p> <p>[\theta = \frac{2\pi t}{T} ]</p> <p>Note, we are taking the angle here in radians. 2 pi is a full circle. That way when t = T, we’ve gone a full circle.</p> <p>We also know that light travels at speed c (c being the “speed of light,” after all). So as light travels a path of length L, the time it traveled is easily calculated as (\frac{L}{c}).</p> <p>Now, lets look at some possible paths:</p> <p><img src="images/su3su2u1/image_005.jpg"></p> <p>The light moves from the dot on the left, through the two slits, and arrives at the point X. Now, for the point X at the center of the screen, both paths will have equal lengths. This means the waves arrive with no difference in phase, and they add together. We expect a bright spot at the center of the screen (and we do get one).</p> <p>Now, lets look at points further up the screen:</p> <p><img src="images/su3su2u1/image_006.jpg"></p> <p>As we move away from the center, the paths have different lengths, and we get a phase difference in the arriving light:</p> <p>[\theta_1 - \theta_2= \frac{2\pi }{cT} \left(L_1 - L_2\right) ]</p> <p>So what happens? As we move up the wall, the length distance gets bigger and the phase difference increases. Every time the phase difference is a multiple of pi we get cancellation, and a dark spot. Every time its a multiple of 2 pi, we get a bright spot. This is exactly Young’s results.</p> <p>But wait a minute, I can hear a bright student piping up (we’ll call him Feynman, but it would be more appropriate to call him Huygens in this case). Feynman says “What if there were 3 slits?”</p> <p>Well, then we’d have to add up the phasors for 3 different slits. Its more algebra, but when they all line up, its a bright spot, when they all cancel its a dark spot,etc. We could even have places where two cancel out, and one doesn’t.</p> <p>&quot;But, what if I made a 4th hole?&quot; We add up four phasors. &quot;A 5th? &quot;We add up 5 phasors.</p> <p>&quot;What if I drilled infinite holes? Then the screen wouldn’t exist anymore! Shouldn’t we recover geometric optics then?&quot;</p> <p>Ah! Very clever! But we DO recover geometric optics. Think about what happens if we add up infinitely many paths. We are essentially adding up infinitely many random phasors of the same amplitude:</p> <p><img src="images/su3su2u1/image_007.jpg"></p> <p>So we expect all these random paths to cancel out.</p> <p>But there is a huge exception.</p> <p>Those random angles are because when we grab an arbitrary path, the time light takes on that path is random.</p> <p>But what happens near a minimum? If we parameterize our random paths, near the minimum the graph of time-of-travel vs parameter looks like this:</p> <p><img src="images/su3su2u1/image_008.jpg"></p> <p>The graph gets flat near the minimum, so all those little Xs have roughly the same phase, which means all those phasors will add together. So the minimum path gets strongly reinforced, and all the other paths cancel out.</p> <p>So now we have one rule for light:</p> <p>To calculate how light moves forward in time, we add up the associated phasors for light traveling every possible path.</p> <p>BUT, when we have many, many paths we can make an approximation. With many, many paths the only one that doesn’t cancel out, the only one that matters, is the path of minimum time.</p> <h3 id="semi-classical-particles">Semi-classical particles</h3> <p>Recall from the previous post that we had improved our understanding of light. Light we suggested, was a wave which means</p> <p>Light takes all possible paths between two points, and the phase of the light depends on the time along the path light takes.</p> <p>Further, this means:</p> <p>In situations where there are many, many paths the contributions of almost all the paths cancel out. Only the path of least time contributes to the result.</p> <p>The astute reader can see where we are going. We already learned that classical particles take the path of least action, so we might guess at a new rule:</p> <p>Particles take all possible paths between two points, and the phase of the particle depends on the action along the path the particle takes.</p> <p>Recall from the previous post that the way we formalized this is that the phase of light could be calculated with the formula</p> <p>[\theta = \frac{2\pi}{T} t]</p> <p>We would like to make a similar formula for particles, but instead of time it must depend on the action, but will we do for the particle equivalent of the “period?” The simplest guess we might take is a constant. Lets call the constant h, planck’s constant (because thats what it is). It has to have the same units of action, which are energy*time.</p> <p>[\theta = \frac{2\pi}{h} * S]</p> <p>Its pretty common in physics to use a slightly different constant (\hbar = \frac{h}{2\pi} ) because it shows up so often.</p> <p>[\theta = \frac{S}{\hbar}]</p> <p>So we have this theory- maybe particles are really waves! We’ll just run a particle through a double slit and we’ll see a pattern just like the light!</p> <p>So we set up our double slit experiment, throw a particle at the screen, and blip. We pick up one point on the other side. Huh? I thought we’d get a wave. So we do the experiment over and over again, and this results</p> <p><img src="images/su3su2u1/image_009.jpg"></p> <p>So we do get the pattern we expected, but only built up over time. What do we make of this?</p> <p>Well, one things seems obvious- the outcome of a large number of experiments fits our prediction very well. So we can interpret the result of our rule as a probability instead of a traditional fully determined prediction. But probabilities have to be positive, so we’ll say the probability is proportional to the square of our amplitude.</p> <p>So lets rephrase our rule:</p> <p>To predict the probability that a particle will arrive at a point x at time t, we take a phasor for every possible path the particle can take, with a phase depending on the action along the path, and we add them all up. Squaring the amplitude gives us the probability.</p> <p>Now, believe it or not, this rule is exactly equivalent to the Schroedinger equation that some of us know and love, and pretty much everything you’ll find in an intro quantum book. Its just a different formulation. But you’ll note that I called it “semi-classical” in the title- thats because undergraduate quantum doesn’t really cover fully quantum systems, but thats a discussion for a later post.</p> <p>If you are familiar with Yudkowsky’s sequence on quantum mechanics or with an intro textbook, you might be used to thinking of quantum mechancis as blobs of amplitude in configuration space changing with time. In this formulation, our amplitudes are associated with paths through spacetime.</p> <p>When next I feel like writing again, we’ll talk a bit about how weird this path rule really is, and maybe some advantages to thinking in paths.</p> <h3 id="basic-special-relativity">Basic special relativity</h3> <p>LNo calculus or light required, special relativity using only algebra. Note- I’m basically typing up some lecture notes here, so this is mostly a sketch.</p> <p>This derivation is based on key principle that I believe Galileo first formulated-</p> <p>The laws of physics are the same in any inertial frame OR there is no way to detect absolute motion. Like all relativity derivations, this is going to involve a thought experiment. In our experiment we have a train that moves from one of train platform to the other. At the same time a toy airplane also flies from one of the platform to the other (originally, I had made a Planes,Trains and Automobiles joke here, but kids these days didn’t get the reference… ::sigh::)</p> <p>There are two events, event 1- everything starts at the left side of the platform. Event 2- everything arrives at the right side of the platform. The entire time the train is moving with a constant velocity v from the platform’s perspective (symmetry tells us this also means that the platform is moving with velocity v from the train’s perspective.)</p> <p>We’ll look at these two events from two different perspectives- the perspective of the platform and the perspective of the train. The goal is to figure out a set of equations that let us relate quantities between the different perspectives.</p> <p>HERE COMES A SHITTY DIAGRAM</p> <p><img src="images/su3su2u1/image_010.jpg"></p> <p>The dot is the toy plane, the box is the train. L is the length of the platform from its own perspective. l is the length of the train from it’s own perspective. T is the time it takes the train to cross the platform from the platform’s perspective. And t is the time the platform takes to cross the train from the train’s perspective.</p> <p>From the platform’s perspective, it’s easy to see the train has length l’ = L - vT. And the toy plane has speed w = L/T.</p> <p>From the train’s perspective, the platform has length L’ = l + vt and the toy plane has speed u = l/t</p> <p>So to summarize</p> <p>Observer | Time passed between events | Length of Train | Speed of Plane</p> <p>Platform | T | l’ = L-vT | w = L/T</p> <p>Train | t |L’ = l+vt | u/t</p> <p>Now, we again exploit symmetry and our Galilean principle. By symmetry,</p> <p>l’/l = L’/L = R</p> <p>Now, by the Galilean principle, R as a function can only depend on v. If it didn’t we could detect absolute motion. We might want to just assume R is 1, but we wouldn’t be very careful if we did.</p> <p>So what we do is this- we want to write a formula for w in terms of u and v and R (which depends only on v). This will tell us how to relate a velocity in the train’s frame to a velocity in the plane’s frame.</p> <p>I’ll skip the algebra, but you can use the relations above to work this out for yourself</p> <p>w = (u+v)/(1+(1-R)u/v) = f(u,v)</p> <p>Here I just used f to name the function there.</p> <p>I WILL EDIT MORE IN, POSTING NOW SO I DON’T LOSE THIS TYPED UP STUFF.</p> <h3 id="more-special-relativity-and-paths">More Special Relativity and Paths</h3> <p>This won’t make much sense if you haven’t read my last post on relativity. Math won’t render on tumblr dash, instead go to the blog.</p> <p>Last time, we worked out formulas for length contraction (and I asked you to work out a formula for time dilation). But what would generally be useful is a formula generally relating events between the different frames of reference. Our thought experiment had two events-</p> <p>event 1, the back end of the train, the back end of the platform, and the toy are all at the same place.</p> <p>event 2- the front of the train, the front end of the platform, and the toy are all at the same place.</p> <p>From the toy’s frame of reference, these events occur at the same place, so we only have a difference between the two events. We’ll call that difference (\Delta\tau) . We’ll always use this to mean “the time between events that occur at the same position” (only in one frame will events occur in the same place), and it’s called proper time.</p> <p>Now, the toy train sees the platform moves with speed -w, and the length of the platform is RL. So this relationship is just time = distance/speed.</p> <p>[\Delta\tau^2 = R^2L^2/w^2 = (1-\frac{w^2}{c^2})L^2/w^2 ]</p> <p>Now, we can manipulate the right hand side by noting that from the platform’s perspective, L^2/w^2 is the time between the two events, and those two events are separated by a distance L. We’ll call the time between events in the platforms frame of reference (\Delta t) and the distance between the events L, we’ll call generally (\Delta x).</p> <p>[\Delta\tau^2 = (1-\frac{w^2}{c^2})L^2/w^2 = (\Delta t^2 - \Delta x^2/c^2) ]</p> <p>Note that the speed w has dropped out of the final version of the equation- this would be true for any frame, since proper time is unique (every frame has a different time measurement, but only one measures the proper time), we have a frame independent measurement.</p> <p>Now, lets relate this back to the idea of paths that I’ve discussed previously. One advantage of the path approach to mechanics is that if we can create a special relativity invariant action then the mechanics we get is also invariant. So one way we might consider to do this is by looking at proper time- (remember S is the action). Note the negative sign- without it there is no minimum only a maximum.</p> <p>[ S \propto -\int d\tau = -C\int d\tau ]</p> <p>Now C has to have units of energy for the action to have the right units.</p> <p>Now, some sketchy physics math</p> <p>[ S = -C\int \sqrt(dt^2 - dx^2/c^2) = -C\int dt \sqrt(1-\frac{dx^2}{dt^2}/c^2) ]</p> <p>[S = -C\int dt \sqrt(1-v^2/c^2) ]</p> <p>So one last step is to note that the approximation we can make for \sqrt(1-v^2/c^2), if v is much smaller than c which is (1-1/2v^2/c^2)</p> <p>So all together, for small v</p> <p>[S = C\int dt (\frac{v^2}{2c^2} - 1)]</p> <p>So if we pick the constant C to be mc^2, then we get</p> <p>[S = \int dt (1/2 mv^2 - mc^2)]</p> <p>We recognize the first term as just the kinetic energy we had before! The second term is just a constant and so won’t effect where the minimum is. This gives us a new understanding of our path rule for particles- particles take the path of maximum proper time (it’s this understanding of mechanics that translates most easily to general relativity)</p> <h3 id="special-relativity-and-free-will">Special relativity and free will</h3> <p>Imagine right now, while you are debating whether or not to post something on tumblr, some aliens in the andromeda galaxy are sitting around a conference table discussing andromeda stuff.</p> <p>So what is the “space time distance” between you right now (deciding what to tumblrize) and those aliens?</p> <p>Well, the distance between andromeda and us is something like 2.5 million light years. So thats a “space time distance” tau (using our formula from last time) of 2.5 million years. So far, so good:</p> <p>Now, imagine an alien, running late to the andromeda meeting, is running in. He is running at maybe 1 meter/second. We know that for him lengths will contract and time will dilate. So for him, time on earth is actually later- using</p> <p>(\Delta \tau^2 = \Delta t^2 - \Delta x^2/ c^2)</p> <p>and using our formula for length contraction, we can calculate that according to our runner in andromeda the current time on Earth is about 9 days later then today.</p> <p>So simultaneous to the committee sitting around on andromeda, you are just now deciding what to tumblrize. According to the runner, it’s 9 days later and you’ve already posted whatever you are thinking about + dozens of other things.</p> <p>So how much free will do you really have about what you post? (This argument is originally due to Rietdijk and Putnam).</p> <h3 id="we-are-doing-taylor-series-in-calculus-and-it-s-really-boring-what-would-you-add-from-physics">We are doing Taylor series in calculus and it's really boring. What would you add from physics?</h3> <p>First, sorry I didn’t get to this for so long.</p> <p>Anyway, there is a phenomenon in physics where almost everything is modeled as a spring (simple harmonic motion is everywhere!). You can see this in discussion of resonances. Wave motion can be understood as springs coupled together,etc, and lots of system exhibit waves- when you speak the tiny air perurbations travel out like waves, same as throwing a pebble in a pond, or wiggling a jump rope. These are all very different systems, so why the hell do we see such similar behavior?</p> <p>Why would this be? Well, think of a system in equilibrium, and nudging it a tiny bit away from equilibrium. If the equilibrium is at some parameter a, and we nudge it a tiny bit away from equilibrium (so x-a = epsilon)</p> <p>[E(x-a)]</p> <p>Now, we can Taylor expand- but we note that in equiilibrium the energy is at a minimum, so the linear term in the Taylor expansion is 0</p> <p>[E(\epsilon)= E(a) + \frac{d^2E}{dx^2}1/2 \epsilon^2 + … ]</p> <p>Now, constants in potential energy don’t matter, and so the first important term is a squared potential energy, which is a spring.</p> <p>So Taylor series-&gt; everything is a spring.</p> <h3 id="why-field-theory">Why field theory?</h3> <p>So far, we’ve learned in earlier posts in my quantum category that</p> <ol> <li><p>Classical theories can be described in terms of paths with rules where particles take the path of “least action.”</p></li> <li><p>We can turn a classical theory into a quantum one by having the particle take every path, with the phase from each path given by the action along the path (divided by hbar).</p></li> </ol> <p>We’ve also learned that we can make a classical theory comply with special relativity by picking a relativistic action (in particle, an action proportional to the “proper time.”)</p> <p>So one obvious thing to try to make a special relativistic quantum theory would be to start with a special relativistic action and do the sum over paths we use for the classical theory.</p> <p>You can do this- and it almost works! If you do the mathematical transition from our original, non-relativistic paths to standard, textbook quantum you’d find that you get the Schroedinger equation (or if you were more sophisticated you could get something called the Pauli equation that no one talks about, but is basically the Schroedinger equation + the fact that electrons have spin).</p> <p>If you try to do it from a relativistic action, you would get an equation called the Klein-Gordon equation (or if you were more sophisticated you could get the Dirac equation). Unfortunately, this runs into trouble- there can be weird negative probabilities, and general weirdness to the solutions.</p> <p>So we have done something wrong- and the answer is that making the action special relativistic invariant isn’t enough.</p> <p>Let’s look at some paths:</p> <p><img src="images/su3su2u1/image_011.jpg"></p> <p>So the dotted line in this picture represents the light cone- how fast light traveling away from the point will travel. All of the paths end up inside the light cone, but some of the paths go outside of it. This leads to really strange situations, lets look at one outside the light cone path from two frames of reference:</p> <p><img src="images/su3su2u1/image_012.jpg"></p> <p>So what we see is that a normal path in the first frame (on the left) looks really strange in the second- because the order of events for events outside the lightcone isn’t fixed, some frame of references see the path as moving back in time.</p> <p>So immediately we see the problem. When we switched to the relativistic theory we weren’t including all the paths- to really include all the paths we need to include paths that also (apparently) move back in time. This is very strange! Notice that if we run time forward the X’ observer sees ,at some points along the path two particles (one moving back in time, one moving forward).</p> <p>Feynman’s genius was to demonstrate that we can think of these particles moving backward in time as anti-particles moving forward in time. So the x’ observer</p> <p>So really our path set looks like</p> <p><img src="images/su3su2u1/image_013.jpg"></p> <p>Notice that not only do we have paths connecting the two points, but we have totally unrelated loops that start and end at the same points- these paths are possible now!</p> <p>So to calculate a probability, we can’t just look at the paths that connect paths connecting points x_o and x_1! There can be weird loopy paths that never touch x_o and x_1 that still matter! From Feynman’s persepctive, particle and anti particle pairs can form, travel awhile and annihilate later.</p> <p>So as a book keeping device we introduce a field- at every point in space it has a value. To calculate the action of the field we can’t just look at the paths- instead we have to sum up the values of the fields (and some derivatives) at every point in space.</p> <p>So our old action was a sum of the action over just times (S is the action, L is the lagrangian)</p> <p>[S = \int dt L ]</p> <p>Our new action has to be a sum over space and time.</p> <p>[S = \int dt d^x l ]</p> <p>So now our Lagrangian is a lagrangian density.</p> <p>And we can’t just restrict ourselves to paths- we have to add up every possible configuration of the field.</p> <p>So that’s why we need field theory to combine relativity with quantum mechanics. Next time some implications</p> <h3 id="field-theory-implications">Field theory implications</h3> <p>So the first thing is that if we take the Feynman interpretation, our field theory doesn’t have a fixed particle number- depending on the weird loops in a configuration it could have an almost arbitrary number of particles. So one way to phrase the problem with not including backwards paths is that we need to allow the particle number to fluctuate.</p> <p>Also, I know some of you are thinking “what are these fields?” Well- that’s not so strange. Think of the electromagnetic fields. If you have no charges around, what are the solutions to the electromagnetic field? They are just light waves. Remember this post? Remember that certain special paths were the most important for the sum over all paths? Similarly, certain field configurations are the most important for the sum over configurations. Those are the solutions to the classical field theory.</p> <p>So if we start with EM field theory, with no charges, then the most important solutions are photons (the light waves). So we can outline levels of approximation</p> <p>Sum over all configurations -&gt; (semi classical) photons that travel all paths -&gt; (fully classical) particles that travel just the classical path.</p> <p>Similarly, with any particle</p> <p>Sum over all configurations -&gt; (semi classical) particles that travel all paths -&gt; (fully classical) particles that travel just the classical path.</p> <p>This is why most quantum mechanics classes really only cover wave mechanics and don’t ever get fully quantum mechanical.</p> <h3 id="planck-length-time">Planck length/time</h3> <p>Answering somervta's question. What is the significance of Planck units.</p> <p>Let’s start with an easier one where we have some intuition- let’s analyze the simple hydrogen atom (the go-to quantum mechanics problem). But instead of doing physics, lets just do dimensional analysis- how big do we expect hydrogen energies to be?</p> <p>Let’s start with something simpler- what sort of distances do we expect a hydrogen atom to have? How big should it’s radius be?</p> <p>Well, first- what physics is involved? I model the hydrogen atom as an electron moving in an electric field, and I expect I’ll need quantum mechanics, so I’ll need hbar (planck’s constant), e, the charge of the electron, coulomb’s constant (call it k), and the mass of the electron. Can I turn these into a length?</p> <p>Let’s give it a try- k*e^2 is an energy times a length. hbar is an energy * a time, so if we divide we can get hbar/(k*e^2) which has units of time/length. Multiply in by another hbar, and we get hbar^2/(k*e^2), which has units of mass * length. So divide by the mass of the electron, and we get a quantity hbar^2/(m*k*e^2).</p> <p>This has units of length, so we might guess that the important length scale for the hydrogen atom is our quantity (this has a value of about 53 picometers, which is about the right scale for atomic hyrdogen).</p> <p>We could also estimate the energy of the hydrogen atom by noting that</p> <p>Energy ~ k*e^2/r and use our scale for r.</p> <p>Energy ~ m*k^2*e^4/(hbar^2) ~27 eV.</p> <p>This is about twice as large as the actual ground state, but its definitely the right order of magnitude.</p> <p>Now what Planck noticed is that if you ask “what are the length scales of quantum gravity?” You end up with the constants G, c, and hbar. Turns out, you can make a length scale out of that (sqrt (hbar*G/c^3) ) So just like with hydrogen, we expect that gives us a characteristic length for where quantum effects might start to matter for gravity (or gravity effects might matter for quantum mechanics).</p> <p>The planck energy and planck mass, then, are similarly characteristic mass and energy scales.</p> <p>It’s sort of “how small do my lengths have to be before quantum gravity might matter?” But it’s just a guess, really. Planck energy is the energy you’d need to probe that sort of length scale (higher energies probe smaller lengths),etc.</p> <p>Does that answer your question?</p> <h3 id="more-physics-answers">More Physics Answers</h3> <p>Answering bgaesop's question:</p> <blockquote> <p>How is the whole dark matter/dark energy thing not just proof that the theory of universal gravitation is wrong?</p> </blockquote> <p>So let’s start with dark energy- the first thing to note is that dark energy isn’t really new, as an idea it goes back to Einstein’s cosmological constant. When the cosmological implications of general relativity were first being understood, Einstein hated that it looked like the universe couldn’t be stable. BUT then he noticed that his field equations weren’t totally general- he could add a term, a constant. When Hubble first noticed that the universe was expanding Einstein dropped the constant, but in the fully general equation it was always there. There has never been a good argument why it should be zero (though some theories (like super symmetry) were introduced in part to force the constant to 0, back when everyone thought it was 0).</p> <p>Dark energy really just means that constant has a non-zero value. Now, we don’t know why it should be non-zero. That’s a responsibility for a deeper theory- as far as GR goes it’s just some constant in the equation.</p> <p>As for dark matter, that’s more complicated. The original observations were that you couldn’t make galactic rotation curves work out correctly with just the observable matter. So some people said “maybe there is a new type of non-interacting matter” and other people said “let’s modify gravity! Changing the theory a bit could fix the curves, and the scale is so big you might not notice the modifications to the theory.”</p> <p>So we have two competing theories, and we need a good way to tell them apart. Some clever scientists got the idea to look at two galaxies that collided- the idea was the normal matter would smash together and get stuck at the center of the collision, but the dark matter would pass right through. So you would see two big blobs of dark matter moving away from each other (you can infer their presence from the way the heavy matter bends light, gravitational lensing), and a clump of visible matter in between. In the bullet cluster, we see exactly that.</p> <p>Now, you can still try to modify gravitation to match the results, but the theories you get start to look pretty bizarre, and I don’t think any modified theory has worked successfully (though the dark matter interpretation is pretty natural).</p> <h3 id="in-the-standard-model-what-are-the-fundamental-beables-things-that-exist-and-what-are-kinds-of-properties-do-they-have-that-is-not-how-much-mass-do-they-have-but-they-have-mass">In the standard model, what are the fundamental &quot;beables&quot; (things that exist) and what are kinds of properties do they have (that is, not &quot;how much mass do they have&quot; but &quot;they have mass&quot;)?</h3> <p>So this one is pretty tough, because I don’t think we know for sure exactly what the “beables” are (assuming you are using beable like Bell’s term).</p> <p>The issue is that field theory is formulated in terms of potentials- the fields that enter into the action are the electromagnetic potential, not the electromagnetic field. In classical electromagnetic theory, we might say the electromagnetic field is a beable (Bell’s example), but the potential is not.</p> <p>But in field theory we calculate everything in terms of potentials- and we consider certain states of the potential to be “photons.”</p> <p>At the electron level, we have a field configuration that is more general than the wavefunction - different configurations represent different combinations of wavefunctions (one configuration might represent a certain 3 particle wavefunction, another might represent a single particle wavefunction,etc).</p> <p>In Bohm type theories, the beables are the actual particle positions, and we could do something like that for field theory- assume the fields are just book keeping devices. This runs into problems though, because field configurations that don’t look much like particles are possible, and can have an impact on your theory. So you want to give some reality to the fields.</p> <p>Another issue is that the field configurations themselves aren’t unique- symmetries relate different field configurations so that very different configurations imply the same physical state.</p> <p>A lot of this goes back to the fact that we don’t have a realistic axiomatic field theory yet.</p> <p>But for concreteness sake, assume the fields are “real,” then you have fermion fields, which have a spin of 1/2, an electro-weak charge, a strong charge, and a coupling to the higgs field. These represent right or left handed electrons,muons,neutrinos,etc.</p> <p>You have gauge-fields (strong field, electro-weak field), these represent your force carrying boson (photons, W,Z bosons, gluons).</p> <p>And you have a Higgs field, which has a coupling to the electroweak field, and it has the property of being non-zero everywhere in space, and that constant value is called its vacuum expectation value.</p> <h3 id="what-s-the-straight-dope-on-dark-matter-candidates">What's the straight dope on dark matter candidates?</h3> <p>So, first off there are two types of potential dark matter. Hot dark matter, and cold dark matter. One obvious form of dark matter would be neutrinos- they only interact weakly and we know they exist! So this seems very obvious and promising until you work it out. Because neutrinos are so light (near massless), most of them will be traveling at very near the speed of light. This is “hot” dark matter and it doesn’t have the right properties.</p> <p>So what we really want is cold dark matter. I think astronomers have some ideas for normal baryonic dark matter (brown dwarfs or something). I don’t know as much about those.</p> <p>Particle physicists instead like to talk about what we call thermal relics. Way back in the early universe, when things were dense and hot, particles would be interconverting between various types (electron-positrons turning into quarks, turning into whatever). As the universe cooled, at some point the electro-weak force would split into the weak and electric force, and some of the weak particles would “freeze out.” We can calculate this and it turns out the density of hypothetical “weak force freeze out” particles would be really close to the density of dark matter. These are called thermal relics. So what we want are particles that interact via the weak force (so the thermal relics have the right density) and are heavier than neutrinos (so they aren’t too hot).</p> <h4 id="from-susy">From SUSY</h4> <p>It turns out it’s basically way too easy to create these sorts of models. There are lots of different super-symmetry models but all of them produce heavy “super partners” for every existing particle. So one thing you can do is assume super symmetry and then add one additional symmetry (they usually pick R-parity) the goal of the additional symmetry is to keep the lightest super partner from decaying. So usually the lightest partner is related to the weak force (generally its a partner to some combination of the Higgs, the Z bosons, and the photons. Since these all have the same quantum numbers they mix into different mass states). These are called neutralinos. Because they are superpartners to weakly interacting particles they will be weakly interacting, and they were forced to be stable by R parity. So BAM, dark matter candidate.</p> <p>Of course, we’ve never seen any super-partners,so…</p> <h4 id="from-guts">From GUTs</h4> <p>Other dark matter candidates can come from grand unified theories. The standard model is a bit strange- the Higgs field ties together two different particles to make the fermions (left handed electron + right handed electron, etc). The exception to this rule are neutrinos. Only left handed neutrinos exist, and their mass is Majorana.</p> <p>But some people have noticed that if you add a right handed neutrino, you can do some interesting things- the first is that with a right handed neutrino in every generation you can embed each generation very cleanly in SO(10). Without the extra neutrino, you can embed in SU(5) but it’s a bit uglier. This has the added advantage that SO groups generally don’t have gauge anomalies.</p> <p>The other thing is that if this neutrino is heavy, then you can explain why the other fermion masses are so light via a see-saw mechanism.</p> <p>Now, SO(10) predicts this right handed neutrino doesn’t interact via the standard model forces, but because the gauge group is larger we have a lot more forces/bosons from the broken GUT. These extra bosons almost always lead to trouble with proton decay, so you have to figure out some way to arrange things so that protons are stable, but you can still make enough sterile neutrinos in the early universe to account for dark matter. I think there is enough freedom to make this mostly work, although the newer LHC constraints probably make that a bit tougher.</p> <p>Obviously we’ve not seen any of the additional bosons of the GUT, or proton decay,etc.</p> <h4 id="from-axions">From Axions</h4> <p>(note: the method for axion production is a bit different than other thermal relics)</p> <p>There is a genuine puzzle to the standard model QCD/SU(3) gauge theory. When the theory was first designed physicist used the most general lagrangian consistent with CP symmetry. But the weak force violates CP, so CP is clearly not a good symmetry. Why then don’t we need to include the CP violating term in QCD?</p> <p>So Peccei and Quinn were like “huh, maybe the term should be there, but look we can add a new field that couples to the CP violating term, and then add some symmetries to force the field to near 0.″ That would be fine, but the symmetry would have an associated goldstone boson, and we’d have spotted a massless particle.</p> <p>So you promote the global Peccei-Quinn symmetry to a guage symmetry, and then the goldston boson becomes massive, and you’ve saved the day. But you’ve got this leftover massive “axion” particle. So BAM dark matter candidate.</p> <p>Like all the other dark matter candidates, this has problems. There are instanton solutions to QCD, and those would break the Peccei-Quinn symmetry. Try to fix it and you ruin the gauge symmetry (and so your back to a global symmetry and a massless, ruled-out axion). So it’s not an exact symmetry, and things get a little strained.</p> <p>So these are the large families I can think of off hand. You can combine the different ones (SUSY SU(5) GUT particles,etc).</p> <p>I realize this will be very hard to follow without much background, so if other people are interested, ask specific questions and I can try to clean up the specifics.</p> <p>Also, I have a gauge theory post for my quantum sequence that will be going up soon.</p> <h3 id="if-your-results-are-highly-counterintuitive">If your results are highly counterintuitive...</h3> <p>They are almost certainly wrong.</p> <p>Once, when I was a young, naive data science I embarked on a project to look at individual claims handlers and how effective they were. How many claims did they manage to settle below the expected cost? How many claims were properly reserved? Basically, how well was risk managed?</p> <p>And I discovered something amazing! Several of the most junior people in the department were fantastic, nearly perfect on all metrics. Several of the most senior people had performance all over the map. They were significantly below average on most metrics! Most of the claims money was spent on these underperformers! Big data had proven that a whole department in a company was nonsense lunacy!</p> <p>Not so fast. Anyone with any insurance experience (or half a brain, or less of an arrogant physics-is-the-best mentality) would have realized something right away- the kinds of claims handled by junior people are going to be different. Everything that a manager thought could be handled easily by someone fresh to the business went to the new guys. Simple cases, no headaches, assess the cost, pay the cost, done.</p> <p>Cases with lots of complications (maybe uncertain liability, weird accidents, etc) went to the senior people. Of course outcomes looked worse, more variance per claim makes the risk much harder to manage. I was the idiot, and misinterpreting my own results!</p> <p>A second example occured with a health insurance company where an employee I supervised thought he’d upended medicine when he discovered a standard-of-care chemo regiment lead to worse outcomes then a much less common/”lighter” alternative. Having learned from my first experience, I dug into the data with him and we found out that the only cases where the less common alternative was used were cases where the cancer had been caught early and surgically removed while it was localized.</p> <p>Since this experience, I’ve talked to startups looking to hire me, and startups looking for investment (and sometimes big-data companies looking to be hired by companies I work for), and I see this mistake over and over. “Look at this amazing counterintuitive big data result!”</p> <p>The latest was in a trade magazine where some new company claimed that a strip-mall lawyer with 22 wins against some judge was necessarily better than white-shoe law firm that won less often against the same judge. (Although in most companies I have worked for, if the case even got to trial something has gone wrong- everyone pushes for settlement. So judging by trial win record is silly for a second reason).</p> <h3 id="locality-fields-and-the-crown-jewel-of-modern-physics">Locality, fields and the crown jewel of modern physics</h3> <p>Apologies, this post is not finished. I will edit to replace the to be continued section soon.</p> <p>Last time, we talked about the need for a field theory associating a mathematical field with any point in space. Today, we are going to talk about what our fields might look like. And we’ll find something surprising!</p> <p>I also want to emphasize locality, so in order to do that let’s consider our space time as a lattice, instead of the usual continuous space.</p> <p><img src="images/su3su2u1/image_014.jpg"></p> <p>So that is a lattice. Now imagine that it’s 4 dimensional instead of 2 dimensional.</p> <p>Now, a field configuration involves putting one of our phasors at every point in space.</p> <p>So here is a field configuration:</p> <p><img src="images/su3su2u1/image_015.jpg"></p> <p>To make our action local (and thus consistent with special relativity) we insist that the action at one lattice point only depends on the field at that point, and on the fields of the neighboring points.</p> <p>We also need to make sure we keep the symmetry we know from earlier posts- we know that the amplitude of the phasor is what matters, and we have the symmetry to change the phase angle.</p> <p><img src="images/su3su2u1/image_016.jpg"></p> <p>Neighbors of the central point, indicated by dotted lines.</p> <p>We can compare neighboring points by subtracting (taking a derivative).</p> <p><img src="images/su3su2u1/image_017.jpg"></p> <p>Sorry that is blurry. . Middle phasor - left phasor = some other phasor.</p> <p>And the last thing we need to capture is the symmetry-remember that the angle of our phasor didn’t matter for predictions- the probabilities are all related to amplitudes (the length of the phasor). The simplest way to do this is to insist that we adjust the angle of all the phasors in the field, everywhere:</p> <p><img src="images/su3su2u1/image_018.jpg"></p> <p>Sorry for the shadow of my hand</p> <p>Anyway, this image shows a transformation of all the phasors. This works, but it seems weird- consider a configuration like this:</p> <p><img src="images/su3su2u1/image_019.jpg"></p> <p>This is two separate localized field configurations- we might interpret this as two particles. But should we really have to adjust the phase angle of the all the fields over by the right particle if we are doing experiments only on the left particle?</p> <p>Maybe what we really want is a local symmetry. A symmetry where we can rotate the phase angle of a phasor at any point individually (and all of them differently, if we like).</p> <p><em>To Be Continued</em></p> Sampling v. tracing perf-tracing/ Sun, 24 Jan 2016 00:00:00 +0000 perf-tracing/ <p>Perf is probably the most widely used general purpose performance debugging tool on Linux. There are multiple contenders for the #2 spot, and, like perf, they're sampling profilers. Sampling profilers are great. They tend to be easy-to-use and low-overhead compared to most alternatives. However, there are large classes of performance problems sampling profilers can't debug effectively, and those problems are becoming more important.</p> <p>For example, consider a Google search query. Below, we have a diagram of how a query is carried out. Each of the black boxes is a rack of machines and each line shows a remote procedure call (RPC) from one machine to another.</p> <p></p> <p><img src="images/intel-cat/search_query.png" width="640" height="352"></p> <p>The diagram shows a single search query coming in, which issues RPCs to over a hundred machines (shown in green), each of which delivers another set of requests to the next, lower level (shown in blue). Each request at that lower level also issues a set of RPCs, which aren't shown because there's too much going on to effectively visualize. At that last leaf level, the machines do 1ms-2ms of work, and respond with the result, which gets propagated and merged on the way back, until the search result is assembled. While that's happening, on any leaf machine, 20-100 other search queries will touch the same machine. A single query might touch a couple thousand machines to get its results. If we look at the latency distribution for RPCs, we'd expect that with that many RPCs, any particular query will see a 99%-ile worst case (tail) latency; <a href="http://research.google.com/pubs/pub40801.html">and much worse than mere 99%-ile, actually</a>.</p> <p>That latency translates directly into money. It's now well established that adding user latency reduces ad clicks, reduces the odds that a user will complete a transaction and buy something, reduces the odds that a user will come back later and become a repeat customer, etc. Over the past ten to fifteen years, the understanding that tail latency is an important factor in determining user latency, and that user latency translates directly to money, has trickled out from large companies like Google into the general consciousness. But debugging tools haven't kept up.</p> <p>Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average.</p> <p>For more on this, let's look at <a href="https://www.youtube.com/watch?v=QBu2Ae8-8LM">this wide ranging Dick Sites talk</a><sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">1</a></sup> which covers, among other things, <a href="http://www.pdl.cmu.edu/SDI/2015/slides/DatacenterComputers.pdf">the performance tracing framework that Dick and others have created at Google</a>. By capturing “every” event that happens, it lets us easily debug performance oddities that would otherwise be difficult to track down. We'll take a look at three different bugs to get an idea about the kinds of problems Google's tracing framework is useful for.</p> <p>First, we can look at another view of the search query we just saw above: given a top-level query that issues some number of RPCs, how long does it take to get responses?</p> <p><img src="images/perf-tracing/1_93.png" width="596" height="370"></p> <p>Time goes from left to right. Each row is one RPC, with the blue bar showing when the RPC was issued and when it finished. We can see that the first RPC is issued and returns before 93 other RPCs go out. When the last of those 93 RPCs is done, the search result is returned. We can see that two of the RPCs take substantially longer than the rest; the slowest RPC gates the result of the search query.</p> <p>To debug this problem, we want a couple things. Because the vast majority of RPCs in a slow query are normal, and only a couple are slow, we need something that does more than just show aggregates, like a sampling profiler would. We need something that will show us specifically what's going on in the slow RPCs. Furthermore, because weird performance events may be hard to reproduce, we want something that's cheap enough that we can run it all the time, allowing us to look at any particular case of bad performance in retrospect. In the talk, Dick Sites mentions having a budget of about 1% of CPU for the tracing framework they have.</p> <p>In addition, we want a tool that has time-granularity that's much shorter than the granularity of the thing we're debugging. Sampling profilers typically run at something like 1 kHz (1 ms between samples), which gives little insight into what happens in a one-time event, like an slow RPC that still executes in under 1ms. There are tools that will display what looks like a trace from the output of a sampling profiler, but the resolution is so poor that these tools provide no insight into most performance problems. While it's possible to crank up the sampling rate on something like perf, you can't get as much resolution as we need for the problems we're going to look at.</p> <p>Getting back to the framework, to debug something like this, we might want to look at a much more zoomed in view. Here's an example with not much going on (just tcpdump and some packet processing with <a href="http://linux.die.net/man/2/recvmsg">recvmsg</a>), just to illustrate what we can see when we zoom in.</p> <p><img src="images/perf-tracing/tcpdump.png" width="756" height="369"></p> <p>The horizontal axis is time, and each row shows what a CPU is executing. The different colors indicate that different things are running. The really tall slices are kernel mode execution, the thin black line is the idle process, and the medium height slices are user mode execution. We can see that CPU0 is mostly handling incoming network traffic in a user mode process, with 18 switches into kernel mode. CPU1 is maybe half idle, with a lot of jumps into kernel mode, doing interrupt processing for tcpdump. CPU2 is almost totally idle, except for a brief chunk when a timer interrupt fires.</p> <p>What's happening is that every time a packet comes in, an interrupt is triggered to notify tcpdump about the packet. The packet is then delivered to the process that called <code>recvmsg</code> on CPU0. Note that running tcpdump isn't cheap, and it actually consumes 7% of a server if you turn it on when the server is running at full load. This only dumps network traffic, and it's already at 7x the budget we have for tracing everything! If we were to look at this in detail, we'd see that Linux's TCP/IP stack has a large instruction footprint, and workloads like tcpdump will consistently come in and wipe that out of the l1i and l2 caches.</p> <p>Anyway, now that we've seen a simple example of what it looks like when we zoom in on a trace, let's look at how we can debug the slow RPC we were looking at before.</p> <p><img src="images/perf-tracing/CPU_RPC_view.png" width="832" height="534"></p> <p>We have two views of a trace of one machine here. At the top, there's one row per CPU, and at the bottom there's one row per RPC. Looking at the top set, we can see that there are some bits where individual CPUs are idle, but that the CPUs are mostly quite busy. Looking at the bottom set, we can see parts of 40 different searches, most of which take around 50us, with the exception of a few that take much longer, like the one pinned between the red arrows.</p> <p><img src="images/perf-tracing/LOCK_THREAD_view.png" width="816" height="552"></p> <p>We can also look at a trace of the same timeframe by which locks are behind held and which threads are executing. The arcs between the threads and the locks show when a particular thread is blocked, waiting on a particular lock. If we look at this, we can see that the time spent waiting for locks is sometimes much longer than the time spent actually executing anything. The thread pinned between the arrows is the same thread that's executing that slow RPC. It's a little hard to see what's going on here, so let's focus on that single slow RPC.</p> <p><img src="images/perf-tracing/CPU_RPC_zoom.png" width="856" height="538"></p> <p>We can see that this RPC spends very little time executing and a lot of time waiting. We can also see that we'd have a pretty hard time trying to find the cause of the waiting with traditional performance measurement tools. <a href="http://stackoverflow.com/questions/24021967/finding-performance-issue-that-may-be-due-to-thread-locking-possibly">According to stackoverflow, you should use a sampling profiler</a>! But tools like OProfile are useless since they'll only tell us what's going on when our RPC is actively executing. What we really care about is what our thread is blocked on and why.</p> <p>Instead of following the advice from stackoverflow, let's look at the second view of this again.</p> <p><img src="images/perf-tracing/LOCK_THREAD_iso.png" width="802" height="536"></p> <p>We can see that, not only is this RPC spending most of its time waiting for locks, it's actually spending most of its time waiting for the same lock, with only a short chunk of execution time between the waiting. With this, we can look at the cause of the long wait for a lock. Additionally, if we zoom in on the period between waiting for the two locks, we can see something curious.</p> <p><img src="images/perf-tracing/super_zoom.png" width="705" height="438"></p> <p>It takes 50us for the thread to start executing after it gets scheduled. Note that the wait time is substantially longer than the execution time. The waiting is because <a href="http://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/">an affinity policy was set which will cause the scheduler to try to schedule the thread back to the same core</a> so that any data that's in the core's cache will still be there, giving you the best possible cache locality, which means that the thread will have to wait until the previously scheduled thread finishes. That makes intuitive sense, but if consider, for example, a <a href="http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake_NewMemLat.txt">2.2GHz Skylake</a>, the cache latency is 6.4ns, and 21.2ns to l2, and l3 cache, respectively. Is it worth changing the affinity policy to speed this kind of thing up? You can't tell from this single trace, but with the tracing framework used to generate this data, you could do the math to figure out if you should change the policy.</p> <p>In the talk, Dick notes that, given the actual working set size, it would be worth waiting up to 10us to schedule on another CPU sharing the same l2 cache, and 100us to schedule on another CPU sharing the same l3 cache<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">2</a></sup>.</p> <p>Something else you can observe from this trace is that, if you care about a workload that resembles Google search, basically every standard benchmark out there is bad, and the standard technique of running <a href="//danluu.com/intel-cat/#spec">N copies of spec</a> is terrible. That's not a straw man. People still do that in academic papers today, and some chip companies use SPEC to benchmark their mobile devices!</p> <p>Anyway, that was one performance issue where we were able to see what was going on because of the ability to see a number of different things at the same time (CPU scheduling, thread scheduling, and locks). Let's look at a simpler single-threaded example on a single machine where a tracing framework is still beneficial:</p> <p><img src="images/perf-tracing/gmail.png" width="906" height="515"></p> <p>This is a trace from gmail, circa 2004. Each row shows the processing that it takes to handle one email. Well, except for the last 5 rows; the last email shown takes so long to process that displaying all of the processing takes 5 rows of space. If we look at each of the normal emails, they all look approximately the same in terms of what colors (i.e., what functions) are called and how much time they take. The last one is different. It starts the same as all the others, but then all this other junk appears that only happens in the slow email.</p> <p>The email itself isn't the problem -- all of that extra junk is the processing that's done to reindex the words from the emails that had just come in, which was batched up across multiple emails. This picture caused the Gmail devs to move that batch work to another thread, reducing tail latency from 1800ms to 100ms. This is another performance bug that it would be very difficult to track down with standard profiling tools. I've often wondered why email almost always appears quickly when I send to gmail from gmail, and it sometimes takes minutes when I send work email from outlook to outlook. My guess is that a major cause is that it's much harder for the outlook devs to track down tail latency bugs like this than it is for the gmail devs to do the same thing.</p> <p>Let's look at one last performance bug before moving on to discussing what kind of visibility we need to track these down. This is a bit of a spoiler, but with this bug, it's going to be critical to see what the entire machine is doing at any given time.</p> <p><a name="histogram"></a></p> <p><img src="images/perf-tracing/disk_tail.png" width="765" height="501"></p> <p>This is a histogram of disk latencies on storage machines for a 64kB read, in ms. There are two sets of peaks in this graph. The ones that make sense, on the left in blue, and the ones that don't, on the right in red.</p> <p>Going from left to right on the peaks that make sense, first there's the peak at 0ms for things that are cached in RAM. Next, there's a peak at 3ms. That's way too fast for the 7200rpm disks we have to transfer 64kB; the time to get a random point under the head is already <code>(1/(7200/60)) / 2 s = 4ms</code>. That must be the time it takes to transfer something from the disk's cache over PCIe. The next peak, at near 25ms, is the time it takes to seek to a point and then read 64kB off the disk.</p> <p>Those numbers don't look so bad, but the 99%-ile latency is a whopping 696ms, and there are peaks at 250ms, 500ms, 750ms, 1000ms, etc. And these are all unreproducible -- if you go back and read a slow block again, or even replay the same sequence of reads, the slow reads are (usually) fast. That's weird! What could possibly cause delays that long? In the talk, Dick Sites says “each of you think of a guess, and you'll find you're all wrong”.</p> <p><img src="images/perf-tracing/13disks.png" width="852" height="150"></p> <p>That's a trace of thirteen disks in a machine. The blue blocks are reads, and the red blocks are writes. The black lines show the time from the initiation of a transaction by the CPU until the transaction is completed. There are some black lines without blocks because some of the transactions hit in a cache and don't require actual disk activity. If we wait for a period where we can see tail latency and zoom in a bit, we'll see this:</p> <p><img src="images/perf-tracing/13disks_tail.png" width="902" height="319"></p> <p>We can see that there's a period where things are normal, and then some kind of phase transition into a period where there are 250ms gaps (4) between periods of disk activity (5) on the machine <em>for all disks</em>. This goes on for nine minutes. And then there's a phase transition and disk latencies go back to normal. That it's machine wide and not disk specific is a huge clue.</p> <p>Using that information, Dick pinged various folks about what could possibly cause periodic delays that are a multiple of 250ms on an entire machine, and found out that the cause was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren't too many requests in a quarter second interval and the kernel stops throttling the threads.</p> <p>After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years<sup class="footnote-ref" id="fnref:Y"><a rel="footnote" href="#fn:Y">3</a></sup>. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn't something you can see in a profile.</p> <p>One question you might have is, is this because of some flaw in existing profilers, or can profilers provide enough information that you don't need to use tracing tools to track down rare, long-tail, performance bugs? I've been talking to Xi Yang about this, who had an ISCA 2015 <a href="http://research.microsoft.com/pubs/244803/preprint.pdf">paper</a> and <a href="https://github.com/yangxi/papers/blob/master/SHIM-ISCA-slides.pptx">talk</a> describing some of his work. He and his collaborators have done a lot more since publishing the paper, but the paper still contains great information on how far a profiling tool can be pushed. As Xi explains in his talk, one of the fundamental limits of a sampling profiler is how often you can sample.</p> <p><img src="images/perf-tracing/perf_sample_1k_100k_10m.png" width="616" height="424"></p> <p>This is a graph of the number of the number of executed instructions per clock (IPC) over time in <a href="https://en.wikipedia.org/wiki/Lucene">Lucene</a>, which is the core of Elasticsearch.</p> <p>At 1kHz, which is the default sampling interval for perf, you basically can't see that anything changes over time at all. At 100kHz, which is as fast as perf runs, you can tell something is going on, but not what. The 10MHz graph is labeled SHIM because that's the name of the tool presented in the paper. At 10MHz, you get a much better picture of what's going on (although it's worth noting that 10MHz is substantially lower resolution than you can get out of some tracing frameworks).</p> <p>If we look at the IPC in different methods, we can see that we're losing a lot of information at the slower sampling rates:</p> <p><img src="images/perf-tracing/perf_top_10_methods.png" width="813" height="372"></p> <p>This is the top 10 hottest methods Lucene ranked by execution time; these 10 methods account for 74% of the total execution time. With perf, it's hard to tell which methods have low IPC, i.e., which methods are spending time stalled. But with SHIM, we can clearly see that there's one method that spends a lot of time waiting, #4.</p> <p>In retrospect, there's nothing surprising about these graphs. We know from the Nyquist theorem that, <a href="https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem">to observe a signal with some frequency, X, we have to sample with a rate at least 2X</a>. There are a lot of factors of performance that have a frequency higher than 1kHz (e.g., <a href="https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states">CPU p-state changes</a>), so we should expect that we're unable to directly observe a lot of things that affect performance with perf or other traditional sampling profilers. If we care about microbenchmarks, we can get around this by repeatedly sampling the same thing over and over again, but for rare or one-off events, it may be hard or impossible to do that.</p> <p>This raises a few questions:</p> <ol> <li>Why does perf sample so infrequently?</li> <li>How does SHIM get around the limitations of perf?</li> <li>Why are sampling profilers dominant?</li> </ol> <h4 id="1-why-does-perf-sample-so-infrequently">1. Why does perf sample so infrequently?</h4> <p>This comment from <a href="https://github.com/torvalds/linux/blob/7b648018f628eee73450b71dc68ebb3c3865465e/kernel/events/core.c#L274">events/core.c in the linux kernel</a> explains the limit:</p> <blockquote> <p>perf samples are done in some very critical code paths (<a href="https://en.wikipedia.org/wiki/Non-maskable_interrupt">NMIs</a>). If they get too much CPU time, the system can lock up and not get any real work done.</p> </blockquote> <p>As we saw from the tcpdump trace in the Dick Sites talk, interrupts take a significant amount of time to get processed, which limits the rate at which you can sample with an interrupt based sampling mechanism.</p> <h4 id="2-how-does-shim-get-around-the-limitations-of-perf">2. How does SHIM get around the limitations of perf?</h4> <p>Instead of having an interrupt come in periodically, like perf, SHIM instruments the runtime so that it periodically runs a code snippet that can squirrel away relevant information. In particular, the authors instrumented the Jikes RVM, which injects yield points into every method prologue, method epilogue, and loop back edge. At a high level, injecting a code snippet into every function prologue and epilogue sounds similar to what Dick Sites describes in his talk.</p> <p>The details are different, and I recommend both watching the Dick Sites talk and reading the Yang et al. paper if you're interested in performance measurement, but the fundamental similarity is that both of them decided that it's <a href="//danluu.com/new-cpu-features/#context-switches-syscalls">too expensive to having another thread break in and sample periodically</a>, so they both ended up injecting some kind of tracing code into the normal execution stream.</p> <p>It's worth noting that sampling, at any frequency, is going to miss waiting on (for example) software locks. Dick Sites's recommendation for this is to timestamp based on wall clock (not CPU clock), and then try to find the underlying causes of unusually long waits.</p> <h4 id="3-why-are-sampling-profilers-dominant">3. Why are sampling profilers dominant?</h4> <p>We've seen that Google's tracing framework allows us to debug performance problems that we'd never be able to catch with traditional sampling profilers, while also collecting the data that sampling profilers collect. From the outside, SHIM looks like a high-frequency sampling profiler, but it does so by acting like a tracing tool. Even perf is getting support for low-overhead tracing. Intel added <a href="https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing">hardware support for certain types for certain types of tracing in Broadwell and Skylake</a>, along with kernel support in 4.1 (with user mode support for perf coming in 4.3). If you're wondering how much overhead these tools have, Andi Kleen claims that the Intel tracing support in Linux has about a 5% overhead, and Dick Sites mentions in the talk that they have a budget of about 1% overhead.</p> <p>It's clear that state-of-the-art profilers are going to look a lot like tracing tools in the future, but if we look at the state of things today, the easiest options are all classical profilers. You can fire up a profiler like perf and it will tell you approximately how much time various methods are taking. With other basic tooling, you can tell what's consuming memory. Between those two numbers, you can solve the majority of performance issues. Building out something like Google's performance tracing framework is non-trivial, and cobbling together existing publicly available tools to trace performance problems is a rough experience. You can see one example of this when <a href="https://blog.cloudflare.com/the-story-of-one-latency-spike/">Marek Majkowski debugged a tail latency issue using System Tap</a>.</p> <p>In <a href="http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html">Brendan Gregg's page on Linux tracers</a>, he says “[perf_events] can do many things, but if I had to recommend you learn just one [tool], it would be CPU profiling”. Tracing tools are cumbersome enough that his top recommendation on his page about tracing tools is to learn a profiling tool!</p> <h3 id="now-what">Now what?</h3> <p>If you want to use an tracing tool like the one we looked at today your options are:</p> <ol> <li>Get a job at <a href="//danluu.com/startup-tradeoffs/">Google</a></li> <li>Build it yourself</li> <li>Cobble together what you need out of existing tools</li> </ol> <h4 id="1-get-a-job-at-google-danluu-com-startup-tradeoffs">1. Get a job at <a href="//danluu.com/startup-tradeoffs/">Google</a></h4> <p>I hear <a href="http://steve-yegge.blogspot.com/2008/03/get-that-job-at-google.html">Steve Yegge has good advice on how to do this</a>. If you go this route, try to attend orientation in Mountain View. They have the best orientation.</p> <h4 id="2-build-it-yourself">2. Build it yourself</h4> <p>If you look at <a href="http://research.microsoft.com/pubs/244803/preprint.pdf">the SHIM paper</a>, there's a lot of cleverness built-in to get really fine-grained information while minimizing overhead. I think their approach is really neat, but considering the current state of things, you can get a pretty substantial improvement without much cleverness. Fundamentally, all you really need is some way to inject your tracing code at the appropriate points, some number of bits for a timestamp, plus a handful of bits to store the event.</p> <p>Say you want trace transitions between user mode and kernel mode. The transitions between waiting and running will tell you what the thread was waiting on (e.g., disk, timer, IPI, etc.). There are maybe 200k transitions per second per core on a busy node. 200k events with a 1% overhead is 50ns per event per core. A cache miss is well over 100 cycles, so our budget is less than one cache miss per event, meaning that each record must fit within a fraction of a cache line. If we have 20 bits of timestamp (<code>RDTSC &gt;&gt; 8 bits</code>, giving ~100ns resolution and 100ms range) and 12 bits of event, that's 4 bytes, or 16 events per cache line. Each core has to have its own buffer to avoid cache contention. To map <code>RDTSC</code> times back to wall clock times, calling <code>gettimeofday</code> along with <code>RDTSC</code> at least every 100ms is sufficient.</p> <p>Now, say the machine is serving 2000 QPS. That's 20 99%-ile tail events per second and 2 99.9% tail events per second. Since those events are, by definition, unusually long, Dick Sites recommends a window of 30s to 120s to catch those events. If we have 4 bytes per event * 200k events per second * 40 cores, that's about 32MB/s of data. Writing to disk while we're logging is hopeless, so you'll want to store the entire log while tracing, which will be in the range of 1GB to 4GB. That's probably fine for a typical machine in a datacenter, which will have between 128GB and 256GB of RAM.</p> <p>My not-so-secret secret hope for this post is that someone will take this idea and implement it. That's already happened with at least one blog post idea I've thrown out there, and this seems at least as valuable.</p> <h4 id="3-cobble-together-what-you-need-out-of-existing-tools">3. Cobble together what you need out of existing tools</h4> <p>If you don't have a magical framework that solves all your problems, the tool you want is going to depend on the problem you're trying to solve.</p> <p>For figuring out why things are waiting, Brendan Gregg's write-up on <a href="http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html">off-CPU flame graphs is a pretty good start</a> if you don't have access to internal Google tools. For that matter, <a href="http://www.brendangregg.com/linuxperf.html">his entire site</a> is great if you're doing any kind of Linux performance analysis. There's info on Dtrace, ftrace, SystemTap, etc. Most tools you might use are covered, although <a href="https://github.com/jcsaezal/pmctrack">PMCTrack</a> is missing.</p> <p>The problem with all of these is that they're all much higher overhead than the things we've looked at today, so they can't be run in the background to catch and effectively replay any bug that comes along if you operate at scale. Yes, that includes dtrace, which I'm calling out in particular because any time you have one of these discussions, a dtrace troll will come along to say that dtrace has supported that for years. It's like the common lisp of trace tools, in terms of community trolling.</p> <p>Anyway, if you're on Windows, <a href="https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/">Bruce Dawson's site</a> seems to be the closest analogue to Bredan Gregg's site. If that doesn't have enough detail, <a href="https://www.amazon.com/gp/product/0735684189/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;camp=1789&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0735684189&amp;linkId=3119782cd4e0ad47568c854ed17b219f">there's always the Windows Internals books</a>.</p> <p>This is a bit far afield, but for problems where you want an easy way to get CPU performance counters, <a href="https://github.com/RRZE-HPC/likwid">likwid</a> is nice. It has a much nicer interface than <code>perf stat</code>, lets you easily only get stats for selected functions, etc.</p> <p><small> Thanks to Nathan Kurz, Xi Yang, Leah Hanson, John Gossman, Dick Sites, Hari Angepat, and Dan Puttick for comments/corrections/discussion.</p> <p>P.S. Xi Yang, one of the authors of SHIM is finishing up his PhD soon and is going to be looking for work. If you want to hire a performance wizard, <a href="https://yangxi.github.io/">he has a CV and resume here</a>. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:D">The talk is amazing and I recommend watching the talk instead of reading this post. I'm writing this up because I know if someone told me I should watch a talk instead of reading the summary, I wouldn't do it. Ok, fine. If you're like me, maybe you'd consider <a href="https://scholar.google.com/scholar?q=richard+l+sites">reading a couple of his papers</a> instead of reading this post. I once heard someone say that it's impossible to disagree with Dick's reasoning. You can disagree with his premises, but if you accept his premises and follow his argument, you have to agree with his conclusions. His presentation is impeccable and his logic is implacable. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:O">This oversimplifies things a bit since, if some level of cache is bandwidth limited, spending bandwidth to move data between cores could slow down other operations more than this operation is sped up by not having to wait. But even that's oversimplified since it doesn't take into account the <a href="//danluu.com/datacenter-power/">extra power it takes</a> to move data from a higher level cache as opposed to accessing the local cache. But that's also oversimplified, as is everything in this post. Reality is really complicated, and the more detail we want the less effective sampling profilers are. <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> <li id="fn:Y"><p>This sounds like a long time, but if you ask around you'll hear other versions of this story at every company that creates systems complex beyond human understanding. I know of one chip project at Sun that was delayed for multiple years because they couldn't track down some persistent bugs. At Microsoft, they famously spent two years tracking down a scrolling smoothness bug on Vista. The bug was hard enough to reproduce that they set up screens in the hallways so that they could casually see when the bug struck their test boxes. One clue was that the bug only struck high-end boxes with video cards, not low-end boxes with integrated graphics, but that clue wasn't sufficient to find the bug.</p> <p>After quite a while, they called the Xbox team in to use their profiling expertise to set up a system that could capture the bug, and once they had the profiler set up it immediately became apparent what the cause was. This was back in the AGP days, where upstream bandwidth was something like 1/10th downstream bandwidth. When memory would fill up, textures would get ejected, and while doing so, the driver would lock the bus and prevent any other traffic from going through. That took long enough that the video card became unresponsive, resulting in janky scrolling.</p> <p>It's really common to hear stories of bugs that can take an unbounded amount of time to debug if the proper tools aren't available.</p> <a class="footnote-return" href="#fnref:Y"><sup>[return]</sup></a></li> </ol> </div> We saw some really bad Intel CPU bugs in 2015 and we should expect to see more in the future cpu-bugs/ Sun, 10 Jan 2016 00:00:00 +0000 cpu-bugs/ <p>2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like <a href="//danluu.com/new-cpu-features/#update">cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs</a>, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they've been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <p>Things haven't looked so great on the engineering/bugs side of things, though. We've seen a number of fairly serious CPU bugs and it looks like we should expect more in the future. I don't keep track of Intel bugs unless they're so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the <a href="http://xenbits.xen.org/xsa/advisory-156.html">bug found by Ben Serebrin and Jan Beulic</a>, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.</p> <p></p> <p>Major cloud vendors were quite lucky that this bug was found by a Google engineer, and that Google decided to share its knowledge of the bug with its competitors before publicly disclosing. Black hats spend a lot of time trying to take down major services. I'm actually really impressed by both the persistence and the cleverness of the people who spend their time attacking the companies I work for. If, buried deep in our infrastructure, we have a bit of code running at <a href="https://en.wikipedia.org/wiki/Deferred_Procedure_Call">DPC</a> that's vulnerable to slowdown because of some kind of hash collision, someone will find and exploit that, even if it takes a long and obscure sequence of events to make it happen. If this CPU microcode hang had been found by one of these black hats, there would have been major carnage for most cloud hosted services at the most inconvenient possible time<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">2</a></sup>.</p> <p>Shortly after the Serebrin/Beulic bug was found, a group of people found that running <a href="http://www.mersenne.org/download/">prime95</a>, a commonly used tool for benchmarking and burn-in, causes <a href="http://mersenneforum.org/showthread.php?t=20714">their entire system to lock up</a>. <a href="https://communities.intel.com/thread/96157?start=15&amp;tstart=0">Intel's response to this was</a>:</p> <blockquote> <p>Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products. This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior.</p> </blockquote> <p>which reveals almost nothing about what's actually going on. If you look at their errata list, you'll find that this is typical, except that they normally won't even name the application that was used to trigger the bug. For example, <a href="http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf">one of the current errata lists</a> has entries like</p> <ul> <li>Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior</li> <li>AVX Gather Instruction That Should Result in #DF May Cause Unexpected System Behavior</li> <li>Processor May Experience a Spurious LLC-Related Machine Check During Periods of High Activity</li> <li>Page Fault May Report Incorrect Fault Information</li> </ul> <p>As we've seen, “unexpected system behavior” can mean that we're completely screwed. Machine checks aren't great either -- they cause Windows to blue screen and Linux to kernel panic. An incorrect address on a page fault is potentially even worse than a mere crash, and if you dig through the list you can find a lot of other scary sounding bugs.</p> <p>And keep in mind that the Intel errata list has the following disclaimer:</p> <blockquote> <p>Errata remain in the specification update throughout the product's lifecycle, or until a particular stepping is no longer commercially available. Under these circumstances, errata removed from the specification update are archived and available upon request.</p> </blockquote> <p>Once they stop manufacturing a stepping (the hardware equivalent of a point release), they reserve the right to remove the errata and you won't be able to find out what errata your older stepping has unless you're important enough to Intel.</p> <p>Anyway, back to 2015. We've seen at least two serious bugs in Intel CPUs in the last quarter<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">3</a></sup>, and it's almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who'd previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we'll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it's much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we've moved past that. In part, that's because &quot;unpredictable system behavior&quot; have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it's mostly because CPUs <a href="//danluu.com/new-cpu-features/">have gotten more complex, making them more difficult to test</a> and <a href="//danluu.com/cpu-backdoors/">audit</a> effectively, while Intel appears to be cutting back on validation effort. Ironically, we have hardware virtualization that's supposed to help us with security, but the virtualization is so complicated<sup class="footnote-ref" id="fnref:V"><a rel="footnote" href="#fn:V">4</a></sup> that the hardware virtualization implementation is likely to expose &quot;unpredictable system behavior&quot; bugs that wouldn't otherwise have existed. This isn't to say it's hopeless -- it's possible, in principle, to design CPUs such that a hang bug on one core doesn't crash the entire system. It's just that it's a fair amount of work to do that at every level (cache directories, the uncore, etc., would have to be modified to operate when a core is hung, as well as OS schedulers). No one's done the work because it hasn't previously seemed important.</p> <p>You'll often hear software folks say that these things don't matter because they can (sometimes) be patched. But, many devices will never get patched, which means that hardware security bugs will leave some devices vulnerable for their entire lifetime. And even if you don't care about consumers, serious bugs are very bad for CPU vendors. At a company I worked for, we once had a bug escape validation and get found after we shipped. One OEM wouldn't talk to us for something like five years after that, and other OEMs that continued working with us had to re-qualify their parts with our microcode patch and they made sure to let us know how expensive that was. Intel has enough weight that OEMs can't just walk away from them after a bug, but they don't have unlimited political capital and every serious bug uses up political capital, even if it can be patched.</p> <p>This isn't to say that we should try to get to zero bugs. There's always going to be a trade off between development speed and and bug rate and the optimal point probably isn't zero bugs. But we're now regularly seeing severe bugs with security implications, which changes the tradeoff a lot. With something like the <a href="https://en.wikipedia.org/wiki/Pentium_FDIV_bug">FDIV bug</a> you can argue that it's statistically unlikely that any particular user who doesn't run numerical analysis code will be impacted, but security bugs are different. Attackers don't run random code, so you can't just say that it's unlikely that some condition will occur.</p> <h3 id="update">Update</h3> <p>After writing this, a person claiming to be an ex-Intel employee said &quot;even with your privileged access, you have no idea&quot; and a pseudo-anonymous commenter on reddit <a href="https://www.reddit.com/r/programming/comments/41k3yx/we_saw_some_really_bad_intel_cpu_bugs_in_2015_and/cz2zwag">made this comment</a>:</p> <blockquote> <p>As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.</p> <p>Why?</p> <p>Let me set the scene: It's late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: &quot;We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times... we can't live forever in the shadow of the early 90's FDIV bug, we need to move on. Our competition is moving much faster than we are&quot; - I'm paraphrasing. Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn't explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn't seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.</p> </blockquote> <p>I haven't been able to confirm this story from another source I personally know, although another anonymous commenter said &quot;I left INTC in mid 2013. From validation. This ... is accurate compared with my experience.&quot; Another anonymous person, someone I know, didn't hear that speech, but found that at around that time, &quot;velocity&quot; became a buzzword and management spent a lot of time talking about how Intel needs more &quot;velocity&quot; to compete with ARM, which appears to confirm the sentiment, if not the actual speech.</p> <p>I've also heard from formal methods people that, around the-timeframe mentioned in the first comment, there was an exodus of formal verification folks. One story I've heard is that people left because they were worried about being made redundant. I'm told that, at the time, early retirement packages were being floated around and people strongly suspected layoffs. Another story I've heard is that things got really strange due to Intel's focus on the mobile battle with ARM, and people wanted to leave before things got even worse. But it's hard to say of this means anything, since Intel has been losing a lot of people to Apple because Apple offers better compensation packages and the promise of being less dysfunctional.</p> <p>I also got anonymous stories about bugs. One person who works in HPC told me that when they were shopping for Haswell parts, a little bird told them that they'd see drastically reduced performance on variants with greater than 12 cores. When they tried building out both 12-core and 16-core systems, they found that they got noticeably better performance on their 12-core systems across a wide variety of workloads. That's not better per-core performance -- that's better absolute performance. Adding 4 more cores reduced the performance on parallel workloads! That was true both in single-socket and two-socket benchmarks.</p> <p>There's also <a href="http://www.tomshardware.com/forum/id-2830772/skylake-build-randomly-freezing-crashing.html">a mysterious hang during idle/low-activity bug that Intel doesn't seem to have figured out yet</a>.</p> <p>And then there's <a href="https://bugzilla.kernel.org/show_bug.cgi?id=103351">this Broadwell bug that hangs Linux if you don't disable low-power states</a>.</p> <p>And of course Intel isn't the only company with bugs -- this <a href="https://lists.debian.org/debian-security/2016/03/msg00084.html">AMD bug found by Robert Swiecki</a> not only allows a VM to crash its host, it also allows a VM to take over the host.</p> <p>I doubt I've even heard of all the recent bugs and stories about verification/validation. Feel free to send other reports my way.</p> <h3 id="more-updates">More updates</h3> <p>A number of folks have noticed <a href="https://forum.synology.com/enu/viewtopic.php?f=7&amp;t=119727&amp;start=60">unusual failure rates in storage devices</a> and switches. This <a href="https://www.servethehome.com/intel-atom-c2000-series-bug-quiet/">appears to be related to an Intel Atom bug</a>. I find this interesting because the Atom is a relatively simple chip, and therefore a relatively simple chip to verify. When the first-gen Atom was released, folks at Intel seemed proud of how few internal spins of the chip were needed to ship a working production chip that, something made possible by the simplicity of the chip. Modern Atoms are more complicated, but not <em>that</em> much more complicated.</p> <p>Intel Skylake and Kaby Lake have <a href="https://lists.debian.org/debian-devel/2017/06/msg00308.html">a hyperthreading bug that's so serious that Debian recommends that users disable hyperthreading to avoid the bug</a>, which can &quot;cause spurious errors, such as application and system misbehavior, data corruption, and data loss&quot;.</p> <p>On the AMD side, there might be <a href="https://community.amd.com/message/2796982">a bug that's as serious any recent Intel CPU bug</a>. If you read that linked thread, you'll see an AMD representative asking people to disable SMT, OPCache Control, and changing LLC settings to possibly mitigate or narrow down a serious crashing bug. On <a href="https://forums.gentoo.org/viewtopic-t-1061546-postdays-0-postorder-asc-start-150.html">another thread</a>, you can find someone reporting an #MC exception with &quot;u-op cache crc mismatch&quot;.</p> <p>Although AMD's response in the forum was that these were isolated issues, <a href="http://www.phoronix.com/scan.php?page=article&amp;item=ryzen-segv-continues">phoronix was able to reproduce crashes by running a stress test that consists of compiling a number of open source programs</a>. They report they were able to get 53 segfaults with one hour of attempted compilation.</p> <p>Some FreeBSD folks have also noticed seemingly unrelated crashes and have been able to get a reproduction by running code at a high address and then firing an interrupt. <a href="https://svnweb.freebsd.org/base?view=revision&amp;revision=321899">This can result in a hang or a crash</a>. The reason this appears to be unrelated to the first reported Ryzen issues is that this is easily reproducible with SMT disabled.</p> <p>Matt Dillon found an AMD bug triggered by DragonflyBSD, and <a href="http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20">commited a tiny patch to fix it</a>:</p> <blockquote> <p>There is a bug in Ryzen related to the kernel iretq'ing into a high user %rip address near the end of the user address space (top of user stack). This is a temporary workaround for the issue.</p> <p>The original %rip for sigtramp was 0x00007fffffffffe0. Moving it down to fa0 wasn't sufficient. Moving it down to f00 moved the bug from nearly instant to taking a few hours to reproduce. Moving it down to be0 it took a day to reproduce. Moving it down to 0x00007ffffffffba0 (this commit) survived the overnight test.</p> </blockquote> <h3 id="meltdown-spectre-https-meltdownattack-com-update"><a href="https://meltdownattack.com/">Meltdown / spectre</a> update</h3> <p>This is an interesting class of attack that takes advantage of speculative execution plus side channel attacks to leak privileged information into user processes. It seems that at least some of these attacks be done from javascript in the browser.</p> <p>Regarding the comments in the first couple updates on Intel's attitude towards validation recently, <a href="https://news.ycombinator.com/item?id=16059635">another person claiming to be ex-Intel backs up the statements above</a>:</p> <blockquote> <p>As a former Intel employee this aligns closely with my experience. I didn't work in validation (actually joined as part of Altera) but velocity is an absolute buzzword and the senior management's approach to complex challenges is sheer panic. Slips in schedules are not tolerated at all - so problems in validation are an existential threat, your project can easily just be canned. Also, because of the size of the company the ways in which quality and completeness are 'acheived' is hugely bureaucratic and rarely reflect true engineering fundamentals.</p> </blockquote> <h3 id="2024-update">2024 update</h3> <p>We're approaching a decade since I wrote this post and the serious CPU bugs keep coming. For example, this recent one was found by <a href="https://www.radgametools.com/oodleintel.htm">RAD tools</a>:</p> <blockquote> <p>Intel Processor Instability Causing Oodle Decompression Failures</p> <p>We believe that this is a hardware problem which affects primarily Intel 13900K and 14900K processors, less likely 13700, 14700 and other related processors as well. Only a small fraction of those processors will exhibit this behavior. The problem seems to be caused by a combination of BIOS settings and the high clock rates and power usage of these processors, leading to system instability and unpredictable behavior under heavy load ... Any programs which heavily use the processor on many threads may cause crashes or unpredictable behavior. There have been crashes seen in RealBench, CineBench, Prime95, Handbrake, Visual Studio, and more. This problem can also show up as a GPU error message, such as spurious &quot;out of video memory&quot; errors, even though it is caused by the CPU.</p> </blockquote> <p>One can argue that this is a configuration bug, but from the standpoint of a typical user, all what they observe is that their CPU is causing crashes. And, realistically, Intel knows that their CPUs are shipping into systems with these settings. The mitigation for this involves doing things like changing the following settings &quot;&quot;SVID behavior&quot; → &quot;Intel fail safe&quot;, &quot;Long duration power limit&quot; → reduce to 125W if set higher (&quot;Processor Base Power&quot; on ARK)&quot;, &quot;Short duration power limit&quot; → reduce to 253W if set higher (for 13900/14900 CPUs, other CPUs have other limits! &quot;Maximum Turbo Power&quot; on ARK)&quot;, etc.</p> <p>If they wanted their CPUs to not crash due to this issue, they could have and should have enforced these settings as well as some others. Instead, they left this up to the BIOS settings, and here we are.</p> <p>Historically, Intel was much more serious about verification, validation, and testing than AMD and we saw this in their output. At one point, when a lot of enthusiast sites were excited about AMD (in the K7 days), Google stopped using AMD and basically banned purchases of AMD CPUs because they were so buggy and had caused so many hard-to-debug problems. But, over time, the relative level of verification/validation/test effort Intel allocates has gone down and Intel seems to have nearly caught or maybe caught AMD in their rate of really serious bugs. Considering Intel's current market position, with very heavy pressure from AMD, ARM, and Nvidia, it seems unlikely that Intel will turn this around in the foreseeable future. Nvidia, historically, has been significantly buggier than AMD or Intel, so Intel still has quite a bit of room to run to become the most buggy major chip manufacturer. Considering that Nvidia is one of the biggest threats to Intel and how Intel responded to threats from other, then-buggier, manufacturers, it seems like we should expect an even higher rate of bad bugs in the coming decade.</p> <p>On the specific bug, there's tremendous pressure to operate more like a &quot;move fast and break things&quot; software company than a traditional, conservative, CPU manufacturer for multiple reasons. When you make a manufacture a CPU, how fast it will run ends up being somewhat random and there's no reliable way to tell how fast it will run other than testing it, so CPU companies run a set of tests on the CPU to see how fast it will go. This test time is actually fairly expensive, so there's a lot of work done to try to find the smallest set of tests possible that will correctly determine how fast the CPU can operate. One easy way to cut costs here is to just run fewer tests even if the smaller set of tests doesn't fully guarantee that the CPU can operate at the speed it's sold at.</p> <p>Another factor influencing this is that CPUs that are sold as nominally faster can sell for more, so there's also pressure to push the CPUs as close to their limits as possible. One way we can see that the margin here has, in general, decreased, is by looking at how overclockable CPUs are. People are often happy with their overclocked CPU if they run a few tests, like prime95, stresstest, etc., and their part doesn't crash, but this isn't nearly enough to determine if the CPU can really run everything a user could throw at it, but if you really try to seriously test a CPU (working at an Intel competitor, we would do this regularly), Intel and other CPU companies have really pushed the limit of how fast they claim their CPUs are relative to how fast they actually are, which sometimes results in CPUs that are sold that have been pushed beyond their capabilities.</p> <p>On overclocking, as Fabian Giesen of RAD notes,</p> <blockquote> <p>This stuff is not sanctioned and will count as overclocking if you try to RMA it but it's sold as a major feature of the platform and review sites test with it on.</p> </blockquote> <p>Daniel Gibson replied with</p> <blockquote> <p>hmm on my mainboard (ASUS ROG Strix B550-A Gaming -clearly gaming hardware, but middle price range) I had to explicitly enable the XMP/EXPO profile for the DDR4-RAM to run at full speed - which is DDR4-3200, officially supported by the CPU (Ryzen 5950X). Otherwise it ran at DDR4-2400 speed, I think? Or was it 2133? I forgot, at least significantly lower</p> </blockquote> <p>To which Fabian noted</p> <blockquote> <p>Correct. Fun fact: turning on EXPO technically voids your warranty ... t's great; both the CPU and the RAM list it as supported but it's officially not.</p> <p>One might call it a racket, if one were inclined to such incisive language.</p> </blockquote> <p>Intel didn't used to officially unofficially support this kind of thing. And, more generally, historically, CPU manufacturers were very hesitant to ship parts that had a non-negligible risk of crashes and data corruption when used as intended if they could avoid them, but more and more of these bugs keep happening. Some end up becoming quite public, like this, due to someone publishing a report about them like the RAD report above. And some get quietly reported to the CPU manufacturer by a huge company, often with some kind of NDA agreement, where the big company gets replacement CPUs and Intel or another manufacturer quietly ships firmware fixes to the issue. And it surely must be the case that some of these aren't really caught at all, unless you count the occasional data corruption or crash as being caught.</p> <h3 id="cpu-internals-series">CPU internals series</h3> <ul> <li><a href="//danluu.com/new-cpu-features/">New CPU features since the 80s</a></li> <li><a href="//danluu.com/branch-prediction/">A brief history of branch prediction</a></li> <li><a href="//danluu.com/integer-overflow/">The cost of branches and integer overflow checking in real code</a></li> <li><a href="cpu-bugs/">CPU bugs</a></li> <li><a href="//danluu.com/hardware-unforgiving/">Why CPU development is hard</a></li> <li><a href="//danluu.com/why-hardware-development-is-hard/">Verilog sucks, part 1</a></li> <li><a href="//danluu.com/pl-troll/">Verilog sucks, part 2</a></li> </ul> <p><small> Thanks to Leah Hanson, Jeff Ligouri, Derek Slager, Ralph Corderoy, Joe Wilder, Nate Martin, Hari Angepat, JonLuca De Caro, Jeff Fowler, and a number of anonymous tipsters for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">As with <a href="google-wage-fixing/">the Apple, Google, Adobe, etc., wage-fixing agreement</a>, legal systems are sending the clear message that businesses should engage in illegal and unethical behavior since they'll end up getting fined a small fraction of what they gain. This is the opposite of the Becker-ian policy that's applied to individuals, where sentences have gotten jacked up on the theory that, since many criminals aren't caught, the criminals that are caught should have severe punishments applied as a deterrence mechanism. The theory is that the criminals will rationally calculate the expected sentence from a crime, and weigh that against the expected value of a crime. If, for example, the odds of being caught are 1% and we increase the expected sentence from 6 months to 50 years, criminals will calculate that the expected sentence has changed from 2 days to 6 months, thereby reducing the effective value of the crime and causing a reduction in crime. We now have decades of evidence that the theory that long sentences will deter crime is either empirically false or that the effect is very small; turns out that people who impulse commit crimes don't deeply study sentencing guidelines before they commit crimes. Ironically, for white-collar corporate crimes where Becker's theory might more plausibly hold, Becker's theory isn't applied. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:C">Something I find curious is how non-linear the level of effort of the attacks is. Google, Microsoft, and Amazon face regular, persistent, attacks, and if they couldn't trivially mitigate <a href="http://status.linode.com/incidents/mmdbljlglnfd">the kind of unsophisticated attack that's been severely affecting Linode availability for weeks</a>, they wouldn't be able to stay in business. If you talk to people at various bay area unicorns, you'll find that a lot of them have accidentally DoS'd themselves when they hit an external API too hard during testing. In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that? <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:A">The fault into microcode infinite loop also affects AMD processors, but basically no one runs a cloud on AMD chips. I'm pointing out Intel examples because Intel bugs have higher impact, not because Intel is buggier. Intel has a much better track record on bugs than AMD. IBM is the only major microprocessor company I know of that's been more serious about hardware verification than Intel, but if you have an IBM system running AIX, I could tell you some stories that will make your hair stand on end. Moreover, it's not clear how effective their verification groups can be since they've been losing experienced folks without being able to replace them for over a decade, but that's a topic for another post. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:V">See <a href="https://github.com/vishmohan/vmlaunch">this code</a> for a simple example of how to use Intel's API for this. The example is simplified, so much so that it's not really useful except as a learning aid, and it still turns out to be around 1000 lines of low-level code. <a class="footnote-return" href="#fnref:V"><sup>[return]</sup></a></li> </ol> </div> Normalization of deviance wat/ Tue, 29 Dec 2015 00:00:00 +0000 wat/ <p>Have you ever mentioned something that seems totally normal to you only to be greeted by surprise? Happens to me all the time when I describe something everyone at work thinks is normal. For some reason, my conversation partner's face morphs from pleasant smile to rictus of horror. Here are a few representative examples.</p> <p>There's the company that is perhaps the nicest place I've ever worked, combining the best parts of Valve and Netflix. The people are amazing and you're given near total freedom to do whatever you want. But as a side effect of the culture, they lose perhaps half of new hires in the first year, some voluntarily and some involuntarily. Totally normal, right? Here are a few more anecdotes that were considered totally normal by people in places I've worked. And often not just normal, but laudable.</p> <p>There's the company that's incredibly secretive about infrastructure. For example, there's the team that was afraid that, if they reported bugs to their hardware vendor, the bugs would get fixed and their competitors would be able to use the fixes. Solution: request the firmware and fix bugs themselves! More recently, I know a group of folks outside the company who tried to reproduce the algorithm in the paper the company published earlier this year. The group found that they couldn't reproduce the result, and that the algorithm in the paper resulted in an unusual level of instability; when asked about this, one of the authors responded “well, we have some tweaks that didn't make it into the paper” and declined to share the tweaks, i.e., the company purposely published an unreproducible result to avoid giving away the details, as is normal. This company enforces secrecy by having a strict policy of firing leakers. This is introduced at orientation with examples of people who got fired for leaking (e.g., the guy who leaked that a concert was going to happen inside a particular office), and by announcing firings for leaks at the company all hands. The result of those policies is that I know multiple people who are afraid to forward emails about things like updated info on health insurance to a spouse for fear of forwarding the wrong email and getting fired; instead, they use another computer to retype the email and pass it along, or take photos of the email on their phone.</p> <p>There's the office where I asked one day about the fact that I almost never saw two particular people in the same room together. I was told that they had a feud going back a decade, and that things had actually improved — for years, they literally couldn't be in the same room because one of the two would get too angry and do something regrettable, but things had now cooled to the point where the two could, occasionally, be found in the same wing of the office or even the same room. These weren't just random people, either. They were the two managers of the only two teams in the office.</p> <p>There's the company whose culture is so odd that, when I sat down to write a post about it, I found that I'd <a href="//danluu.com/startup-tradeoffs/#fn:N">not only written more than for any other single post, but more than all other posts combined</a> (which is well over 100k words now, the length of a moderate book). This is the same company where someone recently explained to me how great it is that, instead of using data to make decisions, we use political connections, and that the idea of making decisions based on data is a myth anyway; no one does that. This is also the company where all four of the things they told me to get me to join were false, and the job ended up being the one thing I specifically said I didn't want to do. When I joined this company, my team didn't use version control for months and it was a real fight to get everyone to use version control. Although I won that fight, I lost the fight to get people to run a build, let alone run tests, before checking in, so the build is broken multiple times per day. When I mentioned that I thought this was a problem for our productivity, I was told that it's fine because it affects everyone equally. Since the only thing that mattered was my stack ranked productivity, so I shouldn't care that it impacts the entire team, the fact that it's normal for everyone means that there's no cause for concern.</p> <p>There's the company that created multiple massive initiatives to recruit more women into engineering roles, <a href="//danluu.com/tech-discrimination/">where women still get rejected in recruiter screens for not being technical enough</a> after being asked questions like &quot;was your experience with algorithms or just coding?&quot;. I thought that my referral with a very strong recommendation would have prevented that, but it did not.</p> <p>There's the company where I worked on a four person effort with a multi-hundred million dollar budget and a billion dollar a year impact, where requests for things that cost hundreds of dollars routinely took months or were denied.</p> <p>You might wonder if I've just worked at places that are unusually screwed up. Sure, the companies are generally considered to be ok places to work and two of them are considered to be among the best places to work, but maybe I've just ended up at places that are overrated. But I have the same experience when I hear stories about how other companies work, even places with stellar engineering reputations, except that it's me that's shocked and my conversation partner who thinks their story is normal.</p> <p>There's the companies that use <a href="https://github.com/box/flaky">@flaky</a>, which includes the vast majority of Python-using SF Bay area unicorns. If you don't know what this is, this is a library that lets you add a Python annotation to those annoying flaky tests that sometimes pass and sometimes fail. When I asked multiple co-workers and former co-workers from three different companies what they thought this did, they all guessed that it re-runs the test multiple times and reports a failure if any of the runs fail. Close, but not quite. It's technically possible to use @flaky for that, but in practice it's used to re-run the test multiple times and reports a pass if any of the runs pass. The company that created @flaky is effectively a storage infrastructure company, and the library is widely used at its biggest competitor.</p> <p>There's the company with a reputation for having great engineering practices that had 2 9s of reliability last time I checked, for reasons that are entirely predictable from their engineering practices. This is the second thing in a row that can't be deanonymized because multiple companies fit the description. Here, I'm not talking about companies trying to be the next reddit or twitter where it's, apparently, totally fine to have 1 9. I'm talking about companies that sell platforms that other companies rely on, where an outage will cause dependent companies to pause operations for the duration of the outage. Multiple companies that build infrastructure find practices that lead to 2 9s of reliability.</p> <p>As far as I can tell, what happens at a lot these companies is that they started by concentrating almost totally on product growth. That's completely and totally reasonable, because companies are worth approximately zero when they're founded; they don't bother with things that protect them from losses, like good ops practices or actually having security, because there's nothing to lose (well, except for user data when the inevitable security breach happens, and if you talk to security folks at unicorns you'll know that these happen).</p> <p>The result is a culture where people are hyper-focused on growth and ignore risk. That culture tends to stick even after company has grown to be worth well over a billion dollars, and the companies have something to lose. Anyone who comes into one of these companies from Google, Amazon, or another place with solid ops practices is shocked. Often, they try to fix things, and then leave when they can't make a dent.</p> <p>Google probably has the best ops and security practices of any tech company today. It's easy to say that you should take these things as seriously as Google does, but it's instructive to see how they got there. If you look at the codebase, you'll see that various services have names ending in z, as do a curiously large number of variables. I'm told that's because, once upon a time, someone wanted to add monitoring. It wouldn't really be secure to have <code>google.com/somename</code> expose monitoring data, so they added a z. <code>google.com/somenamez</code>. For security. At the company that is now the best in the world at security. They're now so good at security that multiple people I've talked to (all of whom joined after this happened) vehemently deny that this ever happened, even though the reasons they give don't really make sense (e.g., to avoid name collisions) and I have this from sources who were there at the time this happened.</p> <p>Google didn't go from adding z to the end of names to having the world's best security because someone gave a rousing speech or wrote a convincing essay. They did it after getting embarrassed a few times, which gave people who wanted to do things “right” the leverage to fix fundamental process issues. It's the same story at almost every company I know of that has good practices. Microsoft was a joke in the security world for years, until multiple disastrously bad exploits forced them to get serious about security. This makes it sound simple, but if you talk to people who were there at the time, the change was brutal. Despite a mandate from the top, there was vicious political pushback from people whose position was that the company got to where it was in 2003 without wasting time on practices like security. Why change what's worked?</p> <p>You can see this kind of thing in every industry. A classic example that tech folks often bring up is hand-washing by doctors and nurses. It's well known that germs exist, and that washing hands properly very strongly reduces the odds of transmitting germs and thereby significantly reduces hospital mortality rates. Despite that, trained doctors and nurses still often don't do it. Interventions are required. Signs reminding people to wash their hands save lives. But when people stand at hand-washing stations to require others walking by to wash their hands, even more lives are saved. People can ignore signs, but they can't ignore being forced to wash their hands.</p> <p>This mirrors a number of attempts at tech companies to introduce better practices. If you tell people they should do it, that helps a bit. If you enforce better practices via code review, that helps a lot.</p> <p>The data are clear that humans are really bad at taking the time to do things that are well understood to incontrovertibly reduce the risk of rare but catastrophic events. We will rationalize that taking shortcuts is the right, reasonable thing to do. There's a term for this: the normalization of deviance. It's well studied in a number of other contexts including healthcare, aviation, mechanical engineering, aerospace engineering, and civil engineering, but we don't see it discussed in the context of software. In fact, I've never seen the term used in the context of software.</p> <p>Is it possible to learn from other's mistakes instead of making every mistake ourselves? The state of the industry make this sound unlikely, but let's give it a shot. <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2821100/">John Banja has a nice summary paper on the normalization of deviance in healthcare</a>, with lessons we can attempt to apply to software development. One thing to note is that, because Banja is concerned with patient outcomes, there's a close analogy to devops failure modes, but normalization of deviance also occurs in cultural contexts that are less directly analogous.</p> <p>The first section of the paper details a number of disasters, both in healthcare and elsewhere. Here's one typical example:</p> <blockquote> <p>A catastrophic negligence case that the author participated in as an expert witness involved an anesthesiologist's turning off a ventilator at the request of a surgeon who wanted to take an x-ray of the patient's abdomen (Banja, 2005, pp. 87-101). The ventilator was to be off for only a few seconds, but the anesthesiologist forgot to turn it back on, or thought he turned it back on but had not. The patient was without oxygen for a long enough time to cause her to experience global anoxia, which plunged her into a vegetative state. She never recovered, was disconnected from artificial ventilation 9 days later, and then died 2 days after that. It was later discovered that the anesthesia alarms and monitoring equipment in the operating room had been deliberately programmed to a “suspend indefinite” mode such that the anesthesiologist was not alerted to the ventilator problem. Tragically, the very instrumentality that was in place to prevent such a horror was disabled, possibly because the operating room staff found the constant beeping irritating and annoying.</p> </blockquote> <p>Turning off or ignoring notifications because there are too many of them and they're too annoying? An erroneous manual operation? This could be straight out of the post-mortem of more than a few companies I can think of, except that the result was a tragic death instead of the loss of millions of dollars. If you <a href="//danluu.com/postmortem-lessons/">read a lot of tech post-mortems</a>, every example in Banja's paper will feel familiar even though the details are different.</p> <p>The section concludes,</p> <blockquote> <p>What these disasters typically reveal is that the factors accounting for them usually had “<a href="http://www.jstor.org/stable/223506?seq=1#page_scan_tab_contents">long incubation periods, typified by rule violations, discrepant events that accumulated unnoticed, and cultural beliefs about hazards that together prevented interventions that might have staved off harmful outcomes</a>”. Furthermore, it is especially striking how multiple rule violations and lapses can coalesce so as to enable a disaster's occurrence.</p> </blockquote> <p>Once again, this could be from an article about technical failures. That makes the next section, on why these failures happen, seem worth checking out. The reasons given are:</p> <h4 id="the-rules-are-stupid-and-inefficient">The rules are stupid and inefficient</h4> <p>The example in the paper is about delivering medication to newborns. To prevent “drug diversion,” nurses were required to enter their password onto the computer to access the medication drawer, get the medication, and administer the correct amount. In order to ensure that the first nurse wasn't stealing drugs, if any drug remained, another nurse was supposed to observe the process, and then enter their password onto the computer to indicate they witnessed the drug being properly disposed of.</p> <p>That sounds familiar. How many technical postmortems start off with “someone skipped some steps because they're inefficient”, e.g., “the programmer force pushed a bad config or bad code because they were sure nothing could go wrong and skipped staging/testing”? The infamous <a href="https://azure.microsoft.com/en-us/blog/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/">November 2014 Azure outage</a> happened for just that reason. At around the same time, a dev at one of Azure's competitors overrode the rule that you shouldn't push a config that fails tests because they knew that the config couldn't possibly be bad. When that caused the canary deploy to start failing, they overrode the rule that you can't deploy from canary into staging with a failure because they knew their config couldn't possibly be bad and so the failure must be from something else. That postmortem revealed that the config was technically correct, but exposed a bug in the underlying software; it was pure luck that the latent bug the config revealed wasn't as severe as the Azure bug.</p> <p>Humans are bad at reasoning about how failures cascade, so we implement bright line rules about when it's safe to deploy. But the same thing that makes it hard for us to reason about when it's safe to deploy makes the rules seem stupid and inefficient.</p> <h4 id="knowledge-is-imperfect-and-uneven">Knowledge is imperfect and uneven</h4> <p>People don't automatically know what should be normal, and when new people are onboarded, they can just as easily learn deviant processes that have become normalized as reasonable processes.</p> <p><a href="http://jvns.ca">Julia Evans</a> described to me how this happens:</p> <p><em>new person joins</em><br> <strong>new person</strong>: WTF WTF WTF WTF WTF<br> <strong>old hands</strong>: yeah we know we're concerned about it<br> <strong>new person</strong>: WTF WTF wTF wtf wtf w...<br> <em>new person gets used to it</em><br> <em>new person #2 joins</em><br> <strong>new person #2</strong>: WTF WTF WTF WTF<br> <strong>new person</strong>: yeah we know. we're concerned about it.<br></p> <p>The thing that's really insidious here is that people will really buy into the WTF idea, and they can spread it elsewhere for the duration of their career. Once, after doing some work <a href="//danluu.com/julialang/">on an open source project that's regularly broken and being told that it's normal to have a broken build, and that they were doing better than average</a>, I ran the numbers, found that project was basically worst in class, and wrote something about the idea that <a href="//danluu.com/broken-builds/">it's possible to have a build that nearly always passes with relatively low effort</a>. The most common comment I got in response was, &quot;Wow that guy must work with superstar programmers. But let's get real. We all break the build at least a few times a week&quot;, as if running tests (or for that matter, even attempting to compile) before checking code in requires superhuman abilities. But once people get convinced that some deviation is normal, they often get really invested in the idea.</p> <h4 id="i-m-breaking-the-rule-for-the-good-of-my-patient">I'm breaking the rule for the good of my patient</h4> <p>The example in the paper is of someone who breaks the rule that you should wear gloves when finding a vein. Their reasoning is that wearing gloves makes it harder to find a vein, which may result in their having to stick a baby with a needle multiple times. It's hard to argue against that. No one wants to cause a baby extra pain!</p> <p>The second worst outage I can think of occurred when someone noticed that a database service was experiencing slowness. They pushed a fix to the service, and in order to prevent the service degradation from spreading, they ignored the rule that you should do a proper, slow, staged deploy. Instead, they pushed the fix to all machines. It's hard to argue against that. No one wants their customers to have degraded service! Unfortunately, the fix exposed a bug that caused a global outage.</p> <h4 id="the-rules-don-t-apply-to-me-you-can-trust-me">The rules don't apply to me/You can trust me</h4> <blockquote> <p>most human beings perceive themselves as good and decent people, such that they can understand many of their rule violations as entirely rational and ethically acceptable responses to problematic situations. They understand themselves to be doing nothing wrong, and will be outraged and often fiercely defend themselves when confronted with evidence to the contrary.</p> </blockquote> <p>As companies grow up, they eventually have to impose security that prevents every employee from being able to access basically everything. And at most companies, when that happens, some people get really upset. “Don't you trust me? If you trust me, how come you're revoking my access to X, Y, and Z?”</p> <p>Facebook famously <a href="http://www.wsj.com/articles/SB10001424052702304898704577478483529665936">let all employees access everyone's profile</a> for a long time, and you can even find HN comments indicating that some recruiters would explicitly mention that as a perk of working for Facebook. And I can think of more than one well-regarded unicorn where everyone still has access to basically everything, even after their first or second bad security breach. It's hard to get the political capital to restrict people's access to what they believe they need, or are entitled, to know. A lot of trendy startups have core values like “trust” and “transparency” which make it difficult to argue against universal access.</p> <h4 id="workers-are-afraid-to-speak-up">Workers are afraid to speak up</h4> <p>There are people I simply don't give feedback to because I can't tell if they'd take it well or not, and once you say something, it's impossible to un-say it. In the paper, the author gives an example of a doctor with poor handwriting who gets mean when people ask him to clarify what he's written. As a result, people guess instead of asking.</p> <p>In most company cultures, people feel weird about giving feedback. Everyone has stories about a project that lingered on for months or years after it should have been terminated because no one was willing to offer explicit feedback. This is a problem even when cultures discourage meanness and encourage feedback: cultures of niceness seem to have as many issues around speaking up as cultures of meanness, if not more. In some places, people are afraid to speak up because they'll get attacked by someone mean. In others, they're afraid because they'll be branded as mean. It's a hard problem.</p> <h4 id="leadership-withholding-or-diluting-findings-on-problems">Leadership withholding or diluting findings on problems</h4> <p>In the paper, this is characterized by flaws and weaknesses being diluted as information flows up the chain of command. One example is how a supervisor might take sub-optimal actions to avoid looking bad to superiors.</p> <p>I was shocked the first time I saw this happen. I must have been half a year or a year out of school. I saw that we were doing something obviously non-optimal, and brought it up with the senior person in the group. He told me that he didn't disagree, but that if we did it my way and there was a failure, it would be really embarrassing. He acknowledged that my way reduced the chance of failure without making the technical consequences of failure worse, but it was more important that we not be embarrassed. Now that I've been working for a decade, I have a better understanding of how and why people play this game, but I still find it absurd.</p> <h4 id="solutions">Solutions</h4> <p>Let's say you notice that your company has a problem that I've heard people at most companies complain about: people get promoted for heroism and putting out fires, not for preventing fires; and people get promoted for shipping features, not for doing critical maintenance work and bug fixing. How do you change that?</p> <p>The simplest option is to just do the right thing yourself and ignore what's going on around you. That has some positive impact, but the scope of your impact is necessarily limited. Next, you can convince your team to do the right thing: I've done that a few times for practices I feel are really important and are sticky, so that I won't have to continue to expend effort on convincing people once things get moving.</p> <p>But if the incentives are aligned against you, it will require an ongoing and probably unsustainable effort to keep people doing the right thing. In that case, the problem becomes convincing someone to change the incentives, and then making sure the change works as designed. How to convince people is worth discussing, but long and messy enough that it's beyond the scope of this post. As for making the change work, I've seen many “obvious” mistakes repeated, both in places I've worked and those whose internal politics I know a lot about.</p> <p>Small companies have it easy. When I worked at a 100 person company, the hierarchy was individual contributor (IC) -&gt; team lead (TL) -&gt; CEO. That was it. The CEO had a very light touch, but if he wanted something to happen, it happened. Critically, he had a good idea of what everyone was up to and could basically adjust rewards in real-time. If you did something great for the company, there's a good chance you'd get a raise. Not in nine months when the next performance review cycle came up, but basically immediately. Not all small companies do that effectively, but with the right leadership, they can. That's impossible for large companies.</p> <p>At large company A (LCA), they had the problem we're discussing and a mandate came down to reward people better for doing critical but low-visibility grunt work. There were too many employees for the mandator to directly make all decisions about compensation and promotion, but the mandator could review survey data, spot check decisions, and provide feedback until things were normalized. My subjective perception is that the company never managed to achieve parity between boring maintenance work and shiny new projects, but got close enough that people who wanted to make sure things worked correctly didn't have to significantly damage their careers to do it.</p> <p>At large company B (LCB), ICs agreed that it's problematic to reward creating new features more richly than doing critical grunt work. When I talked to managers, they often agreed, too. But nevertheless, the people who get promoted are disproportionately those who ship shiny new things. I saw management attempt a number of cultural and process changes at LCB. Mostly, those took the form of pronouncements from people with fancy titles. For really important things, they might produce a video, and enforce compliance by making people take a multiple choice quiz after watching the video. The net effect I observed among other ICs was that people talked about how disconnected management was from the day-to-day life of ICs. But, for the same reasons that normalization of deviance occurs, that information seems to have no way to reach upper management.</p> <p>It's sort of funny that this ends up being a problem about incentives. As an industry, we spend a lot of time thinking about how to incentivize consumers into doing what we want. But then we set up incentive systems that are generally agreed upon as incentivizing us to do the wrong things, and we do so via a combination of a game of telephone and cargo cult diffusion. Back when Microsoft was ascendant, we copied their interview process and asked brain-teaser interview questions. Now that Google is ascendant, we copy their interview process and ask algorithms questions. If you look around at trendy companies that are younger than Google, most of them basically copy their ranking/leveling system, with some minor tweaks. The good news is that, unlike many companies people previously copied, Google has put a lot of thought into most of their processes and made data driven decisions. The bad news is that Google is unique in a number of ways, which means that their reasoning often doesn't generalize, and that <a href="//danluu.com/why-ecc/">people often cargo cult practices long after they've become deprecated at Google</a>.</p> <p>This kind of diffusion happens for technical decisions, too. Stripe <a href="https://www.hakkalabs.co/articles/trouble-with-event-processing-try-the-method-that-stripe-has-mastered-with-mongodb">built a reliable message queue on top of Mongo</a>, so we build <a href="https://aphyr.com/posts/284-jepsen-mongodb">reliable</a> message <a href="http://stackoverflow.com/questions/9274777/mongodb-as-a-queue-service">queues</a> on top of <a href="https://aphyr.com/posts/322-jepsen-mongodb-stale-reads">Mongo</a><sup class="footnote-ref" id="fnref:J"><a rel="footnote" href="#fn:J">1</a></sup>. It's cargo cults all the way down<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">2</a></sup>.</p> <p>The paper has specific sub-sections on how to prevent normalization of deviance, which I recommend reading in full.</p> <ul> <li>Pay attention to weak signals</li> <li>Resist the urge to be unreasonably optimistic</li> <li>Teach employees how to conduct emotionally uncomfortable conversations</li> <li>System operators need to feel safe in speaking up</li> <li>Realize that oversight and monitoring are never-ending</li> </ul> <p>Let's look at how the first one of these, “pay attention to weak signals”, interacts with a single example, the “WTF WTF WTF” a new person gives off when the join the company.</p> <p>If a VP decides something is screwed up, people usually listen. It's a strong signal. And when people don't listen, the VP knows what levers to pull to make things happen. But when someone new comes in, they don't know what levers they can pull to make things happen or who they should talk to almost by definition. They give out weak signals that are easily ignored. By the time they learn enough about the system to give out strong signals, they've acclimated.</p> <p>“Pay attention to weak signals” sure sounds like good advice, but how do we do it? Strong signals are few and far between, making them easy to pay attention to. Weak signals are abundant. How do we filter out the ones that aren't important? And how do we get an entire team or org to actually do it? These kinds of questions can't be answered in a generic way; this takes real thought. We mostly put this thought elsewhere. Startups spend a lot of time thinking about growth, and while they'll all tell you that they care a lot about engineering culture, revealed preference shows that they don't. With a few exceptions, big companies aren't much different. At LCB, I looked through the competitive analysis slide decks and they're amazing. They look at every last detail on hundreds of products to make sure that everything is as nice for users as possible, from onboarding to interop with competing products. If there's any single screen where things are more complex or confusing than any competitor's, people get upset and try to fix it. It's quite impressive. And then when LCB onboards employees in my org, a third of them are missing at least one of, an alias/account, an office, or a computer, a condition which can persist for weeks or months. The competitive analysis slide decks talk about how important onboarding is because you only get one chance to make a first impression, and then employees are onboarded with the impression that the company couldn't care less about them and that it's normal for quotidian processes to be pervasively broken. LCB can't even to get the basics of employee onboarding right, let alone really complex things like acculturation. This is understandable — external metrics like user growth or attrition are measurable, and targets like how to tell if you're acculturating people so that they don't ignore weak signals are softer and harder to determine, but that doesn't mean they're any less important. People write a lot about how things like using fancier languages or techniques like TDD or agile will make your teams more productive, but having a strong engineering culture is much larger force multiplier.</p> <p><small> Thanks to Sophie Smithburg and Marc Brooker for introducing me to the term Normalization of Deviance, and Kelly Eskridge, Leah Hanson, Sophie Rapoport, Sophie Smithburg, Julia Evans, Dmitri Kalintsev, Ralph Corderoy, Jamie Brandon, Egor Neliuba, and Victor Felder for comments/corrections/discussion. </small></p> <p><link rel="prefetch" href="//danluu.com/startup-tradeoffs/"> <link rel="prefetch" href="//danluu.com/why-ecc/"> <link rel="prefetch" href="//danluu.com/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:J"><p>People seem to think I'm joking here. I can understand why, but try Googling <code>mongodb message queue</code>. You'll find statements like “replica sets in MongoDB work extremely well to allow automatic failover and redundancy”. Basically every company I know of that's done this and has anything resembling scale finds this to be non-optimal, to say the least, but you can't actually find blog posts or talks that discuss that. All you see are the posts and talks from when they first tried it and are in the honeymoon period. This is common with many technologies. You'll mostly find glowing recommendations in public even when, in private, people will tell you about all the problems. Today, if you do the search mentioned above, you'll get a ton of posts talking about how amazing it is to build a message queue on top of Mongo, this footnote, and a maybe couple of blog posts by Kyle Kingsbury depending on your exact search terms.</p> <p>If there were an acute failure, you might see a postmortem, but while we'll do postmortems for &quot;the site was down for 30 seconds&quot;, we rarely do postmortems for &quot;this takes 10x as much ops effort as the alternative and it's a death by a thousand papercuts&quot;, &quot;we architected this thing poorly and now it's very difficult to make changes that ought to be trivial&quot;, or &quot;a competitor of ours was able to accomplish the same thing with an order of magnitude less effort&quot;. I'll sometimes do informal postmortems by asking everyone involved oblique questions about what happened, but more for my own benefit than anything else, because I'm not sure people really want to hear the whole truth. This is especially sensitive if the effort has generated a round of promotions, which seems to be more common the more screwed up the project. The larger the project, the more visibility and promotions, even if the project could have been done with much less effort.</p> <a class="footnote-return" href="#fnref:J"><sup>[return]</sup></a></li> <li id="fn:P">I've spent a lot of time asking about why things are the way they are, both in areas where things are working well, and in areas where things are going badly. Where things are going badly, everyone has ideas. But where things are going well, as in the small company with the light-touch CEO mentioned above, almost no one has any idea why things work. It's magic. If you ask, people will literally tell you that it seems really similar to some other place they've worked, except that things are magically good instead of being terrible for reasons they don't understand. But it's not magic. It's hard work that very few people understand. Something I've seen multiple times is that, when a VP leaves, a company will become a substantially worse place to work, and it will slowly dawn on people that the VP was doing an amazing job at supporting not only their direct reports, but making sure that everyone under them was having a good time. It's hard to see until it changes, but if you don't see anything obviously wrong, either you're not paying attention or someone or many someones have put a lot of work into making sure things run smoothly. <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> </ol> </div> Big companies v. startups startup-tradeoffs/ Thu, 17 Dec 2015 00:00:00 +0000 startup-tradeoffs/ <p>There's a meme that's been going around for a while now: you should join a startup because the money is better and the work is more technically interesting. <a href="http://paulgraham.com/wealth.html">Paul Graham says</a> that the best way to make money is to &quot;start or join a startup&quot;, which has been &quot;a reliable way to get rich for hundreds of years&quot;, and that you can &quot;compress a career's worth of earnings into a few years&quot;. Michael Arrington says that you'll <a href="https://www.jwz.org/blog/2011/11/watch-a-vc-use-my-name-to-sell-a-con/">become a part of history</a>. Joel Spolsky says that by joining a big company, you'll end up <a href="http://www.joelonsoftware.com/items/2008/05/01.html">playing foosball and begging people to look at your code</a>. Sam Altman says that if you join Microsoft, <a href="http://blog.samaltman.com/advice-for-ambitious-19-year-olds">you won't build interesting things and may not work with smart people</a>. They all claim that you'll learn more and have better options if you go work at a startup. Some of these links are a decade old now, but the same ideas are still circulating and those specific essays are still cited today.</p> <p></p> <p>Let's look at these points one one-by-one.</p> <ol> <li>You'll earn much more money at a startup</li> <li>You won't do interesting work at a big company</li> <li>You'll learn more at a startup and have better options afterwards</li> </ol> <h3 id="1-earnings">1. Earnings</h3> <p>The numbers will vary depending on circumstances, but we can do a back of the envelope calculation and adjust for circumstances afterwards. Median income in the U.S. is <a href="https://en.wikipedia.org/wiki/Personal_income_in_the_United_States">about $30k/yr</a>. The somewhat bogus zeroth order lifetime earnings approximation I'll use is $30k * 40 = $1.2M. A new grad at Google/FB/Amazon with a lowball offer will have a total comp (salary + bonus + equity) of $130k/yr. <a href="http://www.glassdoor.com/Salary/Google-Senior-Software-Engineer-Salaries-E9079_D_KO7,31.htm">According to glassdoor's current numbers</a>, someone who makes it to T5/senior at Google should have a total comp of around $250k/yr. These are fairly conservative numbers<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">1</a></sup>.</p> <p>Someone who's not particularly successful, but not particularly unsuccessful will probably make senior in five years<sup class="footnote-ref" id="fnref:5"><a rel="footnote" href="#fn:5">2</a></sup>. For our conservative baseline, let's assume that we'll never make it past senior, into the pay grades where compensation really skyrockets. We'd expect earnings (total comp including stock, but not benefits) to looks something like:</p> <table> <thead> <tr> <th>Year</th> <th>Total Comp</th> <th>Cumulative</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>130k</td> <td>130k</td> </tr> <tr> <td>1</td> <td>160k</td> <td>290k</td> </tr> <tr> <td>2</td> <td>190k</td> <td>480k</td> </tr> <tr> <td>3</td> <td>220k</td> <td>700k</td> </tr> <tr> <td>4</td> <td>250k</td> <td>950k</td> </tr> <tr> <td>5</td> <td>250k</td> <td>1.2M</td> </tr> <tr> <td>…</td> <td>...</td> <td>...</td> </tr> <tr> <td>9</td> <td>250k</td> <td>2.2M</td> </tr> <tr> <td>…</td> <td>…</td> <td>…</td> </tr> <tr> <td>39</td> <td>250k</td> <td>9.7M</td> </tr> </tbody> </table> <p>Looks like it takes six years to gross a U.S. career's worth of income. If you want to adjust for the increased tax burden from earning a lot in a few years, add an extra year. Maybe add one to two more years if you decide to live in the bay or in NYC. If you decide not to retire, lifetime earnings for a 40 year career comes in at almost $10M.</p> <p>One common, but false, objection to this is that your earnings will get eaten up by the cost of living in the bay area. Not only is this wrong, it's actually the opposite of correct. You can work at these companies from outside the bay area; most of these companies will pay you maybe 10% less if you work in a location where cost of living is around the U.S. median by working in a satellite office of a trendy company headquartered in SV or Seattle (at least if you work in the US -- pay outside of the US is often much lower for reasons that don't really make sense to me). Market rate at smaller companies in these areas tends to be very low. When I interviewed in places like Portland and Madison, there was a 3x-5x difference between what most small companies were offering and what I could get at a big company in the same city. In places like Austin, where the market is a bit thicker, it was a 2x-3x difference. The difference in pay at 90%-ile companies is greater, not smaller, outside of the SF bay area.</p> <p>Another objection is that <a href="http://www.bls.gov/ooh/computer-and-information-technology/computer-programmers.htm">most programmers at most companies don't make this kind of money</a>. If, three or four years ago, you'd told me that there's a career track where it's totally normal to make $250k/yr after a few years, doing work that was fundamentally pretty similar to the work I was doing then, I'm not sure I would have believed it. No one I knew made that kind of money, except maybe the CEO of the company I was working at. Well him, and folks who went into medicine or finance.</p> <p>The only difference between then and now is that I took a job at a big company. When I took that job, the common story I heard at orientation was basically “I never thought I'd be able to get a job at Google, but a recruiter emailed me and I figured I might as well respond”. For some reason, women were especially likely to have that belief. Anyway, I've told that anecdote to multiple people who didn't think they could get a job at some trendy large company, who then ended up applying and getting in. And what you'll realize if you end up at a place like Google is that most of them are just normal programmers like you and me. If anything, I'd say that Google is, on average, less selective than the startup I worked at. When you only have to hire 100 people total, and half of them are folks you worked with as a technical fellow at one big company and then as an SVP at another one, you can afford to hire very slowly and being extremely selective. Big companies will hire more than 100 people per week, which means they can only be so selective.</p> <p>Despite the hype about how hard it is to get a job at Google/FB/wherever, your odds aren't that bad, and <a href="http://www.bloomberg.com/news/articles/2015-12-17/big-ipo-tiny-payout-for-many-startup-workers?trk=pulse-det-art_view_ext">they're certainly better than your odds striking it rich at a startup</a>, for which <a href="http://www.kalzumeus.com/2011/10/28/dont-call-yourself-a-programmer/">Patrick McKenzie has a handy cheatsheet</a>:</p> <blockquote> <p>Roll d100. (Not the right kind of geek? Sorry. rand(100) then.) <br> <strong>0~70</strong>: Your equity grant is worth nothing. <br> <strong>71~94</strong>: Your equity grant is worth a lump sum of money which makes you about as much money as you gave up working for the startup, instead of working for a megacorp at a higher salary with better benefits.<br> <strong>95~99</strong>: Your equity grant is a life changing amount of money. You won't feel rich — you're not the richest person you know, because many of the people you spent the last several years with are now richer than you by definition — but your family will never again give you grief for not having gone into $FAVORED_FIELD like a proper $YOUR_INGROUP.<br> <strong>100</strong>: You worked at the next Google, and are rich beyond the dreams of avarice. Congratulations.<br> Perceptive readers will note that 100 does not actually show up on a d100 or rand(100).</p> </blockquote> <p>For a more serious take that gives approximately the same results, <a href="https://80000hours.org/2014/05/how-much-do-y-combinator-founders-earn/">80000 hours finds that the average value of a YC founder after 5-9 years is $18M</a>. That sounds great! But there are a few things to keep in mind here. First, YC companies are unusually successful compared to the average startup. Second, in their analysis, 80000 hours notes that 80% of the money belongs to 0.5% of companies. Another 22% are worth enough that founder equity beats working for a big company, but that leaves 77.5% where that's not true.</p> <p>If you're an employee and not a founder, the numbers look a lot worse. If you're a very early employee you'd be quite lucky to get 1/10th as much equity as a founder. If we guess that 30% of YC startups fail before hiring their first employee, that puts the mean equity offering at $1.8M / .7 = $2.6M. That's low enough that for 5-9 years of work, you really need to be in the 0.5% for the payoff to be substantially better than working at a big company unless the startup is paying a very generous salary.</p> <p>There's a sense in which these numbers are too optimistic. Even if the company is successful and has a solid exit, there are plenty of things that can make your equity grant worthless. It's hard to get statistics on this, but anecdotally, this seems to be the common case in acquisitions.</p> <p>Moreover, the pitch that you'll only need to work for four years is usually untrue. To keep your <a href="//danluu.com/startup-options/">lottery ticket</a> until it pays out (or fizzles out), you'll probably have to stay longer. The most common form of equity at early stage startups are <a href="https://en.wikipedia.org/wiki/Incentive_stock_option">ISOs</a> that, by definition, expire 90 at most days after you leave. If you get in early, and leave after four years, you'll have to exercise your options if you want a chance at the lottery ticket paying off. If the company hasn't yet landed a large valuation, you might be able to get away with paying O(median US annual income) to exercise your options. If the company looks like a rocket ship and VCs are piling in, you'll have a massive tax bill, too, all for a lottery ticket.</p> <p>For example, say you joined company X early on and got options for 1% of the company when it was valued at $1M, so the cost exercising all of your options is only $10k. Maybe you got lucky and four years later, the company is valued at $1B and your options have only been diluted to .5%. Great! For only $10k you can exercise your options and then sell the equity you get for $5M. Except that the company hasn't IPO'd yet, so if you exercise your options, you're stuck with a tax bill from making $5M, and by the time the company actually has an IPO, your stock could be worthy anywhere from $0 to $LOTS. In some cases, you can sell your non-liquid equity for some fraction of its “value”, but my understanding is that it's getting more common for companies to add clauses that limit your ability to sell your equity before the company has an IPO. And even when your contract doesn't have a clause that prohibits you from selling your options on a secondary market, <a href="https://news.ycombinator.com/item?id=10705646">companies sometimes use backchannel communications to keep you from being able to sell your options</a>.</p> <p>Of course not every company is like this -- I hear that Dropbox has generously offered to buy out people's options at their current valuation for multiple years running and they now hand out RSUs instead of options, and Pinterest now gives people seven years to exercise their options after they leave -- but stories like that are uncommon enough that they're notable. The result is that people are incentivized to stay at most startups, even if they don't like the work anymore. From chatting with my friends at well regarded highly-valued startups, it sounds like many of them have a substantial fraction of zombie employees who are just mailing it in and waiting for a liquidity event. A common criticism of large companies is that they've got a lot of lifers who are mailing it in, but most large companies will let you leave any time after the first year and walk away with a pro-rated fraction of your equity package<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">3</a></sup>. It's startups where people are incentivized to stick around even if they don't care about the job.</p> <p>At a big company, we have a career's worth of income in six years with high probability once you get your foot in the door. This isn't quite as good as the claim that you'll be able to do that in three or four years at a startup, but the risk at a big company is very low once you land the job. In startup land, we have a lottery ticket that appears to have something like a 0.5% chance of paying off for very early employees. Startups might have had a substantially better expected value when Paul wrote about this in 2004, but big company compensation has increased much faster than compensation at the median startup. We're currently in the best job market the world has ever seen for programmers. That's likely to change at some point. The relative returns on going the startup route will probably look a lot better once things change, but for now, saving up some cash while big companies hand it out like candy doesn't seem like a bad idea.</p> <p>One additional thing to note is that it's possible to get the upside of working at a startup by working at a big company and investing in startups. As of this update (mid-2020), it's common for companies to raise seed rounds at valuations of ~$10M and take checks as small as $5k. This means, for $100k, you can get as much of the company as you'd get if you joined as a very early employee, perhaps even employee #1 if you're not already very senior or recognized in the industry. But the stock you get by investing has better terms than employee equity not even considering vesting, and since your investment doesn't need to vest and you get it immediately and you typically have to stay for four years for your employee equity to vest, you actually only need to invest $25k/yr to get the equity benefit of being a very early employee. Not only can you get better risk adjusted returns (by diversifying), you'll also have much more income if you work at a big company and invest $25k/yr than if you work at a startup.</p> <h3 id="2-interesting-work">2. Interesting work</h3> <p>We've established that big companies will pay you decently. But there's more to life than making money. After all, you spend 40 hours a week working (or more). How interesting is the work at big companies? Joel claimed that large companies don't solve interesting problems and that Google is <a href="http://www.joelonsoftware.com/items/2008/05/01.html">paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex</a>, Sam Altman said something similar (but much more measured) about Microsoft, every third Michael O. Church comment is about how Google tricks a huge number of overqualified programmers into taking jobs that no one wants. Basically every advice thread on HN or reddit aimed at new grads will have multiple people chime in on how the experience you get at startups is better than the experience you'll get slaving away at a big company.</p> <p>The claim that big companies have boring work is too broad and absolute to even possibly be true. It depends on what kind of work you want to do. When I look at conferences where I find a high percentage of the papers compelling, the stuff I find to be the most interesting is pretty evenly split between big companies and academia, with the (very) occasional paper by a startup. For example, looking at ISCA this year, there's a 2:1 ratio of papers from academia to industry (and all of the industry papers are from big companies). But looking at the actual papers, a significant fraction of the academic papers are reproducing unpublished work that was done at big companies, sometimes multiple years ago. If I only look at the new work that I'm personally interested in, it's about a 1:1 ratio. There are some cases where a startup is working in the same area and not publishing, but that's quite rare and large companies do much more research that they don't publish. I'm just using papers as a proxy for having the kind of work I like. There are also plenty of areas where publishing isn't the norm, but large companies do the bulk of the cutting edge work.</p> <p>Of course YMMV here depending on what you want to do. I'm not really familiar with the landscape of front-end work, but it seems to me that big companies don't do the vast majority of the cutting edge non-academic work, the way they do with large scale systems. IIRC, there's an HN comment where Jonathan Tang describes how he created his own front-end work: he had the idea, told his manager about it, and got approval to make it happen. It's possible to do that kind of thing at a large company, but people often seem to have an easier time pursuing that kind of idea at a small company. And if your interest is in product, small companies seem like the better bet (though, once again, I'm pretty far removed from that area, so my knowledge is secondhand).</p> <p>But if you're interested in large systems, at both of my last two jobs, I've seen speculative research projects with 9 figure pilot budgets approved. In a pitch for one of the products, the pitch wasn't even that the project would make the company money. It was that a specific research area was important to the company, and that this infrastructure project would enable the company to move faster in that research area. Since the company is a $X billion dollar a year company, the project only needed to move the needle by a small percentage to be worth it. And so a research project whose goal was to speed up the progress of another research project was approved. Interally, this kind of thing is usually determined by politics, which some people will say makes it not worth it. But if you have a stomach for big company politics, startups simply don't have the resources to fund research problems that aren't core to their business. And many problems that would be hard problems at startups are curiosities at large companies.</p> <p>The flip side of this is that there are experiments that startups have a very easy time doing that established companies can't do. When I was at <a href="http://www.sigecom.org/">EC</a> a number of years ago, back when Facebook was still relatively young, the Google ad auction folks remarked to the FB folks that FB was doing the sort of experiments they'd do if they were small enough to do them, but they couldn't just change the structure of their ad auctions now that there was so much money flowing through their auctions. As with everything else we're discussing, there's a trade-off here and the real question is how to weight the various parts of the trade-off, not which side is better in all ways.</p> <p>The Michael O. Church claim is somewhat weaker: big companies have cool stuff to work on, but you won't be allowed to work on them until you've paid your dues working on boring problems. A milder phrasing of this is that getting to do interesting work is a matter of getting lucky and landing on an initial project you're interested in, but the key thing here is that most companies can give you a pretty good estimate about how lucky you're going to be. Google is notorious for its blind allocation process, and I know multiple people who ended up at MS because they had the choice between a great project at MS and blind allocation at Google, but even Google has changed this to some extent and it's not uncommon to be given multiple team options with an offer. In that sense, big companies aren't much different from startups. It's true that there are some startups that will basically only have jobs that are interesting to you (e.g., an early-stage distributed database startup if you're interested in building a distributed database). But at any startup that's bigger and less specialized, there's going to be work you're interested in and work you're not interested in, and it's going to be up to you to figure out if your offer lets you work on stuff you're interested in.</p> <p>Something to note is that if, per (1), you have the leverage to negotiate a good compensation package, you also have the leverage to negotiate for work that you want to do. We're in what is probably the best job market for programmers ever. That might change tomorrow, but until it changes, you have a lot of power to get work that you want.</p> <h3 id="3-learning-experience">3. Learning / Experience</h3> <p>What about the claim that experience at startups is more valuable? We don't have the data to do a rigorous quantitative comparison, but qualitatively, everything's on fire at startups, and you get a lot of breadth putting out fires, but you don't have the time to explore problems as deeply.</p> <p>I spent the first seven years of my career at a startup and I loved it. It was total chaos, which gave me the ability to work on a wide variety of different things and take on more responsibility than I would have gotten at a bigger company. I did everything from add fault tolerance to an in-house distributed system to owning a quarter of a project that added ARM instructions to an x86 chip, creating both the fastest ARM chip at the time, as well as the only chip capable of switching between ARM and x86 on the fly<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">4</a></sup>. That was a great learning experience.</p> <p>But I've had great learning experiences at big companies, too. At Google, my “starter” project was to join a previously one-person project, read the half finished design doc, provide feedback, and then start implementing. The impetus for the project was that people were worried that image recognition problems would require Google to double the number of machines it owns if a somewhat unlikely but not impossible scenario happened. That wasn't too much different from my startup experience, except for that bit about actually having a design doc, and that cutting infra costs could save billions a year instead of millions a year.</p> <p>Was that project a better or worse learning experience than the equivalent project at a startup? At a startup, the project probably would have continued to be a two-person show, and I would have learned all the things you learn when you bang out a project with not enough time and resources and do half the thing yourself. Instead, I ended up owning a fraction of the project and merely provided feedback on the rest, and it was merely a matter of luck (timing) that I had significant say on fleshing out the architecture. I definitely didn't get the same level of understanding I would have if I implemented half of it myself. On the other hand, the larger team meant that we actually had time to do things like design reviews and code reviews.</p> <p>If you care about impact, it's also easier to have a large absolute impact at a large company, due to the scale that big companies operate at. If I implemented what I'm doing now for a companies the size of the startup I used to work for, it would have had an impact of maybe $10k/month. That's nothing to sneeze at, but it wouldn't have covered my salary. But the same thing at a big company is worth well over 1000x that. There are simply more opportunities to have high impact at large companies because they operate at a larger scale. The corollary to this is that startups are small enough that it's easier to have an impact on the company itself, even when the impact on the world is smaller in absolute terms. Nothing I do is make or break for a large company, but when I worked at a startup, it felt like what we did could change the odds of the company surviving.</p> <p>As far as having better options after having worked for a big company or having worked for a startup, if you want to work at startups, you'll probably have better options with experience at startups. If you want to work on the sorts of problems that are dominated by large companies, you're better off with more experience in those areas, at large companies. There's no right answer here.</p> <h3 id="conclusion">Conclusion</h3> <p>The compensation trade-off has changed a lot over time. When Paul Graham was writing in 2004, he used $80k/yr as a reasonable baseline for what “a good hacker” might make. Adjusting for inflation, <a href="http://data.bls.gov/cgi-bin/cpicalc.pl?cost1=80000&amp;year1=2004&amp;year2=2015">that's about $100k/yr now</a>. But the total comp for “a good hacker” is $250k+/yr, not even counting perks like free food and having really solid insurance. The trade-off has heavily tilted in favor of large companies.</p> <p>The interesting work trade-off has also changed a lot over time, but the change has been… bimodal. The existence of AWS and Azure means that ideas that would have taken millions of dollars in servers and operational expertise can be done with almost no fixed cost and low marginal costs. The scope of things you can do at an early-stage startup that were previously the domain of well funded companies is large and still growing. But at the same time, if you look at the work Google and MS are publishing at top systems conferences, startups are farther from being able to reproduce the scale-dependent work than ever before (and a lot of the most interesting work doesn't get published). Depending on what sort of work you're interested in, things might look relatively better or relatively worse at big companies.</p> <p>In any case, the reality is that the difference between types of companies is smaller than the differences between companies of the same type. That's true whether we're talking about startups vs. big companies or mobile gaming vs. biotech. This is recursive. The differences between different managers and teams at a company can easily be larger than the differences between companies. If someone tells you that you should work for a certain type of company, that advice is guaranteed to be wrong much of the time, whether that's a VC advocating that you should work for a startup or a Turing award winner telling you that you should work in a research lab.</p> <p>As for me, well, I don't know you and it doesn't matter to me whether you end up at a big company, a startup, or something in between. Whatever you decide, I hope you get to know your manager well enough to know that they have your back, your team well enough to know that you like working with them, and your project well enough to know that you find it interesting. Big companies have a class of dysfunction that's unusual at startups<sup class="footnote-ref" id="fnref:N"><a rel="footnote" href="#fn:N">5</a></sup> and startups have <a href="http://totalgarb.tumblr.com/tagged/startupbullshit">their own kinds of dysfunction</a>. You should figure out what the relevant tradeoffs are for you and what kind of dysfunction you want to sign up for.</p> <h3 id="related-advice-elsewhere">Related advice elsewhere</h3> <p><a href="//danluu.com/startup-options/">Myself on options vs. cash</a>.</p> <p><a href="https://web.archive.org/web/20141122200442/https://medium.com/@jocelyngoldfein/the-innovation-dead-end-ee25f8aa090f">Jocelyn Goldfein on big companies vs. small companies</a>.</p> <p><a href="http://www.kalzumeus.com/2011/10/28/dont-call-yourself-a-programmer/">Patrick McKenzie on providing business value vs. technical value</a>, with <a href="http://yosefk.com/blog/do-call-yourself-a-programmer-and-other-career-advice.html">a response from Yossi Kreinin</a>.</p> <p><a href="http://yosefk.com/blog/do-you-really-want-to-be-making-this-much-money-when-youre-50.html">Yossi Kreinin on passion vs. money</a>, and with a <a href="http://yosefk.com/blog/stock-options-a-balanced-approach.html">rebuttal to this post on regret minimization</a>.</p> <p><em>Update: The responses on this post have been quite divided. Folks at big companies usually agree, except that the numbers seem low to them, especially for new grads. This is true even for people who living in places which have a cost of living similar to U.S. median. On the other hand, a lot of people vehemently maintain that the numbers in this post are basically impossible. A lot of people are really invested in the idea that they're making about as much as possible. If you've decided that making less money is the right trade-off for you, that's fine and I don't have any problem with that. But if you really think that you can't make that much money and you don't believe me, I recommend talking to one of the hundreds of thousands of engineers at one of the many large companies that pays well.</em></p> <p><em>Update 2: This post was originally written in 2015, when the $250k number would be conservative but not unreasonable for someone who's &quot;senior&quot; at Google or FB. If we look at the situation today in 2017, people <a href="https://twitter.com/danluu/status/942452212445405185">one entire band below that are regularly bringing in $250k</a> and a better estimate might be $300k or $350k. I'm probably more bear-ish on future dev compensation than most people, but things are looking pretty good for now and an event that wipes out big company dev compensation seems likely to do even worse things to the options packages for almost all existing startups.</em></p> <p><em>Udpate 3: Added note on being able to invest in startups in 2020. I didn't realize that this was possible without having a lot of wealth until around 2019.</em></p> <p><small> Thanks to Kelly Eskridge, Leah Hanson, Julia Evans, Alex Clemmer, Ben Kuhn, Malcolm Matalka, Nick Bergson-Shilcock, Joe Wilder, Nat Welch, Darius Bacon, Lindsey Kuper, Prabhakar Ragde, Pierre-Yves Baccou, David Turner, Oskar Thoren, Katerina Barone-Adesi, Scott Feeney, Ralph Corderoy, Ezekiel Benjamin Smithburg, @agentwaj, and Kyle Littler for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:C"><p>In particular, the Glassdoor numbers seem low for an average. I suspect that's because their average is weighed down by older numbers, while compensation has skyrocketed the past seven years. The average numbers on Glassdoor don't even match the average numbers I heard from other people in my Midwestern satellite office in a large town two years ago, and the market has gone up sharply since then. More recently, on the upper end, I know someone fresh out of school who has a total comp of almost $250k/yr ($350k equity over four years, a $50k signing bonus, plus a generous salary). As is normal, they got a number of offers with varying compensation levels, and then Facebook came in and bid him up. The companies that are serious about competing for people matched the offers, and that was that. This included bids in Seattle and Austin that matched the bids in SV. If you're negotiating an offer, the thing that's critical isn't to be some kind of super genius. It's enough to be pretty good, know what the market is paying, and have multiple offers. This person was worth every penny, which is why he got his offers, but I know several people who are just as good who make half as much just because they only got a single offer and had no leverage.</p> <p>Anyway, the point of this footnote is just that the total comp for experienced engineers can go way above the numbers mentioned in the post. In the analysis that follows, keep in mind that I'm using conservative numbers and that an aggressive estimate for experienced engineers would be much higher. Just for example, at Google, senior is level 5 out of 11 on a scale that effectively starts at 3. At Microsoft, it's 63 out of a weirdo scale that starts at 59 and goes to 70-something and then jumps up to 80 (or something like that, I always forget the details because the scale is so silly). Senior isn't a particularly high band, and people at senior often have total comp substantially greater than $250k/yr. Note that these numbers also don't include the above market rate of stock growth at trendy large companies in the past few years. If you've actually taken this deal, your RSUs have likely appreciated substantially.</p> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:5"><p>This depends on the company. It's true at places like Facebook and Google, which make a serious effort to retain people. It's nearly completely untrue at places like IBM, National Instruments (NI), and Epic Systems, which don't even try. And it's mostly untrue at places like Microsoft, which tries, but in the most backwards way possible.</p> <p>Microsoft (and other mid-tier companies) will give you an ok offer and match good offers from other companies. That by itself is already problematic since it incentivizes people who are interviewing at Microsoft to also interview elsewhere. But the worse issue is that they do the same when retaining employees. If you stay at Microsoft for a long time and aren't one of the few people on the fast track to &quot;partner&quot;, your pay is going to end up severely below market, sometime by as much as a factor of two. When you realize that, and you interview elsewhere, Microsoft will match external offers, but after getting underpaid for years, by hundreds of thousands or millions of dollars (depending on how long you've been there), the promise of making market rate for a single year and then being underpaid for the foreseeable future doesn't seem very compelling. The incentive structure appears as if it were designed to cause people who are between average and outstanding to leave. I've seen this happen with multiple people and I know multiple others who are planning to leave for this exact reason. Their managers are always surprised when this happens, but they shouldn't be; it's eminently predictable.</p> <p>The IBM strategy actually makes a lot more sense to me than the Microsoft strategy. You can save a lot of money by paying people poorly. That makes sense. But why bother paying a lot to get people in the door and then incentivizing them to leave? While it's true that the very top people I work with are well compensated and seem happy about it, there aren't enough of those people that you can rely on them for everything.</p> <a class="footnote-return" href="#fnref:5"><sup>[return]</sup></a></li> <li id="fn:M">Some are better about this than others. Older companies, like MS, sometimes have yearly vesting, but a lot of younger companies, like Google, have much smoother vesting schedules once you get past the first year. And then there's Amazon, which backloads its offers, knowing that they have a high attrition rate and won't have to pay out much. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:S">Sadly, we ended up not releasing this for business reasons that came up later. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:N"><p>My very first interaction with an employee at big company X orientation was having that employee tell me that I couldn't get into orientation because I wasn't on the list. I had to ask how I could get on the list, and I was told that I'd need an email from my manager to get on the list. This was at around 7:30am because orientation starts at 7:30 and then runs for half a day for reasons no one seems to know (I've asked a lot of people, all the way up to VPs in HR). When I asked if I could just come back later in the day, I was told that if I couldn't get in within an hour I'd have to come back next week. I also asked if the fact that I was listed in some system as having a specific manager was evidence that I was supposed to be at orientation and was told that I had to be on the list. So I emailed my manager, but of course he didn't respond because who checks their email at 7:30am? Luckily, my manager had previously given me his number and told me to call if I ever needed anything, and being able to get into orientation and not have to show up at 7:30am again next week seemed like anything, so I gave him a call. Naturally, he asked to talk to the orientation gatekeeper; when I relayed that the orientation guy, he told me that he couldn't talk on the phone -- you see, he can only accept emails and can't talk on the phone, not even just to clarify something. Five minutes into orientation, I was already flabbergasted. But, really, I should have considered myself lucky -- the other person who “wasn't on the list” didn't have his manager's phone number, and as far as I know, he had to come back the next week at 7:30am to get into orientation. I asked the orientation person how often this happens, and he told me “very rarely, only once or twice per week”.</p> <p>That experience was repeated approximately every half hour for the duration of orientation. I didn't get dropped from any other orientation stations, but when I asked, I found that every station had errors that dropped people regularly. My favorite was the station where someone was standing at input queue, handing out a piece of paper. The piece of paper informed you that the machine at the station was going to give you an error with some instructions about what to do. Instead of following those instructions, you had to follow the instructions on the piece of paper when the error occurred.</p> <p>These kinds of experiences occupied basically my entire first week. Now that I'm past onboarding and onto the regular day-to-day, I have a surreal Kafka-esque experience a few times a week. And I've mostly figured out how to navigate the system (usually, knowing the right person and asking them to intervene solves the problem). What I find to be really funny isn't the actual experience, but that most people I talk to who've been here a while think that <a href="//danluu.com/wat/">it literally cannot be any other way</a> and that things could not possibly be improved. Curiously, people who have been here as long who are very senior tend to agree that the company has its share of big company dysfunction. I wish I had enough data on that to tell which way the causation runs (are people who are aware of the function more likely to last long enough to become very senior, or does being very senior give you a perspective that lets you see more dysfunction). Something that's even curiouser is that the company invests a fair amount of effort to give people the impression that things are as good as they could possibly be. At orientation, we got a version of history that made it sound as if the company had pioneered everything from the GUI to the web, with multiple claims that we have the best X in the world, even when X is pretty clear mediocre. It's not clear to me what the company gets out of making sure that most employees don't understand what the downsides are in our own products and processes.</p> <p>Whatever the reason, the attitude that things couldn't possibly be improved isn't just limited to administrative issues. A friend of mine needed to find a function to do something that's a trivial one liner on Linux, but that's considerably more involved on our OS. His first attempt was to use boost, but it turns out that the documentation for doing this on the OS we use is complicated enough that boost got this wrong and has had a bug in it for years. A couple days, and 72 lines of code later, he managed to figure out how to create a function to accomplish his goal. Since he wasn't sure if he was missing something, he forwarded the code review to two very senior engineers (one level below Distinguished Engineer). They weren't sure and forwarded it on to the CTO, who said that he didn't see a simpler way to accomplish the same thing in our OS with the APIs as they currently are.</p> <p>Later, my friend had a heated discussion with someone on the OS team, who maintained that the documentation on how to do this was very clear, and that it couldn't be clearer, nor could the API be any easier. This is despite this being so hard to do that boost has been wrong for seven years, and that two very senior engineers didn't feel confident enough to review the code and passed it up to a CTO.</p> <p>I'm going to stop here not because I'm out of incidents like this, but because a retelling of a half year of big company stories is longer than my blog. Not just longer than this post or any individual post, but longer than everything else on my blog combined, which is a bit over 100k words. Typical estimates for words per page vary between 250 and 1000, putting my rate of surreal experiences at somewhere between 100 and 400 pages every six months. I'm not sure this rate is inherently different from the rate you'd get <a href="http://totalgarb.tumblr.com/tagged/startupbullshit">at startups</a>, but there's a different flavor to the stories and you should have an idea of the flavor by this point.</p> <a class="footnote-return" href="#fnref:N"><sup>[return]</sup></a></li> </ol> </div> Files are hard file-consistency/ Sat, 12 Dec 2015 00:00:00 +0000 file-consistency/ <p>I haven't used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox. Pine, Eudora, and outlook have all corrupted my inbox, forcing me to restore from backup. How is it that desktop mail clients are less reliable than gmail, even though my gmail account not only handles more email than I ever had on desktop clients, but also allows simultaneous access from multiple locations across the globe? Distributed systems have an unfair advantage, in that they can be robust against total disk failure in a way that desktop clients can't, but none of the file corruption issues I've had have been from total disk failure. Why has my experience with desktop applications been so bad?</p> <p></p> <p>Well, what sort of failures can occur? Crash consistency (maintaining consistent state even if there's a crash) is probably the easiest property to consider, since we can assume that everything, from the filesystem to the disk, works correctly; let's consider that first.</p> <h3 id="crash-consistency">Crash Consistency</h3> <p>Pillai et al. had a <a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf">paper</a> and <a href="https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_pillai.pdf">presentation</a> at OSDI '14 on exactly how hard it is to save data without corruption or data loss.</p> <p>Let's look at a simple example of what it takes to save data in a way that's robust against a crash. Say we have a file that contains the text <code>a foo</code> and we want to update the file to contain <code>a bar</code>. The pwrite function looks like it's designed for this exact thing. It takes a file descriptor, what we want to write, a length, and an offset. So we might try</p> <pre><code>pwrite([file], “bar”, 3, 2) // write 3 bytes at offset 2 </code></pre> <p>What happens? If nothing goes wrong, the file will contain <code>a bar</code>, but if there's a crash during the write, we could get <code>a boo</code>, <code>a far</code>, or any other combination. Note that you may want to consider this an example over sectors or blocks and not chars/bytes.</p> <p>If we want atomicity (so we either end up with <code>a foo</code> or <code>a bar</code> but nothing in between) one standard technique is to make a copy of the data we're about to change in an <a href="http://www.cburch.com/cs/340/reading/log/index.html">undo log</a> file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log. We might write something like</p> <pre><code>creat(/dir/log); write(/dir/log, “2,3,foo”, 7); pwrite(/dir/orig, “bar”, 3, 2); unlink(/dir/log); </code></pre> <p>This should allow recovery from a crash without data corruption via the undo log, at least if we're using <code>ext3</code> and we made sure to mount our drive with <code>data=journal</code>. But we're out of luck if, like most people, we're using the default<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">1</a></sup> -- with the default <code>data=ordered</code>, the <code>write</code> and <code>pwrite</code> syscalls can be reordered, causing the write to <code>orig</code> to happen before the write to the log, which defeats the purpose of having a log. We can fix that.</p> <pre><code>creat(/dir/log); write(/dir/log, “2, 3, foo”); fsync(/dir/log); // don't allow write to be reordered past pwrite pwrite(/dir/orig, 2, “bar”); fsync(/dir/orig); unlink(/dir/log); </code></pre> <p>That should force things to occur in the correct order, at least if we're using ext3 with <code>data=journal</code> or <code>data=ordered</code>. If we're using <code>data=writeback</code>, a crash during the the <code>write</code> or <code>fsync</code> to log can leave <code>log</code> in a state where the filesize has been adjusted for the write of “bar”, but the data hasn't been written, which means that the log will contain random garbage. This is because with <code>data=writeback</code>, metadata is <a href="https://en.wikipedia.org/wiki/Journaling_file_system">journaled</a>, but data operations aren't, which means that data operations (like writing data to a file) aren't ordered with respect to metadata operations (like adjusting the size of a file for a write).</p> <p>We can fix that by adding a checksum to the log file when creating it. If the contents of <code>log</code> don't contain a valid checksum, then we'll know that we ran into the situation described above.</p> <pre><code>creat(/dir/log); write(/dir/log, “2, 3, [checksum], foo”); // add checksum to log file fsync(/dir/log); pwrite(/dir/orig, 2, “bar”); fsync(/dir/orig); unlink(/dir/log); </code></pre> <p>That's safe, at least on current configurations of ext3. But it's legal for a filesystem to end up in a state where the log is never created unless we issue an fsync to the parent directory.</p> <pre><code>creat(/dir/log); write(/dir/log, “2, 3, [checksum], foo”); fsync(/dir/log); fsync(/dir); // fsync parent directory of log file pwrite(/dir/orig, 2, “bar”); fsync(/dir/orig); unlink(/dir/log); </code></pre> <p>That should prevent corruption on any Linux filesystem, but if we want to make sure that the file actually contains “bar”, we need another fsync at the end.</p> <pre><code>creat(/dir/log); write(/dir/log, “2, 3, [checksum], foo”); fsync(/dir/log); fsync(/dir); pwrite(/dir/orig, 2, “bar”); fsync(/dir/orig); unlink(/dir/log); fsync(/dir); </code></pre> <p>That results in consistent behavior and guarantees that our operation actually modifies the file after it's completed, as long as we assume that <code>fsync</code> actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn't really flush to disk. OS X requires <code>fcntl(F_FULLFSYNC)</code> to flush to disk, and some versions of ext3 only flush to disk if the the <a href="https://en.wikipedia.org/wiki/Inode">inode</a> changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.</p> <p>Even if we assume fsync issues a flush command to the disk, some disks ignore flush directives for the same reason fsync is gimped on OS X and some versions of ext3 -- to look better in benchmarks. Handling that is beyond the scope of this post, but the <a href="http://www.researchgate.net/profile/Vijay_Chidambaram/publication/220958003_Coerced_Cache_Eviction_and_discreet_mode_journaling_Dealing_with_misbehaving_disks/links/54d0f0190cf29ca811040c8a.pdf">Rajimwale et al. DSN '11 paper</a> and related work cover that issue.</p> <h3 id="filesystem-semantics">Filesystem semantics</h3> <p>When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency. They wrote a tool that collects block-level filesystem traces, and used that to determine which properties don't hold for specific filesystems. The authors are careful to note that they can only determine when properties don't hold -- if they don't find a violation of a property, that's not a guarantee that the property holds.</p> <p><img src="images/file-consistency/fs_properties.png" alt="Different filesystems have very different properties"></p> <p>Xs indicate that a property is violated. The atomicity properties are basically what you'd expect, e.g., no X for single sector overwrite means that writing a single sector is atomic. The authors note that the atomicity of single sector overwrite sometimes comes from a property of the disks they're using, and that running these filesystems on some disks won't give you single sector atomicity. The ordering properties are also pretty much what you'd expect from their names, e.g., an X in the “Overwrite -&gt; Any op” row means that an overwrite can be reordered with some operation.</p> <p>After they created a tool to test filesystem properties, they then created a tool to check if any applications rely on any potentially incorrect filesystem properties. Because invariants are application specific, the authors wrote checkers for each application tested.</p> <p><img src="images/file-consistency/program_bugs.png" alt="Everything is broken"></p> <p>The authors find issues with most of the applications tested, including things you'd really hope would work, like LevelDB, HDFS, Zookeeper, and git. In a talk, one of the authors noted that the developers of sqlite have a very deep understanding of these issues, but even that wasn't enough to prevent all bugs. That speaker also noted that version control systems were particularly bad about this, and that the developers had a pretty lax attitude that made it very easy for the authors to find a lot of issues in their tools. The most common class of error was incorrectly assuming ordering between syscalls. The next most common class of error was assuming that syscalls were atomic<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">2</a></sup>. These are fundamentally the same issues people run into when doing multithreaded programming. Correctly reasoning about re-ordering behavior and inserting barriers correctly is hard. But even though shared memory concurrency is considered a hard problem that requires great care, writing to files isn't treated the same way, even though it's actually harder in a number of ways.</p> <p>Something to note here is that while btrfs's semantics aren't inherently less reliable than ext3/ext4, many more applications corrupt data on top of btrfs because developers aren't used to coding against filesystems that allow directory operations to be reordered (ext2 is perhaps the most recent widely used filesystem that allowed that reordering). We'll probably see a similar level of bug exposure when people start using NVRAM drives that have byte-level atomicity. People almost always just run some tests to see if things work, rather than making sure they're coding against what's legal in a POSIX filesystem.</p> <p>Hardware memory ordering semantics are usually <a href="//danluu.com/new-cpu-features/#memory-concurrency">well documented</a> in a way that makes it simple to determine precisely which operations can be reordered with which other operations, and which operations are atomic. By contrast, here's <a href="http://man7.org/linux/man-pages/man5/ext4.5.html">the ext manpage</a> on its three data modes:</p> <blockquote> <p>journal: All data is committed into the journal prior to being written into the main filesystem.</p> <p>ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.</p> <p>writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. <strong>This is rumoured to be</strong> the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.</p> </blockquote> <p>The manpage literally refers to rumor. This is the level of documentation we have. If we look back at our example where we had to add an <code>fsync</code> between the <code>write(/dir/log, “2, 3, foo”)</code> and <code>pwrite(/dir/orig, 2, “bar”)</code> to prevent reordering, I don't think the necessity of the <code>fsync</code> is obvious from the description in the manpage. If you look at the hardware memory ordering “manpage” above, it specifically defines the ordering semantics, and it certainly doesn't rely on rumor.</p> <p>This isn't to say that filesystem semantics aren't documented anywhere. Between <a href="http://lwn.net/">lwn</a> and LKML, it's possible to get a good picture of how things work. But digging through all of that is hard enough that it's still quite common <a href="http://austingroupbugs.net/view.php?id=672">for there to be long, uncertain discussions on how things work</a>. A lot of the information out there is wrong, and even when information was right at the time it was posted, it often goes out of date.</p> <p>When digging through archives, I've often seen a post from 2005 cited to back up the claim that OS X <code>fsync</code> is the same as Linux <code>fsync</code>, and that OS X <code>fcntl(F_FULLFSYNC)</code> is even safer than anything available on Linux. Even at the time, I don't think that was true for the 2.4 kernel, although it was true for the 2.6 kernel. But since 2008 or so Linux 2.6 with ext3 will do a full flush to disk for each fsync (if the disk supports it, and the filesystem hasn't been specially configured with barriers off).</p> <p>Another issue is that you often also see exchanges <a href="http://lkml.iu.edu/hypermail/linux/kernel/0908.3/01481.html">like this one</a>:</p> <p><strong>Dev 1</strong>: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.<br> <strong>Dev 2</strong>: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.<br> <strong>Dev 1</strong>: Where is that documented?<br> <strong>Dev 2</strong>: linux-kernel mailing list archives.<br> <strong>Dev 3</strong>: Probably from some 6-8 years ago, in e-mail postings that I made.<br></p> <p>Where's this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). I don't mean to pick on filesystem devs. The fs devs whose posts I've read are quite polite compared to LKML's reputation; they generously spend a lot of their time responding to basic questions and I'm impressed by how patient the expert fs devs are with askers, but it's hard for outsiders to troll through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted!</p> <p>In their OSDI 2014 talk, the authors of the paper we're discussing noted that when they reported bugs they'd found, developers would often respond “POSIX doesn't let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you've followed Kyle Kingsbury's Jepsen work, this may sound familiar, except devs respond with “filesystems don't do that” instead of “networks don't do that”. I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I'd be a bit surprised if I don't have at least one bug in this post.</p> <h3 id="filesystem-correctness">Filesystem correctness</h3> <p>We've already encountered a lot of complexity in saving data correctly, and this only scratches the surface of what's involved. So far, we've assumed that the disk works properly, or at least that the filesystem is able to detect when the disk has an error via <a href="https://en.wikipedia.org/wiki/S.M.A.R.T.">SMART</a> or some other kind of monitoring. I'd always figured that was the case until I started looking into it, but that assumption turns out to be completely wrong.</p> <p>The <a href="http://research.cs.wisc.edu/wind/Publications/iron-sosp05.pdf">Prabhakaran et al. SOSP 05 paper</a> examined how filesystems respond to disk errors in some detail. They created a fault injection layer that allowed them to inject disk faults and then ran things like <code>chdir</code>, <code>chroot</code>, <code>stat</code>, <code>open</code>, <code>write</code>, etc. to see what would happen.</p> <p>Between ext3, reiserfs, and NTFS, reiserfs is the best at handling errors and it seems to be the only filesystem where errors were treated as first class citizens during design. It's mostly consistent about propagating errors to the user on reads, and calling <code>panic</code> on write failures, which triggers a restart and recovery. This general policy allows the filesystem to gracefully handle read failure and avoid data corruption on write failures. However, the authors found a number of inconsistencies and bugs. For example, reiserfs doesn't correctly handle read errors on indirect blocks and leaks space, and a specific type of write failure doesn't prevent reiserfs from updating the journal and committing the transaction, which can result in data corruption.</p> <p>Reiserfs is the good case. The authors found that ext3 ignored write failures in most cases, and rendered the filesystem read-only in most cases for read failures. This seems like pretty much the opposite of the policy you'd want. Ignoring write failures can easily result in data corruption, and remounting the filesystem as read-only is a drastic overreaction if the read error was a transient error (transient errors are common). Additionally, ext3 did the least consistency checking of the three filesystems and was the most likely to not detect an error. In one presentation, one of the authors remarked that the ext3 code had lots of comments like “I really hope a write error doesn't happen here&quot; in places where errors weren't handled.</p> <p>NTFS is somewhere in between. The authors found that it has many consistency checks built in, and is pretty good about propagating errors to the user. However, like ext3, it ignores write failures.</p> <p>The paper has much more detail on the exact failure modes, but the details are mostly of historical interest as many of the bugs have been fixed.</p> <p>It would be really great to see an updated version of the paper, and in one presentation someone in the audience asked if there was more up to date information. The presenter replied that they'd be interested in knowing what things look like now, but that it's hard to do that kind of work in academia because grad students don't want to repeat work that's been done before, which is pretty reasonable given the incentives they face. Doing replications is a lot of work, often nearly as much work as the original paper, and replications usually give little to no academic credit. This is one of the many cases where the incentives align very poorly with producing real world impact.</p> <p>The <a href="http://usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al. FAST 08</a> is another paper it would be great to see replicated today. That paper follows up the paper we just looked at, and examines the error handling code in different file systems, using a simple static analysis tool to find cases where errors are being thrown away. Being thrown away is defined very loosely in the paper --- code like the following</p> <pre><code>if (error) { printk(“I have no idea how to handle this error\n”); } </code></pre> <p>is considered <em>not</em> throwing away the error. Errors are considered to be ignored if the execution flow of the program doesn't depend on the error code returned from a function that returns an error code.</p> <p>With that tool, they find that most filesystems drop a lot of error codes:</p> <p><P></p> <p><BR><DIV ALIGN="CENTER"></p> <p><TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER"></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p></FONT></TD></p> <p><TD ALIGN="CENTER" COLSPAN=2><FONT SIZE="-1"> </FONT><FONT SIZE="-1"><B>By % Broken</B></FONT></TD></p> <p><TD ALIGN="CENTER" COLSPAN=2><FONT SIZE="-1"> </FONT><FONT SIZE="-1"><B>By Viol/Kloc</B></FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>Rank </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> FS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> Frac. </FONT></TD></p> <p><TD ALIGN="LEFT" COLSPAN=2><FONT SIZE="-1"> FS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Viol/Kloc</FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>1 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> IBM JFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 24.4 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> ext3 </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 7.2 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>2 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> ext3 </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 22.1 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> IBM JFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 5.6 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>3 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> JFFS v2 </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 15.7 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> NFS Client </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 3.6 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>4 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> NFS Client </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 12.9 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> VFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 2.9 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>5 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> CIFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 12.7 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> JFFS v2 </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 2.2 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>6 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> MemMgmt </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 11.4 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> CIFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 2.1 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>7 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> ReiserFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 10.5 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> MemMgmt </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 2.0 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>8 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> VFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 8.4 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> ReiserFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 1.8 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>9 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> NTFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 8.1 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> XFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 1.4 </FONT></TD></p> <p></TR></p> <p><TR><TD ALIGN="CENTER"><FONT SIZE="-1"></p> <p>10 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> XFS </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 6.9 </FONT></TD></p> <p><TD ALIGN="LEFT"><FONT SIZE="-1"> NFS Server </FONT></TD></p> <p><TD ALIGN="RIGHT"><FONT SIZE="-1"> 1.2 </FONT></TD></p> <p></TR></p> <p></TABLE></p> <p></DIV></p> <p><BR></p> <p><A NAME="table-analysis-robust"></A></p> <p>Comments they found next to ignored errors include: &quot;Should we pass any errors back?&quot;, &quot;Error, skip block and hope for the best.&quot;, &quot;There's no way of reporting error returned from ext3_mark_inode_dirty() to user space. So ignore it.&quot;, &quot;Note: todo: log error handler.&quot;, &quot;We can't do anything about an error here.&quot;, &quot;Just ignore errors at this point. There is nothing we can do except to try to keep going.&quot;, &quot;Retval ignored?&quot;, and &quot;Todo: handle failure.&quot;</p> <p>One thing to note is that in a lot of cases, ignoring an error is more of a symptom of an architectural issue than a bug per se (e.g., ext3 ignored write errors during checkpointing because it didn't have any kind of recovery mechanism). But even so, the authors of the papers found many real bugs.</p> <h3 id="error-recovery">Error recovery</h3> <p>Every widely used filesystem has bugs that will cause problems on error conditions, which brings up two questions. Can recovery tools robustly fix errors, and how often do errors occur? How do they handle recovery from those problems? The <a href="http://usenix.org/legacy/events/osdi08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al. OSDI 08 paper</a> looks at that and finds that fsck, a standard utility for checking and repairing file systems, “checks and repairs certain pointers in an incorrect order . . . the file system can even be unmountable after”.</p> <p>At this point, we know that it's quite hard to write files in a way that ensures their robustness even when the underlying filesystem is correct, the underlying filesystem will have bugs, and that attempting to repair corruption to the filesystem may damage it further or destroy it. How often do errors happen?</p> <h3 id="error-frequency">Error frequency</h3> <p>The <a href="http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/11.1.pdf">Bairavasundaram et al. SIGMETRICS '07 paper</a> found that, depending on the exact model, between 5% and 20% of disks would have at least one error over a two year period. Interestingly, many of these were isolated errors -- 38% of disks with errors had only a single error, and 80% had fewer than 50 errors. <a href="https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/main.html">A follow-up study</a> looked at corruption and found that silent data corruption that was only detected by checksumming happened on .5% of disks per year, with one extremely bad model showing corruption on 4% of disks in a year.</p> <p>It's also worth noting that they found very high locality in error rates between disks on some models of disk. For example, there was one model of disk that had a very high error rate in one specific sector, making many forms of RAID nearly useless for redundancy.</p> <p>That's another study it would be nice to see replicated. <a href="https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/">Most studies on disk focus on the failure rate of the entire disk</a>, but if what you're worried about is data corruption, errors in non-failed disks are more worrying than disk failure, which is easy to detect and mitigate.</p> <h3 id="conclusion">Conclusion</h3> <p>Files are hard. <a href="//danluu.com/butler-lampson-1999/#parallelism">Butler Lampson has remarked</a> that when they came up with threads, locks, and condition variables at PARC, they thought that they were creating a programming model that anyone could use, but that there's now decades of evidence that they were wrong. We've accumulated a lot of evidence that humans are very bad at reasoning about these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems. Lampson suggests that the best known general purpose solution is to package up all of your parallelism into as small a box as possible and then have a wizard write the code in the box. Translated to filesystems, that's equivalent to saying that as an application developer, writing to files safely is hard enough that it should be done via some kind of library and/or database, not by directly making syscalls.</p> <p>Sqlite is quite good in terms of reliability if you want a good default. However, some people find it to be too heavyweight if all they want is a file-based abstraction. What they really want is a sort of polyfill for the file abstraction that works on top of all filesystems without having to understand the differences between different configurations (and even different versions) of each filesystem. Since that doesn't exist yet, when no existing library is sufficient, you need to checksum your data since you will get silent errors and corruption. The only questions are whether or not you detect the errors and whether or not your record format only destroys a single record when corruption happens, or if it destroys the entire database. As far as I can tell, most desktop email client developers have chosen to go the route of destroying all of your email if corruption happens.</p> <p>These studies also hammer home the point that <a href="//danluu.com/everything-is-broken/">conventional testing isn't sufficient</a>. There were multiple cases where the authors of a paper wrote a relatively simple tool and found a huge number of bugs. You don't need any deep computer science magic to write the tools. The error propagation checker from the paper that found a ton of bugs in filesystem error handling was 4k LOC. If you read the paper, you'll see that the authors observed that the tool had a very large number of shortcomings because of its simplicity, but despite those shortcomings, it was able to find a lot of real bugs. I wrote a vaguely similar tool at my last job to enforce some invariants, and it was literally two pages of code. It didn't even have a real parser (it just went line-by-line through files and did some regexp matching to detect the simple errors that it's possible to detect with just a state machine and regexes), but it found enough bugs that it paid for itself in development time the first time I ran it.</p> <p>Almost every software project I've seen has a lot of low hanging testing fruit. Really basic <a href="//danluu.com/testing/">random testing</a>, <a href="//danluu.com/pl-troll/">static analysis</a>, and <a href="https://aphyr.com/tags/jepsen">fault injection</a> can pay for themselves in terms of dev time pretty much the first time you use them.</p> <h3 id="appendix">Appendix</h3> <p>I've probably covered less than 20% of the material in the papers I've referred to here. Here's a bit of info about some other neat info you can find in those papers, and others.</p> <p><a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf">Pillai et al., OSDI '14</a>: this papers goes into much more detail about what's required for crash consistency than this post does. It also gives a fair amount of detail about how exactly applications fail, including diagrams of traces that indicate what false assumptions are embedded in each trace.</p> <p><a href="http://static.usenix.org/legacy/events/fast12/tech/full_papers/Chidambaram.pdf">Chidambara et al., FAST '12</a>: the same filesystem primitives are responsible for both consistency and ordering. The authors propose alternative primitives that separate these concerns, allow better performance while maintaining safety.</p> <p><a href="https://www.researchgate.net/publication/220958003_Coerced_Cache_Eviction_and_Discreet_Mode_Journaling_Dealing_with_Misbehaving_Disks">Rajimwale et al. DSN '01</a>: you probably shouldn't use disks that ignore flush directives, but in case you do, here's a protocol that forces those disks to flush using normal filesystem operations. As you might expect, the performance for this is quite bad.</p> <p><a href="http://research.cs.wisc.edu/wind/Publications/iron-sosp05.pdf">Prabhakaran et al. SOSP '05</a>: This has a lot more detail on filesystem responses to error than was covered in this post. The authors also discuss JFS, an IBM filesystem for AIX. Although it was designed for high reliability systems, it isn't particularly more reliable than the alternatives. Related material is covered further in <a href="http://research.cs.wisc.edu/adsl/Publications/pointer-dsn08.pdf">DSN '08</a>, <a href="http://research.cs.wisc.edu/adsl/Publications/trust-storagess06.pdf">StorageSS '06</a>, <a href="http://research.cs.wisc.edu/adsl/Publications/vmdep-dsn06.pdf">DSN '06</a>, <a href="http://research.cs.wisc.edu/adsl/Publications/parity-fast08.pdf">FAST '08</a>, and <a href="http://research.cs.wisc.edu/adsl/Publications/envyfs-usenix09.pdf">USENIX '09</a>, among others.</p> <p><a href="http://usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al. FAST '08</a> : Again, much more detail than is covered in this post on when errors get dropped, and how they wrote their tools. They also have some call graphs that give you one rough measure of the complexity involved in a filesystem. The XFS call graph is particularly messy, and one of the authors noted in a presentation that an XFS developer said that XFS was fun to work on since they took advantage of every possible optimization opportunity regardless of how messy it made things.</p> <p><a href="http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/11.1.pdf">Bairavasundaram et al. SIGMETRICS '07</a>: There's a lot of information on disk error locality and disk error probability over time that isn't covered in this post. <a href="http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf">A followup paper in FAST08 has more details</a>.</p> <p><a href="http://usenix.org/legacy/events/osdi08/tech/full_papers/gunawi/gunawi_html/index.html">Gunawi et al. OSDI '08</a>: This paper has a lot more detail about when fsck doesn't work. In a presentation, one of the authors mentioned that fsck is the only program that's ever insulted him. Apparently, if you have a corrupt pointer that points to a superblock, fsck destroys the superblock (possibly rendering the disk unmountable), tells you something like &quot;you dummy, you must have run fsck on a mounted disk&quot;, and then gives up. In the paper, the authors reimplement basically all of fsck using a declarative model, and find that the declarative version is shorter, easier to understand, and much easier to extend, at the cost of being somewhat slower.</p> <p>Memory errors are beyond the scope of this post, but <a href="//danluu.com/why-ecc/">memory corruption</a> can cause disk corruption. This is especially annoying because memory corruption can cause you to take a checksum of bad data and write a bad checksum. It's also possible to corrupt in memory pointers, which often results in something very bad happening. See the <a href="http://research.cs.wisc.edu/adsl/Publications/zfs-corruption-fast10.pdf">Zhang et al. FAST '10 paper</a> for more on how ZFS is affected by that. There's a meme going around that ZFS is safe against memory corruption because it checksums, but that paper found that critical things held in memory aren't checksummed, and that memory errors can cause data corruption in real scenarios.</p> <p>The sqlite devs are serious about both <a href="https://www.sqlite.org/howtocorrupt.html">documentation</a> and <a href="https://www.sqlite.org/testing.html">testing</a>. If I wanted to write a reliable desktop application, I'd start by reading the sqlite docs and then talking to some of the core devs. If I wanted to write a reliable distributed application I'd start by getting a job at Google and then reading the design docs and <a href="//danluu.com/postmortem-lessons/">postmortems</a> for GFS, Colossus, Spanner, etc. J/k, but not really.</p> <p>We haven't looked at formal methods at all, but there have been a variety of attempts to formally verify properties of filesystems, such as <a href="https://sibylfs.github.io/">SibylFS</a>.</p> <p>This list isn't intended to be exhaustive. It's just a list of things I've read that I think are interesting.</p> <p><strong>Update</strong>: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that's probably the most common comment I've gotten on this post. If you think this solves the problem, I'm going to ask you to pause for five seconds and consider the problems this might have.</p> <p>The main problems this has are:</p> <ul> <li>rename isn't atomic on crash. POSIX says that rename is atomic, but this only applies to normal operation, not to crashes.</li> <li>even if the techinque worked, the performance is very poor</li> <li>how do you handle hardlinks?</li> <li>metadata can be lost; this can sometimes be preserved, under some filesystems, with ioctls, but now you have filesystem specific code just for the non-crash case</li> <li>etc.</li> </ul> <p>The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they're explicitly warned that people tend to underestimate this problem!</p> <p><strong><a href="//danluu.com/filesystem-errors/">This post reproduces some of the results from these papers on modern filesystems as of 2017</a>.</strong></p> <p><strong><a href="//danluu.com/deconstruct-files/">This talk (transcript) contains a number of newer results and discusses hardware issues in more detail</a>.</strong></p> <p><small> Thanks to Leah Hanson, Katerina Barone-Adesi, Jamie Brandon, Kamal Marhubi, Joe Wilder, David Turner, Benjamin Gilbert, Tom Murphy, Chris Ball, Joe Doliner, Alexy Romanov, Mindy Preston, Paul McJones, Evan Jones, and Jason Petersen for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:D">Turns out <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-ext3.html">some commercially supported distros</a> only support <code>data=ordered</code>. Oh, and when I said <code>data=ordered</code> was the default, that's only the case if pre-2.6.30. After 2.6.30, there's a config option, <code>CONFIG_EXT3_DEFAULTS_TO_ORDERED</code>. If that's not set, the default becomes <code>data=writeback</code>. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:A"><p>Cases where overwrite atomicity is required were documented as known issues, and all such cases assumed single-block atomicity and not multi-block atomicity. By contrast, multiple applications (LevelDB, Mercurial, and HSQLDB) had bad data corruption bugs that came from assuming appends are atomic.</p> <p>That seems to be an indirect result of a commonly used update protocol, where modifications are logged via appends, and then logged data is written via overwrites. Application developers are careful to check for and handle errors in the actual data, but the errors in the log file are often overlooked.</p> <p>There are a number of other classes of errors discussed, and I recommend reading the paper for the details if you work on an application that writes files.</p> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> </ol> </div> Why use ECC? why-ecc/ Fri, 27 Nov 2015 00:00:00 +0000 why-ecc/ <p>Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes <a rel="nofollow" href="http://blog.codinghorror.com/to-ecc-or-not-to-ecc/">a case against using ECC memory</a>. My read is that his major points are:</p> <ol> <li>Google didn't use ECC when they built their servers in 1999</li> <li>Most RAM errors are hard errors and not soft errors</li> <li>RAM errors are rare because hardware has improved</li> <li>If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is &quot;awfully enterprisey&quot;</li> </ol> <p></p> <p>Let's take a look at these arguments one by one:</p> <h2 id="1-google-didn-t-use-ecc-in-1999">1. Google didn't use ECC in 1999</h2> <p>Not too long after Google put these non-ECC machines into production, they realized this was a serious error and not worth the cost savings. If you think cargo culting what Google does is a good idea because it's Google, here are some things you might do:</p> <h4 id="a-put-your-servers-into-shipping-containers">A. Put your servers into shipping containers.</h4> <p>Articles are still written today about what a great idea this is, even though this was an experiment at Google that was deemed unsuccessful. Turns out, even Google's experiments don't always succeed. In fact, their propensity for “moonshots” in the early days meannt that they had more failed experiments that most companies. Copying their failed experiments isn't a particularly good strategy.</p> <h4 id="b-cause-fires-in-your-own-datacenters">B. Cause fires in your own datacenters</h4> <p>Part of the post talks about how awesome these servers are:</p> <blockquote> <p>Some people might look at these early Google servers and see an amateurish fire hazard. Not me. I see a prescient understanding of how inexpensive commodity hardware would shape today's internet. I felt right at home when I saw this server; it's exactly what I would have done in the same circumstances</p> </blockquote> <p>The last part of that is true. But the first part has a grain of truth, too. When Google started designing their own boards, one generation had a regrowth<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup> issue that caused a non-zero number of fires.</p> <p>BTW, if you click through to Jeff's post and look at the photo that the quote refers to, you'll see that the boards have a lot of flex in them. That caused problems and was fixed in the next generation. You can also observe that the cabling is quite messy, which also caused problems, and was also fixed in the next generation. There were other problems as well. <abbr title="When someone looks in the answer key and says, 'I would've come up with that', that's often plausible when their answer is perfect. But when they say that after seeing a specific imperfect answer, it's a bit less plausible that they'd reproduce the exact same mistakes">Jeff's argument here appears to be that, if he were there at the time, he would've seen the exact same opportunities that early Google enigneers did, and since Google did this, it must've been the right thing even if it doesn't look like it. But, a number of things that make it look like not the right thing actually made it not the right thing.</abbr></p> <h4 id="c-make-servers-that-injure-your-employees">C. Make servers that injure your employees</h4> <p>One generation of Google servers had infamously sharp edges, giving them the reputation of being made of “razor blades and hate”.</p> <h4 id="d-create-weather-in-your-datacenters">D. Create weather in your datacenters</h4> <p>From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.</p> <p>Note that these are all things Google tried and then changed. Making mistakes and then fixing them is common in every successful engineering organization. If you're going to cargo cult an engineering practice, you should at least cargo cult current engineering practices, not <a href="//danluu.com/butler-lampson-1999/">something that was done in 1999</a>.</p> <p>When Google used servers without ECC back in 1999, they found a number of symptoms that were ultimately due to memory corruption, including a search index that returned effectively random results to queries. The actual failure mode here is instructive. I often hear that it's ok to ignore ECC on these machines because it's ok to have errors in individual results. But even when you can tolerate occasional errors, ignoring errors means that you're exposing yourself to total corruption, unless you've done a very careful analysis to make sure that a single error can only contaminate a single result. In research that's been done on filesystems, it's been repeatedly shown that despite making valiant attempts at creating systems that are robust against a single error, it's extremely hard to do so and basically every heavily tested filesystem can have a massive failure from a single error (<a href="//danluu.com/file-consistency/">see the output of Andrea and Remzi's research group at Wisconsin if you're curious about this</a>). I'm not knocking filesystem developers here. They're better at that kind of analysis than 99.9% of programmers. It's just that this problem has been repeatedly shown to be hard enough that humans cannot effectively reason about it, and automated tooling for this kind of analysis is still far from a push-button process. In their book on <a href="http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024">warehouse scale computing</a>, Google discusses error correction and detection and ECC is cited as their slam dunk case for when it's obvious that you should use hardware error correction<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">2</a></sup>.</p> <p>Google has great infrastructure. From what I've heard of the infra at other large tech companies, Google's sounds like the best in the world. But that doesn't mean that you should copy everything they do. Even if you look at their good ideas, it doesn't make sense for most companies to copy them. They <a href="//danluu.com/intel-cat/">created a replacement for Linux's work stealing scheduler that uses both hardware run-time information and static traces to allow them to take advantage of new hardware in Intel's server processors that lets you dynamically partition caches between cores</a>. If used across their entire fleet, that could easily save Google more money in a week than stackexchange has spent on machines in their entire history. Does that mean you should copy Google? No, not unless you've already captured all the lower hanging fruit, which includes things like making sure that your core infrastructure is written in highly optimized C++, not Java or (god forbid) Ruby. And the thing is, for the vast majority of companies, writing in a language that imposes a 20x performance penalty is a totally reasonable decision.</p> <h2 id="2-most-ram-errors-are-hard-errors">2. Most RAM errors are hard errors</h2> <p>The case against ECC quotes <a href="http://selse.org//images/selse_2012/Papers/selse2012_submission_4.pdf">this section of a study on DRAM errors</a> (the bolding is Jeff's):</p> <blockquote> <p>Our study has several main findings. First, we find that approximately <strong>70% of DRAM faults are recurring (e.g., permanent) faults, while only 30% are transient faults.</strong> Second, we find that large multi-bit faults, such as faults that affects an entire row, column, or bank, constitute over 40% of all DRAM faults. Third, we find that almost 5% of DRAM failures affect board-level circuitry such as data (DQ) or strobe (DQS) wires. Finally, we find that chipkill functionality reduced the system failure rate from DRAM faults by 36x.</p> </blockquote> <p>This seems to betray a lack of understanding of the implications of this study, as this quote doesn't sound like an argument against ECC; it sounds like an argument for &quot;chipkill&quot;, a particular class of ECC. Putting that aside, Jeff's post points out that hard errors are twice as common as soft errors, and then mentions that they run memtest on their machines when they get them. First, a 2:1 ratio isn't so large that you can just ignore soft errors. Second the post implies that Jeff believes that hard errors are basically immutable and can't surface after some time, which is incorrect. You can think of electronics as wearing out just the same way mechanical devices wear out. The mechanisms are different, but the effects are similar. In fact, if you compare reliability analysis of chips vs. other kinds of reliability analysis, you'll find they often use the same families of distributions to model failures. And, if hard errors were immutable, they would generally get caught in testing by the manufacturer, who can catch errors much more easily than consumers can because they have hooks into circuits that let them test memory much more efficiently than you can do in your server or home computer. Third, Jeff's line of reasoning implies that ECC can't help with detection or correction of hard errors, which is not only incorrect but directly contradicted by the quote.</p> <p>So, how often are you going to run memtest on your machines to try to catch these hard errors, and how much data corruption are you willing to live with? One of the key uses of ECC is not to correct errors, but to signal errors so that hardware can be replaced before silent corruption occurs. No one's going to consent to shutting down everything on a machine every day to run memtest (that would be more expensive than just buying ECC memory), and even if you could convince people to do that, it won't catch as many errors as ECC will.</p> <p>When I worked at a company that owned about 1000 machines, we noticed that we were getting strange consistency check failures, and after maybe half a year we realized that the failures were more likely to happen on some machines than others. The failures were quite rare, maybe a couple times a week on average, so it took a substantial amount of time to accumulate the data, and more time for someone to realize what was going on. Without knowing the cause, analyzing the logs to figure out that the errors were caused by single bit flips (with high probability) was also non-trivial. We were lucky that, as a side effect of the process we used, the checksums were calculated in a separate process, on a different machine, at a different time, so that an error couldn't corrupt the result and propagate that corruption into the checksum. If you merely try to protect yourself with in-memory checksums, there's a good chance you'll perform a checksum operation on the already corrupted data and compute a valid checksum of bad data unless you're doing some really fancy stuff with calculations that carry their own checksums (and if you're that serious about error correction, you're probably using ECC regardless). Anyway, after completing the analysis, we found that memtest couldn't detect any problems, but that replacing the RAM on the bad machines caused a one to two order of magnitude reduction in error rate. Most services don't have this kind of checksumming we had; those services will simply silently write corrupt data to persistent storage and never notice problems until a customer complains.</p> <h2 id="3-due-to-advances-in-hardware-manufacturing-errors-are-very-rare">3. Due to advances in hardware manufacturing, errors are very rare</h2> <p>Jeff says</p> <blockquote> <p>I do seriously question whether ECC is as operationally critical as we have been led to believe [for servers], and I think the data shows modern, non-ECC RAM is already extremely reliable ... Modern commodity computer parts from reputable vendors are amazingly reliable. And their trends show from 2012 onward essential PC parts have gotten more reliable, not less. (I can also vouch for the improvement in SSD reliability as we have had zero server SSD failures in 3 years across our 12 servers with 24+ drives ...</p> </blockquote> <p>and quotes a study.</p> <p>The data in the post isn't sufficient to support this assertion. Note that since RAM usage has been increasing and continues to increase at a fast exponential rate, RAM failures would have to decrease at a greater exponential rate to actually reduce the incidence of data corruption. Furthermore, as chips continue shrink, features get smaller, making the kind of wearout issues discussed in “2” more common. For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.</p> <p>The <a href="http://selse.org//images/selse_2012/Papers/selse2012_submission_4.pdf">2012 study that Atwood quoted</a> has this graph on corrected errors (a subset of all errors) on ten randomly selected failing nodes (6% of nodes had at least one failure):</p> <p><img src="images/why-ecc/one_month_ecc_errors.png"></p> <p>We're talking between 10 and 10k errors for a typical node that has a failure, and that's a cherry-picked study from a post that's arguing that you don't need ECC. Note that the nodes here only have 16GB of RAM, which is an order of magnitude less than modern servers often have, and that this was on an older process node that was less vulnerable to noise than we are now. For anyone who's used to dealing with reliability issues and just wants to know the FIT rate, the study finds a FIT rate of between 0.057 and 0.071 faults per Mbit (which, contra Atwood's assertion, is not a shockingly low number). If you take the most optimistic FIT rate, .057, and do the calculation for a server without much RAM (here, I'm using 128GB, since the servers I see nowadays typically have between 128GB and 1.5TB of RAM)., you get an expected value of .057 * 1000 * 1000 * 8760 / 1000000000 = .5 faults per year per server. Note that this is for faults, not errors. From the graph above, we can see that a fault can easily cause hundreds or thousands of errors per month. Another thing to note is that there are multiple nodes that don't have errors at the start of the study but develop errors later on. So, in fact, the cherry-picked study that Jeff links contradicts Jeff's claim about reliability.</p> <p>Sun/Oracle famously ran into this a number of decades ago. Transistors and DRAM capacitors were getting smaller, much as they are now, and memory usage and caches were growing, much as they are now. Between having smaller transistors that were less resilient to transient upset as well as more difficult to manufacture, and having more on-chip cache, the vast majority of server vendors decided to add ECC to their caches. Sun decided to save a few dollars and skip the ECC. The direct result was that a number of Sun customers reported sporadic data corruption. It took Sun multiple years to spin a new architecture with ECC cache, and Sun made customers sign an NDA to get replacement chips. Of course there's no way to cover up this sort of thing forever, and when it came up, Sun's reputation for producing reliable servers took a permanent hit, much like the time they tried to <a href="anon-benchmark/">cover up poor performance results by introducing a clause into their terms of services disallowing benchmarking</a>.</p> <p>Another thing to note here is that when you're paying for ECC, you're not just paying for ECC, you're paying for parts (CPUs, boards) that have been qual'd more thoroughly. You can easily see this with disk failure rates, and I've seen many people observe this in their own private datasets. In terms of public data, I believe Andrea and Remzi's group had a SIGMETRICS paper a few years back that showed that SATA drives were 4x more likely than SCSI drives to have disk read failures, and 10x more likely to have silent data corruption. This relationship held true even with drives from the same manufacturer. There's no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but it's not about the interface. It's about buying a high-reliability server part vs. a consumer part. Maybe you don't care about disk reliability in particular because you checksum everything and can easily detect disk corruption, but there are some kinds of corruption that are harder to detect.</p> <p>[2024 update, almost a decade later]: looking at this retrospectively, we can see that Jeff's assertion that commodity parts are reliable, &quot;modern commodity computer parts from reputable vendors are amazingly reliable&quot; is still not true. Looking at real-world user data from Firefox, <a href="https://fosstodon.org/@gabrielesvelto/112401643131904845">Gabriele Svelto estimated that approximately 10% to 20% of all Firefox crashes were due to memory corruption</a>. Various game companies that track this kind of thing also report a significant fraction of user crashes appear to be due to data corruption, although I don't have an estimate from any of those companies handy. A more direct argument is that if you talk to folks at big companies that run a lot of ECC memory and look at the rate of ECC errors, there are quite a few errors detected by ECC memory despite ECC memory typically having a lower error rate than random non-ECC memory. This kind of argument is frequently made (here, it was detailed above a decade ago, and when I looked at this when I worked at Twitter fairly recently and there has not been a revolution in memory technology that has reduced the need for ECC over the rates discussed in papers a decade ago), but it often doesn't resontate with folks who say things like &quot;well, those bits probably didn't matter anyway&quot;, &quot;most memory ends up not getting read&quot;, etc. Looking at real-world crashes and noting that the amount of silent data corruption should be expected to be much higher than the rate of crashes seems to resonate with people who aren't excited by looking at raw FIT rates in datacenters.</p> <h2 id="4-if-ecc-were-actually-important-it-would-be-used-everywhere-and-not-just-servers">4. If ECC were actually important, it would be used everywhere and not just servers.</h2> <p><a href="cocktail-ideas/">One way to rephrase this is as a kind of cocktail party efficient markets hypothesis. This can't be important, because if it was, we would have it</a>. Of course this is incorrect and there are many things that would be beneficial to consumers that we don't have, such as <a href="car-safety/">cars that are designed to safe instead of just getting the maximum score in crash tests</a>. Looking at this with respect to the server and consumer markets, this argument can be rephrased as “If this feature were actually important for servers, it would be used in non-servers”, which is incorrect. A primary driver of what's available in servers vs. non-servers is what can be added that buyers of servers will pay a lot for, to allow for price discrimination between server and non-server parts. This is actually one of the more obnoxious problems facing large cloud vendors — hardware vendors are able to jack up the price on parts that have server features because the features are much more valuable in server applications than in desktop applications. Most home users don't mind, giving hardware vendors a mechanism to extract more money out of people who buy servers while still providing cheap parts for consumers.</p> <p>Cloud vendors often have enough negotiating leverage to get parts at cost, but that only works where there's more than one viable vendor. Some of the few areas where there aren't any viable competitors include CPUs and GPUs. There have been a number of attempts by CPU vendors to get into the server market, but each attempt so far has been fatally flawed in a way that made it obvious from an early stage that the attempt was doomed (and these are often 5 year projects, so that's a lot of time to spend on a doomed project). The Qualcomm effort has been getting a lot of hype, but when I talk to folks I know at Qualcomm they all tell me that the current chip is basically for practice, since Qualcomm needed to learn how to build a server chip from all the folks they poached from IBM, and that the next chip is the first chip that has any hope of being competitive. I have high hopes for Qualcomm as well an ARM effort to build good server parts, but those efforts are still a ways away from bearing fruit.</p> <p>The near total unsuitability of current ARM (and POWER) options (not including hypothetical variants of Apple's impressive ARM chip) for most server workloads in terms of performance per TCO dollar is a bit of a tangent, so I'll leave that for another post, but the point is that Intel has the market power to make people pay extra for server features, and they do so. Additionally, some features are genuinely more important for servers than for mobile devices with a few GB of RAM and a power budget of a few watts that are expected to randomly crash and reboot periodically anyway.</p> <h2 id="conclusion">Conclusion</h2> <p>Should you buy ECC RAM? That depends. For servers, it's probably a good bet considering the cost, although it's hard to really do a cost/benefit analysis because it's really hard to figure out the cost of silent data corruption, or the cost of having some risk of burning half a year of developer time tracking down intermittent failures only to find that the were caused by using non-ECC memory.</p> <p>For normal desktop use, I'm pro-ECC, but if you don't have <a href="https://www.reddit.com/r/programming/comments/adoux/coding_horror_and_blogsstackoverflowcom/">regular backups</a> set up, doing backups probably has a better ROI than ECC. But once you have the absolute basics set up, there's a fairly strong case for ECC for consumer machines. For example, if you have backups without ECC, you can easily write corrupt data into your primary store and replicate that corrupt data into backup. But speaking more generally, big companies running datacenters are probably better set up to detect data corruption and more likely to have error correction at higher levels that allow them to recover from data corruption than consumers, so the case for consumers is arguably stronger than it is for servers, where the case is strong enough that's generally considered a no brainer. A major reason consumers don't generally use ECC isn't that it isn't worth it for them, it's that they just have no idea how to attribute crashes and data corruption when they happen. Once you start doing this, as Google and other large companies do, it's immediately obvious that ECC is worth the cost even when you have multiple levels of error correction operating at higher levels. This is analogous to <a href="deconstruct-files/">what we see with files</a>, where big tech companies write software for their datacenters that's much better at dealing with data corruption than big tech companies that write consumer software (and this is often true within the same company). To the user, the cost of having their web app corrupt their data isn't all that different from when their desktop app corrupts their data, the difference is that when their web app corrupts data, it's clearer that it's the company's fault, which changes the incentives for companies.</p> <h3 id="appendix-security">Appendix: security</h3> <p>If you allow any sort of code execution, even sandboxed execution, there are attacks <a href="https://en.wikipedia.org/wiki/Row_hammer">like rowhammer</a> which can allow users to cause data corruption and there have been instances where this has allowed for privilege escalation. ECC doesn't completely mitigate the attack, but it makes it much harder.</p> <p><small> Thanks to Prabhakar Ragde, Tom Murphy, Jay Weisskopf, Leah Hanson, Joe Wilder, and Ralph Corderoy for discussion/comments/corrections. Also, thanks (or maybe anti-thanks) to Leah for convincing me that I should write up this off the cuff verbal comment as a blog post. Apologies for any errors, the lack of references, and the stilted prose; this is basically a transcription of half of a conversation and I haven't explained terms, provided references, or checked facts in the level of detail that I normally do. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R"><p>One of the funnier examples I can think of this, at least to me, is the magical self-healing fuse. Although there are many implementations, you can think of a fuse on a chip as basically a resistor. If you run some current through it, you should get a connection. If you run a lot of current through it, you'll heat up the resistor and eventually destroy it. This is commonly used to fuse off features on chips, or to do things like set the clock rate, with the idea being that once a fuse is blown, there's no way to unblow the fuse.</p> <p>Once upon a time, there was a semiconductor manufacturer that rushed their manufacturing process a bit and cut the tolerances a bit too fine in one particular process generation. After a few months (or years), the connection between the two ends of the fuse could regrow and cause the fuse to unblow. If you're lucky, the fuse will be something like the high-order bit of the clock multiplier, which will basically brick the chip if changed. If you're not lucky, it will be something that results in silent data corruption.</p> <p>I heard about problems in that particular process generation from that manufacturer from multiple people at different companies, so this wasn't an isolated thing. When I say this is funny, I mean that it's funny when you hear this story at a bar. It's maybe less funny when you discover, after a year of testing, that some of your chips are failing because their fuse settings are nonsensical, and you have to respin your chip and delay the release for 3 months. BTW, this fuse regrowth thing is another example of a class of error that can be mitigated with ECC.</p> <p>This is not the issue that Google had; I only mention this because a lot of people I talk to are surprised by the ways in which hardware can fail.</p> <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:P"><p>In case you don't want to dig through the whole book, most of the relevant passage is:</p> <p>In a system that can tolerate a number of failures at the software level, the minimum requirement made to the hardware layer is that its faults are always detected and reported to software in a timely enough manner as to allow the software infrastructure to contain it and take appropriate recovery actions. It is not necessarily required that hardware transparently corrects all faults. This does not mean that hardware for such systems should be designed without error correction capabilities. Whenever error correction functionality can be offered within a reasonable cost or complexity, it often pays to support it. It means that if hardware error correction would be exceedingly expensive, the system would have the option of using a less expensive version that provided detection capabilities only. Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost. Relaxing the requirement that hardware errors be detected, however, would be much more difficult because it means that every software component would be burdened with the need to check its own correct execution. At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google's Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring, and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked—instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM.</p> <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> </ol> </div> What's worked in Computer Science: 1999 v. 2015 butler-lampson-1999/ Mon, 23 Nov 2015 00:00:00 +0000 butler-lampson-1999/ <p>In 1999, Butler Lampson gave a talk about the <a href="http://research.microsoft.com/pubs/68591/computersystemsresearch.pdf">past and future of “computer systems research”</a>. Here are his opinions from 1999 on &quot;what worked&quot;.</p> <table> <thead> <tr> <th>Yes</th> <th>Maybe</th> <th>No</th> </tr> </thead> <tbody> <tr> <td>Virtual memory</td> <td>Parallelism</td> <td>Capabilities</td> </tr> <tr> <td>Address spaces</td> <td>RISC</td> <td>Fancy type systems</td> </tr> <tr> <td>Packet nets</td> <td>Garbage collection</td> <td>Functional programming</td> </tr> <tr> <td>Objects / subtypes</td> <td>Reuse</td> <td>Formal methods</td> </tr> <tr> <td>RDB and SQL</td> <td></td> <td>Software engineering</td> </tr> <tr> <td>Transactions</td> <td></td> <td>RPC</td> </tr> <tr> <td>Bitmaps and GUIs</td> <td></td> <td>Distributed computing</td> </tr> <tr> <td>Web</td> <td></td> <td>Security</td> </tr> <tr> <td>Algorithms</td> <td></td> <td></td> </tr> </tbody> </table> <p></p> <p><br> Basically everything that was a Yes in 1999 is still important today. Looking at the Maybe category, we have:</p> <h3 id="parallelism">Parallelism</h3> <p>This is, unfortunately, still a Maybe. Between the <a href="https://en.wikipedia.org/wiki/Dennard_scaling">end of Dennard scaling</a> and the continued demand for compute, chips now expose plenty of the parallelism to the programmer. Concurrency has gotten much easier to deal with, but really extracting anything close to the full performance available isn't much easier than it was in 1999.</p> <p>In 2009, <a href="https://channel9.msdn.com/Shows/Going+Deep/E2E-Erik-Meijer-and-Butler-Lampson-Abstraction-Security-Embodiment">Erik Meijer and Butler Lampson talked about this</a>, and Lampson's comment was that when they came up with threading, locks, and conditional variables at PARC, they thought they were creating something that programmers could use to take advantage of parallelism, but that they now have decades of evidence that they were wrong. Lampson further remarks that to do parallel programming, what you need to do is put all your parallelism into a little box and then have a wizard go write the code in that box. Not much has changed since 2009.</p> <p>Also, note that I'm using the same criteria to judge all of these. Whenever you say something doesn't work, someone will drop in say that, no wait, here's a PhD that demonstrates that someone has once done this thing, or here are nine programs that demonstrate that Idris is, in fact, widely used in large scale production systems. I take Lampson's view, which is that if the vast majority of programmers are literally incapable of using a certain class of technologies, that class of technologies has probably not succeeded.</p> <p>On recent advancements in parallelism, Intel <a href="//danluu.com/intel-cat/">recently added features that make it easier to take advantage of trivial parallelism by co-scheduling multiple applications on the same machine without interference</a>, but outside of a couple big companies, no one's really taking advantage of this yet. They also added hardware support for STM recently, but it's still not clear how much STM helps with usability when designing large scale systems.</p> <h3 id="risc-danluu-com-risc-definition"><a href="//danluu.com/risc-definition/">RISC</a></h3> <p>If this was a Maybe in 1999 it's certainly a No now. In the 80s and 90s a lot of folks, probably the majority of folks, believed RISC was going to take over the world and x86 was doomed. In 1991, Apple, IBM, and Motorola got together to create PowerPC (PPC) chips that were going to demolish Intel in the consumer market. They opened the Somerset facility for chip design, and collected a lot of their best folks for what was going to be a world changing effort. At the upper end of the market, DEC's Alpha chips were getting twice the performance of Intel's, and their threat to the workstation market was serious enough that Microsoft ported Windows NT to the Alpha. DEC started a project to do dynamic translation from x86 to Alpha; at the time the project started, the projected performance of x86 basically running in emulation on Alpha was substantially better than native x86 on Intel chips.</p> <p>In 1995, Intel released the Pentium Pro. At the time, it had better workstation integer performance than anything else out there, including much more expensive chips targeted at workstations, and its floating point performance was within a factor of 2 of high-end chips. That immediately destroyed the viability of the mainstream Apple/IBM/Moto PPC chips, and in 1998 IBM pulled out of the Somerset venture<sup class="footnote-ref" id="fnref:I"><a rel="footnote" href="#fn:I">1</a></sup> and everyone gave up on really trying to produce desktop class PPC chips. Apple continued to sell PPC chips for a while, but they had to cook up bogus benchmarks to make the chips look even remotely competitive. By the time DEC finished their dynamic translation efforts, x86 in translation was barely faster than native x86 in floating point code, and substantially slower in integer code. While that was a very impressive technical feat, it wasn't enough to convince people to switch from x86 to Alpha, which killed DEC's attempts to move into the low-end workstation and high-end PC market.</p> <p>In 1999, high-end workstations were still mostly RISC machines, and supercomputers were a mix of custom chips, RISC chips, and x86 chips. Today, Intel dominates the workstation market with x86, and the supercomputer market has also moved towards x86. Other than POWER, RISC ISAs were mostly wiped out (like PA-RISC) or managed to survive by moving to the low-margin embedded market (like MIPS), which wasn't profitable enough for Intel to pursue with any vigor. You can see a kind of instruction set arbitrage that MIPS and ARM have been able to take advantage of because of this. Cavium and ARM will sell you a network card that offloads a lot of processing to the NIC, which have a bunch of cheap MIPS and ARM processors, respectively, on board. The low-end processors aren't inherently better at processing packets than Intel CPUS; they're just priced low enough that Intel won't compete on price because they don't want to cannibalize their higher margin chips with sales of lower margin chips. MIPS and ARM have no such concerns because MIPS flunked out of the high-end processor market and ARM has yet to get there. If the best thing you can say about RISC chips is that they manage to exist in areas where the profit margins are too low for Intel to care, that's not exactly great evidence of a RISC victory. That Intel ceded the low end of the market might seem ironic considering Intel's origins, but they've always been aggressive about moving upmarket (they did the same thing when they transitioned from DRAM to SRAM to flash, ceding the barely profitable DRAM market to their competitors).</p> <p>If there's any threat to x86, it's ARM, and it's their business model that's a threat, not their ISA. And as for their ISA, ARM's biggest inroads into mobile and personal computing came with ARMv7 and earlier ISAs, which aren't really more RISC-like than x86<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">2</a></sup>. In the area in which they dominated, their &quot;modern&quot; RISC-y ISA, ARMv8, is hopeless and will continue to be hopeless for years, and they'll continue to dominate with their non-RISC ISAs.</p> <p>In retrospect, the reason RISC chips looked so good in the 80s was that you could fit a complete high-performance RISC microprocessor onto a single chip, which wasn't true of x86 chips at the time. But as we got more transistors, this mattered less.</p> <p>It's possible to nitpick RISC being a no by saying that modern processors translate x86 ops into RISC micro-ops internally, but if you listened to talk at the time, people thought that having an external RISC ISA would be so much lower overhead that RISC would win, which has clearly not happened. Moreover, modern chips also do micro-op fusion in order to fuse operations into decidedly un-RISC-y operations. A clean RISC ISA is a beautiful thing. I sometimes re-read Dick Sites's <a href="http://www.hpl.hp.com/hpjournal/dtj/vol4num4/vol4num4art1.pdf">explanation of the Alpha design</a> just to admire it, but it turns out beauty isn't really critical for the commercial success of an ISA.</p> <h3 id="garbage-collection">Garbage collection</h3> <p>This is a huge Yes now. Every language that's become successful since 1999 has GC and is designed for all normal users to use it to manage all memory. In five years, Rust or D might make that last sentence untrue, but even if that happens, GC will still be in the yes category.</p> <h3 id="reuse">Reuse</h3> <p>Yes, I think, although I'm not 100% sure what Lampson was referring to here. Lampson said that reuse was a maybe because it sometimes works (for UNIX filters, OS, DB, browser) but was also flaky (for OLE/COM). There are now widely used substitutes for OLE; service oriented architectures also seem to fit his definition of re-use.</p> <p>Looking at the No category, we have:</p> <h3 id="capabilities">Capabilities</h3> <p>Yes. Widely used on mobile operating systems.</p> <h3 id="fancy-type-systems">Fancy type systems</h3> <p>It depends on what qualifies as a fancy type system, but if “fancy” means something at least as fancy as Scala or Haskell, this is a No. That's even true if you relax the standard to an ML-like type system. Boy, would I love to be able to do everyday programming in an ML (F# seems particularly nice to me), but we're pretty far from that.</p> <p>In 1999 C, and C++ were mainstream, along with maybe Visual Basic and Pascal, with Java on the rise. And maybe Perl, but at the time most people thought of it as a scripting language, not something you'd use for &quot;real&quot; development. PHP, Python, Ruby, and JavaScript all existed, but were mostly used in small niches. Back then, Tcl was one of the most widely used scripting languages, and it wasn't exactly widely used. Now, PHP, Python, Ruby, and JavaScript are not only more mainstream than Tcl, but more mainstream than C and C++. C# is probably the only other language in the same league as those languages in terms of popularity, and Go looks like the only language that's growing fast enough to catch up in the foreseeable future. Since 1999, we have a bunch of dynamic languages, and a few languages with type systems that are specifically designed not to be fancy.</p> <p>Maybe I'll get to use F# for non-hobby projects in another 16 years, but things don't look promising.</p> <h3 id="functional-programming">Functional programming</h3> <p>I'd lean towards Maybe on this one, although this is arguably a No. Functional languages are still quite niche, but functional programming ideas are now mainstream, at least for the HN/reddit/twitter crowd.</p> <p>You might say that I'm being too generous to functional programming here because I have a soft spot for immutability. That's fair. In 1982, <a href="http://research.microsoft.com/en-us/um/people/simonpj/Papers/other-authors/morris-real-programming.pdf">James Morris wrote</a>:</p> <blockquote> <p>Functional languages are unnatural to use; but so are knives and forks, diplomatic protocols, double-entry bookkeeping, and a host of other things modern civilization has found useful. Any discipline is unnatural, in that it takes a while to master, and can break down in extreme situations. That is no reason to reject a particular discipline. The important question is whether functional programming in unnatural the way Haiku is unnatural or the way Karate is unnatural.</p> <p>Haiku is a rigid form poetry in which each poem must have precisely three lines and seventeen syllables. As with poetry, writing a purely functional program often gives one a feeling of great aesthetic pleasure. It is often very enlightening to read or write such a program. These are undoubted benefits, but real programmers are more results-oriented and are not interested in laboring over a program that already works.</p> <p>They will not accept a language discipline unless it can be used to write programs to solve problems the first time -- just as Karate is occasionally used to deal with real problems as they present themselves. A person who has learned the discipline of Karate finds it directly applicable even in bar-room brawls where no one else knows Karate. Can the same be said of the functional programmer in today's computing environments? No.</p> </blockquote> <p>Many people would make the same case today. I don't agree, but that's a matter of opinion, not a matter of fact.</p> <h3 id="formal-methods">Formal methods</h3> <p>Maybe? Formal methods have had high impact in a few areas. Model checking is omnipresent in chip design. Microsoft's <a href="http://research.microsoft.com/pubs/70038/tr-2004-08.pdf">driver verification tool</a> has probably had more impact than all formal chip design tools combined, clang now has a fair amount of static analysis built in, and so on and so forth. But, formal methods are still quite niche, and the vast majority of developers don't apply formal methods.</p> <h3 id="software-engineering">Software engineering</h3> <p>No. In 1995, David Parnas had a talk at ICSE (the premier software engineering conference) about the fact that even the ICSE papers that won their “most influential paper award” (including two of Parnas's papers) had <a href="//danluu.com/empirical-pl/">very little impact on industry</a>.</p> <p>Basically all of Parnas's criticisms are still true today. One of his suggestions, that there should be distinct conferences for researchers and for practitioners has been taken up, but there's not much cross-pollination between academic conferences like ICSE and FSE and practitioner-focused conferences like StrangeLoop and PyCon.</p> <h3 id="rpc">RPC</h3> <p>Yes. In fact RPCs are now so widely used that I've seen multiple RPCs considered harmful talks.</p> <h3 id="distributed-systems">Distributed systems</h3> <p>Yes. These are so ubiquitous that startups with zero distributed systems expertise regularly use distributed systems provided by Amazon or Microsoft, and it's totally fine. The systems aren't perfect and there are some infamous downtime incidents, but if you compare the bit error rate of random storage from 1999 to something like EBS or Azure Blob Storage, distributed systems don't look so bad.</p> <h3 id="security">Security</h3> <p>Maybe? As with formal methods, a handful of projects with very high real world impact get a lot of mileage out of security research. But security still isn't a first class concern for most programmers.</p> <h3 id="conclusion">Conclusion</h3> <p>What's worked in computer systems research?</p> <table> <thead> <tr> <th>Topic</th> <th>1999</th> <th>2015</th> </tr> </thead> <tbody> <tr> <td>Virtual memory</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Address spaces</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Packet nets</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Objects / subtypes</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>RDB and SQL</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Transactions</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Bitmaps and GUIs</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Web</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Algorithms</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Parallelism</td> <td>Maybe</td> <td>Maybe</td> </tr> <tr> <td>RISC</td> <td>Maybe</td> <td>No</td> </tr> <tr> <td>Garbage collection</td> <td>Maybe</td> <td>Yes</td> </tr> <tr> <td>Reuse</td> <td>Maybe</td> <td>Yes</td> </tr> <tr> <td>Capabilities</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Fancy type systems</td> <td>No</td> <td>No</td> </tr> <tr> <td>Functional programming</td> <td>No</td> <td>Maybe</td> </tr> <tr> <td>Formal methods</td> <td>No</td> <td>Maybe</td> </tr> <tr> <td>Software engineering</td> <td>No</td> <td>No</td> </tr> <tr> <td>RPC</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Distributed computing</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Security</td> <td>No</td> <td>Maybe</td> </tr> </tbody> </table> <p><br> Not only is every Yes from 1999 still Yes today, seven of the Maybes and Nos were upgraded, and only one was downgraded. And on top of that, there are a lot of topics like neural networks that weren't even worth adding to the list as a No that are an unambiguous Yes today.</p> <p>In 1999, I was taking the SATs and applying to colleges. Today, I'm not really all that far into my career, and the landscape has changed substantially; many previously impractical academic topics are now widely used in industry. I probably have twice again as much time until the end of my career and <a href="//danluu.com/infinite-disk/">things are changing faster now than they were in 1999</a>. After reviewing Lampson's 1999 talk, I'm much more optimistic about research areas that haven't yielded much real-world impact (yet), like capability based computing and fancy type systems. It seems basically impossible to predict what areas will become valuable over the next thirty years.</p> <h3 id="correction">Correction</h3> <p>This post originally had Capabilities as a No in 2015. In retrospect, I think that was a mistake and it should have been a Yes due to use on mobile.</p> <p><small> Thanks to Seth Holloway, Leah Hanson, Ian Whitlock, Lindsey Kuper, Chris Ball, Steven McCarthy, Joe Wilder, David Wragg, Sophia Wisdom, and Alex Clemmer for comments/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:I">I know a fair number of folks who were relocated to Somerset from the east coast by IBM because they later ended up working at a company I worked at. It's interesting to me that software companies don't have the same kind of power over employees, and can't just insist that employees move to a new facility they're creating in some arbitrary location. <a class="footnote-return" href="#fnref:I"><sup>[return]</sup></a></li> <li id="fn:A">I once worked for a company that implemented both x86 and ARM decoders (I'm guessing it was the first company to do so for desktop class chips), and we found that our ARM decoder was physically larger and more complex than our x86 decoder. From talking to other people who've also implemented both ARM and x86 frontends, this doesn't seem to be unusual for high performance implementations. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> </ol> </div> Infinite disk infinite-disk/ Sun, 01 Nov 2015 00:00:00 +0000 infinite-disk/ <p>Hardware performance “obviously” affects software performance and affects how software is optimized. For example, the fact that caches are multiple orders of magnitude faster than RAM means that <a href="http://suif.stanford.edu/papers/lam-asplos91.pdf">blocked array accesses</a> give better performance than repeatedly striding through an array.</p> <p>Something that's occasionally overlooked is that hardware performance also has profound implications for system design and architecture. <a href="https://gist.github.com/jboner/2841832">Let's look at this table of latencies that's been passed around since 2012</a>:</p> <p></p> <pre><code>Operation Latency (ns) (ms) L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD 1,000,000 ns 1 ms Disk seek 10,000,000 ns 10 ms Read 1 MB sequentially from disk 20,000,000 ns 20 ms Send packet CA-&gt;Netherlands-&gt;CA 150,000,000 ns 150 ms </code></pre> <p>Consider the latency of a disk seek (10ms) vs. the latency of a round-trip within the same datacenter (.5ms). The round-trip latency is so much lower than the seek time of a disk that we can dis-aggregate storage and distribute it anywhere in the datacenter without noticeable performance degradation, giving applications the appearance of having infinite disk space without any appreciable change in performance. This fact was behind the rise of distributed filesystems like GFS within the datacenter over the past two decades, and various networked attached storage schemes long before.</p> <p>However, doing the same thing on a 2012-era commodity network with SSDs doesn't work. The time to read a page on an SSD is 150us, vs. a 500us round-trip time on the network. That's still a noticeable performance improvement over spinning metal disk, but it's over 4x slower than local SSD.</p> <p>But here we are in 2015. Things have changed. Disks have gotten substantially faster. Enterprise NVRAM drives can do a 4k random read in around 15us, an order of magnitude faster than 2012 SSDs. Networks have improved even more. It's now relatively common to employ a low-latency user-mode networking stack, which drives round-trip latencies for a 4k transfer down to 10s of microseconds. That's fast enough to disaggregate SSD and give applications access to infinite SSD. It's not quite fast enough to disaggregate high-end NVRAM, but <a href="https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet">RDMA</a> can handle that.</p> <p><img src="images/infinite-disk/rdma_vs_tcp.png" width="651" height="147"></p> <p>RDMA drives latencies down another order of magnitude, putting network latencies below NVRAM access latencies by enough that we can disaggregate NVRAM. Note that these numbers are for an unloaded network with no congestion -- these numbers will get substantially worse under load, but they're illustrative of what's possible. This isn't exactly new technology: HPC folks have been using RDMA over InfiniBand for years, but InfiniBand networks are expensive enough that they haven't seen a lot of uptake in datacenters. Something that's new in the past few years is the ability to run RDMA over Ethernet. This turns out to be non-trivial; both <a href="http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf">Microsoft</a> and <a href="http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p537.pdf">Google</a> have papers in this year's SIGCOMM on how to do this without running into the numerous problems that occur when trying to scale this beyond a couple nodes. But it's possible, and we're approaching the point where companies that aren't ridiculously large are going to be able to deploy this technology at scale<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup>.</p> <p>However, while it's easy to say that we should use disaggregated disk because the ratio of network latency to disk latency has changed, it's not as easy as just taking any old system and throwing it on a fast network. If we take a 2005-era distributed filesystem or distributed database and throw it on top of a fast network, it won't really take advantage of the network. That 2005 system is going to have assumptions like the idea that it's fine for an operation to take 500ns, because how much can 500ns matter? But it matters a lot when your round-trip network latency is only few times more than that and applications written in a higher-latency era are often full of &quot;careless&quot; operations that burn hundreds of nanoseconds at a time. Worse yet, designs that are optimal at higher latencies create overhead as latency decreases. For example, with 1ms latency, adding local caching is a huge win and 2005-era high-performance distributed applications will often rely heavily on local caching. But when latency drops below 1us, the caching that was a huge win in 2005 is often not just pointless, but actually counter-productive overhead.</p> <p>Latency hasn't just gone down in the datacenter. Today, I get about 2ms to 3ms latency to YouTube. YouTube, Netflix, and a lot of other services put a very large number of boxes close to consumers to provide high-bandwidth low-latency connections. A side effect of this is that any company that owns one of these services has the capability of providing consumers with infinite disk that's only slightly slower than normal disk. There are a variety of reasons this hasn't happened yet, but it's basically inevitable that this will eventually happen. If you look at what major cloud providers are paying for storage, their <a href="https://en.wikipedia.org/wiki/Cost_of_goods_sold">COGS</a> of providing safely replicated storage is or will become lower than the retail cost to me of un-backed-up unreplicated local disk on my home machine.</p> <p>It might seem odd that cloud storage can be cheaper than local storage, but large cloud vendors have a lot of leverage. The price for the median component they buy that isn't an Intel CPU or an Nvidia GPU is staggeringly low compared to the retail price. Furthermore, the fact that most people don't access the vast majority of their files most of the time. If you look at the throughput of large HDs nowadays, it's not even possible to do so. A <a href="http://hdd.userbenchmark.com/Toshiba-DT01ACA300-3TB/Rating/2735">typical consumer 3TB HD</a> has an average throughput of 155MB/s, making the time to read the entire drive 3e12 / 155e6 seconds = 1.9e4 seconds = 5 hours and 22 minutes. And people don't even access their disks at all most of the time! And when they do, their access patterns result in much lower throughput than you get when reading the entire disk linearly. This means that the vast majority of disaggregated storage can live in cheap cold storage. For a neat example of this, <a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-balakrishnan.pdf">the Balakrishnan et al. Pelican OSDI 2014 paper</a> demonstrates that if you build out cold storage racks such that only 8% of the disk can be accessed at any given time, you can get a substantial cost savings. A tiny fraction of storage will have to live at the edge, for the same reason that a tiny fraction of YouTube videos are cached at the edge. In some sense, the economics are worse than for YouTube, since any particular chunk of data is very likely to be shared, but at the rate that edge compute/storage is scaling up, that's unlikely to be a serious objection in a decade.</p> <p>The most common counter argument to disaggregated disk, both inside and outside of the datacenter, is bandwidth costs. But bandwidth costs have been declining exponentially for decades and continue to do so. Since 1995, we've seen an increase in datacenter NIC speeds go from 10Mb to 40Gb, with 50Gb and 100Gb just around the corner. This increase has been so rapid that, outside of huge companies, almost no one has re-architected their applications to properly take advantage of the available bandwidth. Most applications can't saturate a 10Gb NIC, let alone a 40Gb NIC. There's literally more bandwidth than people know what to do with. The situation outside the datacenter hasn't evolved quite as quickly, but even so, I'm paying $60/month for 100Mb, and if the trend of the last two decades continues, we should see another 50x increase in bandwidth per dollar over the next decade. It's not clear if the cost structure makes cloud-provided disaggregated disk for consumers viable today, but the current trends of implacably decreasing bandwidth cost mean that it's inevitable within the next five years.</p> <p>One thing to be careful about is that <a href="http://www.researchgate.net/profile/David_Rogers2/publication/265543548_Aggregation_and_disaggregation_techniques_and_methodology_in_optimization/links/54357d8e0cf2643ab986792e.pdf">just because we can disaggregate something, it doesn't mean that we should</a>. There was a <a href="http://web.eecs.umich.edu/~twenisch/papers/hpca12-disagg.pdf">fascinating paper by Lim et. al at HPCA 2012 on disaggregated RAM</a> where they build out disaggregated RAM by connecting RAM through the backplane. While we have the technology to do this, which has the dual advantages of allowing us to provision RAM at a lower per-unit cost and also getting better utilization out of provisioned RAM, this doesn't seem to provide a performance per dollar savings at an acceptable level of performance, at least so far<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">2</a></sup>.</p> <p>The change in relative performance of different components causes fundamental changes in how applications should be designed. It's not sufficient to just profile our applications and eliminate the hot spots. To get good performance (or good performance per dollar), we sometimes have to step back, re-examine our assumptions, and rewrite our systems. There's a lot of talk about how hardware improvements are slowing down, which usually refers to improvements in CPU performance. That's true, but there are plenty of other areas that are undergoing rapid change, which requires that applications that care about either performance or cost efficiency need to change. GPUs, hardware accelerators, storage, and networking are all evolving more rapidly than ever.</p> <h4 id="update">Update</h4> <p>Microsoft seems to disagree with me on this one. OneDrive has been moving in the opposite direction. They got rid of infinite disk, lowered quotas for non-infinite storage tiers, and changing their sync model in a way that makes this less natural. I spent maybe an hour writing this post. They probably have a team of Harvard MBAs who've spent 100x that much time discussing the move away from infinite disk. I wonder what I'm missing here. Average utilization was 5GB per user, which is practically free. A few users had a lot of data, but if someone uploads, say, 100TB, you can put most of that on tape. Access times on tape are glacial -- seconds for the arm to get the cartridge and put it in the right place, and tens of seconds to seek to the right place on the tape. But someone who uploads 100TB is basically using it as archival storage anyway, and you can mask most of that latency for the most common use cases (uploading libraries of movies or other media). If the first part of the file doesn't live on tape, and the user starts playing a movie that lives on tape, the movie can easily play for a couple minutes off of warmer storage while the tape access gets queued up. You might say that it's not worth it to spend the time it would take to build a system like that (perhaps two engineers working for six months), but you're already going to want a system that can mask the latency to disk-based cold storage for large files. Adding another tier on top of that isn't much additional work.</p> <h4 id="update-2">Update 2</h4> <p>It's happening. In April 2016, Dropbox announced that they're offering &quot;Dropbox Infinite&quot;, which lets you access your entire Dropbox regardless of the amount of local disk you have available. The inevitable trend happened, although I'm a bit surprised that it wasn't Google that did it first since they have better edge infrastructure and almost certainly pay less for storage. In retrospect, maybe that's not surprising, though -- Google, Microsoft, and Amazon all treat providing user-friendly storage as a second class citizen, while Dropbox is all-in on user friendliness.</p> <p><small> Thanks to Leah Hanson, bbrazil, Kamal Marhubi, mjn, Steve Reinhardt, Joe Wilder, and Jesse Luehrs for comments/corrections/additions that resulted in edits to this. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R">If you notice that when you try to reproduce the Google result, you get instability, you're not alone. The paper leaves out the special sauce required to reproduce the result. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:S"><p>If your goal is to get better utilization, the poor man's solution today is to give applications access to unused RAM via RDMA on a best effort basis, in a way that's vaguely kinda sorta analogous to <a href="//danluu.com/intel-cat/">Google's Heracles work</a>. You might say, wait a second: you could make that same argument for disk, but in fact the cheapest way to build out disk is to build out very dense storage blades full of disks, not to just use RDMA to access the normal disks attached to standard server blades; why shouldn't that be true for RAM? For an example of what it looks like when disks, I/O, and RAM are underprovisioned compared to CPUs, <a href="http://www.wired.com/2012/10/data-center-servers/">see this article where a Mozilla employee claims that it's fine to have 6% CPU utilization because those machines are busy doing I/O</a>. Sure, it's fine, if you don't mind paying for CPUs you're not using instead of building out blades that have the correct ratio of disk to storage, but those idle CPUs aren't free.</p> <p>If the ratio of RAM to CPU we needed were analogous to the ratio of disk to CPU that we need, it might be cheaper to disaggregate RAM. But, while <a href="http://web.eecs.umich.edu/~twenisch/papers/isca09-disaggregate.pdf">the need for RAM is growing faster than the need for compute</a>, we're still not yet at the point where datacenters have a large number of cores sitting idle due to lack of RAM, the same way we would have cores sitting idle due to lack of disk if we used standard server blades for storage. A Xeon-EX can handle 1.5TB of RAM per socket. It's common to put two sockets in a 1/2U blade nowadays, and for the vast majority of workloads, it would be pretty unusual to try to cram more than 6TB of RAM into the 4 sockets you can comfortably fit into 1U.</p> <p>That being said, the issue of disaggregated RAM is still an open question, and <a href="https://www.usenix.org/conference/srecon15europe/program/presentation/hoffman">some folks</a> are a lot more confident about its near-term viability <a href="http://arxiv.org/abs/1503.01416">than others</a>.</p> <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> </ol> </div> Why Intel added cache partitioning intel-cat/ Sun, 04 Oct 2015 00:00:00 +0000 intel-cat/ <p><a href="http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf">Typical server utilization is between 10% and 50%. Google has demonstrated 90% utilization <em>without impacting latency SLAs</em></a>. <a href="https://what-if.xkcd.com/63/">Xkcd estimated that Google owns 2 million machines</a>. If you estimate an amortized total cost of <a href="http://www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024">$4k per machine per year</a>, that's $8 billion per year. With numbers like that, even small improvements have a large impact, and this isn't a small improvement.</p> <p></p> <p>How is it possible to get 2x to 9x better utilization on the same hardware? The low end of those typical utilization numbers comes from having a service with variable demand and fixed machine allocations. Say you have 100 machines dedicated to Jenkins. Those machines might be very busy when devs are active, but they might also have 2% utilization at 3am. Dynamic allocation (switching the machines to other work when they're not needed) can get a typical latency-sensitive service up to somewhere in the 30%-70% range. To do better than that across a wide variety of latency-sensitive workloads with tight SLAs, we need some way to schedule low priority work on the same machines, without affecting the latency of the high priority work.</p> <p>It's not obvious that this is possible. If both high and low priority workloads need to monopolize some shared resources like the last-level cache (LLC), memory bandwidth, disk bandwidth, or network bandwidth, there we're out of luck. With the exception of some specialized services, it's rare to max out disk or network. But what about caches and memory? It turns out that <a href="http://www.industry-academia.org/download/ASPLOS12_Clearing_the_Clouds.pdf">Ferdman et al.</a> looked at this back in 2012 and found that typical server workloads don't benefit from having more than 4MB - 6MB of LLC, despite modern server chips having much larger caches.</p> <p><img src="images/intel-cat/server_llc.png" width="321" height="208"></p> <p>For this graph, scale-out workloads are things like distributed key-value stores, MapReduce-like computations, web search, web serving, etc. SPECint(mcf) is a traditional workstation benchmark. “server” is old school server benchmarks like SPECweb and TPC . We can see that going from 4MB to 11MB of LLC has a small effect on typical datacenter workloads, but a significant effect on this traditional workstation benchmark.</p> <p>Datacenter workloads operate on such large data sets that it's often impossible to fit the dataset in RAM on a single machine, let alone in cache, making a larger LLC not particularly useful. This was result was confirmed by <a href="http://www.eecs.harvard.edu/~skanev/papers/isca15wsc.pdf">Kanev et al.'s ISCA 2015 paper where they looked at workloads at Google</a>. They also showed that memory bandwidth utilization is, on average, quite low.</p> <p><img src="images/intel-cat/low_bandwidth_utilization.png" width="342" height="175"></p> <p>You might think that the low bandwidth utilization is because the workloads are compute bound and don't have many memory accesses. However, when the authors looked at what the cores were doing, they found that a lot of time was spent stalled, waiting for cache/memory.</p> <p><img src="images/intel-cat/cache_stalls.png" width="328" height="169"></p> <p>Each row is a Google workload. When running these typical workloads, cores spend somewhere between 46% and 61% of their time blocked on cache/memory. It's curious that we have low cache hit rates, a lot of time stalled on cache/memory, and low bandwidth utilization. This is suggestive of workloads spending a lot of time waiting on memory accesses that have some kind of dependencies that prevent them from being executed independently.</p> <p>LLCs for high-end server chips are between 12MB and 30MB, even though we only need 4MB to get 90% of the performance, and the 90%-ile utilization of bandwidth is 31%. This seems like a waste of resources. We have a lot of resources sitting idle, or not being used effectively. The good news is that, since we get such low utilization out of the shared resources on our chips, we should be able to schedule multiple tasks on one machine without degrading performance.</p> <p>Great! What happens when we schedule multiple tasks on one machine? The <a href="http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf">Lo et al. Heracles paper at ISCA this year explores this in great detail</a>. The goal of Heracles is to get better utilization on machines by co-locating multiple tasks on the same machine.</p> <p><img src="images/intel-cat/heracles_interference.png" width="665" height="343"></p> <p>The figure above shows three latency sensitive (LC) workloads with strict SLAs. websearch is the query serving service in Google search, ml_cluster is real-time text clustering, and memkeyval is a key-value store analogous to memcached. The values are latencies as a percent of maximum allowed by the SLA. The columns indicate the load on the service, and the rows indicate different types of interference. LLC, DRAM, and Network are exactly what they sound like; custom tasks designed to compete only for that resource. HyperThread means that the interfering task is a spinloop running in the other hyperthread on the same core (running in the same hyperthread isn't even considered since OS context switches are too expensive). CPU power is a task that's designed to use a lot of power and induce thermal throttling. Brain is deep learning. All of the interference tasks are run in a container with low priority.</p> <p>There's a lot going on in this figure, but we can immediately see that the best effort (BE) task we'd like to schedule can't co-exist with any of the LC tasks when only container priorities are used -- all of the <code>brain</code> rows are red, and even at low utilization (the leftmost columns), latency is way above 100% of the SLA latency.. It's also clear that the different LC tasks have different profiles and can handle different types of interference. For example, websearch and ml_cluster are neither network nor compute intensive, so they can handle network and power interference well. However, since memkeyval is both network and compute intensive, it can't handle either network or power interference. The paper goes into a lot more detail about what you can infer from the details of the table. I find this to be one of the most interesting parts of the paper; I'm going to skip over it, but I recommend reading the paper if you're interested in this kind of thing.</p> <p>A simplifying assumption the authors make is that these types of interference are basically independent. This means that independent mechanisms that isolate LC task from “too much” of each individual type of resource should be sufficient to prevent overall interference. That is, we can set some cap for each type of resource usage, and just stay below each cap. However, this assumption isn't exactly true -- for example, the authors show this figure that relates the LLC cache size to the number cores allocated to an LC task.</p> <p><img src="images/intel-cat/llc_vs_cores.png" width="277" height="238"></p> <p>The vertical axis is the max load the LC task can handle before violating its SLA when allocated some specific LLC and number of cores. We can see that it's possible to trade off cache vs cores, which means that we can actually go above a resource cap in one dimension and maintain our SLA by using less of another resource. In the general case, we might also be able to trade off other resources. However, the assumption that we can deal with each resource independently reduces a complex optimization problem to something that's relatively straightforward.</p> <p>Now, let's look at each type of shared resource interference and how Heracles allocates resources to prevent SLA-violating interference.</p> <h4 id="core">Core</h4> <p>Pinning the LC and BE tasks to different cores is sufficient to prevent same-core context switching interference and hyperthreading interference. For this, Heracles used <a href="https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt">cpuset</a>. Cpuset allows you to limit a process (and its children) to only run on a limited set of CPUs.</p> <h4 id="network">Network</h4> <p>On the local machines, Heracles used <a href="https://wiki.archlinux.org/index.php/Advanced_traffic_control">qdisc</a> to enforce quotas. For more on cpuset, qdisc, and other quota/partitioning mechanisms <a href="https://lwn.net/Articles/604609/">this LWN series on cgroups by Neil Brown is a good place to start</a>. Cgroups are used by a lot of widely used software now (Docker, Kubernetes, Mesos, etc.); they're probably worth learning about even if you don't care about this particular application.</p> <h4 id="power">Power</h4> <p>Heracles uses <a href="https://01.org/blogs/tlcounts/2014/running-average-power-limit-%E2%80%93-rapl">Intel's running average power limit</a> to estimate power. This is a feature on Sandy Bridge (2009) and newer processors that uses some on-chip monitoring hardware to estimate power usage fairly precisely. <a href="http://www.cs.binghamton.edu/~millerti/cs680r/papers/DVFS/SystemLevelAnalysis.pdf">Per-core dynamic voltage and frequency scaling</a> is used to limit power usage by specific cores to keep them from going over budget.</p> <h4 id="cache">Cache</h4> <p>The previous isolation mechanisms have been around for a while, but this is one is new to Broadwell chips (released in 2015). The problem here is that if the BE task needs 1MB of LLC and the LC task needs 4MB of LLC, a single large allocation from the BE task will scribble all over the LLC, which is shared, wiping out the 4MB of cached data the LC needs.</p> <p>Intel's “<a href="http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf">Cache Allocation Technology</a>” (CAT) allows the LLC to limit which cores can can access different parts of the cache. Since we often want to pin performance sensitive tasks to cores anyway, this allows us to divide up the cache on a per-task basis.</p> <p><img src="images/intel-cat/CAT_LLC.png" width="431" height="157"></p> <p>Intel's <a href="http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/cache-allocation-technology-white-paper.pdf">April 2015 whitepaper on what they call Cache Allocation Technology (CAT)</a> has some simple benchmarks comparing CAT vs. no-CAT. In this example, they measure the latency to respond to PCIe interrupts while another application has heavy CPU-to-memory traffic, with CAT on and off.</p> <table> <thead> <tr> <th>Condition</th> <th>Min</th> <th>Max</th> <th>Avg</th> </tr> </thead> <tbody> <tr> <td>no CAT</td> <td>1.66</td> <td>30.91</td> <td>4.53</td> </tr> <tr> <td>CAT</td> <td>1.22</td> <td>20.98</td> <td>1.62</td> </tr> </tbody> </table> <p>With CAT, average latency is 36% of latency without CAT. Tail latency doesn't improve as much, but there's also a substantial improvement there. That's interesting, but to me the more interesting question is how effective this is on real workloads, which we'll see when we put all of these mechanisms together.</p> <p>Another use of CAT that I'm not going to discuss at all is to prevent timing attacks, like <a href="https://eprint.iacr.org/2015/898.pdf">this attack, which can recover RSA keys across VMs via LLC interference</a>.</p> <h4 id="dram-bandwidth">DRAM bandwidth</h4> <p>Broadwell and newer Intel chips have memory bandwidth monitoring, but no control mechanism. To work around this, Heracles drops the number of cores allocated to the BE task if it's interfering with the LC task by using too much bandwidth. The coarse grained monitoring and control for this is inefficient in a number of ways that are detailed in the paper, but this still works despite the inefficiencies. However, having per-core bandwidth limiting would give better results with less effort.</p> <h4 id="putting-it-all-together">Putting it all together</h4> <p>This graph shows the effective utilization of LC websearch with other BE tasks scheduled with enough slack that the SLA for websearch isn't violated.</p> <p><img src="images/new-cpu-features/google_heracles.png" width="338" height="209"></p> <p>From barroom conversations with folks at other companies, the baseline (in red) here already looks pretty good: 80% utilization during peak times with a 7 hour trough when utilization is below 50%. With Heracles, the <em>worst case</em> utilization is 80%, and the average is 90%. This is amazing.</p> <p>Note that effective utilization can be greater than 100% since it's measured as throughput for the LC task on a single machine at 100% load plus throughput for the BE task on a single machine at 100% load. For example, if one task needs 100% of the DRAM bandwidth and 0% of the network bandwidth, and the other task needs the opposite, the two tasks would be able to co-locate on the same machine and achieve 200% effective utilization.</p> <p>In the real world, we might “only” get 90% average utilization out of a system like Heracles. Recalling our operating cost estimate of $4 billion for a large company, if the company already had a quite-good average utilization of 75%, using <a href="http://www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024">a standard model for datacenter operating costs</a>, we'd expect 15% more throughput per dollar, or $600 million in free compute. From talking to smaller companies that are on their way to becoming large (companies that spend in the range of $10 million to $100 million a year on compute), they often have utilization that's in the 20% range. Using the same total cost model again, they'd expect to get a 300% increase in compute per dollar, or $30 million to $300 million a year in free compute, depending on their size<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <h4 id="other-observations">Other observations</h4> <p>All of the papers we've looked at have a lot of interesting gems. I'm not going to go into all of them here, but there are a few that jumped out at me.</p> <h5 id="arm-atom-servers">ARM / Atom servers</h5> <p>It's been known for a long time that datacenter machines spend approximately half their time stalled, waiting on memory. In addition, the average number of instructions per clock that server chips are able to execute on real workloads is quite low.</p> <p><img src="images/intel-cat/server_ipc.png" width="335" height="207"></p> <p>The top rows (with horizontal bars) are internal Google workloads and the bottom rows (with green dots) are workstation benchmarks from SPEC, a standard benchmark suite. We can see that Google workloads are lucky to average .5 instructions per clock. We also previously saw that these workloads cause cores to be stalled on memory at least half the time.</p> <p>Despite spending most of their time waiting for memory and averaging something like half an instruction per clock cycle, high-end server chips do much better than Atom or ARM chips on real workloads (Reddi et al., ToCS 2011). This sounds a bit paradoxical -- if chips are just waiting on memory, why should you need a high-performance chip? A tiny ARM chip can wait just as effectively. In fact, it might even be better at waiting since having <a href="http://yosefk.com/blog/amdahls-law-in-reverse-the-wimpy-core-advantage.html">more cores waiting means it can use more bandwidth</a>. But it turns out that servers also spend a lot of their time exploiting instruction-level parallelism, executing multiple instructions at the same time.</p> <p><img src="images/intel-cat/server_ilp.png" width="162" height="154"></p> <p>This is a graph of how many execution units are busy at the same time. Almost a third of the time is spent with 3+ execution units busy. In between long stalls waiting on memory, high-end chips are able to get more computation done and start waiting for the next stall earlier. Something else that's curious is that server workloads have much higher instruction cache miss rates than traditional workstation workloads.</p> <h5 id="code-and-data-prioritization-technology">Code and Data Prioritization Technology</h5> <p><img src="images/intel-cat/icache_miss.png" width="332" height="191"></p> <p>Once again, the top rows (with horizontal bars) are internal Google workloads and the bottom rows (with green dots) are workstation benchmarks from SPEC, a standard suite benchmark suite. The authors attribute this increase in instruction misses to two factors. First, that it's normal to deploy large binaries (100MB) that overwhelm instruction caches. And second, that instructions have to compete with much larger data streams for space in the cache, which causes a lot of instructions to get evicted.</p> <p>In order to address this problem, Intel introduced what they call “Code and Data Prioritization Technology” (CDP). This is an extension of CAT that allows cores to separately limit which subsets of the LLC instructions and data can occupy. Since it's targeted at the last-level cache, it doesn't directly address the graph above, which shows L2 cache miss rates. However, the cost of an L2 cache miss that hits in the LLC is something like <a href="http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt">26ns on Broadwell vs. 86ns for an L2 miss that also misses the LLC and has to go to main memory</a>, which is a substantial difference.</p> <p>Kanev et al. propose going a step further and having a split icache/dcache hierarchy. This isn't exactly a radical idea -- l1 caches are already split, so why not everything else? My guess is that Intel and other major chip vendors have simulation results showing that this doesn't improve performance per dollar, but who knows? Maybe we'll see split L2 caches soon.</p> <h5 id="spec">SPEC</h5> <p>A more general observation is that SPEC is basically irrelevant as a benchmark now. It's somewhat dated as a workstation benchmark, and simply completely inappropriate as a benchmark for servers, office machines, gaming machines, dumb terminals, laptops, and mobile devices<sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">2</a></sup>. The market for which SPEC is designed is getting smaller every year, and SPEC hasn't even been really representative of that market for at least a decade. And yet, among chip folks, it's still the most widely used benchmark around.</p> <p><img src="images/intel-cat/search_query.png" width="640" height="352"></p> <p>This is what a search query looks like at Google. A query comes in, a wide fanout set of RPCs are issued to a set of machines (the first row). Each of those machines also does a set of RPCs (the second row), those do more RPCs (the third row), and there's a fourth row that's not shown because the graph has so much going on that it looks like noise. This is one quite normal type of workload for a datacenter, and there's nothing in SPEC that looks like this.</p> <p>There are a lot more fun tidbits in all of these papers, and I recommend reading them if you thought anything in this post was interesting. <strong>If you liked this post, you'll probably also like <a href="https://www.youtube.com/watch?v=QBu2Ae8-8LM">this talk by Dick Sites on various performance and profiling related topics</a>, <a href="//danluu.com/clwb-pcommit/">this post on Intel's new CLWB and PCOMMIT instructions</a>, and <a href="//danluu.com/new-cpu-features/">this post on other &quot;new&quot; CPU features</a></strong>.</p> <p><small> Thanks to Leah Hanson, David Kanter, Joe Wilder, Nico Erfurth, and Jason Davies for comments/corrections on this. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">I often hear people ask, why is company X so big? You could do that with 1/10th as many engineers! That's often <a href="http://scraps.benkuhn.net/2015/09/02/darkmatter.html">not true</a>. But even when it's true, it's usually the case that doing so would leave a lot of money on the table. As companies scale up, smaller and smaller optimizations are worthwhile. For a company with enough scale, something a small startup wouldn't spend 10 minutes on can pay itself back tenfold even if it takes a team of five people a year. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:M">When I did my last set of interviews, I asked a number of mobile chip vendors how they measure things I care about on my phone, like responsiveness. Do they have a full end-to-end test with a fake finger and a camera that lets you see the actual response time to a click? Or maybe they have some tracing framework that can fake a click to see the response time? As far as I can tell, no one except Apple has a handle on this at all, which might explain why a two generation old iPhone smokes my state of the art Android phone in actual tasks, even though the Android phone which crushes workstation benchmarks like SPEC, and benchmarks of what people did in the early 80s, like Dhrystone (both of which are used by multiple mobile processor vendors). I don't know if I can convince anyone who doesn't already believe this, but choosing good benchmarks is extremely important. I use an Android phone because I got it for free. The next time I buy a phone, I'm buying one that does tasks I actually do quickly, not one that runs academic benchmarks well. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> </ol> </div> Slowlock limplock/ Wed, 30 Sep 2015 00:00:00 +0000 limplock/ <p>Every once in awhile, you hear a story like “there was a case of a 1-Gbps NIC card on a machine that suddenly was transmitting only at 1 Kbps, which then caused a chain reaction upstream in such a way that the performance of the entire workload of a 100-node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes”. The stories are interesting and the postmortems are fun to read, but it's not really clear how vulnerable systems are to this kind of failure or how prevalent these failures are.</p> <p>The situation reminds me of distributed systems failures before <a href="https://aphyr.com/tags/jepsen">Jepsen</a>. There are lots of anecdotal horror stories, but a common response to those is “works for me”, even when talking about systems that are now known to be fantastically broken. A handful of companies that are really serious about correctness have good tests and metrics, but they mostly don't talk about them publicly, and the general public has no easy way of figuring out if the systems they're running are sound.</p> <p></p> <p>Thanh Do et al. have tried to look at this systematically -- <a href="http://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf">what's the effect of hardware that's been crippled but not killed</a>, and <a href="http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf">how often does this happen in practice</a>? It turns out that a lot of commonly used systems aren't robust against against “limping” hardware, but that the incidence of these types of failures are rare (at least until you have unreasonably large scale).</p> <p>The effect of a single slow node can be <a href="http://pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf">quite dramatic</a>:</p> <p><img src="images/limpware/fb_slow_nic.png" alt="The effect of a single slow NIC on an entire cluster" width="484" height="328"></p> <p>The job completion rate slowed down from 172 jobs per hour to 1 job per hour, effectively killing the entire cluster. Facebook has mechanisms to deal with dead machines, but they apparently didn't have any way to deal with slow machines at the time.</p> <p>When Do et al. looked at widely used open source software (HDFS, Hadoop, ZooKeeper, Cassandra, and HBase), they found similar problems.</p> <p><img src="images/limpware/limpware.png" width="617" height="362"></p> <p>Each subgraph is a different failure condition. F is HDFS, H is Hadoop, Z is Zookeeper, C is Cassandra, and B is HBase. The leftmost (white) bar is the baseline no-failure case. Going to the right, the next is a crash, and the subsequent bars are results for a single degraded but not crashed hardware (further right means slower). In most (but not all) cases, having degraded hardware affected performance a lot more than having failed hardware. Note that these graphs are all log scale; going up one increment is a 10x difference in performance!</p> <p>Curiously, a failed disk can cause some operations to speed up. That's because there are operations that have less replication overhead if a replica fails. It seems a bit weird to me that there isn't more overhead, because the system has to both find a replacement replica and replicate data, but what do I know?</p> <p>Anyway, why is a slow node so much worse than a dead node? The authors define three failure modes and explain what causes each one. There's operation limplock, when an operation is slow because some subpart of the operation is slow (e.g., a disk read is slow because the disk is degraded), node limplock, when a node is slow even for seemingly unrelated operations (e.g, a read from RAM is slow because a disk is degraded), and cluster limplock, where the entire cluster is slow (e.g., a single degraded disk makes an entire 1000 machine cluster slow).</p> <p>How do these happen?</p> <h4 id="operation-limplock">Operation Limplock</h4> <p>This one is the simplest. If you try to read from disk, and your disk is slow, your disk read will be slow. In the real world, we'll see this when operations have a single point of failure, and when monitoring is designed to handle total failure and not degraded performance. For example, an HBase access to a region goes through the server responsible for that region. The data is replicated on HDFS, but this doesn't help you if the node that owns the data is limping. Speaking of HDFS, it has a timeout is 60s and reads are in 64K chunks, which means your reads can slow down to almost 1K/s before HDFS will fail over to a healthy node.</p> <h4 id="node-limplock">Node Limplock</h4> <p>How can it be the case that (for example) a slow disk causes memory reads to be slow? Looking at HDFS again, it uses a thread pool. If every thread is busy very slowly completing a disk read, memory reads will block until a thread gets free.</p> <p>This isn't only an issue when using limited thread pools or other bounded abstractions -- the reality is that machines have finite resources, and unbounded abstractions will run into machine limits if they aren't carefully designed to avoid the possibility. For example, Zookeeper keeps queue of operations, and a slow follower can cause the leader's queue to exhaust physical memory.</p> <h4 id="cluster-limplock">Cluster Limplock</h4> <p>An entire cluster can easily become unhealthy if it relies on a single primary and the primary is limping. Cascading failures can also cause this -- the first graph, where a cluster goes from completing 172 jobs an hour to 1 job an hour is actually a Facebook workload on Hadoop. The thing that's surprising to me here is that Hadoop is supposed to be tail tolerant -- individual slow tasks aren't supposed to have a large impact on the completion of the entire job. So what happened? Unhealthy nodes infect healthy nodes and eventually lock up the whole cluster.</p> <p><img src="images/limpware/cascade.png" alt="An unhealthy node infects an entire cluster" width="211" height="147"></p> <p>Hadoop's tail tolerance comes from kicking off speculative computation when results are coming in slowly. In particular, when stragglers come in unusually slowly compared to other results. This works fine when a reduce node is limping (subgraph H2), but when a <a href="http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf">map node</a> limps (subgraph H1), it can slow down all reducers in the same job, which defeats Hadoop's tail-tolerance mechanisms.</p> <p><img src="images/limpware/hadoop_deadlock.png" alt="A single bad map node effectively deadlocks hadoop" width="176" height="62"></p> <p>To see why, we have to look at <a href="https://www.usenix.org/legacy/event/osdi08/tech/full_papers/zaharia/zaharia_html/index.html">Hadoop's speculation algorithm</a>. Each job has a progress score which is a number between 0 and 1 (inclusive). For a map, the score is the fraction of input data read. For a reduce, each of three phases (copying data from mappers, sorting, and reducing) gets 1/3 of the score. A speculative job will get run if a task has run for at least one minute and has a progress score that's less than the average for its category minus 0.2.</p> <p>In case H2, the NIC is limping, so the map phase completes normally since results end up written to local disk. But when reduce nodes try to fetch data from the limping map node, they all stall, pulling down the average score for the category, which prevents speculative jobs from being run. Looking at the big picture, each Hadoop node has a limited number of map and reduce tasks. If those fill up with limping tasks, the entire node will lock up. Since Hadoop isn't designed to avoid cascading failures, this eventually causes the entire cluster to lock up.</p> <p>One thing I find interesting is that this exact cause of failures was <a href="http://research.google.com/archive/mapreduce.html">described in the original MapReduce paper</a>, published in 2004. They even explicitly called out slow disk and network as causes of stragglers, which motivated their speculative execution algorithm. However, they didn't provide the details of the algorithm. The open source clone of MapReduce, Hadoop, attempted to avoid the same problem. Hadoop was initially released in 2008. Five years later, when the paper we're reading was published, its built-in mechanism for straggler detection not only failed to prevent multiple types of stragglers, it also failed to prevent stragglers from effectively deadlocking the entire cluster.</p> <h3 id="conclusion">Conclusion</h3> <p>I'm not going to go into details of how each system fared under testing. That's detailed quite nicely in the paper, which I recommend reading the paper if you're curious. To summarize, Cassandra does quite well, whereas HDFS, Hadoop, and HBase don't.</p> <p>Cassandra seems to do well for two reasons. First, <a href="https://issues.apache.org/jira/browse/CASSANDRA-488">this patch from 2009</a> prevents queue overflows from infecting healthy nodes, which prevents a major failure mode that causes cluster-wide failures in other systems. Second, the architecture used (<a href="http://www.eecs.harvard.edu/~mdw/proj/seda/">SEDA</a>) decouples different types of operations, which lets good operations continue to execute even when some operations are limping.</p> <p>My big questions after reading this paper are, how often do these kinds of failures happen, how, and shouldn't reasonable metrics/reporting catch this sort of thing anyway?</p> <p>For the answer to the first question, many of the same authors also have a paper where they looked at <a href="http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf">3000 failures in Cassandra, Flume, HDFS, and ZooKeeper</a> and determined which failures were hardware related and what the hardware failure was.</p> <p><img src="images/limpware/hardware_failure_causes.png" width="147" height="104"></p> <p>14 cases of degraded performance vs. 410 other hardware failures. In their sample, that's 3% of failures; rare, but not so rare that we can ignore the issue.</p> <p>If we can't ignore these kinds of errors, how can we catch them before they go into production? The paper uses the <a href="http://www.emulab.net/">Emulab testbed</a>, which is really cool. Unfortunately, the Emulab page reads “Emulab is a public facility, available without charge to most researchers worldwide. If you are unsure if you qualify for use, please see our policies document, or ask us. If you think you qualify, you can apply to start a new project.”. That's understandable, but that means it's probably not a great solution for most of us.</p> <p>The vast majority of limping hardware is due to network or disk slowness. Why couldn't a modified version of Jepsen, or something like it, simulate disk or network slowness? A naive implementation wouldn't get anywhere near the precision of Emulab, but since we're talking about order of magnitude slowdowns, having 10% (or even 2x) variance should be ok for testing the robustness of systems against degraded hardware. There are a number of ways you could imagine that working. For example, to simulate a slow network on linux, you could try throttling <a href="http://linux-ip.net/articles/Traffic-Control-HOWTO/">via qdisc</a>, <a href="https://github.com/majek/fluxcapacitor">hooking syscalls via ptrace</a>, etc. For a slow CPU, you can rate-limit via cgroups and cpu.shares, or just map the process to UC memory (or maybe WT or WC if that's a bit too slow), and so on and so forth for disk and other failure modes.</p> <p>That leaves my last question, shouldn't systems already catch these sorts of failures even if they're not concerned about them in particular? As we saw above, systems with cripplingly slow hardware are rare enough that we can just treat them as dead without significantly impacting our total compute resources. And systems with crippled hardware can be detected pretty straightforwardly. Moreover, multi-tenant systems have to do <a href="//danluu.com/new-cpu-features/#partitioning">continuous monitoring of their own performance to get good utilization</a> anyway.</p> <p>So why should we care about designing systems that are robust against limping hardware? One part of the answer is defense in depth. Of course we should have monitoring, but we should also have systems that are robust when our monitoring fails, as it inevitably will. Another part of the answer is that by making systems more tolerant to limping hardware, we'll also make them more tolerant to interference from other workloads in a multi-tenant environment. That last bit is a somewhat speculative empirical question -- it's possible that it's more efficient to design systems that aren't particularly robust against interference from competing work on the same machine, while using <a href="//danluu.com/intel-cat/">better partitioning</a> to avoid interference.</p> <p><br> <br> <small> Thanks to Leah Hanson, Hari Angepat, Laura Lindzey, Julia Evans, and James M. Lee for comments/discussion. </small></p> Steve Yegge's prediction record yegge-predictions/ Mon, 31 Aug 2015 00:00:00 +0000 yegge-predictions/ <p></p> <p>I try to avoid making predictions<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. It's a no-win proposition: if you're right, <a href="https://en.wikipedia.org/wiki/Hindsight_bias">hindsight bias</a> makes it look like you're pointing out the obvious. And most predictions are wrong. Every once in a while when someone does a review of predictions from pundits, they're almost always wrong at least as much as you'd expect from random chance, and then hindsight bias makes each prediction look hilariously bad.</p> <p>But, occasionally, you run into someone who makes pretty solid non-obvious predictions. I was re-reading some of Steve Yegge's old stuff and it turns out that he's one of those people.</p> <p>His most famous prediction is probably <a href="http://steve-yegge.blogspot.com/2007/02/next-big-language.html">the rise of JavaScript</a>. This now seems incredibly obvious in hindsight, so much so that the future laid out in Gary Bernhardt's <a href="https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript">Birth and Death of JavaScript</a> seems at least a little plausible. But you can see how non-obvious Steve's prediction was at the time by reading both the comments on his blog, and comments from HN, reddit, and the other usual suspects.</p> <p>Steve was also crazy-brave enough to <a href="https://sites.google.com/site/steveyegge2/ten-predictions">post ten predictions about the future</a> in 2004. He says “Most of them are probably wrong. The point of the exercise is the exercise itself, not in what results.”, but the predictions are actually pretty reasonable.</p> <h4 id="prediction-1-xml-databases-will-surpass-relational-databases-in-popularity-by-2011">Prediction #1: XML databases will surpass relational databases in popularity by 2011</h4> <p>2011 might have been slightly too early and JSON isn't exactly XML, but NoSQL databases have done really well for pretty much the reason given in the prediction, “Nobody likes to do O/R mapping; everyone just wants a solution.”. Sure, <a href="https://www.mongodb.com/presentations/theres-monster-my-closet-architecture-mongodb-powered-event-processing-system">Mongo may lose your data, but it's easy to set up and use</a>.</p> <h4 id="prediction-2-someone-will-make-a-lot-of-money-by-hosting-open-source-web-applications">Prediction #2: Someone will make a lot of money by hosting open-source web applications</h4> <p>This depends on what you mean by “a lot”, but this seems basically correct.</p> <blockquote> <p>We're rapidly entering the age of hosted web services, and big companies are taking advantage of their scalable infrastructure to host data and computing for companies without that expertise.</p> </blockquote> <p>For reasons that seem baffling in retrospect, Amazon understood this long before any of its major competitors and was able to get a huge head start on everybody else. Azure didn't get started until 2009, and Google didn't get serious about public cloud hosting until even later.</p> <p>Now that everyone's realized what Steve predicted in 2004, it seems like every company is trying to spin up a public cloud offering, but the market is really competitive and hiring has become extremely difficult. Despite giving out a large number of offers an integer multiple above market rates, Alibaba still hasn't managed to put together a team that's been able to assemble a competitive public cloud, and companies that are trying to get into the game now without as much cash to burn as Alibaba are having an even harder time.</p> <blockquote> <p>For both bug databases and source-control systems, the obstacle to outsourcing them is trust. I think most companies would love it if they didn't have to pay someone to administer Bugzilla, Subversion, Twiki, etc. Heck, they'd probably like someone to outsource their email, too.</p> </blockquote> <p>A lot of companies have moved both issue tracking and source-control to GitHub or one of its competitors, and even more have moved if you just count source-control. Hosting your own email is also a thing of the past for all but the most paranoid (or most bogged down in legal compliance issues).</p> <h4 id="prediction-3-multi-threaded-programming-will-fall-out-of-favor-by-2012">Prediction #3: Multi-threaded programming will fall out of favor by 2012</h4> <p>Hard to say if this is right or not. Depends on who you ask. This seems basically right for applications that don't need the absolute best levels for performance, though.</p> <blockquote> <p>In the past, oh, 20 years since they invented threads, lots of new, safer models have arrived on the scene. Since 98% of programmers consider safety to be unmanly, the alternative models (e.g. CSP, fork/join tasks and lightweight threads, coroutines, Erlang-style message-passing, and other event-based programming models) have largely been ignored by the masses, including me.</p> </blockquote> <p>Shared memory concurrency is still where it's at for really high performance programs, but Go has popularized CSP; actors and futures are both “popular” on the JVM; etc.</p> <h4 id="prediction-4-java-s-market-share-on-the-jvm-will-drop-below-50-by-2010">Prediction #4: Java's &quot;market share&quot; on the JVM will drop below 50% by 2010</h4> <p>I don't think this was right in 2010, or even now, although we're moving in the right direction. There's a massive amount of dark matter -- programmers who do business logic and don't blog or give talks -- that makes this prediction unlikely to come true in the near future.</p> <p>It's impossible to accurately measure market share, but basically every language ranking you can find will put <a href="http://redmonk.com/sogrady/2015/07/01/language-rankings-6-15/">Java in the top 3, with Scala and Clojure not even in the top 10</a>. Given the near power-law distribution of measured language usage, Java must still be above 90% share (and that's probably a gross underestimate).</p> <h4 id="prediction-5-lisp-will-be-in-the-top-10-most-popular-programming-languages-by-2010">Prediction #5: Lisp will be in the top 10 most popular programming languages by 2010</h4> <p>Not even close. Depending on how you measure this, Clojure might be in the top 20 (it is if you believe the Redmonk rankings), but it's hard to see it making it into the top 10 in this decade. As with the previous prediction, there's just way too much inertia here. Breaking into the top 10 means joining the ranks of Java, JS, PHP, Python, Ruby, C, C++, and C#. Clojure just isn't boring enough. C# was able to sneak in by pretending to boring, but Clojure's got no hope of doing that and there isn't really another Dylan on the horizon.</p> <h4 id="prediction-6-a-new-internet-community-hangout-will-appear-one-that-you-and-i-will-frequent">Prediction #6: A new internet community-hangout will appear. One that you and I will frequent</h4> <p>This seems basically right, at least for most values of “you”.</p> <blockquote> <p>Wikis, newsgroups, mailing lists, bulletin boards, forums, commentable blogs — they're all bullshit. Home pages are bullshit. People want to socialize, and create content, and compete lightly with each other at different things, and learn things, and be entertained: all in the same place, all from their couch. Whoever solves this — i.e. whoever creates AOL for real people, or whatever the heck this thing turns out to be — is going to be really, really rich.</p> </blockquote> <p>Facebook was founded the year that was written. Zuckerberg is indeed really, really, rich.</p> <h4 id="prediction-7-the-mobile-wireless-handheld-market-is-still-at-least-5-years-out">Prediction #7: The mobile/wireless/handheld market is still at least 5 years out</h4> <p>Five years from Steve's prediction would have been 2009. Although the iPhone was released in 2007, it was a while before sales really took off. In 2009, the majority of phones were feature phones, and Android was barely off the ground.</p> <p><img src="images/yegge-predictions/phone_sales.jpeg" alt="Symbian is in the lead until Q4 2010!" width="640" height="434"></p> <p>Note that this graph only runs until 2013; if you graph things <a href="https://en.wikipedia.org/wiki/Mobile_operating_system">up to 2015</a> on a linear scale, sales are so low in 2009 that you basically can't even see what's going on.</p> <h4 id="prediction-8-someday-i-will-voluntarily-pay-google-for-one-of-their-services">Prediction #8: Someday I will voluntarily pay Google for one of their services</h4> <p>It's hard to tell if this is correct (Steve, feel free to <a href="https://twitter.com/danluu">let me know</a>), but it seems true in spirit. Google has more and more services that they charge for, and they're even experimenting with <a href="https://www.google.com/contributor/welcome/">letting people pay to avoid seeing ads</a>.</p> <h4 id="prediction-9-apple-s-laptop-sales-will-exceed-those-of-hp-compaq-ibm-dell-and-gateway-combined-by-2010">Prediction #9: Apple's laptop sales will exceed those of HP/Compaq, IBM, Dell and Gateway combined by 2010</h4> <p>If you include tablets, Apple hit #1 in the market by 2010, but I don't think they do better than all of the old workhorses combined. Again, this seems to underestimate the effect of dark matter, in this case, people buying laptops for boring reasons, e.g., corporate buyers and normal folks who want something under Apple's price range.</p> <h4 id="prediction-10-in-five-years-time-most-programmers-will-still-be-average">Prediction #10: In five years' time, most programmers will still be average</h4> <p>More of a throwaway witticism than a prediction, but sure.</p> <p>That's a pretty good set of predictions for 2004. With the exception of the bit about Lisp, all of the predictions seem directionally correct; the misses are mostly caused by underestimating the sheer about of inertia it takes for a young/new solution to take over.</p> <p>Steve also has a number of posts that aren't explicitly about predictions that, nevertheless, make pretty solid predictions about how things are today, written way back in 2004. There's <a href="https://sites.google.com/site/steveyegge2/its-not-software">It's Not Software</a>, which was years ahead of its time about how people write “software”, how writing server apps is really different from writing shrinkwrap software in a way that obsoletes a lot of previously solid advice, like Joel's dictum against rewrites, as well as how service oriented architectures look; <a href="https://sites.google.com/site/steveyegge2/google-at-delphi">the Google at Delphi</a> (again from 2004) correctly predicts the importance of ML and AI as well as Google's very heavy investment in ML; an <a href="https://web.archive.org/web/20060814161212/http://sztywny.titaniumhosting.com/2006/07/23/stiff-asks-great-programmers-answers/">old interview where he predicts</a> &quot;web application programming is gradually going to become the most important client-side programming out there. I think it will mostly obsolete all other client-side toolkits: GTK, Java Swing/SWT, Qt, and of course all the platform-specific ones like Cocoa and Win32/MFC/&quot;; etc. A number of Steve's internal Google blog posts also make interesting predictions, but AFAIK those are confidential. Of course these all these things seem obvious in retrospect, but that's just part of Steve's plan to pass as a normal human being.</p> <p>In <a href="https://plus.google.com/110981030061712822816/posts/AaygmbzVeRq">a relatively recent post</a>, Steve throws Jeff Bezos under the bus, exposing him as one of a number of “hyper-intelligent aliens with a tangential interest in human affairs”. While the crowd focuses on Jeff, Steve is able to sneak out the back. But we're onto you, Steve.</p> <p><small> Thanks to Leah Hanson, Chris Ball, Mindy Preston, and Paul Gross for comments/corrections. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1"><p>When asked about a past prediction of his, Peter Thiel commented that writing is dangerous and mentioned that a professor once told him that writing a book is more dangerous than having a child -- you can always disown a child, but there's nothing you can do to disown a book.</p> <p>The only prediction I can recall publicly making is that I've been on <a href="https://news.ycombinator.com/item?id=3164452">the</a> <a href="https://news.ycombinator.com/item?id=4954170">record</a> for at least five years saying that, despite the hype, ARM isn't going completely crush Intel in the near future, but that seems so obvious that it's not even worth calling it a prediction. Then again, this was a minority opinion up until pretty recently, so maybe it's not that obvious.</p> <p>I've also correctly predicted the failure of a number of chip startups, but since the vast majority of startups fail, that's expected. Predicting successes is much more interesting, and my record there is decidedly mixed. Based purely on who was involved, I thought that SiByte, Alchemy, and PA Semi were good bets. Of those, SiByte was a solid success, Alchemy didn't work out, and PA Semi was maybe break-even.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> Reading postmortems postmortem-lessons/ Thu, 20 Aug 2015 00:00:00 +0000 postmortem-lessons/ <p>I love reading postmortems. They're educational, but unlike most educational docs, they tell an entertaining story. I've spent a decent chunk of time reading postmortems at both Google and Microsoft. I haven't done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.</p> <p></p> <h3 id="error-handling">Error Handling</h3> <p>Proper error handling code is hard. Bugs in error handling code are a major cause of <em>bad</em> problems. This means that the probability of having sequential bugs, where an error causes buggy error handling code to run, isn't just the independent probabilities of the individual errors multiplied. It's common to have cascading failures cause a serious outage. There's a sense in which this is obvious -- error handling is generally regarded as being hard. If I mention this to people they'll tell me how obvious it is that a disproportionate number of serious postmortems come out of bad error handling and cascading failures where errors are repeatedly not handled correctly. But despite this being “obvious”, it's not so obvious that sufficient test and static analysis effort are devoted to making sure that error handling works.</p> <p>For more on this, Ding Yuan et al. have a great paper and talk: <a href="https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan">Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems</a>. The paper is basically what it says on the tin. The authors define a critical failure as something that can take down a whole cluster or cause data corruption, and then look at a couple hundred bugs in Cassandra, HBase, HDFS, MapReduce, and Redis, to find 48 critical failures. They then look at the causes of those failures and find that most bugs were due to bad error handling. 92% of those failures are actually from errors that are handled incorrectly.</p> <p><img src="images/postmortem-lessons/osdi_error.png" alt="Graphic of previous paragraph" width="530" height="142"></p> <p>Drilling down further, 25% of bugs are from simply ignoring an error, 8% are from catching the wrong exception, 2% are from incomplete TODOs, and another 23% are &quot;easily detectable&quot;, which are defined as cases where “the error handling logic of a non-fatal error was so wrong that any statement coverage testing or more careful code reviews by the developers would have caught the bugs”. By the way, this is one reason I don't mind Go style error handling, despite the common complaint that the error checking code is cluttering up the main code path. If you care about building robust systems, the error checking code is the main code!</p> <p><a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf">The full paper</a> has a lot of gems that that I mostly won't describe here. For example, they explain the unreasonable effectiveness of <a href="https://aphyr.com/tags/jepsen">Jepsen</a> (98% of critical failures can be reproduced in a 3 node cluster). They also dig into what percentage of failures are non-deterministic (26% of their sample), as well as the causes of non-determinism, and create a static analysis tool that can catch many common error-caused failures.</p> <h3 id="configuration">Configuration</h3> <p>Configuration bugs, not code bugs, are the most common cause I've seen of really bad outages. When I looked at publicly available postmortems, searching for “global outage postmortem” returned about 50% outages caused by configuration changes. Publicly available postmortems aren't a representative sample of all outages, but a random sampling of postmortem databases also reveals that config changes are responsible for a disproportionate fraction of extremely bad outages. As with error handling, I'm often told that it's obvious that config changes are scary, but it's not so obvious that most companies test and stage config changes like they do code changes.</p> <p>Except in extreme emergencies, risky code changes are basically never simultaneously pushed out to all machines because of the risk of taking down a service company-wide. But it seems that every company has to learn the hard way that seemingly benign config changes can also cause a company-wide service outage. For example, this was the cause of the infamous November 2014 Azure outage. I don't mean to pick on MS here; their major competitors have also had serious outages for similar reasons, and they've all put processes into place to reduce the risk of that sort of outage happening again.</p> <p>I don't mean to pick on large cloud companies, either. If anything, the situation there is better than at most startups, even very well funded ones. Most of the “unicorn” startups that I know of don't have a proper testing/staging environment that lets them test risky config changes. I can understand why -- it's often hard to set up a good QA environment that mirrors prod well enough that config changes can get tested, and like driving without a seatbelt, nothing bad happens the vast majority of the time. If I had to make my own seatbelt before driving my car, I might not drive with a seatbelt either. Then again, if driving without a seatbelt were as scary as making config changes, I might consider it.</p> <p><a href="http://research.microsoft.com/en-us/um/people/gray/papers/TandemTR85.7_WhyDoComputersStop.pdf">Back in 1985, Jim Gray observed</a> that &quot;operator actions, system configuration, and system maintenance was the main source of failures -- 42%&quot;. Since then, there have been a variety of studies that have found similar results. For example, <a href="http://asrabkin.bitbucket.org/papers/software12.pdf">Rabkin and Katz</a> found the following causes for failures:</p> <p><img src="images/postmortem-lessons/rabkin_katz.png" alt="Causes in decreasing order: misconfig, bug, operational, system, user, install, hardware" width="303" height="234"></p> <h3 id="hardware">Hardware</h3> <p>Basically every part of a machine can fail. Many components can also cause data corruption, often at rates that are much higher than advertised. For example, <a href="http://research.google.com/pubs/pub35162.html">Schroeder, Pinherio, and Weber</a> found DRAM error rates were more than an order of magnitude worse than advertised. The number of silent errors is staggering, and this actually caused problems for Google back before they switched to ECC RAM. Even with error detecting hardware, things can go wrong; relying on <a href="http://noahdavids.org/self_published/CRC_and_checksum.html">Ethernet checksums to protect against errors is unsafe</a> and I've personally seen malformed packets get passed through as valid packets. At scale, you can run into more undetected errors than you expect, if you expect hardware checks to catch hardware data corruption.</p> <p>Failover from bad components can also fail. <a href="https://aws.amazon.com/message/67457/">This AWS failure tells a typical story</a>. Despite taking reasonable sounding measures to regularly test the generator power failover process, a substantial fraction of AWS East went down when a storm took out power and a set of backup generators failed to correctly provide power when loaded.</p> <h3 id="humans">Humans</h3> <p>This section should probably be called process error and not human error since I consider having humans in a position where they can accidentally cause a catastrophic failure to be a process bug. It's generally accepted that, if you're running large scale systems, you have to have systems that are robust to hardware failures. If you do the math on how often machines die, it's obvious that systems that aren't robust to hardware failure cannot be reliable. But humans are even more error prone than machines. Don't get me wrong, I like humans. Some of my best friends are human. But if you repeatedly put a human in a position where they can cause a catastrophic failure, you'll eventually get a catastrophe. And yet, the following pattern is still quite common:</p> <blockquote> <p>Oh, we're about to do a risky thing! Ok, let's have humans be VERY CAREFUL about executing the risky operation. Oops! We now have a global outage.</p> </blockquote> <p>Postmortems that start with “Because this was a high risk operation, foobar high risk protocol was used” are ubiquitous enough that I now think of extra human-operated steps that are done to mitigate human risk as an ops smell. Some common protocols are having multiple people watch or confirm the operation, or having ops people standing by in case of disaster. Those are reasonable things to do, and they mitigate risk to some extent, but in many postmortems I've read, automation could have reduced the risk a lot more or removed it entirely. There are a lot of cases where the outage happened because a human was expected to flawlessly execute a series of instructions and failed to do so. That's exactly the kind of thing that programs are good at! In other cases, a human is expected to perform manual error checking. That's sometimes harder to automate, and a less obvious win (since a human might catch an error case that the program misses), but in most cases I've seen it's still a net win to automate that sort of thing.</p> <p><img src="images/postmortem-lessons/idc2013_downtime.png" alt="Causes in decreasing order: human error, system failure, out of IPs, natural disaster"></p> <p>In an IDC survey, respondents voted human error as the most troublesome cause of problems in the datacenter.</p> <p>One thing I find interesting is how underrepresented human error seems to be in public postmortems. As far as I can tell, Google and MS both have substantially more automation than most companies, so I'd expect their postmortem databases to contain proportionally fewer human error caused outages than I see in public postmortems, but in fact it's the opposite. My guess is that's because companies are less likely to write up public postmortems when the root cause was human error enabled by risky manual procedures. A prima facie plausible alternate reason is that improved technology actually increases the fraction of problems caused by humans, which is true in some industries, like flying. I suspect that's not the case here due to the sheer number of manual operations done at a lot of companies, but there's no way to tell for sure without getting access to the postmortem databases at multiple companies. If any company wants to enable this analysis (and others) to be done (possibly anonymized), please get in touch.</p> <h3 id="monitoring-alerting">Monitoring / Alerting</h3> <p>The lack of proper monitoring is never the sole cause of a problem, but it's often a serious contributing factor. As is the case for human errors, these seem underrepresented in public postmortems. When I talk to folks at other companies about their worst near disasters, a large fraction of them come from not having the right sort of alerting set up. They're often saved having a disaster bad enough to require a public postmortem by some sort of ops heroism, but heroism isn't a scalable solution.</p> <p>Sometimes, those near disasters are caused by subtle coding bugs, which is understandable. But more often, it's due to blatant process bugs, like not having a clear escalation path for an entire class of failures, causing the wrong team to debug an issue for half a day, or not having a backup on-call, causing a system to lose or corrupt data for hours before anyone notices when (inevitably) the on-call person doesn't notice that something's going wrong.</p> <p>The <a href="https://en.wikipedia.org/wiki/Northeast_blackout_of_2003">Northeast blackout of 2003</a> is a great example of this. It could have been a minor outage, or even just a minor service degradation, but (among other things) a series of missed alerts caused it to become one of the worst power outages ever.</p> <h3 id="not-a-conclusion">Not a Conclusion</h3> <p>This is where the conclusion's supposed to be, but I'd really like to do some serious data analysis before writing some kind of conclusion or call to action. What should I look for? What other major classes of common errors should I consider? These aren't rhetorical questions and I'm genuinely interested in hearing about other categories I should think about. <a href="https://twitter.com/danluu">Feel free to ping me here</a>. I'm also trying to collect <a href="https://github.com/danluu/post-mortems">public postmortems here</a>.</p> <p>One day, I'll get around to the serious analysis, but even without going through and classifying thousands of postmortems, I'll probably do a few things differently as a result of having read a bunch of these. I'll spend relatively more time during my code reviews on errors and error handling code, and relatively less time on the happy path. I'll also spend more time checking for and trying to convince people to fix “obvious” process bugs.</p> <p>One of the things I find to be curious about these failure modes is that when I talked about what I found with other folks, at least one person told me that each process issue I found was obvious. But these “obvious” things still cause a lot of failures. In one case, someone told me that what I was telling them was obvious at pretty much the same time their company was having a global outage of a multi-billion dollar service, caused by the exact thing we were talking about. Just because something is obvious doesn't mean it's being done.</p> <h3 id="elsewhere">Elsewhere</h3> <p>Richard Cook's <a href="http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf">How Complex Systems Fail</a> takes a more general approach; his work inspired <a href="http://www.amazon.com/gp/product/0312430000/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0312430000&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=2VKSNUT3M4W7KUMR">The Checklist Manifesto</a>, which has saved lives.</p> <p>Allspaw and Robbin's <a href="http://www.amazon.com/gp/product/1449377440/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1449377440&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=U7MG5A2J6OLQG627">Web Operations: Keeping the Data on Time</a> talks about this sort of thing in the context of web apps. Allspaw also has a nice post about <a href="http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/">some related literature from other fields</a>.</p> <p>In areas that are a bit closer to what I'm used to, there's a long history of studying the causes of failures. Some highlights include <a href="http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf">Jim Gray's Why Do Computers Stop and What Can Be Done About It?</a> (1985), <a href="https://www.usenix.org/legacy/event/usits03/tech/full_papers/oppenheimer/oppenheimer_html/">Oppenheimer et. al's Why Do Internet Services Fail, and What Can Be Done About It?</a> (2003), <a href="Understanding and Dealing with Operator Mistakes in Internet Services">Nagaraja et. al's Understanding and Dealing with Operator Mistakes in Internet Services</a> (2004), part of <a href="http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024">Barroso et. al's The Datacenter as a Computer</a> (2009), and <a href="http://asrabkin.bitbucket.org/papers/software12.pdf">Rabkin and Katz's How Hadoop Clusters Break</a> (2013), and <a href="http://cseweb.ucsd.edu/~tixu/papers/sosp13.pdf">Xu et. al's Do Not Blame Users for Misconfigurations</a>.</p> <p>There's also a long history of trying to understand aircraft reliability, and the <a href="https://en.wikipedia.org/wiki/Reliability-centered_maintenance">story of how processes have changed over the decades</a> is fascinating, although I'm not sure how to generalize those lessons.</p> <p>Just as an aside, I find it interesting how hard it's been to eke out extra uptime and reliability. In 1974, <a href="http://web.eecs.umich.edu/~prabal/teaching/eecs582-w13/readings/ritchie74unix.pdf">Ritchie and Thompson</a> wrote about a system &quot;costing as little as $40,000&quot; with 98% uptime. A decade later, Jim Gray uses 99.6% uptime as a reasonably good benchmark. We can do much better than that now, but the level of complexity required to do it is staggering.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p><small> Thanks to Leah Hanson, Anonymous, Marek Majkowski, Nat Welch, Joe Wilder, and Julia Hansbrough for providing comments on a draft of this. Anonymous, if you prefer to not be anonymous, send me a message on Zulip. For anyone keeping score, that's three folks from Google, one person from Cloudflare, and one anonymous commenter. I'm always open to comments/criticism, but I'd be especially interested in comments from folks who work at companies with less scale. Do my impressions generalize?</p> <p>Thanks to gwern and Dan Reif for taking me up on this and finding some bugs in this post. </small></p> Slashdot and Sourceforge slashdot-sourceforge/ Sun, 31 May 2015 00:00:00 +0000 slashdot-sourceforge/ <p></p> <p>If you've followed any tech news aggregator in the past week (the week of the 24th of May, 2015), you've probably seen the story about how <a href="https://plus.google.com/+gimp/posts/cxhB1PScFpe">SourceForge</a> is taking over admin accounts for existing projects and injecting adware in installers for packages like GIMP. For anyone not following the story, SourceForge has a long history of adware laden installers, but they used to be opt-in. It appears that the process is now mandatory for many projects.</p> <p>People have been wary of SourceForge ever since they added a feature to allow projects to opt-in to adware bundling, but you could at least claim that projects are doing it by choice. But now that SourceForge is clearly being malicious, they've wiped out all of the user trust that was built up over sixteen years of operating. No clueful person is going to ever download something from SourceForge again. If search engines start penalizing SourceForge for distributing adware, they won't even get traffic from people who haven't seen this story, wiping out basically all of their value.</p> <p>Whenever I hear about a story like this, I'm amazed at how quickly it's possible to destroy user trust, and how much easier it is to destroy a brand than to create one. In that vein, it's funny to see Slashdot (which is owned by the same company as SourceForge) also attempting to destroy their own brand. They're the only major tech news aggregator which hasn't had a story on this, and that's because they've <a href="http://www.reddit.com/r/programming/comments/37xbzt/goodbye_sourceforge/crqpnzo">buried every story</a> that someone submits. This has prompted people to start submitting comments about this on other stories.</p> <p><img src="images/slashdot-sourceforge/slashdot_sourceforge.png" alt="A highly upvoted comment about SourceForge on Slashdot" width="733" height="475"></p> <p>I find this to be pretty incredible. How is it possible that someone, somewhere, thinks that censoring SourceForge's adware bundling on Slashdot is a net positive for Slashdot Media, the holding company that owns Slashdot and SourceForge? A quick search on either Google or Google News shows that the story has already made it to a number of major tech publications, making the value of suppressing the story nearly zero in the best case. And in the worst case, this censorship will create another <a href="http://yro.slashdot.org/story/06/04/20/1538256/growing-censorship-concerns-at-digg">Digg moment</a><sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>, where readers stop trusting the moderators and move on to sites that aren't as heavily censored. There's basically no upside here and a substantial downside risk.</p> <p>I can see why DHI, the holding company that owns Slashdot Media, would want to do something. Their last earnings report indicated that Slashdot Media isn't doing well, and the last thing they need is bad publicity driving people away from Slashdot:</p> <blockquote> <p>Corporate &amp; Other segment revenues decreased 6% to $4.5 million for the quarter ended March 31, 2015, reflecting a decline in certain revenue streams at Slashdot Media.</p> </blockquote> <p>Compare that to their post-acquisition revenue from Q4 2012, which is the first quarter after DHI purchased Slashdot Media:</p> <blockquote> <p>Revenues totaled $52.7 . . . including $4.7 million from the Slashdot Media acquisition</p> </blockquote> <p>“Corporate &amp; Other” seems to encompass more than just Slashdot Media. And despite that, as well as milking SourceForge for all of the short-term revenue they can get, all of “Corporate &amp; Other” is doing worse than Slashdot Media alone in 2012<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>. Their original stated plan for SourceForge and Slashdot was <a href="http://www.theverge.com/2012/9/18/3351970/dice-holdings-geeknet-slashdot-careers-news">&quot;to keep them pretty much the same as they are [because we] are very sensitive to not disrupting how users use them . . .&quot;</a>, but it didn't take long for them realize that wasn't working; here's a snippet from their 2013 earnings report:</p> <blockquote> <p>advertising revenue has declined over the past year and there is no improvement expected in the future financial performance of Slashdot Media's underlying advertising business. Therefore, $7.2 million of intangible assets and $6.3 million of goodwill related to Slashdot Media were reduced to zero.</p> </blockquote> <p>I believe it was shortly afterwards that SourceForge started experimenting with adware/malware bundlers for projects that opted in, which somehow led us to where we are today.</p> <p>I can understand the desire to do something to help Slashdot Media, but it's hard to see how permanently damaging Slashdot's reputation is going to help. As far as I can tell, they've fallen back to this classic syllogism: “We must do something. This is something. We must do this.”</p> <p><em>Update:</em> The Sourceforge/GIMP story is now on Slashdot, the week after it appeared everywhere else and a day after this was written, with a note about how the editor just got back from the weekend to people &quot;freaking out that we're 'burying' this story&quot;, playing things down to make it sound like this would have been posted if it wasn't the weekend. That's not a very convincing excuse when tens of stories were posted by various editors, including the one who ended up making the Sourceforge/GIMP post, since the Sourceforge/GIMP story broke last Wednesday. The &quot;weekend&quot; excuse seems especially flimsy since when the Sourceforge/nmap story broke on the next Wednesday and Slashdot was under strict scrutiny for the previously delay, they were able to publish that story almost immediately on the same day, despite it having been the start of the &quot;weekend&quot; the last time a story broke on a Wednesday. Moreover, the Slashdot story is <a href="http://www.reddit.com/r/programming/comments/386c75/sourceforge_locked_in_projects_of_fleeing_users/crstuof">very careful</a> to use terms like &quot;modified binary&quot; and &quot;present third party offers&quot; instead of &quot;malware&quot; or &quot;adware&quot;.</p> <p>Of course this could all just be an innocent misunderstanding, and I doubt we'll ever have enough information to know for sure either way. But Slashdot's posted excuse certainly isn't very confidence inspiring.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">Ironically, if you follow the link, you'll see that Slashdot's founder, CmdrTaco, is against “content getting removed for being critical of sponsors”. It's not that Slashdot wasn't biased back then; Slashdot used to be notorious for their pro-Linux pro-open source anti-MS anti-commercial bias. If you read through the comments in that link, you'll see that a lot of people lost their voting abilities after upvoting a viewpoint that runs against Slashdot's inherent bias. But it's Slashdot's bias that makes the omission of this story so remarkable. This is exactly the kind of thing Slashdot readers and moderators normally make hay about. But CmdrTaco has been gone for years, as has the old Slashdot. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">If you want to compare YoY results, Slashdot Media pulled in $4M in Q1 2013. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> </ol> </div> The googlebot monopoly googlebot-monopoly/ Wed, 27 May 2015 00:00:00 +0000 googlebot-monopoly/ <p>TIL that Bell Labs and a whole lot of other websites block archive.org, not to mention most search engines. Turns out I have <a href="https://github.com/danluu/debugging-stories/issues/3">a broken website link</a> in a GitHub repo, caused by the deletion of an old webpage. When I tried to pull the original from archive.org, I found that it's not available because Bell Labs blocks the archive.org crawler in their robots.txt:</p> <p></p> <pre><code>User-agent: Googlebot User-agent: msnbot User-agent: LSgsa-crawler Disallow: /RealAudio/ Disallow: /bl-traces/ Disallow: /fast-os/ Disallow: /hidden/ Disallow: /historic/ Disallow: /incoming/ Disallow: /inferno/ Disallow: /magic/ Disallow: /netlib.depend/ Disallow: /netlib/ Disallow: /p9trace/ Disallow: /plan9/sources/ Disallow: /sources/ Disallow: /tmp/ Disallow: /tripwire/ Visit-time: 0700-1200 Request-rate: 1/5 Crawl-delay: 5 User-agent: * Disallow: / </code></pre> <p>In fact, Bell Labs not only blocks the Internet Archiver bot, it blocks all bots except for Googlebot, msnbot, and their own corporate bot. And msnbot was superseded by bingbot <a href="http://en.wikipedia.org/wiki/Msnbot">five years ago</a>!</p> <p>A quick search using a term that's only found at Bell Labs<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">1</a></sup>, e.g., “This is a start at making available some of the material from the Tenth Edition Research Unix manual.”, reveals that bing indexes the page; either bingbot follows some msnbot rules, or that msnbot still runs independently and indexes sites like Bell Labs, which ban bingbot but not msnbot. Luckily, in this case, a lot of search engines (like Yahoo and DDG) use Bing results, so Bell Labs hasn't disappeared from the non-Google internet, but you're out of luck <a href="http://en.wikipedia.org/wiki/Yandex#cite_note-13">if you're one of the 55% of Russians who use yandex</a>.</p> <p>And all that is a relatively good case, where one non-Google crawler is allowed to operate. It's not uncommon to see robots.txt files that ban everything but Googlebot. Running <a href="http://blog.nullspace.io/building-search-engines.html">a competing search engine</a> and preventing a Google monopoly is hard enough without having sites ban non-Google bots. We don't need to make it even harder, nor do we need to accidentally<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup> ban the Internet Archive bot.</p> <p><small> P.S. While you're checking that your robots.txt doesn't ban everyone but Google, consider looking at your CPUID checks to make sure that you're using feature flags <a href="https://twitter.com/danluu/status/1203450367515615233">instead of banning everyone but Intel and AMD</a>.</p> <p>BTW, I do think there can be legitimate reasons to block crawlers, including archive.org, but I don't think that the common default many web devs have, of blocking everything but googlebot, is really intended to block competing search engines as well as archive.org. </small></p> <p>2021 Update: since this post was first published, archive.org started ignoring being blocked in robots.txt and archives posts where they are blocked in robots.txt. I've heard that some competing search engines do the same thing, so this mis-use of robots.txt, where sites ban everything but googlebot, is slowly making robots.txt effectively useless, much like browsers identify themselves as every browser in user-agent strings to work around sites that incorrectly block browsers they don't think are compatible.</p> <p>A related thing is that sites will sometimes ban competing search engines, like Bing, in a fit of pique, which they wouldn't do to Google since Google provides too much traffic for them to be able get away with that, e.g., <a href="https://twitter.com/danluu/status/981992814824378369">Discourse banned Bing because they were upset that Bing was crawling discourse at 0.46 QPS</a>.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:A">At least until this page gets indexed. Google has a turnaround time of minutes to hours on updates to this page, which I find pretty amazing. I actually find that more impressive than seeing stuff on CNN reflected in seconds to minutes. Of course search engines are going to want to update CNN in real time. But a blog like mine? If they're crawling a niche site like mine every hour, they must also be crawling millions or tens of millions of other sites on an hourly basis and updating their index appropriately. Either that or they pull updates off of RSS, but even that requires millions or tens of millions of index updates per hour for sites with my level of traffic. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:2">I don't object, in principle, to a robots.txt that prevents archive.org from archiving sites -- although the common opinion among programmers seems to be that it's a sin to block archive.org, I believe it's fine to do that if you don't want old versions of your site floating around. But it should be an informed decision, not an accident. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> </ol> </div> A defense of boring languages boring-languages/ Mon, 25 May 2015 00:00:00 +0000 boring-languages/ <p>Boring languages are underrated. Many appear to be rated quite highly, at least if you look at market share. But even so, they're underrated. Despite the popularity of Dan McKinley's <a href="https://mcfunley.com/choose-boring-technology">&quot;choose boring technology&quot;</a> essay, boring languages are widely panned. People who use them are too (e.g., they're a target of essays by Paul Graham and Joel Spolsky, <a href="https://news.ycombinator.com/item?id=2379259">and other people have picked up a similar attitude</a>).</p> <p>A commonly used pitch for interesting languages goes something like &quot;Sure, you can get by with writing blub for boring work, which almost all programmers do, but if you did interesting work, then you'd want to use an interesting language&quot;. My feeling is that this has it backwards. When I'm doing boring work that's basically bottlenecked on the speed at which I can write boilerplate, it feels much nicer to use an interesting language (like F#), which lets me cut down on the amount of time spent writing boilerplate. But when I'm doing interesting work, the boilerplate is a rounding error and I don't mind using a boring language like Java, even if that means a huge fraction of the code I'm writing is boilerplate.</p> <p>Another common pitch, similar to the above, is that learning interesting languages will teach you new ways to think that will make you a much more effective programmer<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>. I can't speak for anyone else, but I found that line of reasoning compelling when I was early in my career and learned ACL2 (a Lisp), Forth, F#, etc.; enough of it stuck that I still love F#. But, despite taking the advice that &quot;learning a wide variety of languages that support different programming paradigms will change how you think&quot; seriously, my experience has been that the things I've learned mostly let me crank through boilerplate more efficiently. While that's pretty great when I have a boilerplate-constrained problem, when I have a hard problem, I spend so little time on that kind of stuff that the skills I learned from writing a wide variety of languages don't really help me; instead, what helps me is having domain knowledge that gives me a good lever with which I can solve the hard problem. This explains something I'd wondered about when I finished grad school and arrived in the real world: why is it that the programmers who build the systems I find most impressive typically have deep domain knowledge rather than interesting language knowledge?</p> <p>Another perspective on this is Sutton's response when asked why he robbed banks, &quot;because that's where the money is&quot;. Why do I work in boring languages? Because that's what the people I want to work with use, and what the systems I want to work on are written in. The vast majority of the systems I'm interested in are writing in boring languages. Although that technically doesn't imply that the vast majority of people I want to work with primarily use and have their language expertise in boring languages, that also turns out to be the case in practice. That means that, for greenfield work, it's also likely that the best choice will be a boring language. I think F# is great, but I wouldn't choose it over working with the people I want to work with on the problems that I want to work on.</p> <p>If I look at the list of things I'm personally impressed with (things like Spanner, BigTable, Colossus, etc.), it's basically all C++, with almost all of the knockoffs in Java. When I think for a minute, the list of software written in C, C++, and Java is really pretty long. Among the transitive closure of things I use and the libraries and infrastructure used by those things, those three languages are ahead by a country mile, with PHP, Ruby, and Python rounding out the top 6. Javascript should be in there somewhere if I throw in front-end stuff, but it's so ubiquitous that making a list seems a bit pointless.</p> <p>Below are some lists of software written in boring languages. These lists are long enough that I’m going to break them down into some arbitrary sublists. As is often the case, these aren’t really nice orthogonal categories and should be tags, but here we are. In the lists below, apps are categorized under “Backend” based on the main language used on the backend of a webapp. The other categories are pretty straightforward, even if their definitions a bit idiosyncratic and perhaps overly broad.</p> <h2 id="c">C</h2> <h3 id="operating-systems">Operating Systems</h3> <p>Linux, including variants like KindleOS<br> BSD<br> Darwin (with C++)<br> <a href="http://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs">Plan 9</a><br> Windows (kernel in C, with some C++ elsewhere)</p> <h3 id="platforms-infrastructure">Platforms/Infrastructure</h3> <p><a href="http://en.wikipedia.org/wiki/Memcached">Memcached</a><br> <a href="https://www.sqlite.org/index.html">SQLite</a><br> <a href="http://en.wikipedia.org/wiki/Nginx">nginx</a><br> <a href="http://en.wikipedia.org/wiki/Apache_HTTP_Server">Apache</a><br> <a href="http://en.wikipedia.org/wiki/IBM_DB2">DB2</a><br> <a href="http://en.wikipedia.org/wiki/PostgreSQL">PostgreSQL</a><br> <a href="https://aphyr.com/posts/307-call-me-maybe-redis-redux">Redis</a><br> <a href="http://en.wikipedia.org/wiki/Varnish_%28software%29">Varnish</a><br> <a href="http://en.wikipedia.org/wiki/HAProxy">HAProxy</a> AWS Lambda workers (with most of the surrounding infrastructure written in Java), according to @jayachdee</p> <h3 id="desktop-apps">Desktop Apps</h3> <p>git<br> Gimp (with perl)<br> VLC<br> Qemu<br> OpenGL<br> <a href="http://en.wikipedia.org/wiki/FFmpeg">FFmpeg</a><br> Most GNU userland tools<br> Most BSD userland tools<br> <a href="http://lcamtuf.coredump.cx/afl/">AFL</a><br> Emacs<br> Vim</p> <h2 id="c-1">C++</h2> <h3 id="operating-systems-1">Operating Systems</h3> <p>BeOS/<a href="http://en.wikipedia.org/wiki/Haiku_%28operating_system%29">Haiku</a></p> <h3 id="platforms-infrastructure-1">Platforms/Infrastructure</h3> <p><a href="http://research.google.com/archive/gfs.html">GFS</a><br> <a href="http://www.wired.com/2012/07/google-colossus/">Colossus</a><br> <a href="http://en.wikipedia.org/wiki/Ceph_%28software%29">Ceph</a><br> <a href="http://research.google.com/pubs/pub36632.html">Dremel</a><br> <a href="http://research.google.com/archive/chubby.html">Chubby</a><br> <a href="http://research.google.com/archive/bigtable.html">BigTable</a><br> <a href="http://research.google.com/archive/spanner.html">Spanner</a><br> <a href="http://en.wikipedia.org/wiki/MySQL">MySQL</a><br> <a href="http://en.wikipedia.org/wiki/%C3%98MQ">ZeroMQ</a><br> <a href="https://github.com/scylladb/scylla">ScyllaDB</a><br> <a href="https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads">MongoDB</a><br> <a href="http://en.wikipedia.org/wiki/Apache_Mesos">Mesos</a><br> <a href="http://en.wikipedia.org/wiki/Java_virtual_machine">JVM</a><br> <a href="http://en.wikipedia.org/wiki/.NET_Framework">.NET</a></p> <h3 id="backend-apps">Backend Apps</h3> <p>Google Search<br> PayPal<br> Figma (<a href="https://www.figma.com/blog/building-a-professional-design-tool-on-the-web/">front-end written in C++ and cross-compiled to JS</a>)</p> <h3 id="desktop-apps-1">Desktop Apps</h3> <p>Chrome<br> MS Office<br> LibreOffice (with Java)<br> Evernote (originally in C#, converted to C++)<br> Firefox<br> Opera<br> Visual Studio (with C#)<br> Photoshop, Illustrator, InDesign, etc.<br> gcc<br> llvm/clang<br> Winamp<br> <a href="https://github.com/z3prover/z3/wiki">Z3</a><br> Most AAA games<br> Most pro audio and video production apps</p> <h3 id="elsewhere">Elsewhere</h3> <p>Also see <a href="http://www.stroustrup.com/applications.html">this list</a> and <a href="https://isocpp.org/wiki/faq/big-picture#who-uses-cpp">some of the links here</a>.</p> <h2 id="java">Java</h2> <h3 id="platforms-infrastructure-2">Platforms/Infrastructure</h3> <p><a href="http://en.wikipedia.org/wiki/Apache_Hadoop">Hadoop</a><br> <a href="http://www.aosabook.org/en/hdfs.html">HDFS</a><br> <a href="http://en.wikipedia.org/wiki/Apache_ZooKeeper">Zookeeper</a><br> <a href="http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html">Presto</a><br> <a href="https://aphyr.com/posts/294-call-me-maybe-cassandra/">Cassandra</a><br> <a href="https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0">Elasticsearch</a><br> <a href="https://en.wikipedia.org/wiki/Lucene">Lucene</a><br> <a href="http://en.wikipedia.org/wiki/Apache_Tomcat">Tomcat</a><br> <a href="http://en.wikipedia.org/wiki/Jetty_(web_server)">Jetty</a></p> <h3 id="backend-apps-1">Backend Apps</h3> <p>Gmail<br> LinkedIn<br> <a href="http://www.theregister.co.uk/2007/11/12/ebay_glitches/">Ebay</a><br> <a href="http://silvaetechnologies.eu/blg/50/the-majority-of-netflix-services-are-built-on-java">Most of Netflix</a><br> A large fraction of Amazon services</p> <h3 id="desktop-apps-2">Desktop Apps</h3> <p><a href="http://en.wikipedia.org/wiki/IBM_VisualAge">Eclipse</a><br> JetBrains IDEs<br> SmartGit<br> <a href="http://en.wikipedia.org/wiki/Minecraft">Minecraft</a></p> <h1 id="vhdl-verilog">VHDL/Verilog</h1> <p>I'm not even going to make a list because basically every major microprocessor, NIC, switch, etc. is made in either VHDL or Verilog. For existing projects, you might say that this is because you have a large team that's familiar with some boring language, but I've worked on greenfield hardware/software co-design for deep learning and networking virtualization, both with teams that are hired from scratch for the project, and we still used Verilog, despite one of the teams having one of the larger collections of bluespec proficient hardware engineers anywhere outside of Arvind's group at MIT.</p> <p><a href="https://twitter.com/danluu">Please suggest</a> other software that you think belongs on this list; it doesn't have to be software that I personally use. Also, does anyone know what EC2, S3, and Redshift are written in? I suspect C++, but I couldn't find a solid citation for that. This post was last updated 2021-08.</p> <h2 id="appendix-meta">Appendix: meta</h2> <p>One thing I find interesting is that, in personal conversations with people, the vast majority of experienced developers I know think that most mainstream languages are basically fine, modulo performance constraints, and this is even more true among people who've built systems that are really impressive to me. Online discussion of what someone might want to learn is very different, with learning interesting/fancy languages being generally high up on people's lists. When I talk to new programmers, they're often pretty influenced by this (e.g., at Recurse Center, before ML became trendy, learning fancy languages was the most popular way people tried to become better as a programmer, and I'd say that's now #2 behind ML). While I think learning a fancy language does work for some people, I'd say that's overrated in that there are many other techniques that seem to click with at least the same proportion of people who try it that are much less popular.</p> <p>A question I have is, why is online discussion about this topic so one-sided while the discussions I've had in real life are so oppositely one-sided. Of course, neither people who are loud on the internet nor people I personally know are representative samples of programmers, but I still find it interesting.</p> <p><small> Thanks to Leah Hanson, James Porter, Waldemar Q, Nat Welch, Arjun Sreedharan, Rafa Escalante, @matt_dz, Bartlomiej Filipek, Josiah Irwin, @jayachdee, Larry Ogrondek, Miodrag Milic, Presto, Matt Godbolt, Leah Hanson, Noah Haasis, Lifan Zeng, @chozu@fedi.absturztau.be, and Josiah Irwin for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">a variant of this argument goes beyond teaching you techniques and says that the languages you know determine what you think via the Sapir-Whorf hypothesis. I don't personally find this compelling since, when I'm solving hard problems, I don't think about things in a programming language. YMMV if you think in a programming language, but I think of an abstract solution and then translate the solution to a language, so having another language in my toolbox can, at most, help me think of better translations and save on translation. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> </ol> </div> Advantages of monorepos monorepo/ Sun, 17 May 2015 00:00:00 +0000 monorepo/ <p>Here's a conversation I keep having:</p> <blockquote> <p><strong>Someone</strong>: Did you hear that Facebook/Google uses a giant monorepo? WTF!<br> <strong>Me</strong>: Yeah! It's really convenient, don't you think?<br> <strong>Someone</strong>: That's THE MOST RIDICULOUS THING I've ever heard. Don't FB and Google know what a terrible idea it is to put all your code in a single repo?<br> <strong>Me</strong>: I think engineers at FB and Google are probably familiar with using smaller repos (doesn't <a href="http://en.wikipedia.org/wiki/Junio_Hamano">Junio Hamano</a> work at Google?), and they still prefer a single huge repo for [reasons].<br> <strong>Someone</strong>: Oh that does sound pretty nice. I still think it's weird but I could see why someone would want that.<br></p> </blockquote> <p>“[reasons]” is pretty long, so I'm writing this down in order to avoid repeating the same conversation over and over again.</p> <p></p> <h3 id="simplified-organization">Simplified organization</h3> <p>With multiple repos, you typically either have one project per repo, or an umbrella of related projects per repo, but that forces you to define what a “project” is for your particular team or company, and it sometimes forces you to split and merge repos for reasons that are pure overhead. For example, having to split a project because it's too big or has too much history for your VCS is not optimal.</p> <p>With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way. Using a single repo also reduces overhead from managing dependencies.</p> <p>A side effect of the simplified organization is that it's easier to navigate projects. The monorepos I've used let you essentially navigate as if everything is on a networked file system, re-using the idiom that's used to navigate within projects. Multi repo setups usually have two separate levels of navigation -- the filesystem idiom that's used inside projects, and then a meta-level for navigating between projects.</p> <p>A side effect of that side effect is that, with monorepos, it's often the case that it's very easy to get a dev environment set up to run builds and tests. If you expect to be able to navigate between projects with the equivalent of <code>cd</code>, you also expect to be able to do <code>cd; make</code>. Since it seems weird for that to not work, it usually works, and whatever tooling effort is necessary to make it work gets done<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup>. While it's technically possible to get that kind of ease in multiple repos, it's not as natural, which means that the necessary work isn't done as often.</p> <h3 id="simplified-dependencies">Simplified dependencies</h3> <p>This probably goes without saying, but with multiple repos, you need to have some way of specifying and versioning dependencies between them. That sounds like it ought to be straightforward, but in practice, most solutions are cumbersome and involve a lot of overhead.</p> <p>With a monorepo, it's easy to have one universal version number for all projects. Since atomic cross-project commits are possible (though these tend to split into many parts for practical reasons at large companies), the repository can always be in a consistent state -- at commit #X, all project builds should work. Dependencies still need to be specified in the build system, but whether that's a make Makefiles or bazel BUILD files, those can be checked into version control like everything else. And since there's just one version number, the Makefiles or BUILD files or whatever you choose don't need to specify version numbers.</p> <h3 id="tooling">Tooling</h3> <p>The simplification of navigation and dependencies makes it much easier to write tools. Instead of having tools that must understand relationships between repositories, as well as the nature of files within repositories, tools basically just need to be able to read files (including some file format that specifies dependencies between units within the repo).</p> <p>This sounds like a trivial thing but, <a href="https://github.com/chrisvana/repobuild/wiki/Motivation">take this example by Christopher Van Arsdale</a> on how easy builds can become:</p> <blockquote> <p><a href="http://bazel.io/">The build system inside of Google</a> makes it incredibly easy to build software using large modular blocks of code. You want a crawler? Add a few lines here. You need an RSS parser? Add a few more lines. A large distributed, fault tolerant datastore? Sure, add a few more lines. These are building blocks and services that are shared by many projects, and easy to integrate. … This sort of Lego-like development process does not happen as cleanly in the open source world. … As a result of this state of affairs (more speculation), there is a complexity barrier in open source that has not changed significantly in the last few years. This creates a gap between what is easily obtainable at a company like Google versus a[n] open sourced project.</p> </blockquote> <p>The system that Arsdale is referring to is so convenient that, before it was open sourced, ex-Google engineers at <a href="https://facebook.github.io/buck/">Facebook</a> and <a href="https://pantsbuild.github.io/">Twitter</a> wrote their own versions of bazel in order to get the same benefits.</p> <p>It's theoretically possible to create a build system that makes building anything, with any dependencies, simple without having a monorepo, but it's more effort, enough effort that I've never seen a system that does it seamlessly. Maven and sbt are pretty nice, in a way, but it's not uncommon to lose a lot of time tracking down and fixing version dependency issues. Systems like rbenv and virtualenv try to sidestep the problem, but they result in a proliferation of development environments. Using a monorepo where HEAD always points to a consistent and valid version removes the problem of tracking multiple repo versions entirely<sup class="footnote-ref" id="fnref:V"><a rel="footnote" href="#fn:V">2</a></sup>.</p> <p>Build systems aren't the only thing that benefit from running on a mono repo. Just for example, static analysis can run across project boundaries without any extra work. Many other things, like cross-project integration testing and <a href="https://github.com/google/codesearch">code search</a> are also greatly simplified.</p> <h3 id="cross-project-changes">Cross-project changes</h3> <p>With lots of repos, making cross-repo changes is painful. It typically involves tedious manual coordination across each repo or hack-y scripts. And even if the scripts work, there's the overhead of correctly updating cross-repo version dependencies. Refactoring an API that's used across tens of active internal projects will probably a good chunk of a day. Refactoring an API that's used across thousands of active internal projects is hopeless.</p> <p>With a monorepo, you just <a href="http://research.google.com/pubs/pub41342.html">refactor the API and all of its callers</a> in one commit. That's not always trivial, but it's much easier than it would be with lots of small repos. I've seen APIs with thousands of usages across hundreds of projects get refactored and with a monorepo setup it's so easy that it's no one even thinks twice.</p> <p>Most people now consider it absurd to use a version control system like CVS, RCS, or ClearCase, where it's impossible to do a single atomic commit across multiple files, forcing people to either manually look at timestamps and commit messages or keep meta information around to determine if some particular set of cross-file changes are “really” atomic. SVN, hg, git, etc solve the problem of atomic cross-file changes; monorepos solve the same problem across projects.</p> <p>This isn't just useful for large-scale API refactorings. David Turner, who worked on twitter's migration from many repos to a monorepo gives this example of a small cross-cutting change and the overhead of having to do releases for those:</p> <blockquote> <p>I needed to update [Project A], but to do that, I needed my colleague to fix one of its dependencies, [Project B]. The colleague, in turn, needed to fix [Project C]. If I had had to wait for C to do a release, and then B, before I could fix and deploy A, I might still be waiting. But since everything's in one repo, my colleague could make his change and commit, and then I could immediately make my change.</p> <p>I guess I could do that if everything were linked by git versions, but my colleague would still have had to do two commits. And there's always the temptation to just pick a version and &quot;stabilize&quot; (meaning, stagnate). That's fine if you just have one project, but when you have a web of projects with interdependencies, it's not so good.</p> <p>[In the other direction,] Forcing <em>dependees</em> to update is actually another benefit of a monorepo.</p> </blockquote> <p>It's not just that making cross-project changes is easier, tracking them is easier, too. To do the equivalent of <code>git bisect</code> across multiple repos, you must be disciplined about using another tool to track meta information, and most projects simply don't do that. Even if they do, you now have two really different tools where one would have sufficed.</p> <p>Ironically, there's a sense in which this benefit decreases as the company gets larger. At Twitter, which isn't exactly small, David Turner got a lot of value out of being able to ship cross-project changes. But at a Google-sized company, large commits can be large enough that it makes sense to split them into many smaller commits for a variety of reasons, which necessitates tooling that can effectively split up large conceptually atomic changes into many non-atomic commits.</p> <h3 id="mercurial-and-git-are-awesome-it-s-true">Mercurial and git are awesome; it's true</h3> <p>The most common response I've gotten to these points is that switching to either git or hg from either CVS or SVN is a huge productivity win. That's true. But a lot of that is because git and hg are superior in multiple respects (e.g., better merging), not because having small repos is better per se.</p> <p>In fact, Twitter has been patching git and <a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/">Facebook has been patching Mercurial</a> in order to support giant monorepos.</p> <h3 id="downsides">Downsides</h3> <p>Of course, there are downsides to using a monorepo. I'm not going to discuss them because the downsides are already widely discussed. Monorepos aren't strictly superior to manyrepos. They're not strictly worse, either. My point isn't that you should definitely switch to a monorepo; it's merely that using a monorepo isn't totally unreasonable, that folks at places like Google, Facebook, Twitter, Digital Ocean, and Etsy might have good reasons for preferring a monorepo over hundreds or thousands or tens of thousands of smaller repos.</p> <h3 id="other-discussion">Other discussion</h3> <p><a href="http://gregoryszorc.com/blog/2014/09/09/on-monolithic-repositories/">Gregory</a> <a href="http://gregoryszorc.com/blog/2015/02/17/lost-productivity-due-to-non-unified-repositories/">Szorc</a>. <a href="https://developers.facebooklive.com/videos/561/big-code-developer-infrastructure-at-facebook-s-scale">Face</a><a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/">book</a>. <a href="http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/">Benjamin Pollack</a> (one of the co-creators of Kiln). <a href="https://qafoo.com/resources/presentations/froscon_2015/monorepos.html">Benjamin Eberlei</a>. <a href="http://blog.rocketpoweredjetpants.com/2015/04/monorepo-one-source-code-repository-to.html">Simon Stewart</a>. <a href="https://www.digitalocean.com/company/blog/taming-your-go-dependencies/">Digital Ocean</a>. <a href="http://www.infoq.com/presentations/Development-at-Google">Goo</a><a href="https://www.youtube.com/watch?v=W71BTkUbdqE">gle</a>. <a href="http://git-merge.com/videos/scaling-git-at-twitter-wilhelm-bierbaum.html">Twitter</a>. <a href="http://www.reddit.com/r/programming/comments/1unehr/scaling_mercurial_at_facebook/cek9nkq">thedufer</a>. <a href="http://paulhammant.com/categories.html#trunk_based_development">Paul Hammant</a>.</p> <p><small> Thanks to Kamal Marhubi, David Turner, Leah Hanson, Mindy Preston, Chris Ball, Daniel Espeset, Joe Wilder, Nicolas Grilly, Giovanni Gherdovich, Paul Hammant, Juho Snellman, and Simon Thulbourn for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R">This was even true at a hardware company I worked at which created a monorepo by versioning things in RCS over NFS. Of course you can't let people live edit files in the central repository so someone wrote a number of scripts that basically turned this into perforce. I don't recommend this system, but even with an incredibly hacktastic monorepo, you still get a lot of the upsides of a monorepo. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:V">At least as long as you have some mechanism for <a href="https://news.ycombinator.com/item?id=10604168">vendoring upstream dependencies</a>. While this works great for Google because Google writes a large fraction of the code it relies on, and has enough employees that tossing all external dependencies into the monorepo has a low cost amortized across all employees, I could imagine this advantage being too expensive to take advantage of for smaller companies. <a class="footnote-return" href="#fnref:V"><sup>[return]</sup></a></li> </ol> </div> We used to build steel mills near cheap power. Now that's where we build datacenters datacenter-power/ Mon, 04 May 2015 00:00:00 +0000 datacenter-power/ <p>Why are people so concerned with hardware power consumption nowadays? Some common answers to this question are that <a href="http://www.vox.com/2015/3/9/8178213/apple-macbook-all-batteries">power is critically important for phones, tablets, and laptops</a> and that <a href="ftp://ftp.cs.utexas.edu/pub/dburger/papers/ISCA11.pdf">we can put more silicon on a modern chip than we can effectively use</a>. In 2001 <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;arnumber=912412&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D912412">Patrick Gelsinger observed that if scaling continued at then-current rates</a>, chips would have the power density of a nuclear reactor by 2005, a rocket nozzle by 2010, and the surface of the sun by 2015, implying that power density couldn't continue on its then-current path. Although this was already fairly obvious at the time, now that it's 2015, we can be extra sure that power density didn't continue to grow at unbounded rates. Anyway, the importance of portables and scaling limits are both valid and important reasons, but since they're widely discussed, I'm going to talk about an underrated reason.</p> <p>People often focus on the portable market because it's cannibalizing desktop market, but that's not the only growth market -- servers are also becoming more important than desktops, and power is really important for servers. To see why power is important for servers, let's look at some calculations about how what it costs to run a datacenter from <a href="http://www.amazon.com/gp/product/012383872X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=012383872X&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=Y6Z2OBCUCR72ALEB">Hennessy &amp; Patterson</a>.</p> <p></p> <p>One of the issues is that you pay for power multiple times. Some power is lost at the substation, although we might not have to pay for that directly. Then we lose more storing energy in a UPS. This figure below states 6%, but smaller scale datacenters can easily lose twice that. After that, we lose more power stepping down the power to a voltage that a server can accept. That's over a 10% loss for a setup that's pretty efficient.</p> <p>After that, we lose more power in the server's power supply, stepping down the voltage to levels that are useful inside a computer, which is often about another 10% loss (not pictured in the figure below).</p> <p><img src="images/datacenter-power/power_conversion.png" alt="Power conversion figure from Hennessy & Patterson, which reproduced the figure from Hamilton" width="640" height="383"></p> <p>And then once we get the power into servers, it gets turned into waste heat. To keep the servers from melting, we have to pay for power to cool them. <a href="http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006">Barroso and Holzle</a> estimated that 30%-50% of the power drawn by a datacenter is used for chillers, and that an additional 10%-20% is for the <a href="http://www.dchuddle.com/2011/crac-v-crah/">CRAC</a> (air circulation). That means for every watt of power used in the server, we pay for another 1-2 watts of support power.</p> <p>And to actually get all this power, we have to pay for the infrastructure required to get the power into and throughout the datacenter. Hennessy &amp; Patterson estimate that of the $90M cost of an example datacenter (just the facilities -- not the servers), 82% is associated with power and cooling<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">1</a></sup>. The servers in the datacenter are estimated to only cost $70M. It's not fair to compare those numbers directly since servers need to get replaced more often than datacenters; once you take into account the cost over the entire lifetime of the datacenter, the amortized cost of power and cooling comes out to be 33% of the total cost, when servers have a 3 year lifetime and infrastructure has a 10-15 year lifetime.</p> <p>If we look at all the costs, the breakdown is:</p> <p><style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}td.l {text-align:left;}</style> <table> <tr> <th>category</th><th>%</th></tr> <tr> <td>server machines</td><td>53</td></tr> <tr> <td>power &amp; cooling infra</td><td>20</td></tr> <tr> <td>power use</td><td>13</td></tr> <tr> <td>networking</td><td>8</td></tr> <tr> <td>other infra</td><td>4</td></tr> <tr> <td>humans</td><td>2</td></tr> </table></p> <p>Power use and people are the cost of operating the datacenter (OPEX), whereas server machines, networking, power &amp; cooling infra, and other infra are capital expenditures that are amortized across the lifetime of the datacenter (CAPEX).</p> <p>Computation uses a lot of power. <a href="http://www.datacenterknowledge.com/archives/2010/05/19/microsoft-building-new-data-center-in-quincy/">We used to build steel mills near cheap sources of power</a>, but <a href="http://www.datacenterknowledge.com/archives/2010/05/19/microsoft-building-new-data-center-in-quincy/">now that's where we build datacenters</a>. As companies start considering the full cost of applications, we're seeing a lot more power optimized solutions<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">2</a></sup>. Unfortunately, this is really hard. On the software side, with the exceptions of toy microbenchmark examples, <a href="http://arcade.cs.columbia.edu/energy-oopsla14.pdf">best practices for writing power efficient code still aren't well understood</a>. On the hardware side, Intel recently released a new generation of chips with significantly improved performance per watt that doesn't have much better absolute performance than the previous generation. On the hardware accelerator front, some large companies are building dedicated power-efficient hardware for specific computations. But with existing tools, hardware accelerators are costly enough that dedicated hardware only makes sense for the largest companies. There isn't an easy answer to this problem.</p> <p><small> If you liked this post, you'd probably like chapter 6 of <a href="http://www.amazon.com/gp/product/012383872X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=012383872X&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=Y6Z2OBCUCR72ALEB">Hennessy &amp; Patterson</a>, which walks through not only the cost of power, but a number of related back of the envelope calculations relating to datacenter performance and cost.</p> <p>Apologies for the quickly scribbled down post. I jotted this down shortly before signing an NDA for an interview where I expected to learn some related information and I wanted to make sure I had my thoughts written down before there was any possibility of being contaminated with information that's under NDA.</p> <p>Thanks to Justin Blank for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:E"><p>Although this figure is widely cited, I'm unsure about the original source. This is probably the most suspicious figure in this entire post. Hennessy &amp; Patterson cite “Hamilton 2010”, which appears to be a reference to <a href="http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_GenomicsCloud20100608.pdf">this presentation</a>. That presentation doesn't make the source of the number obvious, although <a href="http://perspectives.mvdirona.com/2008/11/cost-of-power-in-large-scale-data-centers/">this post by Hamilton does cite a reference for that figure</a>, but the citation points to <a href="http://blogs.msdn.com/b/the_power_of_software/archive/2008/09/19/intense-computing-or-in-tents-computing.aspx">this post</a>, which seems to be about putting datacenters in tents, not the fraction of infrastructure that's dedicated to power and cooling.</p> <p>Some other works, <a href="http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/mar09/Reed.pdf">such as this one</a> cite <a href="http://www.electronics-cooling.com/2007/02/in-the-data-center-power-and-cooling-costs-more-than-the-it-equipment-it-supports/">this article</a>. However, that article doesn't directly state 82% anywhere, and it makes a number of estimates that the authors acknowledge are very rough, with qualifiers like “While, admittedly, the authors state that there is a large error band around this equation, it is very useful in capturing the magnitude of infrastructure cost.”</p> <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> <li id="fn:1">That being said, power isn't everything -- Reddi et al. looked at replacing conventional chips with low-power chips for a real workload (MS Bing) and found that while they got an improvement in power use per query, tail latency increased significantly, especially when servers were heavily loaded. Since Bing has a mechanism that causes query-related computations to terminate early if latency thresholds are hit, the result is both higher latency and degraded search quality. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> Reading citations is easier than most people think dunning-kruger/ Sun, 29 Mar 2015 00:00:00 +0000 dunning-kruger/ <p>It's really common to see claims that some meme is backed by “studies” or “science”. But when I look at the actual studies, it usually turns out that the data are opposed to the claim. Here are the last few instances of this that I've run across. </p> <h2 id="dunning-kruger">Dunning-Kruger</h2> <p>A pop-sci version of Dunning-Kruger, the most common one I see cited, is that, the less someone knows about a subject, the more they think they know. Another pop-sci version is that people who know little about something overestimate their expertise because their lack of knowledge fools them into thinking that they know more than they do. The actual claim Dunning and Kruger make is much weaker than the first pop-sci claim and, IMO, the evidence is weaker than the second claim. The original paper isn't much longer than most of the incorrect pop-sci treatments of the paper, and we can get pretty good idea of the claims by looking at the four figures included in the paper. In the graphs below, “perceived ability” is a subjective self rating, and “actual ability” is the result of a test.</p> <p><img src="images/dunning-kruger/dunning_1.png" alt="Dunning-Kruger graph" width="333" height="301"> <img src="images/dunning-kruger/dunning_2.png" alt="Dunning-Kruger graph" width="334" height="302"> <img src="images/dunning-kruger/dunning_3.png" alt="Dunning-Kruger graph" width="335" height="304"> <img src="images/dunning-kruger/dunning_4.png" alt="Dunning-Kruger graph" width="331" height="299"></p> <p>In two of the four cases, there's an obvious positive correlation between perceived skill and actual skill, which is the opposite of the first pop-sci conception of Dunning-Kruger that we discussed. As for the second, we can see that people at the top end also don't rate themselves correctly, so the explanation for Dunning-Kruger's results is that people who don't know much about a subject (an easy interpretation to have of the study, given its title, <cite>Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments</cite>) is insufficient because that doesn't explain why people at the top of the charts have what appears to be, at least under the conditions of the study, a symmetrically incorrect guess about their skill level. One could argue that there's a completely different effect that just happens to cause the same, roughly linear, slope in perceived ability that people who are &quot;unskilled and unaware of it&quot; have. But, if there's any plausible simpler explanation, then that explanation seems overly complicated without additional evidence (which, if any exists, is not provided in the paper)<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">1</a></sup>.</p> <p>A plausible explanation of why perceived skill is compressed, especially at the low end, is that few people want to rate themselves as below average or as the absolute best, shrinking the scale but keeping a roughly linear fit. The crossing point of the scales is above the median, indicating that people, on average, overestimate themselves, but that's not surprising given the population tested (more on this later). In the other two cases, the correlation is very close to zero. It could be that the effect is different for different tasks, or it could be just that the sample size is small and that the differences between the different tasks is noise. It could also be that the effect comes from the specific population sampled (students at Cornell, who are probably actually above average in many respects). If you look up Dunning-Kruger on Wikipedia, it claims that a replication of Dunning-Kruger on East Asians shows the opposite result (perceived skill is lower than actual skill, and the greater the skill, the greater the difference), and that the effect is possibly just an artifact of American culture, but the citation is actually a link to an editorial which mentions a meta analysis on East Asian confidence, so that might be another example of a false citation. Or maybe it's just a link to the wrong source. In any case, the effect certainly isn't that the more people know, the less they think they know.</p> <h2 id="income-happiness">Income &amp; Happiness</h2> <p>It's become common knowledge that money doesn't make people happy. As of this writing, a Google search for <code>happiness income</code> returns a knowledge card that making more than $75k/year has no impact on happiness. Other top search results claim the happiness ceiling occurs at $10k/year, $30k/year, $40k/year and $75k/year.</p> <p><img src="images/dunning-kruger/google_happiness.png" alt="Google knowledge card says that $75k should be enough for anyone" width="603" height="479"></p> <p>Not only is that wrong, the wrongness <a href="https://web.archive.org/web/20160610202901/http://www.brookings.edu/~/media/research/files/papers/2013/04/subjective%20well%20being%20income/subjective%20well%20being%20income.pdf">is robust across every country studied, too</a>.</p> <p><img src="images/dunning-kruger/happiness_within_country.png" alt="People with more income are happier" width="629" height="506"></p> <p>That happiness is correlated with income doesn't come from cherry picking one study. That result holds across five iterations of the World Values Survey (1981-1984, 1989-1993, 1994-1999, 2000-2004, and 2005-2009), three iterations of the Pew Global Attitudes Survey (2002, 2007, 2010), five iterations of the International Social Survey Program (1991, 1998, 2001, 2007, 2008), and a large scale Gallup survey.</p> <p>The graph above has income on a log scale, if you pick a country and graph the results on a linear scale, you get <a href="https://gravityandlevity.wordpress.com/2013/05/06/how-we-measure-our-happiness/">something like this</a>.</p> <p><img src="images/dunning-kruger/happiness_income_linear.gif" alt="Best fit log to happiness vs. income" width="576" height="335"></p> <p>As with all graphs of a log function, it looks like the graph is about to level off, which results in interpretations like the following:</p> <p><img src="images/dunning-kruger/bad_log_1.png" alt="Distorted log graph" width="291" height="393"></p> <p>That's an actual graph from an article that claims that income doesn't make people happy. These vaguely log-like graphs that level off are really common. If you want to see more of these, try an <a href="https://duckduckgo.com/?q=happiness+income&amp;iax=1&amp;ia=images">image search for “happiness income”</a>. My favorite is the one where people who make enough money literally hit the top of the scale. Apparently, there's a dollar value which not only makes you happy, it makes you as happy as it is possible for humans to be.</p> <p>As with Dunning-Kruger, you can look at the graphs in the papers to see what's going on. It's a little easier to see why people would pass along the wrong story here, since it's easy to misinterpret the data when it's plotted against a linear scale, but it's still pretty easy to see what's going on by taking a peek at the actual studies.</p> <h2 id="hedonic-adaptation-happiness">Hedonic Adaptation &amp; Happiness</h2> <p>The idea that people bounce back from setbacks (as well as positive events) and return to a fixed level of happiness entered the popular consciousness after <a href="http://www.amazon.com/gp/product/1400077427/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1400077427&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=3ERCGEMU57L2FDF7">Daniel Gilbert wrote about it in a popular book</a>.</p> <p>But even without looking at the literature on adaptation to adverse events, the previous section on wealth should cast some doubt on this. If people rebound from both bad events and good, how is it that making more money causes people to be happier?</p> <p>Turns out, the idea that people adapt to negative events and return to their previous set-point is a myth. Although the exact effects vary depending on the bad event, disability<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">2</a></sup>, divorce<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">3</a></sup>, loss of a partner<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">4</a></sup>, and unemployment<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">5</a></sup> all have long-term negative effects on happiness. Unemployment is the one event that can be undone relatively easily, but the effects persist even after people become reemployed. I'm only citing four studies here, but <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3289759/">a meta analysis of the literature</a> shows that the results are robust across existing studies.</p> <p>The same thing applies to positive events. While it's “common knowledge” that winning the lottery doesn't make people happier, <a href="http://www.nytimes.com/2014/05/27/science/how-to-win-the-lottery-happily.html">it turns out that isn't true, either</a>.</p> <p>In both cases, early cross-sectional results indicated that it's plausible that extreme events, like winning the lottery or becoming disabled, don't have long term effects on happiness. But the longitudinal studies that follow individuals and measure the happiness of the same person over time as events happen show the opposite result -- events do, in fact, affect happiness. For the most part, these aren't new results (some of the initial results predate Daniel Gilbert's book), but the older results based on less rigorous studies continue to propagate faster than the corrections.</p> <h2 id="chess-position-memorization">Chess position memorization</h2> <p>I frequently see citations claiming that, while experts can memorize chess positions better than non-experts, the advantage completely goes away when positions are randomized. When people refer to a specific citation, it's generally Chase and Simon's 1973 paper Perception in Chess, a &quot;classic&quot; which has been cited a whopping 7449 times in the literature, which says:</p> <blockquote> <p>De Groat did, however, find an intriguing difference between masters and weaker players in his short-term memory experiments. Masters showed a remarkable ability to reconstruct a chess position almost perfectly after viewing it for only 5 sec. There was a sharp dropoff in this ability for players below the master level. This result could not be attributed to the masters’ generally superior memory ability, for when chess positions were constructed by placing the same numbers of pieces randomly on the board, the masters could then do no better in reconstructing them than weaker players, Hence, the masters appear to be constrained by the same severe short-term memory limits as everyone else ( Miller, 1956), and their superior performance with &quot;meaningful&quot; positions must lie in their ability to perceive structure in such positions and encode them in chunks. Specifically, if a chess master can remember the location of 20 or more pieces on the board, but has space for only about five chunks in short-term memory, then each chunk must be composed of four or five pieces, organized in a single relational structure.</p> </blockquote> <p>The paper then runs an experiment which &quot;proves&quot; that master-level players actually do worse than beginners when memorizing random mid-game positions even though they do much better memorizing real mid-game positions (and, in end-game positions, they do the about the same as beginners when positions are randomized). Unfortunately, the paper used an absurdly small sample size of one chess player at each skill level.</p> <p>A quick search indicates that this result does not reproduce with larger sample sizes, e.g., Gobet and Simon, in &quot;Recall of rapidly presented random chess positions is a function of skill&quot;, say</p> <blockquote> <p>A widely cited result asserts that experts’ superiority over novices in recalling meaningful material from their domain of expertise vanishes when they are confronted with random material. A review of recent chess experiments in which random positions served as control material (presentation time between 3 and 10 sec) shows, however, that strong players generally maintain some superiority over weak players even with random positions, although the relative difference between skill levels is much smaller than with game positions. The implications of this finding for expertise in chess are discussed and the question of the recall of random material in other domains is raised.</p> </blockquote> <p>They find this scales with skill level and, e.g., for &quot;real&quot; positions, 2350+ ELO players memorized ~2.2x the number of correct pieces that 1600-2000 ELO players did, but the difference was ~1.6x for random positions (these ratios are from eyeballing a graph and may be a bit off). 1.6x is smaller than 2.2x, but it's certainly not the claimed 1.0.</p> <p>I've also seen this result cited to claim that it applies to other fields, but in a quick search of applying this result to other fields, results either show something similar (a smaller but still observable difference on randomized positions) or don't reproduce, e.g., McKeithen did this for programmers and found that, on trying to memorize programs, on &quot;normal&quot; program experts were ~2.5x better than beginners on the first trial and 3x better by the 6th trial, whereas on the &quot;scrambled&quot; program, experts were 3x better on the first trial and progressed to being only ~1.5x better by the 6th trial. Despite this result contradicting Chase and Simon, I've seen people cite this result to claim the same thing as Chase and Simon, presumably from people who didn't read what McKeithen actually wrote.</p> <h2 id="type-systems">Type Systems</h2> <p>Unfortunately, false claims about studies and evidence aren't limited to pop-sci memes; they're everywhere in both software and hardware development. For example, see this comment from a Scala/FP &quot;thought leader&quot;:</p> <p><img src="images/empirical-pl/pl_godwin.png" alt="Tweet claiming that any doubt that type systems are helpful is equivalent to being an anti-vaxxer" width="524" height="214"></p> <p>I see something like this at least once a week. I'm picking this example not because it's particularly egregious, but because it's typical. If you follow a few of the big time FP proponents on twitter, you'll see regularly claims that there's very strong empirical evidence and extensive studies backing up the effectiveness of type systems.</p> <p>However, <a href="empirical-pl/">a review of the empirical evidence</a> shows that the evidence is mostly incomplete, and that it's equivocal where it's not incomplete. Of all the false memes, I find this one to be the hardest to understand. In the other cases, I can see a plausible mechanism by which results could be misinterpreted. “Relationship is weaker than expected” can turn into “relationship is opposite of expected”, log can look a lot like an asymptotic function, and preliminary results using inferior methods can spread faster than better conducted follow-up studies. But I'm not sure what the connection between the evidence and beliefs are in this case.</p> <h2 id="is-this-preventable">Is this preventable?</h2> <p>I can see why false memes might spread quickly, even when they directly contradict reliable sources. Reading papers sounds like a lot of work. It sometimes is. But it's often not. Reading a pure math paper is usually a lot of work. Reading an empirical paper to determine if the methodology is sound can be a lot of work. For example, biostatistics and econometrics papers tend to apply completely different methods, and it's a lot of work to get familiar enough with the set of methods used in any particular field to understand precisely when they're applicable and what holes they have. But reading empirical papers just to see what claims they make is usually pretty easy.</p> <p>If you read the abstract and conclusion, and then skim the paper for interesting bits (graphs, tables, telling flaws in the methodology, etc.), that's enough to see if popular claims about the paper are true in most cases. In my ideal world, you could get that out of just reading the abstract, but it's not uncommon for papers to make claims in the abstract that are much stronger than the claims made in the body of the paper, so you need to at least skim the paper.</p> <p>Maybe I'm being naive here, but I think a major reason behind false memes is that checking sources sounds much harder and more intimidating than it actually is. A striking example of this is when Quartz published its article on how there isn't a gender gap in tech salaries, <a href="gender-gap/">which cited multiple sources that showed the exact opposite</a>. Twitter was abuzz with people proclaiming that the gender gap has disappeared. When I published <a href="gender-gap/">a post which did nothing but quote the actual cited studies</a>, many of the same people then proclaimed that their original proclamation was mistaken. It's great that they were willing to tweet a correction<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">6</a></sup>, but as far as I can tell no one actually went and read the source data, even though the graphs and tables make it immediately obvious that the author of the original Quartz article was pushing an agenda, not even with cherry picked citations, but citations that showed the opposite of their thesis.</p> <p>Unfortunately, it's in the best interests of non-altruistic people who do read studies to make it seem like reading studies is difficult. For example, when I talked to the founder of a widely used pay-walled site that reviews evidence on supplements and nutrition, he claimed that it was ridiculous to think that &quot;normal people&quot; could interpret studies correctly and that experts are needed to read and summarize studies for the masses. But he's just a serial entrepreneur who realized that you can make a lot of money by reading studies and summarizing the results! A more general example is how people sometimes try to maintain an authoritative air by saying that you need certain credentials or markers of prestige to really read or interpret studies.</p> <p>There are certainly fields where you need some background to properly interpret a study, but even then, the amount of knowledge that a degree contains is quite small and can be picked up by anyone. For example, excluding lab work (none of which contained critical knowledge for interpreting results), I was within a small constact factor of spending one hour of time per credit hour in school. At the conversion rate, an engineering degree from my alma mater costs a bit more than 100 hours and almost all non-engineering degrees land at less than 40 hours, with a large amount of overlap between them because a lot of degrees will require the same classes (e.g., calculus). Gatekeeping reading and interpreting a study on whether or not someone has a credential like a degree is absurd when someone can spend a week's worth of time gaining the knowledge that a degree offers.</p> <p><strong>If you liked this post, you'll probably enjoy <a href="discontinuities/">this post on odd discontinuities</a>, <a href="tech-discrimination/">this post how the effect of markets on discrimination is more nuanced than it's usually made out to be</a> and <a href="percentile-latency/">this other post discussing some common misconceptions</a>.</strong></p> <h3 id="2021-update">2021 update</h3> <p>In retrospect, I think the mystery of the &quot;type systems&quot; example is simple: it's a different kind of fake citation than the others. In the first three examples, a clever, contrarian, but actually wrong idea got passed around. This makes sense because people love clever, contrarian, ideas and don't care very much if they're wrong, so clever, contarian, relatively frequently become viral relative to their correctness.</p> <p>For the type systems example, it's just that people commonly fabricate evidence and then appeal to authority to support their position. In the post, I was confused because I couldn't see how anyone could look at the evidence and then make the claims that type system advocates do but, after reading thousands of discussions from people advocating for their pet tool/language/practice, I can see that it was naive of me to think that these advocates would even consider looking for evidence as opposed to just pretending that evidence exists without ever having looked.</p> <p><small> Thanks to Leah Hanson, Lindsey Kuper, Jay Weisskopf, Joe Wilder, Scott Feeney, Noah Ennis, Myk Pono, Heath Borders, Nate Clark, and Mateusz Konieczny for comments/corrections/discussion.</p> <p>BTW, if you're going to send me a note to tell me that I'm obviously wrong, please make sure that I'm actually wrong. In general, I get great feedback and I've learned a lot from the feedback that I've gotten, but the feedback I've gotten on this post has been unusually poor. Many people have suggested that the studies I've referenced have been debunked by some other study I clearly haven't read, but in every case so far, I've already read the other study. </small></p> <p><link rel="prefetch" href="empirical-pl/"> <link rel="prefetch" href="gender-gap/"> <link rel="prefetch" href="discontinuities/"> <link rel="prefetch" href="tech-discrimination/"> <link rel="prefetch" href="percentile-latency/"></p> <div class="footnotes"> <hr /> <ol> <li id="fn:D">Dunning and Kruger claim, without what I'd consider strong evidence, that this is because people who perform well overestimate how well other people perform. While that may be true, one could also say that the explanation for people who are &quot;unskilled&quot; is that they underestimate how well other people perform. &quot;Phase 2&quot; attempts to establish that's not the case, but I don't find the argument convincing for a number of reasons. To pick one example, at the end of the section, they say &quot;Despite seeing the superior performances of their peers, bottom-quartile participants continued to hold the mistaken impression that they had performed just fine.&quot;, but we don't know that the participants believed that they performed fine, we just know what their perceived percentile is. It's possible to believe that you're peforming poorly while also being in a high percentile (<a href="p95-skill/">and I frequently have this belief for activties I haven't seriously practiced or studied</a>, which seems likely to be the case for the participants of the Dunning-Kruger study who scored poorly on tasks with respect to those tassks). <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:1"><p><a href="http://www.ncbi.nlm.nih.gov/pubmed/17469954">Long-term disability is associated with lasting changes in subjective well-being: evidence from two nationally representative longitudinal studies.</a></p> <blockquote> <p>Hedonic adaptation refers to the process by which individuals return to baseline levels of happiness following a change in life circumstances. Two nationally representative panel studies (Study 1: N = 39,987; Study 2: N = 27,406) were used to investigate the extent of adaptation that occurs following the onset of a long-term disability. In Study 1, 679 participants who acquired a disability were followed for an average of 7.18 years before and 7.39 years after onset of the disability. In Study 2, 272 participants were followed for an average of 3.48 years before and 5.31 years after onset. Disability was associated with moderate to large drops in happiness (effect sizes ranged from 0.40 to 1.27 standard deviations), followed by little adaptation over time.</p> </blockquote> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2"><p><a href="http://www.ncbi.nlm.nih.gov/pubmed/16313658">Time does not heal all wounds</a></p> <blockquote> <p>Cross-sectional studies show that divorced people report lower levels of life satisfaction than do married people. However, such studies cannot determine whether satisfaction actually changes following divorce. In the current study, data from an 18-year panel study of more than 30,000 Germans were used to examine reaction and adaptation to divorce. Results show that satisfaction drops as one approaches divorce and then gradually rebounds over time. However, the return to baseline is not complete. In addition, prospective analyses show that people who will divorce are less happy than those who stay married, even before either group gets married. Thus, the association between divorce and life satisfaction is due to both preexisting differences and lasting changes following the event.</p> </blockquote> <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:3"><p><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.9139&amp;rep=rep1&amp;type=pdf">Reexamining adaptation and the set point model of happiness: Reactions to changes in marital status.</a></p> <blockquote> <p>According to adaptation theory, individuals react to events but quickly adapt back to baseline levels of subjective well-being. To test this idea, the authors used data from a 15-year longitudinal study of over 24,000 individuals to examine the effects of marital transitions on life satisfaction. On average, individuals reacted to events and then adapted back toward baseline levels. However, there were substantial individual differences in this tendency. Individuals who initially reacted strongly were still far from baseline years later, and many people exhibited trajectories that were in the opposite direction to that predicted by adaptation theory. Thus, marital transitions can be associated with long-lasting changes in satisfaction, but these changes can be overlooked when only average trends are examined.</p> </blockquote> <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> <li id="fn:4"><p><a href="https://ideas.repec.org/p/bru/bruedp/02-16.html">Unemployment Alters the Set-Point for Life Satisfaction</a></p> <blockquote> <p>According to set-point theories of subjective well-being, people react to events but then return to baseline levels of happiness and satisfaction over time. We tested this idea by examining reaction and adaptation to unemployment in a 15-year longitudinal study of more than 24,000 individuals living in Germany. In accordance with set-point theories, individuals reacted strongly to unemployment and then shifted back toward their baseline levels of life satisfaction. However, on average, individuals did not completely return to their former levels of satisfaction, even after they became reemployed. Furthermore, contrary to expectations from adaptation theories, people who had experienced unemployment in the past did not react any less negatively to a new bout of unemployment than did people who had not been previously unemployed. These results suggest that although life satisfaction is moderately stable over time, life events can have a strong influence on long-term levels of subjective well-being.</p> </blockquote> <a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li> <li id="fn:L"><p>One thing I think it's interesting to look at is how you can see the opinions of people who are cagey about revealing their true opinions in which links they share. For example, Scott Alexander and Tyler Cowen both linked to the bogus gender gap article as something interesting to read and tend to link to things that have the same view.</p> <p>If you naively read their writing, it appears as if they're impartially looking at evidence about how the world works, which they then share with people. But when you observe that they regularly share evidence that supports one narrative, regardless of quality, and don't share evidence that supports the opposite narrative, it would appear that they have a strong opinion on the issue that they reveal via what they link to.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> </ol> </div> Given that we spend little on testing, how should we test software? testing/ Tue, 10 Mar 2015 00:00:00 +0000 testing/ <p>I've been reading a lot about software testing, lately. Coming from a hardware background (CPUs and hardware accelerators), it's interesting how different software testing is. Bugs in software are much easier to fix, so it makes sense to spend a lot less effort spent on testing. Because less effort is spent on testing, methodologies differ; software testing is biased away from methods with high fixed costs, towards methods with high variable costs. But that doesn't explain all of the differences, or even most of the differences. Most of the differences come from a cultural <a href="http://en.wikipedia.org/wiki/Path_dependence">path dependence</a>, which shows how non-optimally test effort is allocated in both hardware and software.</p> <p></p> <p>I don't really know anything about software testing, but here are some notes from what I've seen at Google, on a few open source projects, and in a handful of papers and demos. Since I'm looking at software, I'm going to avoid talking about how hardware testing isn't optimal, but I find that interesting, too.</p> <h3 id="manual-test-generation">Manual Test Generation</h3> <p>From what I've seen, most test effort on most software projects comes from handwritten tests. On the hardware projects I know of, writing tests by hand consumed somewhere between 1% and 25% of the test effort and was responsible for a much smaller percentage of the actual bugs found. Manual testing is considered ok for sanity checking, and sometimes ok for really dirty corner cases, but it's not scalable and too inefficient to rely on.</p> <p>It's true that there's some software that's difficult to do automated testing on, but the software projects I've worked on have relied pretty much totally on manual testing despite being in areas that are among the easiest to test with automated testing. As far as I can tell, that's not because someone did a calculation of the tradeoffs and decided that manual testing was the way to go, it's because it didn't occur to people that there were alternatives to manual testing.</p> <p>So, what are the alternatives?</p> <h3 id="random-test-generation">Random Test Generation</h3> <p>The good news is that random testing is easy to implement. You can spend <a href="https://github.com/danluu/Fuzz.jl">an hour implementing a random test generator and find tens of bugs</a>, or you can spend more time and find <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=jsfunfuzz">thousands of bugs</a>.</p> <p>You can start with something that's almost totally random and generates incredibly dumb tests. As you spend more time on it, you can add constraints and generate smarter random tests that find more complex bugs. Some good examples of this are <a href="http://www.squarefree.com/2007/08/02/introducing-jsfunfuzz/">jsfunfuzz</a>, which started out relatively simple and gained smarts as time went out, and Jepsen, which originally checked some relatively simple constraints and can now check linearizability.</p> <p>While you can generate random tests pretty easily, it still takes some time to write a powerful framework or collection of functions. Luckily, this space is well covered by existing frameworks.</p> <h3 id="random-test-generation-framework">Random Test Generation, Framework</h3> <p>Here's an example of how simple it is to write a JavaScript tests using Scott Feeney's <a href="https://github.com/graue/gentest">gentest</a>, taken from the gentest readme.</p> <p>You want to test something like</p> <pre><code>function add(x, y) { return x + y; } </code></pre> <p>To check that addition commutes, so you'd write</p> <pre><code>var t = gentest.types; forAll([t.int, t.int], 'addition is commutative', function(x, y) { return add(x, y) === add(y, x); }); </code></pre> <p>Instead of checking the values by hand, or writing the code to generate the values, the framework handles that and generates tests for after you when you specify the constraints. QuickCheck-like generative test frameworks tend to be simple enough that they're no harder to learn how to use than any other unit test or mocking framework.</p> <p>You'll sometimes hear objections about how random testing can only find shallow bugs because random tests are too dumb to find really complex bugs. For one thing, that assumes that you don't specify constraints that allow the random generator to generate intricate test cases. But even then, <a href="https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan">this paper</a> analyzed production failures in distributed systems, looking for &quot;critical&quot; bugs, bugs that either took down the entire cluster or caused data corruption, and found that 58% could be caught with very simple tests. Turns out, generating “shallow” random tests is enough to catch most production bugs. And that's on projects that are unusually serious about testing and static analysis, projects that have much better test coverage than the average project.</p> <p>A specific examples of the effective of naive random testing this is the story John Hughes tells <a href="http://www.cse.chalmers.se/edu/year/2012/course/DIT848/files/13-GL-QuickCheck.pdf">in this talk</a>. It starts out when some people came to him with a problem.</p> <blockquote> <p>We know there is a lurking bug somewhere in the dets code. We have got 'bad object' and 'premature eof' every other month the last year. We have not been able to track the bug down since the dets files is repaired automatically next time it is opened.</p> </blockquote> <p><img src="images/testing/hughes_bug.png" alt="Stack: Application on top of Mnesia on top of Dets on top of File system" width="293" height="487"></p> <p>An application that ran on top of Mnesia, a distributed database, was somehow causing errors a layer below the database. There were some guesses as to the cause. Based on when they'd seen the failures, maybe something to do with rehashing something or other in files that are bigger than 1GB? But after more than a month of effort, no one was really sure what was going on.</p> <p>In less than a day, with QuickCheck, they found five bugs. After fixing those bugs, they never saw the problem again. Each of the five bugs was reproducible on a database with one record, with at most five function calls. It is very common for bugs that have complex real-world manifestations to be reproducible with really simple test cases, if you know where to look.</p> <p>In terms of developer time, using some kind of framework that generates random tests is a huge win over manually writing tests in a lot of circumstances, and it's so trivially easy to try out that there's basically no reason not to do it. The ROI of using more advanced techniques may or may not be worth the extra investment to learn how to implement and use them.</p> <p>While dumb random testing works really well in a lot of cases, it has limits. Not all bugs are shallow. I know of a hardware company that's very good at finding deep bugs by having people with years or decades of domain knowledge write custom test generators, which then run on N-thousand machines. That works pretty well, but it requires a lot of test effort, much more than makes sense for almost any software.</p> <p>The other option is to build more smarts into the program doing the test generation. There are a ridiculously large number of papers on how to do that, but very few of those papers have turned into practical, robust, software tools. The sort of simple coverage-based test generation used in AFL doesn't have that many papers on it, but it seems to be effective.</p> <h3 id="random-test-generation-coverage-based">Random Test Generation, Coverage Based</h3> <p>If you're using an existing framework, coverage-based testing isn't much harder than using any other sort of random testing. In theory, at least. There are often a lot of knobs you can turn to adjust different settings, as well other complexity.</p> <p>If you're writing a framework, there are a lot of decisions. Chief among them are what coverage metric to use and how to use that coverage metric to drive test generation.</p> <p>For the first choice, which coverage metric, there are coverage metrics that are tractable, but too simplistic, like function coverage, or line coverage (a.k.a. <a href="http://en.wikipedia.org/wiki/Basic_block">basic block</a> coverage). It's easy to track those, but it's also easy to get 100% coverage while missing very serious bugs. And then there are metrics that are great, but intractable, like state coverage or path coverage. Without some kind of magic to collapse equivalent paths or states together, it's impossible to track those for non-trivial programs.</p> <p>For now, let's assume we're not going to use magic, and use some kind of approximation instead. Coming up with good approximations that work in practice often takes a lot of trial and error. Luckily, Michal Zalewski has experimented with a wide variety of different strategies for <a href="http://lcamtuf.coredump.cx/afl/">AFL</a>, a testing tool that instruments code with some coverage metrics that allow the tool to generate smart tests.</p> <p><a href="http://lcamtuf.coredump.cx/afl/technical_details.txt">AFL does the following</a>. Each branch gets something like the following injected, which approximates tracking edges between basic blocks, i.e., which branches are taken and how many times:</p> <pre><code>cur_location = &lt;UNIQUE_COMPILE_TIME_RANDOM_CONSTANT&gt;; shared_mem[prev_location ^ cur_location]++; prev_location = cur_location &gt;&gt; 1; </code></pre> <p>shared_mem happens to be a 64kB array in AFL, but the size is arbitrary.</p> <p>The non-lossy version of this would be to have <code>shared_mem</code> be a map of <code>(prev_location, cur_location) -&gt; int</code>, and increment that. That would track how often each edge (prev_location, cur_location) is taken in the basic block graph.</p> <p>Using a fixed sized array and xor'ing prev_location and cur_location provides lossy compression. To keep from getting too much noise out of trivial changes, for example, running a loop 1200 times vs. 1201 times, AFL only considers a bucket to have changed when it crosses one of the following boundaries: <code>1, 2, 3, 4, 8, 16, 32, or 128</code>. That's one of the two things that AFL tracks to determine coverage.</p> <p>The other is a global set of all (prev_location, cur_location) tuples, which makes it easy to quickly determine if a tuple/transition is new.</p> <p>Roughly speaking, AFL keeps a queue of “interesting” test cases it's found and generates mutations of things in the queue to test. If something changes the coverage stat, it gets added to the queue. There's also some logic to avoid adding test cases that are too slow, and to remove test cases that are relatively uninteresting.</p> <p>AFL is about 13k lines of code, so there's clearly a lot more to it than that, but, conceptually, it's pretty simple. Zalewksi explains why he's kept AFL so simple <a href="http://lcamtuf.coredump.cx/afl/related_work.txt">here</a>. His comments are short enough that they're worth reading in their entirety if you're at all interested, but I'll excerpt a few bits anyway.</p> <blockquote> <p>In the past six years or so, I've also seen a fair number of academic papers that dealt with smart fuzzing (focusing chiefly on symbolic execution) and a couple papers that discussed proof-of-concept application of genetic algorithms. I'm unconvinced how practical most of these experiments were … Effortlessly getting comparable results [from AFL] with state-of-the-art symbolic execution in equally complex software still seems fairly unlikely, and hasn't been demonstrated in practice so far.</p> </blockquote> <h3 id="test-generation-other-smarts">Test Generation, Other Smarts</h3> <p>While Zalewski is right that it's hard to write a robust and generalizable tool that uses more intelligence, it's possible to get a lot of mileage out of domain specific tools. For example, <a href="http://db.cs.berkeley.edu/papers/dbtest12-bloom.pdf">BloomUnit</a>, a test framework for distributed systems, helps you test non-deterministic systems by generating a subset of valid orderings, which uses a SAT solver to avoid generating equivalent re-orderings. The authors don't provide benchmark results the same way Zalewksi does with AFL, but even without benchmarks it's at least plausible that a SAT solver can be productively applied to test case generation. If nothing else, distributed system tests are often slow enough that you can do a lot of work without severely impacting test throughput.</p> <p>Zalewski says “If your instrumentation makes it 10x more likely to find a bug, but runs 100x slower, your users [are] getting a bad deal.“, which is a great point -- gains in test smartness have to be balanced against losses in test throughput, but if you're testing with something like Jepsen, where your program under test actually runs on multiple machines that have to communicate with each other, the test is going to be slow enough that you can spend a lot of computation generating smarter tests before getting a 10x or 100x slowdown.</p> <p>This same effect makes it difficult to port smart hardware test frameworks to software. It's not unusual for a “short” hardware test to take minutes, and for a long test to take hours or days. As a result, spending a massive amount of computation to generate more efficient tests is worth it, but naively porting a smart hardware test framework<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup> to software is a recipe for overly clever inefficiency.</p> <h3 id="why-not-coverage-based-unit-testing">Why Not Coverage-Based Unit Testing?</h3> <p>QuickCheck and the tens or hundreds of QuickCheck clones are pretty effective for random unit testing, and AFL is really amazing at coverage-based pseudo-random end-to-end test generation to find crashes and security holes. How come there isn't a tool that does coverage-based unit testing?</p> <p>I often assume that if there isn't an implementation of a straightforward idea, there must be some reason, like maybe it's much harder than it sounds, but <a href="http://www.somerandomidiot.com/">Mindy</a> convinced me that there's often no reason something hasn't been done before, so I tried making the simplest possible toy implementation.</p> <p>Before I looked at AFL's internals, I created this really dumb function to test. The function takes an array of arbitrary length as input and is supposed to return a non-zero int.</p> <pre><code>// Checks that a number has its bottom bits set func some_filter(x int) bool { for i := 0; i &lt; 16; i = i + 1 { if !(x&amp;1 == 1) { return false } x &gt;&gt;= 1 } return true } // Takes an array and returns a non-zero int func dut(a []int) int { if len(a) != 4 { return 1 } if some_filter(a[0]) { if some_filter(a[1]) { if some_filter(a[2]) { if some_filter(a[3]) { return 0 // A bug! We failed to return non-zero! } return 2 } return 3 } return 4 } return 5 } </code></pre> <p>dut stands for device under test, a commonly used term in the hardware world. This code is deliberately contrived to make it easy for a coverage based test generator to make progress. Since the code does little work as possible per branch and per loop iteration, the coverage metric changes every time we do a bit of additional work<sup class="footnote-ref" id="fnref:G"><a rel="footnote" href="#fn:G">2</a></sup>. It turns out, that <a href="http://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html">a lot of software acts like this, despite not being deliberately built this way</a>.</p> <p>Random testing is going to have a hard time finding cases where <code>dut</code> incorrectly returns 0. Even if you set the correct array length, a total of 64 bits have to be set to particular values, so there's a 1 in 2^64 chance of any particular random input hitting the failure.</p> <p>But a test generator that uses something like AFL's fuzzing algorithm hits this case almost immediately. Turns out, with reasonable initial inputs, it even finds a failing test case before it really does any coverage-guided test generation because the heuristics AFL uses for generating random tests generate an input that covers this case.</p> <p>That brings up the question of why QuickCheck and most of its clones don't use heuristics to generate random numbers. The QuickCheck paper mentions that it uses random testing because it's nearly as good as <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;arnumber=5010257&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5010257">partition testing</a> and much easier to implement. That may be true, but it doesn't mean that generating some values using simple heuristics can't generate better results with the same amount of effort. Since Zalewski has <a href="http://lcamtuf.blogspot.com/2014/08/binary-fuzzing-strategies-what-works.html">already done the work of figuring out, empirically, what heuristics are likely to exercise more code paths</a>, it seems like a waste to ignore that and just generate totally random values.</p> <p>Whether or not it's worth it to use coverage guided generation is a bit iffier; it doesn't prove anything that a toy coverage-based unit testing prototype can find a bug in a contrived function that's amenable to coverage based testing. But that wasn't the point. The point was to see if there was some huge barrier that should prevent people from doing coverage-driven unit testing. As far as I can tell, there isn't.</p> <p>It helps that the implementation of the golang is very well commented and has good facilities for manipulating go code, which makes it really easy to modify its coverage tools to generate whatever coverage metrics you want, but most languages have some kind of coverage tools that can be hacked up to provide the appropriate coverage metrics so it shouldn't be too painful for any mature language. And once you've got the coverage numbers, generating coverage-guided tests isn't much harder than generating random QuickCheck like tests. There are some cases where it's pretty difficult to generate good coverage-guided tests, like when generating functions to test a function that uses higher-order functions, but even in those cases you're no worse off than you would be with a QuickCheck clone<sup class="footnote-ref" id="fnref:Q"><a rel="footnote" href="#fn:Q">3</a></sup>.</p> <h3 id="test-time">Test Time</h3> <p>It's possible to run software tests much more quickly than hardware tests. One side effect of that is that it's common to see people proclaim that all tests should run in time bound X, and you're doing it wrong if they don't. I've heard various values of X from 100ms to 5 minutes. Regardless of the validity of those kinds of statements, a side effect of that attitude is that people often think that running a test generator for a few hours is A LOT OF TESTING. I overheard one comment about how a particular random test tool had found basically all the bugs it could find because, after a bunch of bug fixes, it had been run for a few hours without finding any additional bugs.</p> <p>And then you have hardware companies, which will dedicate thousands of machines to generating and running tests. That probably doesn't make sense for a software company, but considering the relative cost of a single machine compared to the cost of a developer, it's almost certainly worth dedicating at least one machine to generating and running tests. And for companies with their own machines, or dedicated cloud instances, generating tests on idle machines is pretty much free.</p> <h3 id="attitude">Attitude</h3> <p>In &quot;Lessons Learned in Software Testing&quot;, the authors mention that QA shouldn't be expected to find all bugs and that QA shouldn't have veto power over releases because it's impossible to catch most important bugs, and thinking that QA will do so leads to sloppiness. That's a pretty common attitude on the software teams I've seen. But on hardware teams, it's expected that all “bad” bugs will be caught before the final release and QA will shoot down a release if it's been inadequately tested. Despite that, devs are pretty serious about making things testable by avoiding unnecessary complexity. If a bad bug ever escapes (e.g., the Pentium FDIV bug or the Haswell STM bug), there's a post-mortem to figure out how the test process could have gone so wrong that a significant bug escaped.</p> <p>It's hard to say how much of the difference in bug count between hardware and software is attitude, and how much is due to the difference in the amount of effort expended on testing, but I think attitude is a significant factor, in addition to the difference in resources.</p> <p>It affects everything<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">4</a></sup>, down to what level of tests people write. There's a lot of focus on unit testing in software. In hardware, people use the term unit testing, but it usually refers to what would be called an integration test in software. It's considered too hard to thoroughly test every unit; it's much less total effort to test “units” that lie on clean API boundaries (which can be internal or external), so that's where test effort is concentrated.</p> <p>This also drives test generation. If you accept that bad bugs will occur frequently, manually writing tests is ok. But if your goal is to never release a chip with a bad bug, there's no way to do that when writing tests by hand, so you'll rely on some combination of random testing, manual testing for tricky edge cases, and formal methods. If you then decide that you don't have the resources to avoid bad bugs all the time, and you have to scale things back, you'll be left with the most efficient bug finding methods, which isn't going to leave a lot of room for writing tests by hand.</p> <h3 id="conclusion">Conclusion</h3> <p>A lot of projects could benefit from more automated testing. Basically every language has a QuickCheck-like framework available, but most projects that are amenable to QuickCheck still rely on manual tests. For all but the tiniest companies, dedicating at least one machine for that kind of testing is probably worth it.</p> <p>I think QuickCheck-like frameworks could benefit from using a coverage driven approach. It's certainly easy to implement for functions that take arrays of ints, but that's also pretty much the easiest possible case for something that uses AFL-like test generation (other than, maybe, an array of bytes). It's possible that this is much harder than I think, but if so, I don't see why.</p> <p>My background is primarily in hardware, so I could be totally wrong! If you have a software testing background, I'd be really interested in <a href="https://twitter.com/danluu">hearing what you think</a>. Also, I haven't talked about the vast majority of the topics that testing covers. For example, figuring out what should be tested is really important! So is figuring out how where nasty bugs might be hiding, and having a good regression test setup. But those are pretty similar between hardware and software, so there's not much to compare and contrast.</p> <h3 id="resources">Resources</h3> <p><a href="http://www.exampler.com/testing-com/writings/coverage.pdf">Brian Marick on code coverage, and how it can be misused</a>.</p> <blockquote> <p>If a part of your test suite is weak in a way that coverage can detect, it's likely also weak in a way coverage can't detect.</p> </blockquote> <p>I'm used to bugs being thought of in the same way -- if a test generator takes a month to catch a bug in an area, there are probably other subtle bugs in the same area, and more work needs to be done on the generator to flush them out.</p> <p><a href="http://www.amazon.com/gp/product/0122007514/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0122007514&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=DNSHYGZA2USRTHCU">Lessons Learned in Software Testing: A Context-Driven Approach, by Kaner, Bach, &amp; Pettichord</a>. This book is too long to excerpt, but I find it interesting because it reflects a lot of conventional wisdom.</p> <p><a href="http://lcamtuf.coredump.cx/afl/technical_details.txt">AFL whitepaper</a>, <a href="http://lcamtuf.coredump.cx/afl/historical_notes.txt">AFL historical notes</a>, and <a href="http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz">AFL code tarball</a>. All of it is really readable. One of the reasons I spent so much time looking at AFL is because of how nicely documented it is. Another reason is, of course, that it's been very effective at finding bugs on a wide variety of projects.</p> <p><em>Update: Dmitry Vyukov's <a href="https://speakerdeck.com/filosottile/automated-testing-with-go-fuzz">Go-fuzz</a>, which looks like it was started a month after this post was written, uses the approach from the proof of concept in this post of combining the sort of logic seen in AFL with a QuickCheck-like framework, and has been shown to be quite effective. I believe David R. MacIver is also planning to use this approach in the next version of <a href="https://hypothesis.readthedocs.org/en/latest/">hypothesis</a>.</em></p> <p>And here's some testing related stuff of mine: <a href="//danluu.com/everything-is-broken/">everything is broken</a>, <a href="//danluu.com/broken-builds/">builds are broken</a>, <a href="//danluu.com/julialang/">julia is broken</a>, and <a href="//danluu.com/new-cpu-features/">automated bug finding using analytics</a>.</p> <h3 id="terminology">Terminology</h3> <p>I use the term random testing a lot, in a way that I'm used to using it among hardware folks. I probably mean something broader than what most software folks mean when they say random testing. For example, <a href="http://www.sqlite.org/testing.html">here's how sqlite describes their testing</a>. There's one section on fuzz (random) testing, but it's much smaller than the sections on, say, I/O error testing or OOM testing. But as a hardware person, I'd also put I/O error testing or OOM testing under random testing because I'd expect to use randomly generated tests to test those.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p><small> I've gotten great feedback from a lot of software folks! Thanks to Leah Hanson, Mindy Preston, Allison Kaptur, Lindsey Kuper, Jamie Brandon, John Regehr, David Wragg, and Scott Feeney for providing comments/discussion/feedback. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S"><p>This footnote is a total tangent about a particular hardware test framework! You may want to skip this!</p> <p>SixthSense does a really good job of generating smart tests. It takes as input, some unit or collection of units (with assertions), some checks on the outputs, and some constraints on the inputs. If you don't give it any constraints, it assumes that any input is legal.</p> <p>Then it runs for a while. For units with<em>out</em> “too much” state, it will either find a bug or tell you that it formally proved that there are no bugs. For units with “too much” state, it's still pretty good at finding bugs, using some combination of random simulation and exhaustive search.</p> <p><img src="images/testing/sixth_sense_search.png" alt="Combination of exhaustive search and random execution" width="559" height="151"></p> <p>It can issue formal proofs for units with way too much state to brute force. How does it reduce the state space and determine what it's covered?</p> <p>I basically don't know. There are at least <a href="http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2989">thirty-seven papers on SixthSense</a>. Apparently, it uses <a href="http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2990">a combination of combinational rewriting, sequential redundancy removal, min-area retiming, sequential rewriting, input reparameterization, localization, target enlargement, state-transition folding, isomoprhic property decomposition, unfolding, semi-formal search, symbolic simulation, SAT solving with BDDs, induction, interpolation, etc.</a>.</p> <p>My understanding is that SixthSense has had a multi-person team working on it for over a decade. Considering the amount of effort IBM puts into finding hardware bugs, investing tens or hundreds of person years to create a tool like SixthSense is an obvious win for them, but it's not really clear that it makes sense for any software company to make the same investment.</p> <p>Furthermore, SixthSense is really slow by software test standards. Because of the massive overhead involved in simulating hardware, SixthSense actually runs faster than a lot of simple hardware tests normally would, but running SixthSense on a single unit can easily take longer than it takes to run all of the tests on most software projects.</p> <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:G">Among other things, it uses nested <code>if</code> statements instead of <code>&amp;&amp;</code> because go's coverage tool doesn't create separate coverage points for <code>&amp;&amp;</code> and <code>||</code>. <a class="footnote-return" href="#fnref:G"><sup>[return]</sup></a></li> <li id="fn:Q">Ok, you're slightly worse off due to the overhead of generating and looking at coverage stats, but that's pretty small for most non-trivial programs. <a class="footnote-return" href="#fnref:Q"><sup>[return]</sup></a></li> <li id="fn:A"><p>This is another long, skippable, footnote. This difference in attitude also changes how people try to write correct software. I've had &quot;<a href="//danluu.com/tests-v-reason/">testing is hopelessly inadequate….(it) can be used very effectively to show the presence of bugs but never to show their absence.</a>&quot; quoted at me tens of times by software folks, along with an argument that we have to reason our way out of having bugs. But the attitude of most hardware folks is that while the back half of that statement is true, testing (and, to some extent, formal verification) is the least bad way to assure yourself that something is probably free of bad bugs.</p> <p>This is even true not just on a macro level, but on a micro level. When I interned at Micron in 2003, I worked on flash memory. I read &quot;the green book&quot;, and the handful of papers that were new enough that they weren't in the green book. After all that reading, it was pretty obvious that we (humans) didn't understand all of the mechanisms behind the operation and failure modes of flash memory. There were plausible theories about the details of the exact mechanisms, but proving all of them was still an open problem. Even one single bit of flash memory was beyond human understanding. And yet, we still managed to build reliable flash devices, despite building them out of incompletely understood bits, each of which would eventually fail due to some kind of random (in the quantum sense) mechanism.</p> <p>It's pretty common for engineering to advance faster than human understanding of the underlying physics. When you work with devices that aren't understood and assembly them to create products that are too complex for any human to understand or for any known technique to formally verify, there's no choice but to rely on testing. With software, people often have the impression that it's possible to avoid relying on testing because it's possible to just understand the whole thing.</p> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> </ol> </div> What happens when you load a URL? navigate-url/ Sat, 07 Mar 2015 00:00:00 +0000 navigate-url/ <p>I've been hearing this question a lot lately, and when I do, it reminds me how much I don't know. Here are some questions this question brings to mind.</p> <p></p> <ol> <li>How does a keyboard work? Why can’t you press an arbitrary combination of three keys at once, except on fancy gaming keyboards? That implies something about how key presses are detected/encoded.</li> <li>How are keys debounced? Is there some analog logic, or is there a microcontroller in the keyboard that does this, or what? How do membrane switches work?</li> <li>How is the OS notified of the keypress? I could probably answer this for a 286, but nowadays it's somehow done through x2APIC, right? How does that work?</li> <li>Also, USB, PS/2, and AT keyboards are different, somehow? How does USB work? And what about laptop keyboards? Is that just a USB connection?</li> <li>How does a USB connector work? You have this connection that can handle 10Gb/s. That surely won't work if there's any gap at all between the physical doodads that are being connected. How do people design connectors that can withstand tens of thousands of insertions and still maintain their tolerances?</li> <li>How does the OS tell the program something happened? How does it know which program to talk to?</li> <li>How does the browser know to try to load a webpage? I guess it sees an &quot;http://&quot; or just assumes that anything with no prefix is a URL?</li> <li>Assume we don't have the webpage cached, so we have to do DNS queries and stuff.</li> <li>How does DNS work? How does DNS caching work? Let's assume it isn't cached at anywhere nearby and we have to go find some far away DNS server.</li> <li>TCP? We establish a connection? Do we do that for DNS or does it have to be UDP?</li> <li>How does the OS decide if an outgoing connection should be allowed? What if there's a software firewall? How does that work?</li> <li>For TCP, without TLS/SSL, we can just do slow-start followed by some standard congestion protocol, right? Is there some deeper complexity there?</li> <li>One level down, how does a network card work?</li> <li>For what matter, how does the network card know what to do? Is there a memory region we write to that the network card can see or does it just monitor bus transactions directly?</li> <li>Ok, say there's a memory region. How does that work? How do we write memory?</li> <li>Some things happen in the CPU/SoC! This is one of the few areas where I know something, so, I'll skip over that. A signal eventually comes out on some pins. What's that signal? Nowadays, people use DDR3, but we didn't always use that protocol. Presumably DDR3 lets us go faster than DDR2, which was faster than DDR, and so on, but why?</li> <li>And then the signal eventually goes into a DRAM module. As with the CPU, I'm going to mostly ignore what's going on inside, but I'm curious if DRAM modules still either trench capacitors or stacked capacitors, or has this technology moved on?</li> <li>Going back to our network card, what happens when the signal goes out on the wire? Why do you need a cat5 and not a cat3 cable for 100Mb Ethernet? Is that purely a signal integrity thing or do the cables actually have different wiring?</li> <li>One level below that the wires are surely long enough that they can act like transmission lines / waveguides. How is termination handled? Is twisted pair sufficient to prevent inductive coupling or is there more fancy stuff going on?</li> <li>Say we have a local Ethernet connection to a cable modem. How do cable modems work? Isn't cable somehow multiplexed between different customers? How is it possible to get so much bandwidth through a single coax cable?</li> <li>Going back up a level, the cable connection eventually gets to the ISP. How does the ISP know where to route things? How does internet routing work? Some bits in the header decide the route? How do routing tables get adjusted?</li> <li>Also, the 8.8.8.8 DNS stuff is anycast, right? How is that different from routing &quot;normal&quot; traffic? Ditto for anything served from a Cloudflare CDN. What do they need to do to prevent route flapping and other badness?</li> <li>What makes anycast hard enough to do that very few companies use it?</li> <li>IIRC, the Stanford/Coursera algorithms course mentioned that it's basically a distributed Bellman-Ford calculation. But what prevents someone from putting bogus routes up?</li> <li>If we can figure out where to go our packets go from our ISP through some edge router, some core routers, another edge router, and then go through their network to get into the “meat” of a datacenter.</li> <li>What's the difference between core and edge routers?</li> <li>At some point, our connection ends up going into fiber. How does that happen?</li> <li>There must be some kind of laser. What kind? How is the signal modulated? Is it WDM or TDM? Is it single-mode or multi-mode fiber?</li> <li>If it's WDM, how is it muxed/demuxed? It would be pretty weird to have a prism in free space, right? This is the kind of thing an AWG could do. Is that what's actually used?</li> <li>There must be repeaters between links. How do repeaters work? Do they just boost the signal or do they decode it first to avoid propagating noise? If the latter, there must be DCF between repeaters.</li> <li>Something that just boosts the signal is the simplest case. How does an EDFA work? Is it basically just running current through doped fiber, or is there something deeper going on there?</li> <li>Below that level, there's the question of how standard single mode fiber and DCF work.</li> <li>Why do we need DCF, anyway? I guess it's cheaper to have a combination of standard fiber and DCF than to have fiber with very low dispersion. Why is that?</li> <li>How does fiber even work? I mean, ok, it's probably a waveguide that uses different dielectrics to keep the light contained, but what's the difference between good fiber and bad fiber?</li> <li>For example, hasn't fiber changed over the past couple decades to severely reduce PMD? How is that possible? Is that just more precise manufacturing, or is there something else involved?</li> <li>Before PMD became a problem and was solved, there was decades of work that went into increasing fiber bandwidth, vaugely analogous to the way there was decades of work that went into increasing processor performance but also completely different. What was that work and what were the blockers that work was clearing? You'd have to actually know a good deal about fiber engineering to answer this, and I don't.</li> <li>Going back up a few levels, we go into a datacenter. What's up there? Our packets go through a switching network to TOR to machine? What's a likely switch topology? <a href="https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/">Facebook's</a> isn't quite something straight out of <a href="http://www.amazon.com/gp/product/0122007514/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0122007514&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=DNSHYGZA2USRTHCU">Dally and Towles</a>, but it's the kind of thing you could imagine building with that kind of knowledge. It hasn't been long enough since FB published their topology for people to copy them, but is the idea obvious enough that you'd expect it to be independently &quot;copied&quot;?</li> <li>Wait, is that even right? Should we expect a DNS server to sit somewhere in some datacenter?</li> <li>In any case, after all this our DNS resolves query to an IP. We establish a connection, and then what?</li> <li>HTTP GET? How are HTTP 1.0 and 1.1 different? 2.0?</li> <li>And then we get some files back and the browser has to render them somehow. There's a request for the HTML and also for the CSS and js, and separate requests for images? This must be complicated, since browsers are complicated. I don't have any idea of the complexity of this, so there must be a lot I'm missing.</li> <li>After the browser renders something, how does it get to the GPU and what does the GPU do?</li> <li>For 2d graphics, we probably just notify the OS of... something. How does that work?</li> <li>And how does the OS talk to the GPU? Is there some memory mapped region where you can just paint pixels, or is it more complicated than that?</li> <li>How does an LCD display work? How does the connection between the monitor and the GPU work?</li> <li>VGA is probably the simplest possibility. How does that work?</li> <li>If it's a static site, I guess we're done?</li> <li>But if the site has ads, isn't that stuff pretty complicated? How do targeted ads and ad auctions work? A bunch of stuff somehow happens in maybe 200ms?</li> </ol> <p>Where I can get answers to this stuff<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">1</a></sup>? That's not a rhetorical question! <a href="https://twitter.com/danluu">I'm really interested in hearing about other resources</a>!</p> <p><a href="https://github.com/alex/what-happens-when">Alex Gaynor set up a GitHub repo that attempts to answer this entire question</a>. It answers some of the questions, and has answers to some questions it didn't even occur to me to ask, but it's missing answers to the vast majority of these questions.</p> <p>For high-level answers, here's <a href="http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/">Tali Garsiel and Paul Irish on how a browser works</a> and <a href="http://pyvideo.org/video/1677/how-the-internet-works">Jessica McKellar how the Internet Works</a>. For how a simple OS does things, <a href="http://pdos.csail.mit.edu/6.828/2014/xv6.html">Xv6</a> has good explanations. For how Linux works, <a href="http://duartes.org/gustavo/blog/">Gustavo Duarte has a series of explanations here</a><a href="http://www.linusakesson.net/programming/tty/index.php">For TTYs, this article by Linus Akesson is a nice supplement</a> to Duarte's blog.</p> <p>One level down from that, <a href="http://www.jmarshall.com/easy/http/">James Marshall has a concise explanation of HTTP 1.0 and 1.1</a>, and <a href="http://www.sans.org/reading-room/whitepapers/protocols/ssl-tls-beginners-guide-1029">SANS has an old but readable guide on SSL and TLS</a>. <a href="https://url.spec.whatwg.org/">This isn't exactly smooth prose, but this spec for URLs explains in great detail what a URL is</a>.</p> <p>Going down another level, <a href="https://technet.microsoft.com/en-us/library/cc786128.aspx">MS TechNet has an explanation of TCP</a>, which also includes a short explanation of UDP.</p> <p>One more level down, <a href="http://www.informit.com/articles/printerfriendly/21320">Kyle Cassidy has a quick primer on Ethernet</a>, <a href="http://arstechnica.com/gadgets/2011/07/ethernet-how-does-it-work/">Iljitsch van Beijnum has a lengthier explanation with more history</a>, and <a href="http://www.ciscopress.com/articles/printerfriendly/357103">Matthew J Castelli has an explanation of LAN switches</a>. And then we have <a href="http://support.usr.com/support/6000/6000-ug/two.html">DOCSIS and cable modems</a>. <a href="http://www.olson-technology.com/AppNotes/long-haul-communications-systems.pdf">This gives a quick sketch of how long haul fiber is set up</a>, but there must be a better explanation out there somewhere. And here's <a href="//danluu.com/new-cpu-features/">a quick sketch of modern CPUs</a>. <a href="http://www.waitingforfriday.com/index.php/C64_VICE_Front-End">For an answer to the keyboard specific questions, Simon Inns explains keypress decoding and why you can't press an arbitrary combination of keys on a keyboard</a>.</p> <p>Down one more level, <a href="http://electriciantraining.tpub.com/14182/">this explains how wires work</a>, <a href="https://www.nanog.org/meetings/nanog57/presentations/Monday/mon.tutorial.Steenbergen.Optical.39.pdf">Richard A. Steenbergen explains fiber</a>, and <a href="http://www.amazon.com/gp/product/0201543931/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0201543931&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=MXV5K7IJXWXJD446">Pierret</a> <a href="http://www.amazon.com/gp/product/013061792X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=013061792X&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=N7YLCANVDPI3M35R">explains</a> transistors.</p> <p>P.S. As an interview question, this is pretty much the antithesis of the <a href="http://sockpuppet.org/blog/2015/03/06/the-hiring-post/">tptacek strategy</a>. From what I've seen, my guess is that tptacek-style interviews are much better filters than open ended questions like this.</p> <p><small></p> <p>Thanks to Marek Majkowski, Allison Kaptur, Mindy Preston, Julia Evans, Marie Clemessy, and Gordon P. Hemsley for providing answers and links to resources with answers! Also, thanks to Julia Evans and Sumana Harihareswara for convincing me to turn these questions into a blog post.</p> <p></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:E">I mostly don't have questions about stuff that happens inside a PC listed, but I'm pretty curious about how modern high-speed busses work and how high-speed chips deal with the massive inductance they must have to deal with getting signals to and from the chip. <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> </ol> </div> Goodhearting IQ, cholesterol, and tail latency percentile-latency/ Thu, 05 Mar 2015 00:00:00 +0000 percentile-latency/ <p>Most real-world problems are big enough that you can't just head for the end goal, you have to break them down into smaller parts and set up intermediate goals. For that matter, most games are that way too. “Win” is too big a goal in chess, so you might have a subgoal like <a href="http://www.chesstactics.org/index.php?Type=page&amp;Action=none&amp;From=2,1,1,1">don't get forked</a>. While creating subgoals makes intractable problems tractable, it also creates the problem of determining the relative priority of different subgoals and whether or not a subgoal is relevant to the ultimate goal at all. In chess, there are libraries worth of books written on just that.</p> <p>And chess is really simple compared to a lot of real world problems. 64 squares. 32 pieces. Pretty much any analog problem you can think of contains more state than chess, and so do a lot of discrete problems. Chess is also relatively simple because you can directly measure whether or not you succeeded (won). Many real-world problems have the additional problem of not being able to measure your goal directly.</p> <p></p> <h2 id="iq-early-childhood-education">IQ &amp; Early Childhood Education</h2> <p>In 1962, what's now known as the Perry Preschool Study started in Ypsilanti, a blue-collar town near Detroit. It was a randomized trial, resulting in students getting either no preschool or two years of free preschool. After two years, students in the preschool group showed a 15 point bump in IQ scores; other early education studies showed similar results.</p> <p>In the 60s, these promising early results spurred the creation of Head Start, a large scale preschool program designed to help economically disadvantaged children. Initial results from Head Start were also promising; children in the program got a 10 point IQ boost.</p> <p>The next set of results was disappointing. By age 10, the difference in test scores and IQ between the trial and control groups wasn't statistically significant. The much larger scale Head Start study showed similar results; the authors of the <a href="http://eric.ed.gov/?id=ED036321">first major analysis of Head Start</a> concluded that</p> <blockquote> <p>(1) Summer programs are ineffective in producing lasting gains in affective and cognitive development, (2) full-year programs are ineffective in aiding <a href="http://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-1698-3_1718">affective development</a> and only marginally effective in producing lasting cognitive gains, (3) all Head Start children are still considerably below national norms on tests of language development and scholastic achievement, while school readiness at grade one approaches the national norm, and (4) parents of Head Start children voiced strong approval of the program. Thus, while full-year Head Start is somewhat superior to summer Head Start, neither could be described as satisfactory.</p> </blockquote> <p>Education in the U.S. isn't cheap, and these early negative results caused calls for reductions in funding and even the abolishment of the program. Turns out, it's quite difficult to cut funding for a program designed to help disadvantaged children, and the program lives on despite repeated calls to cripple or kill the program.</p> <p>Well after the initial calls to shut down Head Start, long-term results started coming in from the Perry preschool study. As adults, people in the experimental (preschool) group were less likely to have been arrested, less likely to have spent time in prison, and more likely to have graduated from high school. Unfortunately, due to methodological problems in the study design, it's not 100% clear where these effects come from. Although the goal was to do a randomized trial, the experimental design necessitated home visits for the experimental group. As a result, children in the experimental group whose mothers were employed swapped groups with children in the control group whose mothers were unemployed. The positive effects on the preschool group could have been caused by having at-home mothers. Since the Head Start studies weren't randomized and using <a href="https://www.aeaweb.org/articles.php?doi=10.1257/jep.15.4.69">instrumental variables</a> (IVs) to tease out causation in “natural experiments” didn't become trendy until relatively recently, it took a long time to get plausible causal results from Head Start.</p> <p>The goal of analyses with an instrumental variable is to extract causation, the same way you'd be able to in a randomized trial. A classic example is determining the effect of putting kids into school a year earlier or later. Some kids naturally start school a year earlier or later, but there are all sorts of factors that can cause that happen, which means that a correlation between an increased likelihood of playing college sports in kids who started school a year later could just as easily be from the other factors that caused kids to start a year later as it could be from actually starting school a year later.</p> <p>However, date of birth can be used as an instrumental variable that isn't correlated with those other factors. For each school district, there's an arbitrary cutoff that causes kids on one side of the cutoff to start school a year later than kids on the other side. With the not-unreasonable assumption that being born one day later doesn't cause kids to be better athletes in college, you can see if starting school a year later seems to have a causal effect on the probability of playing sports in college.</p> <p>Now, back to Head Start. One IV analysis used a funding discontinuity across counties to generate a quasi experiment. The idea is that there are discrete jumps in the level of Head Start funding across regions that are caused by variations in a continuous variable, which gives you something like a randomized trial. Moving 20 feet across the county line doesn't change much about kids or families, but it moves kids into an area with a significant change in Head Start funding.</p> <p>The results of other IV analyses on Head Start are similar. Improvements in test scores faded out over time, but there were significant long-term effects on graduation rate (high school and college), crime rate, health outcomes, and other variables that are more important than test scores.</p> <p>There's no single piece of incredibly convincing evidence. The randomized trial has methodological problems, and IV analyses nearly always leave some lingering questions, but the weight of the evidence indicates that even though scores on standardized tests, including IQ tests, aren't improved by early education programs, people's lives are substantially improved by early education programs. However, if you look at the <a href="http://query.nytimes.com/gst/abstract.html?res=9F07E2DE1F39E63ABC4C52DFB2668382679EDE">early commentary on programs like Head Start</a>, there's no acknowledgment that intermediate targets like IQ scores might not perfectly correlate with life outcomes. Instead you see declarations like <a href="http://archives.chicagotribune.com/1969/04/14/page/62/article/head-starts-effect-is-nil-study-shows">“poor children have been so badly damaged in infancy by their lower-class environment that Head Start cannot make much difference”</a>.</p> <p>The funny thing about all this is that it's well known that IQ doesn't correlate perfectly to outcomes. In the range of environments that you see in typical U.S. families, to correlation to outcomes you might actually care about has an r value in the range of .3 to .4. That's incredibly strong for something in the social sciences, but even that incredibly strong statement is a statement IQ isn't responsible for &quot;most&quot; of the effect on real outcomes, even ignoring possible confounding factors.</p> <h2 id="cholesterol-myocardial-infarction">Cholesterol &amp; Myocardial Infarction</h2> <p>There's a long history of population studies showing a correlation between cholesterol levels and an increased risk of heart attack. A number of early studies found that lifestyle interventions that made cholesterol levels more favorable also decreased heart attack risk. And then statins were invented. Compared to older drugs, statins make cholesterol levels dramatically better and have a large effect on risk of heart attack.</p> <p>Prior to the invention of statins, the standard intervention was a combination of diet and pre-statin drugs. There's a lot of literature on this; <a href="http://www.nejm.org/doi/full/10.1056/NEJM199010183231606">here's one typical review</a> that finds, in randomized trials, a combination of dietary changes and drugs has a modest effect on both cholesterol levels and heart attack risk.</p> <p>Given that narrative, it certainly sounds reasonable to try to develop new drugs that improve cholesterol levels, but when Pfizer spent $800 million doing exactly that, developing torcetrapib, they found that they created a drug which <a href="http://en.wikipedia.org/wiki/Torcetrapib">substantially increased heart attack risk despite improving cholesterol levels</a>. Hoffman-La Roche's attempt fared a bit better because it <a href="http://en.wikipedia.org/wiki/Dalcetrapib">improved cholesterol without killing anyone, but it still failed to decrease heart attack risk</a>. <a href="http://en.wikipedia.org/?title=Ezetimibe/simvastatin">Merck</a> and <a href="http://pipeline.corante.com/archives/2010/03/15/tricors_troubles.php">Tricor</a> have also had the same problem.</p> <p>What happened? Some interventions that affected cholesterol levels also affected real health outcomes, prompting people to develop drugs that affect cholesterol. But it turns out that improving cholesterol isn't an inherent good, and like many intermediate targets, it's possible to improve <a href="http://pipeline.corante.com/archives/2010/03/15/tricors_troubles.php">without affecting the end goal</a>.</p> <h2 id="99-ile-latency-latency">99%-ile Latency &amp; Latency</h2> <p>It's pretty common to see latency measurements and benchmarks nowadays. It's well understood that poor latency in applications costs you money, as it causes people to stop using the application. It's also well understood that average latency (mean, median, or mode), by itself, isn't a great metric. It's common to use 99%-ile, 99.9%-ile, 99.99%-ile, etc., in order to capture some information about the distribution and make sure that bad cases aren't too bad.</p> <p>What happens when you use the 99%-iles as intermediate targets? If you require 99%-ile latency to be under 0.5 millisec and 99.99% to be under 5 millisecond you might get a latency distribution that looks something like this.</p> <p><img src="images/percentile-latency/tene_latency.png" alt="Latency graph with kinks at 99%-ile, 99.9%-ile, and 99.99%-ile" width="399" height="294"></p> <p>This is a graph of an actual application that Gil Tene has been showing off in <a href="https://www.youtube.com/watch?v=9MKY4KypBzg">his talks about latency</a>. If you specify goals in terms of 99%-ile, 99.9%-ile, and 99.99%-ile, you'll optimize your system to barely hit those goals. Those optimizations will often push other latencies around, resulting in a funny looking distribution that has kinks at those points, with latency that's often nearly as bad as possible everywhere else.</p> <p>It's is a bit odd, but there's nothing sinister about this. If you try a series of optimizations while doing nothing but looking at three numbers, you'll choose optimizations that improve those three numbers, even if they make the rest of the distribution much worse. In this case, latency rapidly degrades above the 99.99%-ile because the people optimizing literally had no idea how much worse they were making the 99.991%-ile when making changes. It's like the video game solving AI that presses pause before its character is about to get killed, because <a href="http://www.cs.cmu.edu/~tom7/mario/">pausing the game prevents its health from decreasing</a>. If you have very narrow optimization goals, and your measurements don't give you any visibility into anything else, everything but your optimization goals is going to get thrown out the window.</p> <p>Since the end goal is usually to improve the user experience and not just optimize three specific points on the distribution, targeting a few points instead of using some kind of weighted integral can easily cause anti-optimizations that degrade the actual user experience, while producing great slideware.</p> <p>In addition to the problem of optimizing just the 99%-ile to the detriment of everything else, there's the question of how to measure the 99%-ile. One method of measuring latency, used by multiple commonly used benchmarking frameworks, is to do something equivalent to</p> <pre><code>for (int i = 0; i &lt; NUM; ++i) { auto a = get_time(); do_operation(); auto b = get_time(); measurements[i] = b - a; } </code></pre> <p>If you optimize the 99%-ile of that measurement, you're optimizing the 99%-ile for when all of your users get together and decide to use your app sequentially, coordinating so that no one requests anything until the previous user is finished.</p> <p>Consider a contrived case where you measure for 20 seconds. For the first 10 seconds, each response takes 1ms. For the 2nd 10 seconds, the system is stalled, so the last request takes 10 seconds, resulting in 10,000 measurements of 1ms and 1 measurement of 10s. With these measurements, the 99%-ile is 1ms, as is the 99.9%-ile, for the matter. Everything looks great!</p> <p>But if you consider a “real” system where users just submit requests, uniformly at random, the 75%-ile latency should be &gt;= 5 seconds because if any query comes during the 2nd half, it will get jammed up, for an average of 5 seconds and as much as 10 seconds, in addition to whatever queuing happens because requests get stuck behind other requests.</p> <p>If this example sounds contrived, it is; if you'd prefer a real world example, <a href="http://psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html">see this post by Nitsan Wakart</a>, which finds shows how YCSB (Yahoo Cloud Serving Benchmark) has this problem, and how different the distributions look before and after the fix.</p> <p><img src="images/percentile-latency/ycsb_coordinated.png" alt="Order of magnitude latency differences between YCSB's measurement and the truth" width="640" height="247"></p> <p>The red line is YCSB's claimed latency. The blue line is what the latency looks like after Wakart fixed the coordination problem. There's more than an order of magnitude difference between the original YCSB measurement and Wakart's corrected version.</p> <p>It's important to not only consider the whole distribution, to make make sure you're measuring a distribution that's relevant. Real users, which can be anything from a human clicking something on a web app, to an app that's waiting for an RPC, aren't going to coordinate to make sure they don't submit overlapping requests; they're not even going to obey a uniform random distribution.</p> <h2 id="conclusion">Conclusion</h2> <p>This is the point in a blog post where you're supposed to get the one weird trick that solves your problem. But the only trick is that there is no trick, that you have to constantly check that your map is somehow connected to the territory<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>.</p> <h3 id="resources">Resources</h3> <p><a href="http://aspe.hhs.gov/daltcp/reports/headstar.htm">1990 HHS Report on Head Start</a>. <a href="http://aspe.hhs.gov/daltcp/reports/headstar.htm">2012 Review of Evidence on Head Start</a>.</p> <p><a href="https://www.aeaweb.org/articles.php?doi=10.1257/jep.15.4.69">A short article on instrumental variables</a>. <a href="http://www.amazon.com/gp/product/0691120358/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0691120358&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=6WFT6E7PJBSVW47N">A book on econometrics and instrumental variables</a>.</p> <p><a href="https://www.youtube.com/watch?v=XmImGiVuJno">Aysylu Greenberg video on benchmarking pitfalls</a>; it's not latency specific, but it covers a wide variety of common errors. <a href="https://www.youtube.com/watch?v=9MKY4KypBzg">Gil Tene video on latency</a>; covers many more topics than this post. <a href="http://psy-lob-saw.blogspot.com/2015/02/hdrhistogram-better-latency-capture.html">Nitsan Wakart on measuring latency</a>; has code examples and links to libraries.</p> <p><small></p> <h3 id="acknowledgments">Acknowledgments</h3> <p>Thanks to Leah Hanson for extensive comments on this, and to Scott Feeney and Kyle Littler for comments that resulted in minor edits.</p> <p></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">Unless you're in school and your professor likes to give problems where the answers are nice, simple, numbers, maybe the weird trick is that you know you're off track if you get an intermediate answer with a <code>170/23</code> in front of it. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> </ol> </div> AI doesn't have to be very good to displace humans customer-service/ Sun, 15 Feb 2015 00:00:00 +0000 customer-service/ <p>There's an ongoing debate over whether &quot;AI&quot; will ever be good enough to displace humans and, if so, when it will happen. In this debate, the optimists tend to focus on how much AI is improving and the pessimists point to all the ways AI isn't as good as an ideal human being. I think this misses two very important factors.</p> <p>One, is that jobs that are on the potential chopping block, such as first-line customer service, customer service for industries that are either low margin or don't care about the customer, etc., tend to be filled by apathetic humans in a poorly designed system, and <a href="p95-skill/">humans aren't even very good at simple tasks</a> they <a href="bad-decisions/">care a lot about</a>. When we're apathetic, we're absolutely terrible; it's not going to take a nearly-omniscient sci-fi level AI to perform at least somewhat comparably.</p> <p>Two, companies are going to replace humans with AI in many roles even if AI is significantly worse as if the AI is much cheaper. One place this has already happened (though perhaps this software is too basic to be considered an AI) is with phone trees. Phone trees are absolutely terrible compared to the humans they replaced, but they're also orders of magnitude cheaper. Although there are high-margin high-touch companies that won't put you through a phone tree, at most companies, for a customer looking for customer service, a huge number of work hours have been replaced by phone trees, and were then replaced again by phone trees with poor AI voice recognition that I find worse than old school touch pad phone trees. It's not a great experience, and may get much worse when AI automates even more of the process.</p> <p>But on the other hand, here's a not-too-atypical customer service interaction I had last week with a human who was significantly worse than a mediocre AI. I scheduled an appointment for an MRI. The MRI is for a jaw problem which makes it painful to talk. I was hoping that the scheduling would be easy, so I wouldn't have to spend a lot of time talking on the phone. But, as is often the case when dealing with bureaucracy, it wasn't easy.</p> <p>Here are the steps it took.</p> <p></p> <ol> <li>Have jaw pain.</li> <li>See dentist. Get referral for MRI when dentist determines that it's likely to be a joint problem.</li> <li>Dentist gets referral form from UW Health, faxes it to them according to the instructions on the form, and emails me a copy of the referral.</li> <li>Call UW Health.</li> <li>UW Health tries to schedule me for an MRI of my pituitary.</li> <li>Ask them to make sure there isn't an error.</li> <li>UW Health looks again and realizes that's a referral for something else. They can't find anything for me.</li> <li>Ask UW Health to call dentist to work it out. UW Health claims they cannot make phone calls.</li> <li>Talk to dentist again. Ask dentist to fax form again.</li> <li>Call UW Health again. Ask them to check again.</li> <li>UW Health says form is illegally filled out.</li> <li>Ask them to call dentist to work it out, again.</li> <li>UW Health says that's impossible.</li> <li>Ask why.</li> <li>UW Health says, “for legal reasons”.</li> <li>Realize that's probably a vague and unfounded fear of HIPAA regulations. Try asking again nicely for them to call my dentist, using different phrasing.</li> <li>UW Health agrees to call dentist. Hangs up.</li> <li>Look at referral, realize that it's actually impossible for someone outside of UW Health (like my dentist) to fill out the form legally given the instructions on the form.</li> <li>Talk to dentist again.</li> <li>Dentist agrees form is impossible, talks to UW Health to figure things out.</li> <li>Call UW Health to see if they got the form.</li> <li>UW Health acknowledges receipt of valid referral.</li> <li>Ask to schedule earliest possible appointment.</li> <li>UW Health isn't sure they can accept referrals from dentists. Goes to check.</li> <li>UW Health determines it is possible to accept a referral from a dentist.</li> <li>UW Health suggests a time on 2/17.</li> <li>I point out that I probably can't make it because of a conflicting appointment, also with UW Health, which I know about because I can see it on my profile with I log into the UW Health online system.</li> <li>UW Health suggests a time on 2/18.</li> <li>I point out another conflict that is in the UW Health system.</li> <li>UW Health starts looking for times on later dates.</li> <li>I ask if there are any other times available on 2/17.</li> <li>UW Health notices that there are other times available on 2/17 and schedules me later on 2/17.</li> </ol> <p>I present this not because it's a bad case, but because it's a representative one<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">1</a></sup>. In this case, my dentist's office was happy to do whatever was necessary to resolve things, but UW Health refused to talk to them without repeated suggestions that talking to my dentist would be the easiest way to resolve things. Even then, I'm not sure it helped much. This isn't even all that bad, since I was able to convince the intransigent party to cooperate. The bad cases are when both parties refuse to talk to each other and both claim that the situation can only be resolved when the other party contacts them, resulting in a deadlock. The good cases are when both parties are willing to talk to each other and work out whatever problems are necessary. Having a non-AI phone tree or web app that exposes simple scheduling would be far superior to the human customer service experience here. An AI chatbot that's a light wrapper around the API a web app would use would be worse than being able to use a normal website, but still better than human customer service. An AI chatbot that's more than a just a light wrapper would blow away the humans who do this job for UW Health.</p> <p>The case against using computers instead of humans is that computers are bad at handling error conditions, can't adapt to unusual situations, and behave according to mechanical rules, which can often generate ridiculous outcomes, but that's precisely the situation we're in right now with humans. It already feels like dealing with a computer program. Not a modern computer program, but a compiler from the 80s that tells you that there's at least one error, with no other diagnostic information.</p> <p>UW Health sent a form with impossible instructions to my dentist. That's not great, but it's understandable; mistakes happen. However, when they got the form back and it wasn't correctly filled out, instead of contacting my dentist they just threw it away. Just like an 80s compiler. Error! The second time around, they told me that the form was incorrectly filled out. Error! There was a human on the other end who could have noted that the form was impossible to fill out. But like an 80s compiler, they stopped at the first error and gave it no further thought. This eventually got resolved, but the error messages I got along the way were much worse than I'd expect from a modern program. Clang (and even gcc) give me much better error messages than I got here.</p> <p>Of course, as we saw with healthcare.gov, outsourcing interaction to computers doesn't guarantee good results. <a href="tech-discrimination/">There are some claims that market solutions will automatically fix any problem</a>, but those <a href="https://news.ycombinator.com/item?id=9051049">claims</a> <a href="https://news.ycombinator.com/item?id=5523992">don't</a> <a href="https://news.ycombinator.com/item?id=5208988">always</a> <a href="https://news.ycombinator.com/item?id=453507">work</a> <a href="https://news.ycombinator.com/item?id=4216898">out</a>.</p> <p><img id="dentist" src="images/customer-service/seeking_googler.png" alt="Seeking AdSense Googler. Need AdSense help. My emails remain unanswered. Are you the special Googler who will help?" width="273" height="148"></p> <p>That's an ad someone was running for a few months on Facebook in order to try to find a human at Google to help them because every conventional technique they had at their disposal failed. Google has perhaps the most advanced ML in the world, they're as market driven as any other public company, and they've mostly tried to automate away service jobs like first-level support because support doesn't scale. As a result, the most reliable methods of getting support at Google are</p> <ol> <li>Be famous enough that a blog post or tweet will get enough attention to garner a response.</li> <li>Work at Google or know someone who works at Google and is willing to not only file an internal bug, but to drive it to make sure it gets handled.</li> </ol> <p>If you don't have direct access to one of these methods, running an ad is actually a pretty reasonable solution. (1) and (2) don't always work, but they're more effective than not being famous and hoping a blog post will hit HN, or <a href="https://news.ycombinator.com/item?id=526688">being a paying customer</a>. The point here isn't to rag on Google, it's just that automated customer service solutions aren't infallible, even when you've got an AI that can beat the strongest go player in the world and multiple buildings full of people applying that same technology to practical problems.</p> <p>While replacing humans with computers doesn't always create a great experience, good computer based systems for things like scheduling and referrals can already be much better than the average human at a bureaucratic institution<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">2</a></sup>. With the right setup, a computer-based system can be better at escalating thorny problems to someone who's capable of solving them than a human-based system. And computers will only get better at this. There will be bugs. And there will be bad systems. But there are already bugs in human systems. And there are already bad human systems.</p> <p>I'm not sure if, in my lifetime, technology will advance to the point where computers can be as good as helpful humans in a well designed system. But we're already at the point where computers can be as helpful as apathetic humans in a poorly designed system, which describes a significant fraction of service jobs.</p> <h3 id="2023-update">2023 update</h3> <p>When ChatGPT was released in 2022, the debate described above in 2015 happened again, with the same arguments on both sides. People are once again saying that AI (this time, ChatGPT and LLMs) can't replace humans because a great human is better than ChatGPT. They'll often pick a couple examples of ChatGPT saying something extremely silly, &quot;hallucinating&quot;, but if you <a href="https://mastodon.social/@danluu/109458271351706514.">ask a human to explain something, even a world-class expert, they often hallucinate a totally fake explanation as well</a></p> <p>Many people on the pessimist side argued that it would be decades before LLMs can replace humans for the exact reasons we noted were false in 2015. Everyone made this argument <a href="https://mastodon.social/@danluu/109579276227658596">after multiple industries had massive cuts in the number of humans they need to employ due to pre-LLM &quot;AI&quot; automation</a> and many of these people even made this argument after companies had already laid people off and replaced people with LLMs. <a href="https://mastodon.social/@danluu/109458264830891897">I commented on this at the time, using the same reasoning I used in this 2015 post</a> before realizing that <a href="https://mastodon.social/@danluu/109458901726543386">I'd already written down this line of reasoning in 2015</a>. But, cut me some slack; I'm just a human, not a computer, so I have a fallible memory.</p> <p>Now that it's been a year ChatGPT was released, the AI pessimists who argued that LLMs would displace human jobs for a very long time have been proven even more wrong by layoff after layoff where customer service orgs were cut to the bone and mostly replaced by AI, AI customer service seems quite poor, just like human customer service. But human customer service isn't improving, while AI customer service is. For example, here are some recent customer service interactions I had as a result of bringing my car in to get the oil changed, rotate the tires, and do a third thing (long story).</p> <ol> <li>I call my local tire shop and oil change place<sup class="footnote-ref" id="fnref:O"><a rel="footnote" href="#fn:O">3</a></sup> and ask if they can do the three things I want with my car</li> <li>They say yes</li> <li>I ask if I can just drop by or if I need to make an appointment</li> <li>They say yes, I can just drop by to get the world done</li> <li>I ask if I can talk to the service manager directly to get some more info</li> <li>After being transferred to the service manager, I describe what I want again and ask when I can come in</li> <li>They say that will take a lot of time and I'll need to make an appointment. They can get me in next week. If I listened to the first guy, I would've had a completely pointless one-hour round trip drive since they couldn't, in fact, do the work I wanted as a drop-in</li> <li>A week later, I bring the car in and talk to someone at the desk, who asks me what I need done</li> <li>I describe what I need and notice that he only writes down about 1/3 of what I said, so I follow up and</li> <li>ask what oil they're going to use</li> <li>The guy says &quot;we'll use the right oil&quot;</li> <li>I tell him that I want 0W-20 synthetic because my car has a service bulletin indicating that this is recommended, which is different from the label on the car, so could they please note this.</li> <li>The guy repeats &quot;we'll use the right oil&quot;.</li> <li>(12) again, with slightly different phrasing</li> <li>(13) again, with slightly different phrasing</li> <li>(12) again, with slightly different phrasing</li> <li>The guy says, &quot;it's all in the computer, the computer has the right oil&quot;.</li> <li>I ask him what oil the computer says to use</li> <li>Annoyed, the guy walks over to the computer and pull up my car, telling me that my car should use 5W-30</li> <li>I tell him that's not right for my vehicle due to the service bulletin and I want 0W-20 synthetic</li> <li>The guy, looking shocked, says &quot;Oh&quot;, and then looks at the computer and says &quot;oh, it says we can also use 0W-20&quot;</li> <li>The guy writes down 0W-20 on the sheet for my car</li> <li>I leave, expecting that the third thing I asked for won't be done or won't completely be done since it wasn't really written down</li> <li>The next day, I pick up my car and they fully didn't do the third thing.</li> </ol> <p>Overall, how does an LLM compare? It's probably significantly better than this dude, who acted like an archetypical stoner who doesn't want to be there and doesn't want to do anything, and the LLM will be cheaper as well. However, the LLM will be worse than a web interface that lets me book the exact work I want and write a note to the tech who's doing the work. For better or for worse, I don't think my local tire / oil change place is going to give me a nice web interface that lets me book the exact work I want any time soon, so this guy is going to be replaced by an LLM and not a simple web app.</p> <h3 id="elsewhere">Elsewhere</h3> <ul> <li><a href="https://srconstantin.github.io/2019/02/25/humans-who-are-not-concentrating.html">Sarah Constantin's, 2019: Humans Who Are Not Concentrating Are Not General Intelligences</a></li> <li><a href="https://bsky.app/profile/vickiboykis.com/post/3kewppylblx2g">Vicki Boykis, 2023:</a>: Probably 40-50% of LLM use-cases would be made redundant if just 10-20% of the top sites on the internet had better internal search engines. <ul> <li><a href="https://bsky.app/profile/hillelwayne.com/post/3kex4cjtjeu2v">Hillel Wayne in response</a>: 45-55% if they had bulk searching</li> </ul></li> </ul> <p><small> Thanks to Leah Hanson and Josiah Irwin for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:R">Representative of my experience in Madison, anyway. The absolute worst case of this I encountered in Austin isn't even as bad as the median case I've seen in Madison. YMMV. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:L"><p>I wonder if a deranged version of the <a href="http://en.wikipedia.org/wiki/Law_of_one_price">law of one price</a> applies, the law of one level of customer service. However good or bad an organization is at customer service, they will create or purchase automated solutions that are equally good or bad.</p> <p>At Costco, the checkout clerks move fast and are helpful, so you don't have much reason to use the automated checkout. But then the self-checkout machines tend to be well-designed; they're physically laid out to reduce the time it takes to feed a large volume of stuff through them, and they rarely get confused and deadlock, so there's not much reason not to use them. At a number of other grocery chains, the checkout clerks are apathetic and move slowly, and will make mistakes unless you remind them of what's happening. It makes sense to use self-checkout at those places, except that the self-checkout machines aren't designed particularly well and are often configured so that they often get confused and require intervention from an overloaded checkout clerk.</p> <p>The same thing seems to happen with automated phone trees, as well as both of the examples above. Local Health has an online system to automate customer service, but they went with Epic as the provider, and as a result it's even worse than dealing with their phone support. And it's possible to get a human on the line if you're a customer on some Google products, but that human <a href="http://successfulsoftware.net/2015/03/04/google-bans-hyperlinks/">is often no more helpful</a> <a href="https://news.ycombinator.com/item?id=3803568">than the automated system</a> <a href="http://zoekeating.tumblr.com/post/108898194009/what-should-i-do-about-youtube">you'd otherwise deal with</a>.</p> <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:O">BTW, this isn't a knock against my local tire shop. I used my local tire shop because they're actually above average! I've also tried the local dealership, which is fine but super expensive, and a widely recommended independent Volvo specialist, which was much worse — they did sloppy work and missed important issues and were sloppy elsewhere as well; they literally forgot to order parts for the work they were going to do (a mistake an AI probably wouldn't have made), so I had to come back another day to finish the work on my car! <a class="footnote-return" href="#fnref:O"><sup>[return]</sup></a></li> </ol> </div> CPU backdoors cpu-backdoors/ Tue, 03 Feb 2015 00:00:00 +0000 cpu-backdoors/ <p>It's generally accepted that any piece of software could be compromised with a backdoor. Prominent examples include the <a href="http://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal">Sony/BMG installer</a>, which had a backdoor built-in to allow Sony to keep users from copying the CD, which also allowed malicious third-parties to take over any machine with the software installed; the <a href="http://redmine.replicant.us/projects/replicant/wiki/SamsungGalaxyBackdoor">Samsung Galaxy</a>, which has a backdoor that allowed the modem to access the device's filesystem, which also allows anyone running <a href="https://www.youtube.com/watch?v=RXqQioV_bpo">a fake base station to access files on the device</a>; <a href="http://www.cypherspace.org/adam/hacks/lotus-nsa-key.html">Lotus Notes</a>, which had a backdoor which allowed encryption to be defeated; and <a href="https://forums.lenovo.com/t5/Lenovo-P-Y-and-Z-series/Lenovo-Pre-instaling-adware-spam-Superfish-powerd-by/td-p/1726839">Lenovo laptops</a>, which pushed all web traffic through a proxy (including HTTPS, via a trusted root certificate) in order to push ads, which allowed anyone with the correct key (which was distributed on every laptop) to intercept HTTPS traffic.</p> <p>Despite sightings of backdoors in <a href="http://www.cl.cam.ac.uk/~sps32/sec_news.html#Assurance">FPGAs</a> and <a href="https://github.com/elvanderb/TCP-32764">networking gear</a>, whenever someone brings up the possibility of CPU backdoors, it's still common for people to claim that it's <a href="https://news.ycombinator.com/item?id=6147767">impossible</a>. I'm not going to claim that CPU backdoors exist, but I will claim that the implementation is easy, if you've got the right access.</p> <p></p> <p>Let's say you wanted to make a backdoor. How would you do it? There are three parts to this: what could a backdoored CPU do, how could the backdoor be accessed, and what kind of compromise would be required to install the backdoor?</p> <p>Starting with the first item, what does the backdoor do? There are a lot of possibilities. The simplest is to allow privilege escalation: make the CPU to <a href="http://en.wikipedia.org/wiki/Protection_ring">transition from ring3 to ring0 or SMM</a>, giving the running process kernel-level privileges. Since it's the CPU that's doing it, this can punch through both hardware and software virtualization. There are a lot of subtler or more invasive things you could do, but privilege escalation is both simple enough and powerful enough that I'm not going to discuss the other options.</p> <p>Now that you know what you want the backdoor to do, how should it get triggered? Ideally, it will be something that no one will run across by accident, or even by brute force, while looking for backdoors. Even with that limitation, the state space of possible triggers is huge.</p> <p>Let's look at a particular instruction, <code>fyl2x</code><sup class="footnote-ref" id="fnref:I"><a rel="footnote" href="#fn:I">1</a></sup>. Under normal operation, it takes two floating point registers as input, giving you <code>2*80=160</code> bits to hide a trigger in. If you trigger the backdoor off of a specific pair of values, that's probably safe against random discovery. If you're really worried about someone stumbling across the backdoor by accident, or brute forcing a suspected backdoor, you can check more than the two normal input registers (after all, you've got control of the CPU).</p> <p>This trigger is nice and simple, but the downside is that hitting the trigger probably requires executing native code since you're unlikely to get chrome or Firefox to emit an <code>fyl2x</code> instruction. You could try to work around that by triggering off an instruction you can easily get a JavaScript engine to emit (like an <code>fadd</code>). The problem with that is that if you patch an add instruction and add some checks to it, it will become noticeably slower (although, if you can edit the hardware, you should be able to do it with no overhead). It might be possible to create something hard to detect that's triggerable through JavaScript by patching a <a href="http://www.csc.depauw.edu/~bhoward/asmtut/asmtut7.html">rep string</a> instruction and doing some stuff to set up the appropriate “key” followed by a block copy, or maybe <code>idiv</code>. Alternately, if you've managed to get a copy of the design, you can probably figure out a way to use debug logic triggers<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">2</a></sup> or performance counters to set off a backdoor when some arbitrary JavaScript gets run.</p> <p>Alright, now you've got a backdoor. How do you insert the backdoor? In software, you'd either edit the source or the <a href="//danluu.com/edit-binary/">binary</a>. In hardware, if you have access to the source, you can edit it as easily as you can in software. The hardware equivalent of recompiling the source, creating physical chips, has tremendously high fixed costs; if you're trying to get your changes into the source, you'll want to either compromise the design<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">3</a></sup> and insert your edits before everything is sent off to get manufactured, or compromise the manufacturing process and sneak in your edits at the last second<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">4</a></sup>.</p> <p>If that sounds too hard, you could try compromising the patch mechanism. Most modern CPUs come with a built-in patch mechanism to allow bug fixes after the fact. It's likely that the CPU you're using has been patched, possibly from day one, and possibly as part of a firmware update. The details of the patch mechanism for your CPU are a closely guarded secret. It's likely that the CPU has a public key etched into it, and that it will only accept a patch that's been signed by the right private key.</p> <p>Is this actually happening? I have no idea. Could it be happening? Absolutely. What are the odds? Well, the primary challenge is non-technical, so I'm not the right person to ask about that. If I had to guess, I'd say no, if for no other reason than the <a href="https://github.com/elvanderb/TCP-32764">ease of subverting other equipment</a>.</p> <p><small> I haven't discussed how to make a backdoor that's hard to detect even if someone has access to software you've used to trigger a backdoor. That's harder, but it should be possible once chips start coming with built-in <a href="http://en.wikipedia.org/wiki/Trusted_Platform_Module">TPMs</a>.</p> <p><strong>If you liked this post, you'll probably enjoy <a href="//danluu.com/cpu-bugs/">this post on CPU bugs</a> and might be interested in <a href="//danluu.com/new-cpu-features/">this post about new CPU features over the past 35 years</a>.</strong></p> <h4 id="updates">Updates</h4> <p>See <a href="https://twitter.com/danluu/status/562962211782815746">this twitter thread</a> for much more discussion, some of which is summarized below.</p> <p>I'm not going to provide individual attributions because there are too many comments, but here's a summary of comments from @hackerfantastic, Arrigo Triulzi, David Kanter, @solardiz, @4Dgifts, Alfredo Ortega, Marsh Ray, and Russ Cox. Mistakes are my own, of course.</p> <p><a href="http://www.realworldtech.com/forum/?threadid=35566&amp;curpostid=35566">AMD's K7 and K8 had their microcode patch mechanisms compromised</a>, allowing for the sort of attacks mentioned in this post. Turns out, AMD didn't encrypt updates or validate them with a checksum, which lets you easily modify updates until you get one that does what you want.</p> <p><a href="http://es.slideshare.net/ortegaalfredo/deep-submicronbackdoorsortegasyscan2014slides">Here's an example of a backdoor that was created for demonstration purposes</a>, by Alfredo Ortega.</p> <p>For folks without a hardware background, <a href="http://media.ccc.de/browse/congress/2013/30C3_-_5443_-_en_-_saal_g_-_201312281830_-_introduction_to_processor_design_-_byterazor.html#video">this talk on how to implement a CPU in VHDL is nice, and it has a section on how to implement a backdoor</a>.</p> <p>Is it possible to backdoor RDRAND by providing bad random results? Yes. I mentioned that in my first draft of this post, but I got rid of it since my impression was that people don't trust RDRAND and mix the results other sources of entropy. That doesn't make a backdoor useless, but it significantly reduces the value.</p> <p>Would it be possible to store and dump AES-NI keys? It's probably infeasible to sneak flash memory onto a chip without anyone noticing, but modern chips have logic analyzer facilities that let you store and dump data. However, access to those is through some secret mechanism and it's not clear how you'd even get access to binaries that would let you reverse engineer their operation. That's in stark contrast to the K8 reverse engineering, which was possible because microcode patches get included in firmware updates.</p> <p>It would be possible to check instruction prefixes for the trigger. x86 lets you put redundant (and contradictory) instruction prefixes on instructions. Which prefixes get used are well defined, so you can add as many prefixes as you want without causing problems (up to the prefix length limit). The issues with this are that it's probably hard to do without sacrificing performance with a microcode patch, the limited number of prefixes and the length limit mean that your effective key size is relatively small if you don't track state across multiple instructions, and that you can only generate the trigger with native code.</p> <p>As far as anyone knows, this is all speculative, and no one has seen an actual CPU backdoor being used in the wild.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p>Thanks to Leah Hanson for extensive comments, to Aleksey Shipilev and Joe Wilder for suggestions/corrections, and to the many participants in the twitter discussion linked to above. Also, thanks to Markus Siemens for noticing that a bug in some RSS readers was causing problems, and for providing the workaround. That's not really specific to this post, but it happened to come up here. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:I"><p>This choice of instruction is somewhat, but not completely, arbitrary. You'll probably want an instruction that's both slow and microcoded, to make it easy to patch with a microcode patch without causing a huge performance hit. The rest of this footnote is about what it means for an instruction to be microcoded. It's quite long and not in the critical path of this post, so you might want to skip it.</p> <p>The distinction between a microcoded instruction and one that's implemented in hardware is, itself, somewhat arbitrary. CPUs have an instruction set they implement, which you can think of as a public API. Internally, they can execute a different instruction set, which you can think of as a private API.</p> <p>On modern Intel chips, instructions that turn into four (or fewer) uops (private API calls) are translated into uops directly by the decoder. Instructions that result in more uops (anywhere from five to hundreds or possibly thousands) are decoded via a microcode engine that reads uops out of a small ROM or RAM on the CPU. Why four and not five? That's a result of some tradeoffs, not some fundamental truth. The terminology for this isn't standardized, but the folks I know would say that an instruction is “microcoded” if its decode is handled by the microcode engine and that it's “implemented in hardware” if its decode is handled by the standard decoder. The microcode engine is sort of its own CPU, since it has to be able to handle things like reading and writing from temporary registers that aren't architecturally visible, reading and writing from internal RAM for instructions that need more than just a few registers of scratch space, conditional microcode branches that change which microcode the microcode engine fetches and decodes, etc.</p> <p>Implementation details vary (and tend to be secret). But whatever the implementation, you can think of the microcode engine as something that loads a RAM with microcode when the CPU starts up, which then fetches and decodes microcoded instructions out of that RAM. It's easy to modify what microcode gets executed by changing what gets loaded on boot via a microcode patch.</p> <p>For quicker turnaround while debugging, it's somewhere between plausible and likely that Intel also has a mechanism that lets them force non-microcoded instructions to execute out of the microcode RAM in order to allow them to be patched with a microcode patch. But even if that's not the case, compromising the microcode patch mechanism and modifying a single microcoded instruction should be sufficient to install a backdoor.</p> <a class="footnote-return" href="#fnref:I"><sup>[return]</sup></a></li> <li id="fn:D">For the most part, these aren't publicly documented, but you can get a high-level overview of what kind of debug triggers Intel was building into their chips a couple generators ago starting at page 128 of <a href="http://click.intel.com/the-tick-tock-beat-of-microprocessor-development-at-intel.html">Intel Technology Journal, Volume 4, Issue 3</a>. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:A">For the past couple years, there's been a debate over whether or not major corporations have been compromised and whether such a thing is even possible. During the cold war, government agencies on all sides were compromised at various levels for extended periods of time, despite having access to countermeasures not available to any corporations today (not hiring citizens of foreign countries, &quot;enhanced interrogation techniques&quot;, etc.). I'm not sure that we'll ever know if companies are being compromised, but it would certainly be easier to compromise a present-day corporation than it was to compromise government agencies during the cold war, and that was eminently doable. Compromising a company enough to get the key to the microcode patch is trivial compared to what was done during the cold war. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:E"><p>This is another really long footnote about minutia! In particular, it's about the manufacturing process. You might want to skip it! If you don't, don't say I didn't warn you.</p> <p>It turns out that editing chips before manufacturing is fully complete is relatively easy, by design. To explain why, we'll have to look at how chips are made.</p> <p><img src="images/cpu-backdoors/intel_22nm.png" alt="Cross section of Intel chip, 22nm process" width="420" height="440"></p> <p>When you look at a <a href="http://www.intel.com/content/dam/www/public/us/en/documents/presentation/silicon-technology-leadership-presentation.pdf">cross-section of a chip</a>, you see that silicon gates are at the bottom, forming logical primitives like <a href="http://www.nand2tetris.org/">nand gates</a>, with a series of metal layers above (labeled M1 through M8), forming wires that connect different gates. A cartoon model of the manufacturing process is that chips are built from the bottom up, one layer a time, where each layer is created by depositing some material and then etching part of it away using a mask, in a process that's analogous to <a href="http://en.wikipedia.org/wiki/Lithography">lithographic printing</a>. The non-cartoon version involves a lot of complexity -- <a href="https://www.youtube.com/watch?v=NGFhc8R_uO4">Todd Fernendez estimates that it takes about 500 steps to create the layers below “M1”</a>. Additionally, the level of precision needed is high enough that the light used to etch causes enough wear in the equipment that it wears out. You probably don't normally think about lenses wearing out due to light passing through them, but at the level of precision required for each of the hundreds of steps required to make a transistor, it's a serious problem. If that sounds surprising to you, you're not alone. An <a href="http://www.itrs.net/">ITRS roadmap</a> from the 90s predicted that by 2016, we'd be at almost 30GHz (higher is better) on a 9nm process (smaller is better), with chips consuming almost 300 watts. Instead, 5 GHz is considered pretty fast, and anyone who isn't Intel will be lucky to get high-yield production on a 14nm process by the start of 2016. Making chips is harder than anyone guessed it would be.</p> <p>A modern chip has enough layers that it takes about three months to make one, from start to finish. This makes bugs very bad news since a bug fix that requires a change to one of the bottom layers takes three months to manufacture. In order to reduce the turnaround time on bug fixes, it's typical to scatter unused logic gates around the silicon, to allow small bug fixes to be done with an edit to a few layers that are near the top. Since chips are made in a manufacturing line process, at any point in time, there are batches of partially complete chips. If you only need to edit one of the top metal layers, you can apply the edit to a partially finished chip, cutting the turnaround time down from months to weeks.</p> <p>Since chips are designed to allow easy edits, someone with access to the design before the chip is manufactured (such as the manufacturer) can make major changes with relatively small edits. I suspect that if you were to make this comment to anyone at a major CPU company, they'd tell you it's impossible to do this without them noticing because it would get caught in characterization or when they were trying to find speed paths or something similar. One would hope, but <a href="http://www.cl.cam.ac.uk/~sps32/sec_news.html#Assurance">actual hardware devices have shipped with backdoors</a>, and either no one noticed, or they were complicit.</p> <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> </ol> </div> Blog monetization blog-ads/ Sat, 24 Jan 2015 00:00:00 +0000 blog-ads/ <p>Does it make sense for me to run ads on my blog? I've been thinking about this lately, since Carbon Ads contacted me about putting an ad up. What are the pros and cons? This isn't a rhetorical question. I'm genuinely <a href="https://twitter.com/danluu">interested in what you think</a>.</p> <p></p> <h3 id="pros">Pros</h3> <h4 id="money">Money</h4> <p>Hey, who couldn't use more money? And it's basically free money. Well, except for the all of the downsides.</p> <h4 id="data">Data</h4> <p>There's lots of studies on the impact of ads on site usage and behavior. But as with any sort of benchmarking, it's not really clear how or if that generalizes to other sites if you don't have a deep understanding of the domain, and I have almost no understanding of the domain. If I run some ads and do some A/B testing I'll get to see what the effect is on my site, which would be neat.</p> <h3 id="cons">Cons</h3> <h4 id="money-1">Money</h4> <p>It's not enough money to make a living off of, and it's never going to be. When Carbon contacted me, they asked me how much traffic I got in the past 30 days. At the time, Google Analytics showed 118k sessions, 94k users, 143k page views. Cloudflare tends to show about 20% higher traffic since 20% of people block Google Analytics, but those 20% plus more probably block ads, so the &quot;real&quot; numbers aren't helpful here. I told them that, but I also told them that those numbers were pretty unusual and that I'd expect to average much less traffic.</p> <p>How much money is that worth? I don't know if the CPM (cost per thousand impressions) numbers they gave me are confidential, so I'll just use a current standard figure of $1 CPM. If my traffic continued at that rate, that would be $143/month, or $1,700/year. Ok, that's not too bad.</p> <p><img src="images/blog-ads/ga_traffic.png" alt="Distribution of traffic on this blog. About 500k hits total." width="640" height="142"></p> <p>But let's look at the traffic since I started this blog. I didn't add analytics until after a post of mine got passed around on HN and reddit, so this isn't all of my traffic, but it's close.</p> <p>For one thing, the 143k hits over a 30-day period seems like a fluke. I've never had a calendar month with that much traffic. I just happen to have a traffic distribution which turned up a bunch of traffic over a specific 30-day period.</p> <p>Also, if I stop blogging, as I did from April to October, my traffic level drops to pretty much zero. And even if I keep blogging, it's not really clear what my “natural” traffic level is. Is the level before I paused my blogging the normal level or the level after? Either way, $143/month seems like a good guess for an upper bound. I might exceed that, but I doubt it.</p> <p>For a hard upper bound, let's look at one of the most widely read programming blogs, Coding Horror. Jeff Atwood is nice enough to make his traffic stats available. Thanks Jeff!</p> <p><img src="images/blog-ads/ch_traffic.png" alt="Distribution of traffic on Coding Horror. 1.7M hits in a month at its peak." width="592" height="252"></p> <p>He got 1.7M hits in his best month, and 1.25M wouldn't be a bad month for him, even when he was blogging regularly. With today's CPM rates, that's $1.7k/month at his peak and $1.25k/month for a normal month.</p> <p>But Jeff Atwood blogs about general interest programming topics, like <a href="http://blog.codinghorror.com/the-future-of-markdown/">Markdown</a> and <a href="http://blog.codinghorror.com/why-ruby/">Ruby</a> and I blog about obscure stuff, like <a href="//danluu.com/clwb-pcommit/">why Intel might want to add new instructions to speed up non-volatile storage</a> with the occasional <a href="//danluu.com/empirical-pl/">literature review</a> for variety. There's no way I can get as much traffic as someone who blogs about more general interest topics; I'd be surprised if I could even get within a factor of 2, so $600/month seems like a hard and probably unreachable upper bound for sustainable income.</p> <p>That's not bad. After taxes, that would have approximately covered my rent when I lived in Austin, and could have covered rent + utilities and other expenses if I'd had a roommate. But the wildly optimistic success rate is that you barely cover rent when the programming job market is hot enough that mid-level positions at big companies pay out total compensation that's 8x-9x the median income in the U.S. That's not good.</p> <p>Worse yet, this is getting worse over time. CPM is down something like 5x since the 90s, and continues to decline. Meanwhile, the percentage of people using ad blockers continues to increase.</p> <p>Premium ads can get well over an order of magnitude higher CPM and <a href="http://daringfireball.net/feeds/sponsors/">sponsorships can fetch an ever better return</a>, so the picture might not be quite as bleak as I'm making it out to be. But to get premium ads you need to appeal to specific advertisers. What advertisers are interested in an audience that's mostly programmers with an interest in low-level shenanigans? I don't know, and I doubt it's worth the effort to find out unless I can get to Jeff Atwood levels of traffic, which I find unlikely.</p> <h5 id="a-tangent-on-alexa-rankings">A Tangent on Alexa Rankings</h5> <p>What's up with Alexa? Why do so many people use it as a gold standard? In theory, it's supposed to show how popular a site was over the past three months. According to Alexa, Coding Horror is ranked at 22k and I'm at 162k. My understanding is that traffic is more than linear in rank so you'd expect Coding Horror to have substantially more than 7x the traffic that I do. But if you compare Jeff's stats to mine over the past three months (Oct 21 - Jan 21), statcounter claims he's had 78k hits compared to my 298k hits. Even if you assume that traffic is merely linear in Alexa rank, that's a 28x difference in relative traffic between the direct measurement and Alexa's estimate.</p> <p>I'm not claiming that my blog is more popular in any meaningful sense -- if Jeff posted as often as I did in the past three months, I'm sure he'd have at least 10x more traffic than me. But given that Jeff now spends most of his time on non-blogging activities and that his traffic is at the level it's at when he rarely blogs, the Alexa ranks for our sites seem way off.</p> <p>Moreover, the Alexa sub-metrics are inconsistent and nonsensical. Take this graph on the relative proportion of users who use this site from home, school, or work.</p> <p><img src="images/blog-ads/alexa_below_average.png" alt="Relatively below average, at everything!" width="249" height="138"></p> <p>It's below average in every category, which should be impossible for a relative ranking like this. But even mathematical impossibility doesn't stop Alexa!</p> <h4 id="traffic">Traffic</h4> <p>Ads reduce traffic. How much depends both on the site and the ads. I might do a literature review some other time, but for now I'm just going to link to this single result by <a href="http://www.decisionsciencenews.com/2015/01/02/annoying-animated-ads-may-cost-worth-websites/">Daniel G. Goldstein, Siddharth Suri, R. Preston McAfee, Matthew Ekstrand-Abueg, and Fernando Diaz</a> that attempts to quantify the cost.</p> <p>My point isn't that some specific study applies to adding a single ad to my site, but that it's well known that adding ads reduces traffic and has some effect on long-term user behavior, which has some cost.</p> <p>It's relatively easy to quantify the cost if you're looking at something like the study above, which compares “annoying” ads to “good” ads to see what the cost of the “annoying” ads are. It's harder to quantify for a personal blog where the baseline benefit is non-monetary.</p> <p>What do I get out of this blog, anyway? The main benefits I can see are that I've met and regularly correspond with some great people I wouldn't have otherwise met, that I often get good feedback on my ideas, and that every once in a while someone pings me about a job that sounds interesting because they saw a relevant post of mine.</p> <p>I doubt I can effectively estimate the amount of traffic I'll lose, and even if I could, I doubt I could figure out the relationship between that and the value I get out of blogging. My gut says that the value is “a lot” and that the monetary payoff is probably “not a lot”, but it's not clear what that means at the margin.</p> <h4 id="incentives">Incentives</h4> <p>People are influenced by money, even when they don't notice it. I'm people. I might do something to get more revenue, even though the dollar amount is small and I wouldn't consciously spend a lot of effort of optimizing things to get an extra $5/month.</p> <p>What would that mean here? Maybe I'd write more blog posts? When I experimented with blurting out blog posts more frequently, with less editing, I got uniformly positive feedback, so maybe being incentivized to write more wouldn't be so bad. But I always worry about unconscious bias and I wonder what other effects running ads might have on me.</p> <h4 id="privacy">Privacy</h4> <p>Ad networks can track people through ads. My impression is that people are mostly concerned with really big companies that have enough information that they could deanonymize people if they were so inclined, like Google and Facebook, but some people are probably also concerned about smaller ad networks like Carbon. Just as an aside, I'm curious if companies that attempt to do lots of tracking, like Tapad and MediaMath actually have more data on people than better known companies like Yahoo and eBay. I doubt that kind of data is publicly available, though.</p> <h4 id="paypal">Paypal</h4> <p>This is specific to Carbon, but they pay out through PayPal, which is notorious for freezing funds for six months if you get enough money that you'd actually want the money, and for pseudo-randomly <a href="http://www.jacquesmattheij.com/PayPal+robbed+my+bank+account">draining your bank account due to clerical errors</a>. I've managed to avoid hooking my PayPal account up to my bank account so far, but I'll have to either do that or get money out through an intermediary if I end up making enough money that I want to withdraw it.</p> <h3 id="conclusion">Conclusion</h3> <p>Is running ads worth it? I don't know. If I had to guess, I'd say no. I'm going to try it anyway because I'm curious what the data looks like, and I'm not going to get to see any data if I don't try something, but it's not like that data will tell me whether or not it was worth it.</p> <p>At best, I'll be able to see a difference in click-through rates on my blog with and without ads. This blog mostly spreads through word of mouth, so what I really want to see is the difference in the rate at which the blog gets shared with other people, but I don't see a good way to do that. I could try globally enabling or disabling ads for months at a time, but the variance between months is so high that I don't know that I'd get good data out of that even if I did it for years.</p> <p><i>Thanks to Anja Boskovic for comments/corrections/discussion.</i></p> <h4 id="update">Update</h4> <p>After running an ads for a while, it looks like about 40% of my traffic uses an ad blocker (whereas about 17% of my traffic blocks Google Analytics). I'm not sure if I should be surprised that the number is so high or that it's so low. On the one hand, 40% is a lot! On the other hand, despite complaints that ad blockers slow down browsers, my experience has been that web pages load a lot faster when I'm blocking ads using the right ad blocker and I don't see any reason not to use an blocker. I'd expect that most of my traffic comes from programmers, who all know that ad blocking is possible.</p> <p>There's the argument that ad blocking is piracy and/or stealing, but I've never heard a convincing case made. If anything, I think that some of the people who make that argument step over the line, as when ars technica blocked people who used ad blockers, and then backed off and merely exhorted people to disable ad blocking for their site. I think most people would agree that directly exhorting people to click on ads and commit click fraud is unethical; asking people to disable ad blocking is a difference in degree, not in kind. People who use ad blockers are much less likely to click on ads, so having them disable ad blockers to generate impressions that are unlikely to convert strikes me as pretty similar to having people who aren't interested in the product generate clicks.</p> <p>Anyway, I ended up removing this ad after they failed to send a payment after the first payment. AdSense is rumored to wait until just before payment before cutting people off, to get as many impressions as possible for free, but AdSense at least notifies you about it. Carbon just stopped paying without saying anything, while still running the ad. I could probably ask someone at Carbon or BuySellAds about it, but considering how little the ad is worth, it's not really worth the hassle of doing that.</p> <h4 id="update-2">Update 2</h4> <p>It's been almost two years since I said that I'd never get enough traffic for blogging to be able to cover my living expenses. It turns out that's not true! My reasoning was that I mostly tend to blog about low-level technical topics, which can't possibly generate enough traffic to generate &quot;real&quot; ad revenue. That reason is still as valid as ever, but my blogging is now approximately half low-level technical stuff, and half general-interest topics for programmers.</p> <p><img src="images/blog-ads/2016-month.png" alt="Traffic for one month on this blog in 2016. Roughly 3.1M hits." width="1856" height="878"></p> <p>Here's a graph of my traffic for the past 30 days (as of October 25th, 2016). Since this is Cloudflare's graph of requests, this would wildly overestimate traffic for most sites, because each image and CSS file is one request. However, since the vast majority of my traffic goes to pages with no external CSS and no images, this is pretty close to my actual level of traffic. 15% of the requests are images, and 10% is RSS (which I won't count because the rate of RSS hits is hard to correlation to the rate of actual people reading). But that means that 75% of the traffic appears to be &quot;real&quot;, which puts the traffic into this site at roughly 2.3M hits per month. At a typical $1 ad CPM, that's $2.3k/month, which could cover my share of household expenses.</p> <p>Additionally, when I look at blogs that really try to monetize their traffic, they tend to monetize at a much better rate. For example, <a href="https://web.archive.org/web/20190220011804/https://slatestarcodex.com/advertise/">Slate Star Codex charges $1250 for 6 months of ads and appears to be running 8 ads</a>, for a total of $20k/yr. The author claims to get &quot;10,000 to 20,000 impressions per day&quot;, or roughly 450k hits per month. I get about 5x that much traffic. If we scale that linearly, that might be $100k/yr instead of $20k/yr. One thing that I find interesting is that the ads on Slate Star Codex don't get blocked by my ad blocker. It seems like that's because the author isn't part of some giant advertising program, and ad blockers don't go out of their way to block every set of single-site custom ads out there. I'm using Slate Star Codex as an example because I think it's not super ad optimized because I doubt I would optimize my ads much if I ran ads.</p> <p>This is getting to the point where it seems a bit unreasonable not to run ads (I doubt the non-direct value I get out of this blog can consistently exceed $100k/yr). I probably &quot;should&quot; run ads, but I don't think the revenue I get from something like AdSense or Carbon is really worth it, and it seems like a hassle to run my own ad program the way Slate Star Codex does. It seems totally irrational to leave $90k/yr on the table because &quot;it seems like a hassle&quot;, but here we are. I went back and added affiliate code to all of my Amazon links, but if I'm estimating Amazon's payouts correctly, that will amount to less than $100/month.</p> <p>I don't think it's necessarily more irrational than behavior I see from other people -- I regularly talk to people who leave $200k/yr or more on the table by working for startups instead of large companies, and that seems like a reasonable preference to me. They make &quot;enough&quot; money and like things the way they are. What's wrong with that? So why can't not running ads be a reasonable preference? It still feels pretty unreasonable to me, though! A few people have suggested crowdfunding, but the top earning programmers have at least an order of magnitude more exposure than I do and make an order of magnitude less than I could on ads (folks like Casey Muratori, ESR, and eevee are pulling in around $1000/month).</p> <h4 id="update-3">Update 3</h4> <p>I'm now trying <a href="https://www.patreon.com/danluu">donations via Patreon</a>. I suspect this won't work, but I'd be happy to be wrong!</p> What's new in CPUs since the 80s? new-cpu-features/ Sun, 11 Jan 2015 00:00:00 +0000 new-cpu-features/ <p>This is a response to the following question from David Albert:</p> <blockquote> <p>My mental model of CPUs is stuck in the 1980s: basically boxes that do arithmetic, logic, bit twiddling and shifting, and loading and storing things in memory. I'm vaguely aware of various newer developments like vector instructions (SIMD) and the idea that newer CPUs have support for virtualization (though I have no idea what that means in practice).</p> <p>What cool developments have I been missing? What can today's CPU do that last year's CPU couldn't? How about a CPU from two years ago, five years ago, or ten years ago? The things I'm most interested in are things that programmers have to manually take advantage of (or programming environments have to be redesigned to take advantage of) in order to use and as a result might not be using yet. I think this excludes things like Hyper-threading/SMT, but I'm not honestly sure. I'm also interested in things that CPUs can't do yet but will be able to do in the near future.</p> </blockquote> <p></p> <p>Everything below refers to x86 and linux, unless otherwise indicated. History has a tendency to repeat itself, and a lot of things that were new to x86 were old hat to supercomputing, mainframe, and workstation folks.</p> <h3 id="the-present">The Present</h3> <h4 id="miscellania">Miscellania</h4> <p>For one thing, chips have wider registers and can address more memory. In the 80s, you might have used an 8-bit CPU, but now you almost certainly have a <a href="http://courses.cs.washington.edu/courses/csep590/06au/projects/history-64-bit.pdf">64-bit CPU</a> in your machine. I'm not going to talk about this too much, since I assume you're familiar with programming a 64-bit machine. In addition to providing more address space, 64-bit mode provides more registers and more consistent floating point results (via the avoidance of pseudo-randomly getting 80-bit precision for 32 and 64 bit operations via x87 floating point). Other things that you're very likely to be using that were introduced to x86 since the early 80s include paging / virtual memory, pipelining, and floating point.</p> <h4 id="esoterica">Esoterica</h4> <p>I'm also going to avoid discussing things that are now irrelevant (like A20M) and things that will only affect your life if you're writing drivers, BIOS code, doing security audits, or other unusually low-level stuff (like APIC/x2APIC, SMM, NX, or <a href="https://eprint.iacr.org/2016/086">SGX</a>).</p> <h4 id="memory-caches">Memory / Caches</h4> <p>Of the remaining topics, the one that's most likely to have a real effect on day-to-day programming is how memory works. My first computer was a 286. On that machine, a memory access might take a few cycles. A few years back, I used a Pentium 4 system where a memory access took more than 400 cycles. Processors have sped up a lot more than memory. The solution to the problem of having relatively slow memory has been to add caching, which provides fast access to frequently used data, and prefetching, which preloads data into caches if the access pattern is predictable.</p> <p>A few cycles vs. 400+ cycles sounds really bad; that's well over 100x slower. But if I write a dumb loop that reads and operates on a large block of 64-bit (8-byte) values, the CPU is smart enough to prefetch the correct data before I need it, which lets me process at about <a href="//danluu.com/assembly-intrinsics/">22 GB/s</a> on my 3GHz processor. A calculation that can consume 8 bytes every cycle at 3GHz only works out to 24GB/s, so getting 22GB/s isn't so bad. We're losing something like 8% performance by having to go to main memory, not 100x.</p> <p>As a first-order approximation, using predictable memory access patterns and operating on chunks of data that are smaller than your CPU cache will get you most of the benefit of modern caches. If you want to squeeze out as much performance as possible, <a href="http://people.freebsd.org/~lstewart/articles/cpumemory.pdf">this document</a> is a good starting point. After digesting that 100 page PDF, you'll want to familiarize yourself with the microarchitecture and memory subsystem of the system you're optimizing for, and learn how to profile the performance of your application with something like <a href="https://code.google.com/p/likwid/">likwid</a>.</p> <h4 id="tlbs">TLBs</h4> <p>There are lots of little caches on the chip for all sorts of things, not just main memory. You don't need to know about the decoded instruction cache and other funny little caches unless you're really going all out on micro-optimizations. The big exception is the TLBs, which are caches for virtual memory lookups (done via a 4-level page table structure on x86). Even if the page tables were in the l1-data cache, that would be 4 cycles per lookup, or 16 cycles to do an entire virtual address lookup each time around. That's totally unacceptable for something that's required for all user-mode memory accesses, so there are small, fast, caches for virtual address lookups.</p> <p>Because the first level TLB cache has to be fast, it's severely limited in size (perhaps 64 entries on a modern chip). If you use 4k pages, that limits the amount of memory you can address without incurring a TLB miss. x86 also supports 2MB and 1GB pages; some applications will benefit a lot from using larger page sizes. It's something worth looking into if you've got a long-running application that uses a lot of memory.</p> <p>Also, first-level caches are usually limited by the page size times <a href="//danluu.com/3c-conflict/">the associativity of the cache</a>. If the cache is smaller than that, the bits used to index into the cache are the same regardless if whether you're looking at the virtual address or the physical address, so you don't have to do a virtual to physical translation before indexing into the cache. If the cache is larger than that, you have to first do a TLB lookup to index into the cache (which will cost at least one extra cycle), or build a virtually indexed cache (which is possible, but adds complexity and coupling to software). You can see this limit in modern chips. Haswell has an 8-way associative cache and 4kB pages. Its l1 data cache is <code>8 * 4kB = 32kB</code>.</p> <h4 id="out-of-order-execution-serialization">Out of Order Execution / Serialization</h4> <p>For a couple decades now, x86 chips have been able to speculatively execute and re-order execution (to avoid blocking on a single stalled resource). This sometimes results in <a href="https://docs.google.com/document/d/18gs0bkEwQ5cO8pMXT_MsOa8Xey4NEavXq-OvtdUXKck/pub">odd performance hiccups</a>. But x86 is pretty strict in requiring that, for a single CPU, externally visible state, like registers and memory, must be updated as if everything were executed in order. The implementation of this involves making sure that, for any pair of instructions with a dependency, those instructions execute in the correct order with respect to each other.</p> <p>That restriction that things look like they executed in order means that, for the most part, you can ignore the existence of OoO execution unless you're trying to eke out the best possible performance. The major exceptions are when you need to make sure something not only looks like it executed in order externally, but actually executed in order internally.</p> <p>An example of when you might care would be if you're trying to measure the execution time of a sequence of instructions using <code>rdtsc</code>. <code>rdtsc</code> reads a hidden internal counter and puts the result into <code>edx</code> and <code>eax</code>, externally visible registers.</p> <p>Say we do something like</p> <pre><code>foo rdtsc bar mov %eax, [%ebx] baz </code></pre> <p>where foo, bar, and baz don't touch <code>eax</code>, <code>edx</code>, or <code>[%ebx]</code>. The mov that follows the rdtsc will write the value of <code>eax</code> to some location in memory, and because <code>eax</code> is an externally visible register, the CPU will guarantee that the <code>mov</code> doesn't execute until after <code>rdtsc</code> has executed, so that everything looks like it happened in order.</p> <p>However, since there isn't an explicit dependency between the <code>rdtsc</code> and either <code>foo</code> or <code>bar</code>, the <code>rdtsc</code> could execute before <code>foo</code>, between <code>foo</code> and <code>bar</code>, or after <code>bar</code>. It could even be the case that <code>baz</code> executes before the <code>rdtsc</code>, as long as <code>baz</code> doesn't affect the move instruction in any way. There are some circumstances where that would be fine, but it's not fine if the <code>rdtsc</code> is there to measure the execution time of <code>foo</code>.</p> <p>To precisely order the <code>rdtsc</code> with respect to other instructions, we need to an instruction that serializes execution. Precise details on how exactly to do that <a href="http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf">are provided in this document by Intel</a>.</p> <h4 id="memory-concurrency">Memory / Concurrency</h4> <p>In addition to the ordering restrictions above, which imply that loads and stores to the same location can't be reordered with respect to each other, x86 loads and stores have some other restrictions. In particular, for a single CPU, stores are never reordered with other stores, and stores are never reordered with earlier loads, regardless of whether or not they're to the same location.</p> <p>However, loads can be reordered with earlier stores. For example, if you write</p> <pre><code>mov 1, [%esp] mov [%ebx], %eax </code></pre> <p>it can be executed as if you wrote</p> <pre><code>mov [%ebx], %eax mov 1, [%esp] </code></pre> <p>But the converse isn't true — if you write the latter, it can never be executed as if you wrote the former.</p> <p>You could force the first example to execute as written by inserting a serializing instruction. But that requires the CPU to serialize all instructions. But that's slow, since it effectively forces the CPU to wait until all instructions before the serializing instruction are done before executing anything after the serializing instruction. There's also an <code>mfence</code> instruction that only serializes loads and stores, if you only care about load/store ordering.</p> <p>I'm not going to discuss the other memory fences, lfence and sfence, <a href="https://stackoverflow.com/q/20316124/334816">but you can read more about them here</a>.</p> <p>We've looked at single core ordering, where loads and stores are mostly ordered, but there's also multi-core ordering. The above restrictions all apply; if core0 is observing core1, it will see that all of the single core rules apply to core1's loads and stores. However, if core0 and core1 interact, there's no guarantee that their interaction is ordered.</p> <p>For example, say that core0 and core 1 start with <code>eax</code> and <code>edx</code> set to 0, and core 0 executes</p> <pre><code>mov 1, [_foo] mov [_foo], %eax mov [_bar], %edx </code></pre> <p>while core1 executes</p> <pre><code>mov 1, [_bar] mov [_bar], %eax mov [_foo], %edx </code></pre> <p>For both cores, <code>eax</code> has to be <code>1</code> because of the within-core dependency between the first instruction and the second instruction. However, it's possible for <code>edx</code> to be <code>0</code> in both cores because line 3 of core0 can execute before core0 sees anything from core1, and visa versa.</p> <p>That covers memory barriers, which serialize memory accesses within a core. Since stores are required to be seen in a consistent order across cores, they can, they also have an effect on cross-core concurrency, but it's pretty difficult to reason about that kind of thing correctly. <a href="http://yarchive.net/comp/linux/locking.html">Linus has this to say on using memory barriers instead of locking</a>:</p> <blockquote> <p>The <em>real</em> cost of not locking also often ends up being the inevitable bugs. Doing clever things with memory barriers is almost always a bug waiting to happen. It's just <em>really</em> hard to wrap your head around all the things that can happen on ten different architectures with different memory ordering, and a single missing barrier. … The fact is, any time anybody makes up a new locking mechanism, THEY ALWAYS GET IT WRONG. Don't do it.</p> </blockquote> <p>And it turns out that on modern x86 CPUs, using locking to implement concurrency primitives is <a href="https://blogs.oracle.com/dave/resource/NHM-Pipeline-Blog-V2.txt">often cheaper than using memory barriers</a>, so let's look at locks.</p> <p>If we set <code>_foo</code> to 0 and have two threads that both execute <code>incl (_foo)</code> 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2. If it's not obvious why the theoretical minimum is 2 and not 10000, figuring that out is a good exercise. If it is obvious, my bonus exercise for you is, can any reasonable CPU implementation get that result, or is that some silly thing the spec allows that will never happen? There isn't enough information in this post to answer the bonus question, but I believe I've linked to enough information.</p> <p>We can try this with a simple code snippet</p> <pre><code>#include &lt;stdlib.h&gt; #include &lt;thread&gt; #define NUM_ITERS 10000 #define NUM_THREADS 2 int counter = 0; int *p_counter = &amp;counter; void asm_inc() { int *p_counter = &amp;counter; for (int i = 0; i &lt; NUM_ITERS; ++i) { __asm__(&quot;incl (%0) \n\t&quot; : : &quot;r&quot; (p_counter)); } } int main () { std::thread t[NUM_THREADS]; for (int i = 0; i &lt; NUM_THREADS; ++i) { t[i] = std::thread(asm_inc); } for (int i = 0; i &lt; NUM_THREADS; ++i) { t[i].join(); } printf(&quot;Counter value: %i\n&quot;, counter); return 0; } </code></pre> <p>Compiling the above with <code>clang++ -std=c++11 -pthread</code>, I get the following distribution of results on two of my machines:</p> <p><img src="images/new-cpu-features/write-contention.png" alt="Different distributions of non-determinism on Haswell and Sandy Bridge" width="640" height="480"></p> <p>Not only do the results vary between runs, the distribution of results is different on different machines. We never hit the theoretical minimum of 2, or for that matter, anything below 10000, but there's some chance of getting a final result anywhere between 10000 and 20000.</p> <p>Even though <code>incl</code> is a single instruction, it's not guaranteed to be atomic. Internally, <code>incl</code> is implemented as a load followed by an add followed by an store. It's possible for an increment on cpu0 to sneak in and execute between the load and the store on cpu1 and visa versa.</p> <p>The solution Intel has for this is the <code>lock</code> prefix, which can be added to a handful of instructions to make them atomic. If we take the above code and turn <code>incl</code> into <code>lock incl</code>, the resulting output is always 20000.</p> <p>So, that's how we make a single instruction atomic. To make a sequence atomic, we can use <code>xchg</code> or <code>cmpxchg</code>, which are always locked as compare-and-swap primitives. I won't go into detail about how that works, but see <a href="http://davidad.github.io/blog/2014/03/23/concurrency-primitives-in-intel-64-assembly/">this article by David Dalrymple</a> if you're curious..</p> <p>In addition to making a memory transaction atomic, locks are globally ordered with respect to each other, and loads and stores aren't re-ordered with respect to locks.</p> <p>For a rigorous model of memory ordering, <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.pdf">see the x86 TSO doc</a>.</p> <p>All of this discussion has been how about how concurrency works in hardware. Although there are limitations on what x86 will re-order, compilers don't necessarily have those same limitations. In C or C++, you'll need to insert the appropriate primitives to make sure the compiler doesn't re-order anything. <a href="http://yarchive.net/comp/linux/memory_barriers.html">As Linus points out here</a>, if you have code like</p> <pre><code>local_cpu_lock = 1; // .. do something critical .. local_cpu_lock = 0; </code></pre> <p>the compiler has no idea that <code>local_cpu_lock = 0</code> can't be pushed into the middle of the critical section. Compiler barriers are distinct from CPU memory barriers. Since the x86 memory model is relatively strict, some compiler barriers are no-ops at the hardware level that tell the compiler not to re-order things. If you're using a language that's higher level than microcode, assembly, C, or C++, your compiler probably handles this for you without any kind of annotation.</p> <h4 id="memory-porting">Memory / Porting</h4> <p>If you're porting code to other architectures, it's important to note that x86 has one of the strongest memory models of any architecture you're likely to encounter nowadays. If you write code that just works without thinking it through and port it to architectures that have weaker guarantees (PPC, ARM, or Alpha), you'll almost certainly have bugs.</p> <p>Consider this example:</p> <pre><code>Initial ----- x = 1; y = 0; p = &amp;x; CPU1 CPU2 ---- ---- i = *p; y = 1; MB; p = &amp;y; </code></pre> <p><code>MB</code> is a memory barrier. On an Alpha 21264 system, this can result in <code>i = 0</code>.</p> <p>Kourosh Gharachorloo explains how:</p> <blockquote> <p>CPU2 does y=1 which causes an &quot;invalidate y&quot; to be sent to CPU1. This invalidate goes into the incoming &quot;probe queue&quot; of CPU1; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on CPU1. The invalidate is acknowledged right away at this point (i.e., you don't wait for it to actually invalidate the copy in CPU1's cache before sending the acknowledgment). Therefore, CPU2 can go through its MB. And it proceeds to do the write to p. Now CPU1 proceeds to read p. The reply for read p is allowed to bypass the probe queue on CPU1 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, CPU1 can derefence p to read the old value of y that is sitting in its cache (the invalidate y in CPU1's probe queue is still sitting there).</p> <p>How does an MB on CPU1 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of p, you do an MB which pulls in the invalidate to y for sure. And you can no longer see the old cached value for y.</p> <p>Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, CPU1 will likely have ample opportunity to service the messages (i.e., invalidate) in its probe queue before it receives the data reply for &quot;read p&quot;. Nonetheless, if you get into a situation where you have placed many things in CPU1's probe queue ahead of the invalidate to y, then it is possible that the reply to p comes back and bypasses this invalidate. It would be difficult for you to set up the scenario though and actually observe the anomaly.</p> </blockquote> <p>This is long enough without my talking about other architectures so I won't go into detail, but if you're wondering why anyone would create a spec that allows this kind of optimization, consider that before rising fab costs crushed DEC, their chips were so fast that they could <a href="https://www.usenix.org/legacy/publications/library/proceedings/usenix-nt97/full_papers/chernoff/chernoff.pdf">run industry standard x86 benchmarks of real workloads in emulation faster than x86 chips could run the same benchmarks natively</a>. For more explanation of why the most RISC-y architecture of the time made the decisions it did, see <a href="dick-sites-alpha-axp-architecture.pdf">this paper on the motivations behind the Alpha architecture</a>.</p> <p>BTW, this is a major reason I'm skeptical of the Mill architecture. Putting aside arguments about whether or not they'll live up to their performance claims, being technically excellent isn't, in and of itself, a business model.</p> <h4 id="memory-non-temporal-stores-write-combine-memory">Memory / Non-Temporal Stores / Write-Combine Memory</h4> <p>The set of restrictions outlined in the previous section apply to cacheable (i.e., “write-back” or WB) memory. That, itself, was new at one time. Before that, there was only uncacheable (UC) memory.</p> <p>One of the interesting things about UC memory is that all loads and stores are expected to go out to the bus. That's perfectly reasonable in a processor with no cache and little to no on-board buffering. A result of that is that devices that have access to memory can rely on all accesses to UC memory regions creating separate bus transactions, in order (because some devices will use a memory read or write as as trigger to do something). That worked great in 1982, but it's not so great if you have a video card that just wants to snarf down whatever the latest update is. If multiple writes happen to the same UC location (or different bytes of the same word), the CPU is required to issue a separate bus transaction for each write, even though a video card doesn't really care about seeing each intervening result.</p> <p>The solution to that was to create a memory type called write combine (WC). WC is a kind of eventually consistent UC. Writes have to eventually make it to memory, but they can be buffered internally. WC memory also has weaker ordering guarantees than UC.</p> <p>For the most part, you don't have to deal with this unless you're talking directly with devices. The one exception are “non-temporal” load and store operations. These make particular loads and stores act like they're to WC memory, even if the address is in a memory region that's marked WB.</p> <p>This is useful if you don't want to pollute your caches with something. This is often useful if you're doing <a href="http://blogs.fau.de/hager/archives/2103">some kind of streaming calculation</a> where you know you're not going to use a particular piece of data more than once.</p> <h4 id="memory-numa">Memory / NUMA</h4> <p>Non-uniform memory access, where memory latencies and bandwidth are different for different processors, is so common that we mostly don't talk about NUMA or ccNUMA anymore because they're so common that it's assumed to be the default.</p> <p>The takeaway here is that threads that share memory should be on the same socket, and a memory-mapped I/O heavy thread should make sure it's on the socket that's closest to the I/O device it's talking to.</p> <p>I've mostly avoided explaining the <em>why</em> behind things because that would make this post at least an order of magnitude longer than it's going to be. But I'll give a vastly oversimplified explanation of why we have NUMA systems, partially because it's a self-contained thing that's relatively easy to explain and partially to demonstrate how long the why is compared to the what.</p> <p>Once upon a time, there was just memory. Then CPUs got fast enough relative to memory that people wanted to add a cache. It's bad news if the cache is inconsistent with the backing store (memory), so the cache has to keep some information about what it's holding on to so it knows if/when it needs to write things to the backing store.</p> <p>That's not too bad, but once you get 2 cores with their own caches, it gets a little more complicated. To maintain the same programming model as the no-cache case, the caches have to be consistent with each other and with the backing store. Because existing load/store instructions have nothing in their API that allows them to say <code>sorry! this load failed because some other CPU is holding onto the address you want</code>, the simplest thing was to have every CPU send a message out onto the bus every time it wanted to load or store something. We've already got this memory bus that both CPUs are connected to, so we just require that other CPUs respond with the data (and invalidate the appropriate cache line) if they have a modified version of the data in their cache.</p> <p>That works ok. Most of the time, each CPU only touches data the other CPU doesn't care about, so there's some wasted bus traffic. But it's not too bad because once a CPU puts out a message saying <code>Hi! I'm going to take this address and modify the data</code>, it can assume it completely owns that address until some other CPU asks for it, which will probably won't happen. And instead of doing things on a single memory address, we can operate on cache lines that have, say, 64 bytes. So, the overall overhead is pretty low.</p> <p>It still works ok for 4 CPUs, although the overhead is a bit worse. But this thing where each CPU has to respond to every other CPU's fails to scale much beyond 4 CPUs, both because the bus gets saturated and because the caches will get saturated (the physical size/cost of a cache is <code>O(n^2)</code> in the number of simultaneous reads and write supported, and the speed is inversely correlated to the size).</p> <p>A “simple” solution to this problem is to have a single centralized directory that keeps track of all the information, instead of doing N-way peer-to-peer broadcast. Since we're packing 2-16 cores on a chip now anyway, it's pretty natural to have a single directory per chip (socket) that tracks the state of the caches for every core on a chip.</p> <p>This only solves the problem for each chip, and we need some way for the chips to talk to each other. Unfortunately, while we were scaling these systems up the bus speeds got fast enough that it's really difficult to drive a signal far enough to connect up a bunch of chips and memory all on one bus, even for small systems. The simplest solution to that is to have each socket own a region of memory, so every socket doesn't need to be connected to every part of memory. This also avoids the complexity of needed a higher level directory of directories, since it's clear which directory owns any particular piece of memory.</p> <p>The disadvantage of this is that if you're sitting in one socket and want some memory owned by another socket, you have a significant performance penalty. For simplicity, most “small” (&lt; 128 core) systems use ring-like busses, so the performance penalty isn't just the direct latency/bandwidth penalty you pay for walking through a bunch of extra hops to get to memory, it also uses up a finite resource (the ring-like bus) and slows down other cross-socket accesses.</p> <p>In theory, the OS handles this transparently, but it's <a href="https://www.usenix.org/conference/atc14/technical-sessions/presentation/gaud">often inefficient</a>.</p> <h4 id="context-switches-syscalls">Context Switches / Syscalls</h4> <p>Here, syscall refers to a linux system call, not the <code>SYSCALL</code> or <code>SYSENTER</code> x86 instructions.</p> <p>A side effect of all the caching that modern cores have is that <a href="http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html">context switches are expensive</a>, which causes syscalls to be expensive. Livio Soares and Michael Stumm discuss the cost in great detail <a href="http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/papers/flexsc-osdi10.pdf">in their paper</a>. I'm going to use a few of their figures, below. Here's a graph of how many instructions per clock (IPC) a Core i7 achieves on Xalan, a sub-benchmark from SPEC CPU.</p> <p><img src="images/new-cpu-features/flexsc_ipc_recovery.png" alt="Long tail of overhead from a syscall. 14,000 cycles." width="268" height="147"></p> <p>14,000 cycles after a syscall, code is still not quite running at full speed.</p> <p>Here's a table of the footprint of a few different syscalls, both the direct cost (in instructions and cycles), and the indirect cost (from the number of cache and TLB evictions).</p> <p><img src="images/new-cpu-features/flexsc_cost.png" alt="Cost of stat, pread, pwrite, open+close, mmap+munmap, and open+write+close" width="502" height="108"></p> <p>Some of these syscalls cause 40+ TLB evictions! For a chip with a 64-entry d-TLB, that nearly wipes out the TLB. The cache evictions aren't free, either.</p> <p>The high cost of syscalls is the reason people have switched to using batched versions of syscalls for high-performance code (e.g., <code>epoll</code>, or <code>recvmmsg</code>) and the reason that people who need <a href="//danluu.com/clwb-pcommit/">very high performance I/O</a> often use user space I/O stacks. More generally, the cost of context switches is why high-performance code is often thread-per-core (or even single threaded on a pinned thread) and not thread-per-logical-task.</p> <p>This high cost was also the driver behind <a href="http://www.linuxjournal.com/content/creating-vdso-colonels-other-chicken">vDSO, which turns some simple syscalls that don't require any kind of privilege escalation into simple user space library calls</a>.</p> <h4 id="simd">SIMD</h4> <p>Basically all modern x86 CPUs support SSE, 128-bit wide vector registers and instructions. Since it's common to want to do the same operation multiple times, Intel added instructions that will let you operate on a 128-bit chunk of data as 2 64-bit chunks, 4 32-bit chunks, 8 16-bit chunks, etc. ARM supports the same thing with a different name (NEON), and the instructions supported are pretty similar.</p> <p>It's pretty common to get a 2x-4x speedup from using SIMD instructions; it's definitely worth looking into if you've got a computationally heavy workload.</p> <p>Compilers are good enough at recognizing common patterns that can be vectorized that simple code, like the following, will automatically use vector instructions with modern compilers</p> <pre><code>for (int i = 0; i &lt; n; ++i) { sum += a[i]; } </code></pre> <p>But compilers will <a href="//danluu.com/assembly-intrinsics/">often produce non-optimal code</a> if you don't write the assembly by hand, especially for SIMD code, so you'll want to look at the disassembly and check for compiler optimization bugs if you really care about getting the best possible performance.</p> <h4 id="power-management">Power Management</h4> <p>There are a lot of fancy power management feature on modern CPUs that <a href="//danluu.com/datacenter-power/">optimize power usage</a> in different scenarios. The result of these is that “race to idle”, completing work as fast as possible and then letting the CPU go back to sleep is the most power efficient way to work.</p> <p>There's been a lot of work that's shown that specific microoptmizations can benefit power consumption, but <a href="http://arcade.cs.columbia.edu/energy-oopsla14.pdf">applying those microoptimizations on real workloads often results in smaller than expected benefits</a>.</p> <h4 id="gpu-gpgpu">GPU / GPGPU</h4> <p>I'm even less qualified to talk about this than I am about the rest of this stuff. Luckily, Cliff Burdick volunteered to write a section on GPUs, so here it is.</p> <p>Prior to the mid-2000's, Graphical Processing Units (GPUs) were restricted to an API that allowed only a very limited amount of control of the hardware. As the libraries became more flexible, programmers began using the processors for more general-purpose tasks, such as linear algebra routines. The parallel architecture of the GPU could work on large chunks of a matrix by launching hundreds of simultaneous threads. However, the code had to use traditional graphics APIs and was still limited in how much of the hardware it could control. Nvidia and ATI took notice and released frameworks that allowed the user to access more of the hardware with an API familiar with people outside of the graphics industry. The libraries gained popularity, and today GPUs are widely used for high-performance computing (HPC) alongside CPUs.</p> <p>Compared to CPUs, the hardware on GPUs have a few major differences, outlined below:</p> <h5 id="processors">Processors</h5> <p>At the top level, a GPU processor contains one or many streaming multiprocessors (SMs). Each streaming multiprocessor on a modern GPU typically contains over 100 floating point units, or what are typically referred to as cores in the GPU world. Each core is typically clocked around 800MHz, although, like CPUs, processors with higher clock rates but fewer cores are also available. GPU processors lack many features of their CPU counterparts, including large caches and branch prediction. Between the layers of cores, SMs, and the overall processor, communicating becomes increasingly slower. For this reason, problems that perform well on GPUs are typically highly-parallel, but have some amount of data that can be shared between a small number of threads. We'll get into why this is in the memory section below.</p> <h5 id="memory">Memory</h5> <p>Memory on modern GPU is broken up into 3 main categories: global memory, shared memory, and registers. Global memory is the GDDR memory that's advertised on the box of the GPU and is typically around 2-12GB in size, and has a throughput of 300-400GB/s. Global memory can be accessed by all threads across all SMs on the processor, and is also the slowest type of memory on the card. Shared memory is, as the name says, memory that's shared between all threads within the same SM. It is usually at least twice as fast as global memory, but is not accessible between threads on different SMs. Registers are much like registers on a CPU in that they are the fastest way to access data on a GPU, but they are local per thread and the data is not visible to any other running thread. Both shared memory and global memory have very strict rules on how they can be accessed, with severe performance penalties for not following them. To reach the throughputs mentioned above, memory accesses must be completely coalesced between threads within the same thread group. Similar to a CPU reading into a single cache line, GPUs have cache lines sized so that a single access can serve all threads in a group if aligned properly. However, in the worst case where all threads in a group access memory in a different cache line, a separate memory read will be required for each thread. This usually means that most of the data in the cache line is not used by the thread, and the usable throughput of the memory goes down. A similar rule applies to shared memory as well, with a couple exceptions that we won't cover here.</p> <h5 id="threading-model">Threading Model</h5> <p>GPU threads run in a SIMT (Single Instruction Multiple Thread) fashion, and each thread runs in a group with a pre-defined size in the hardware (typically 32). That last part has many implications; every thread in that group must be working on the same instruction at the same time. If any of the threads in a group need to take a divergent path (an if statement, for example) of code from the others, all threads not part of the branch suspend execution until the branch is complete. As a trivial example:</p> <pre><code>if (threadId &lt; 5) { // Do something } // Do More </code></pre> <p>In the code above, this branch would cause 27 of our 32 threads in the group to suspend execution until the branch is complete. You can imagine if many groups of threads all run this code, the overall performance will take a large hit while most of the cores sit idle. Only when an entire group of threads is stalled is the hardware allowed to swap in another group to run on those cores.</p> <h5 id="interfaces">Interfaces</h5> <p>Modern GPUs must have a CPU to copy data to and from CPU and GPU memory, and to launch and code on the GPU. At the highest throughput, a PCIe 3.0 bus with 16 lanes can achieves rates of about 13-14GB/s. This may sound high, but when compared to the memory speeds residing on the GPU itself, they're over an order of magnitude slower. In fact, as GPUs get more powerful, the PCIe bus is increasingly becoming a bottleneck. To see any of the performance benefits the GPU has over a CPU, the GPU must be loaded with a large amount of work so that the time the GPU takes to run the job is significantly higher than the time it takes to copy the data to and from.</p> <p>Newer GPUs have features to launch work dynamically in GPU code without returning to the CPU, but it's fairly limited in its use at this point.</p> <h5 id="gpu-conclusion">GPU Conclusion</h5> <p>Because of the major architectural differences between CPUs and GPUs, it's hard to imagine either one replacing the other completely. In fact, a GPU complements a CPU well for parallel work and allows the CPU to work independently on other tasks as the GPU is running. AMD is attempting to merge the two technologies with their &quot;Heterogeneous System Architecture&quot; (HSA), but taking existing CPU code and determining how to split it between the CPU and GPU portion of the processor will be a big challenge not only for the processor, but for compilers as well.</p> <h4 id="virtualization">Virtualization</h4> <p>Since you mentioned virtualization, I'll talk about it a bit, but Intel's implementation of virtualization instructions generally isn't something you need to think about unless you're writing very low-level code that directly deals with virtualization.</p> <p>Dealing with that stuff is pretty messy, as you can see <a href="https://github.com/vishmohan/vmlaunch">from this code</a>. Setting stuff up to use Intel's VT instructions to launch a VM guest is about 1000 lines of low-level code, even for the very simple case shown there.</p> <h4 id="virtual-memory">Virtual Memory</h4> <p>If you look at Vish's VT code, you'll notice that there's a decent chunk of code dedicated to page tables / virtual memory. That's another “new” feature that you don't have to worry about unless you're writing an OS or other low-level systems code. Using virtual memory is much simpler than using segmented memory, but that's not relevant nowadays so I'll just leave it at that.</p> <h4 id="smt-hyper-threading">SMT / Hyper-threading</h4> <p>Since you brought it up, I'll also mention SMT. As you said, this is mostly transparent for programmers. A typical speedup for enabling SMT on a single core is around 25%. That's good for overall throughput, but it means that each thread might only get 60% of its original performance. For applications where you care a lot about single-threaded performance, you might be better off disabling SMT. It depends a lot on the workload, though, and as with any other changes, you should run some benchmarks on your exact workload to see what works best.</p> <p>One side effect of all this complexity that's been added to chips (and software) is that performance is a lot less predictable than it used to be; the relative importance of benchmarking your exact workload on the specific hardware it's going to run on has gone up.</p> <p>Just for example, people often point to benchmarks from the <a href="http://benchmarksgame.alioth.debian.org/">Computer Languages Benchmarks Game</a> as evidence that one language is faster than another. I've tried reproducing the results myself, and on my mobile Haswell (as opposed to the server Kentsfield that's used in the results), I get results that are different by as much as 2x (in relative speed). Running the same benchmark on the same machine, Nathan Kurz recently pointed me to an example where <code>gcc -O3</code> is 25% slower than <code>gcc -O2</code>. <a href="http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf">Changing the linking order on C++ programs can cause a 15% performance change</a>. Benchmarking is a hard problem.</p> <h4 id="branches">Branches</h4> <p>Old school conventional wisdom is that branches are expensive, and should be avoided at all (or most) costs. On a Haswell, the branch misprediction penalty is 14 cycles. Branch mispredict rates depend on the workload. Using <code>perf stat</code> on a few different things (bzip2, top, mysqld, regenerating my blog), I get branch mispredict rates of between 0.5% and 4%. If we say that a correctly predicted branch costs 1 cycle, that's an average cost of between <code>.995 * 1 + .005 * 14 = 1.065 cycles</code> to <code>.96 * 1 + .04 * 14 = 1.52 cycles</code>. That's not so bad.</p> <p>This actually overstates the penalty since about 1995, since Intel added conditional move instructions that allow you to conditionally move data without a branch. This instruction was memorably <a href="http://yarchive.net/comp/linux/cmov.html">panned by Linus</a>, which has given it a bad reputation, but it's fairly common to <a href="https://github.com/logicchains/LPATHBench/issues/53#issuecomment-68160081">get significant speedups using cmov compared to branches</a></p> <p>A real-world example of the cost of extra branches are enabling integer overflow checks. When using <a href="//danluu.com/integer-overflow/">bzip2 to compress a particular file, that increases the number of instructions by about 30% (with all of the increase coming from extra branch instructions), which results in a 1% performance hit</a>.</p> <p>Unpredictable branches are bad, but most branches are predictable. Ignoring the cost of branches until your profiler tells you that you have a hot spot is pretty reasonable nowadays. CPUs have gotten a lot better at executing poorly optimized code over the past decade, and compilers are getting better at optimizing code, which makes optimizing branches a poor use of time unless you're trying to squeeze out the absolute best possible performance out of some code.</p> <p>If it turns out that's what you need to do, you're likely to be better off using <a href="http://en.wikipedia.org/wiki/Profile-guided_optimization">profile-guided optimization</a> than trying to screw with this stuff by hand.</p> <p>If you really must do this by hand, there are compiler directives you can use to say whether a particular branch is likely to be taken or not. Modern CPUs ignore branch hint instructions, but they can help the compiler lay out code better.</p> <h4 id="alignment">Alignment</h4> <p>Old school conventional wisdom is that you should pad out structs and make sure things are aligned. But on a Haswell chip, the mis-alignment for almost any single-threaded thing you can think of that doesn't cross a page boundary is zero. There are some cases where it can make a difference, but in general, this is another type of optimization that's mostly irrelevant because CPUs have gotten so much better at executing bad code. It's also mildly harmful in cases where it increases the memory footprint for no benefit.</p> <p>Also, <a href="//danluu.com/3c-conflict/">don't make things page aligned or otherwise aligned to large boundaries or you'll destroy the performance of your caches</a>.</p> <h4 id="self-modifying-code">Self-modifying code</h4> <p>Here's another optimization that doesn't really make sense anymore. Using self-modifying code to decrease code size or increase performance used to make sense, but because modern caches tend to split up their l1 instruction and data caches, modifying running code requires expensive communication between a chip's l1 caches.</p> <h3 id="the-future">The Future</h3> <p>Here are some possible changes, from least speculative to most speculative.</p> <h4 id="partitioning">Partitioning</h4> <p>It's now obvious that more and more compute is moving into large datacenters. Sometimes this involves running on VMs, sometimes it involves running in some kind of container, and sometimes it involves running bare metal, but in any case, individual machines are often multiplexed to run a wide variety of workloads. Ideally, you'd be able to schedule best effort workloads to soak up stranded resources without effecting latency sensitive workloads with an SLA. It turns out that you can actually do this with some <a href="//danluu.com/intel-cat/">relatively straightforward hardware changes</a>.</p> <p><img src="images/new-cpu-features/google_heracles.png" alt="90% overall machine utilization" width="338" height="209"></p> <p><a href="http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf">David Lo, et. al, were able to show that you can get about 90% machine utilization without impacting latency SLAs</a> if caches can be partitioned such that best effort workloads don't impact latency sensitive workloads. The solid red line is the load on a normal Google web search cluster, and the dashed green line is what you get with the appropriate optimizations. From bar-room conversations, my impression is that the solid red line is actually already better (higher) than most of Google's competitors are able to do. If you compare the 90% optimized utilization to typical server utilization of 10% to 90%, that results in a massive difference in cost per unit of work compared to running a naive, unoptimized, setup. With substantial hardware effort, Google was able to avoid interference, but additional isolation features could allow this to be done at higher efficiency with less effort.</p> <h4 id="transactional-memory-and-hardware-lock-elision">Transactional Memory and Hardware Lock Elision</h4> <p>IBM already has these features in their POWER chips. Intel made an attempt to add these to Haswell, but they're disabled because of a bug. In general, modern CPUs are quite complex <a href="//danluu.com/cpu-bugs/">and we should expect to see many more bugs than we used to</a>.</p> <p>Transactional memory support is what it sounds like: hardware support for transactions. This is through three new instructions, <code>xbegin</code>, <code>xend</code>, and <code>xabort</code>.</p> <p><code>xbegin</code> starts a new transaction. A conflict (or an <code>xabort</code>) causes the architectural state of the processor (including memory) to get rolled back to the state it was in just prior to the <code>xbegin</code>. If you're using transactional memory via library or language support, this should be transparent to you. If you're implementing the library support, you'll have to figure out how to convert this hardware support, with its limited hardware buffer sizes, to something that will handle arbitrary transactions.</p> <p>I'm not going to discuss Hardware Lock Elision except to say that, under the hood, it's implemented with mechanisms that are really similar to the mechanisms used to implement transactional memory and that it's designed to speed up lock-based code. If you want to take advantage of HLE, see <a href="http://mcg.cs.tau.ac.il/papers/amir-levy-msc.pdf">this doc</a>.</p> <h4 id="fast-i-o">Fast I/O</h4> <p>I/O bandwidth is going up and I/O latencies are going down, both for storage and for networking. The problem is that I/O is normally done via syscalls. As we've seen, the relative overhead of syscalls has been going up. For both storage and networking, the answer is to move to <a href="http://research.microsoft.com/pubs/198366/Asplos2012_MonetaD.pdf">user mode I/O stacks</a> (putting everything in kernel mode would work, too, but that's a harder sell). On the storage side, that's mostly still a weirdo research thing, but HPC and HFT folks have been doing that in networking for a while. And by a while, I don't mean a few months. Here's <a href="http://web.stanford.edu/group/comparch/papers/huggahalli05.pdf">a paper from 2005 that talks about the networking stuff I'm going to discuss, as well as some stuff I'm not going to discuss (DCA)</a>.</p> <p>This is finally trickling into the non-supercomputing world. MS has been advertising Azure with infiniband networking with virtualized RDMA for over a year, Cloudflare <a href="http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-generation-servers/">has talked about using Solarflare NICs to get the same capability</a>, etc. Eventually, we're going to see SoCs with fast Ethernet onboard, and unless that's limited to Xeon-type devices, it's going to trick down into all devices. The competition between ARM devices will probably cause at least one ARM device maker to put fast Ethernet on their commodity SoCs, which may force Intel's hand.</p> <p>That RDMA bit is significant; it lets you bypass the CPU completely and have the NIC respond to remote requests. A couple months ago, I worked through the Stanford/Coursera Mining Massive Data Sets class. During one of the first lectures, they provide an example of a “typical” datacenter setup with 1Gb top-of-rack switches. That's not unreasonable for processing “massive” data if you're doing kernel TCP through non-RDMA NICs, since you can floor an entire core trying to push 1Gb/s through linux's TCP stack. But with Azure, MS talks about getting 40Gb out of a single machine; that's one machine getting 40x the bandwidth of what you might expect out of an entire rack. They also mention sub 2 us latencies, which is multiple orders of magnitude lower than you can get out of kernel TCP. This isn't exactly a new idea. <a href="http://www.scs.stanford.edu/~rumble/papers/latency_hotos11.pdf">This paper from 2011</a> predicts everything that's happened on the network side so far, along with some things that are still a ways off.</p> <p><a href="https://www.youtube.com/watch?v=8Kyoj3bKepY&amp;feature=youtu.be&amp;t=20m8s">This MS talk discusses how you can take advantage of this kind of bandwidth and latency</a> for network storage. A concrete example that doesn't require clicking through to a link is Amazon's EBS. It lets you use an “elastic” disk of arbitrary size on any of your AWS nodes. Since a spinning metal disk seek has higher latency than an RPC over kernel TCP, you can get infinite storage pretty much transparently. For example, say you can get <a href="http://www.scs.stanford.edu/~rumble/papers/latency_hotos11.pdf">100us (.1ms) latency out of your network</a>, and your disk seek time is 8ms. That makes a remote disk access 8.1ms instead of 8ms, which isn't that much overhead. That doesn't work so well with SSDs, though, since you can get 20 us (.02ms) <a href="http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme/3">out of an SSD</a>. But RDMA latency is low enough that a transparent EBS-like layer is possible for SSDs.</p> <p>So that's networked I/O. The performance benefit might be even bigger on the disk side, if/when next generation storage technologies that are faster than flash start getting deployed. The performance delta is so large that <a href="//danluu.com/clwb-pcommit/">Intel is adding new instructions to keep up with next generation low-latency storage technology</a>. Depending on who you ask, that stuff has been a few years away for a decade or two; this is more iffy than the networking stuff. But even with flash, people are showing off devices that can get down into the single microsecond range for latency, which is a substantial improvement.</p> <h4 id="hardware-acceleration">Hardware Acceleration</h4> <p>Like fast networked I/O, this is already here in some niches. <a href="http://www.deshawresearch.com/publications.html">DESRES has been doing ASICs to get 100x-1000x speedup in computational chemistry for years</a>. <a href="http://research.microsoft.com/apps/pubs/default.aspx?id=212001">Microsoft has talked about speeding up search with FPGAs</a>. People have been <a href="http://web.eecs.umich.edu/~twenisch/papers/isca13.pdf">looking into accelerating memcached and similar systems for a while</a>, <a href="http://csl.stanford.edu/~christos/publications/2014.hwkvs.nvmw.slides.pdf">researchers from Toshiba and Stanford demonstrated a real implementation a while back</a>, and I recently saw a pre-print out of Berkeley on the same thing. There are multiple companies making Bitcoin mining ASICs. That's also true for <a href="http://www.neuflow.org/">other</a> <a href="http://www.artificiallearning.com/products/">application</a> <a href="https://github.com/cambridgehackers/">areas</a>.</p> <p>It seems like we should see more of this as it gets harder to get power/performance gains out of CPUs. You might consider this a dodge of your question, if you think of programming as being a software oriented endeavor, but another way to look at it is that what it means to program something will change. In the future, it might mean designing hardware like an FPGA or ASIC in combination with writing software.</p> <h5 id="update">Update</h5> <p>Now that it's 2016, one year after this post was originally published, we can see that companies are investing in hardware accelerators. In addition to its previous work on FPGA accelerated search, Microsoft <a href="https://www.youtube.com/watch?v=RffHFIhg5Sc&amp;feature=youtu.be&amp;t=23m30s">has announced that it's using FPGAs to accelerate networking</a>. Google has been closed mouthed about infrastructure, as is typical for them, but if you look at the initial release of Tensorflow, you can see snippets of code that clearly references FPGAs, such as:</p> <pre><code>enum class PlatformKind { kInvalid, kCuda, kOpenCL, kOpenCLAltera, // Altera FPGA OpenCL platform. // See documentation: go/fpgaopencl // (StreamExecutor integration) kHost, kMock, kSize, }; </code></pre> <p>and</p> <pre><code>string PlatformKindString(PlatformKind kind) { switch (kind) { case PlatformKind::kCuda: return &quot;CUDA&quot;; case PlatformKind::kOpenCL: return &quot;OpenCL&quot;; case PlatformKind::kOpenCLAltera: return &quot;OpenCL+Altera&quot;; case PlatformKind::kHost: return &quot;Host&quot;; case PlatformKind::kMock: return &quot;Mock&quot;; default: return port::StrCat(&quot;InvalidPlatformKind(&quot;, static_cast&lt;int&gt;(kind), &quot;)&quot;); } } </code></pre> <p>As of this writing, Google doesn't return any results for <code>+google +kOpenClAltera</code>, so it doesn't appear that this has been widely observed. If you're not familiar with Altera OpenCL and you work at google, you can try the internal go link suggested in the comment, <code>go/fpgaopencl</code>. If, like me, you don't work at Google, well, <a href="https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html">there's Altera's docs here</a>. The basic idea is that you can take OpenCL code, the same kind of thing you might run on a GPU, and run it on an FPGA instead, and from the comment, it seems like Google has some kind of setup that lets you stream data in and out of nodes with FPGAs.</p> <p>That FPGA-specific code was removed in ddd4aaf5286de24ba70402ee0ec8b836d3aed8c7, which has a commit message that starts with “TensorFlow: upstream changes to git.” and then has a list of internal google commits that are being upstreamed, along with a description of each internal commit. Curiously, there's nothing about removing FPGA support even though that seems like it's a major enough thing that you'd expect it to be described, unless it was purposely redacted. Amazon has also been quite secretive about their infrastructure plans, but you can make reasonable guesses there by looking at the hardware people they've been vacuuming up. A couple other companies are also betting pretty heavily on hardware accelerators, but since I learned about that through private conversations (as opposed to accidentally published public source code or other public information), I'll leave you to guess which companies.</p> <h4 id="dark-silicon-socs">Dark Silicon / SoCs</h4> <p>One funny side effect of the way transistor scaling has turned out is that we can pack a ton of transistors on a chip, but they generate so much heat that the average transistor can't switch most of the time if you don't want your chip to melt.</p> <p>A result of this is that it makes more sense to include dedicated hardware that isn't used a lot of the time. For one thing, this means we get all sorts of specialized instructions like the <a href="http://www.felixcloutier.com/x86/PCMPESTRI.html">PCMP</a> and <a href="http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf">ADX</a> instructions. But it also means that we're getting chips with entire devices integrated that would have previously lived off-chip. That includes things like GPUs and (for mobile devices) radios.</p> <p>In combination with the hardware acceleration trend, it also means that it makes more sense for companies to design their own chips, or at least parts of their own chips. Apple has gotten a lot of mileage out of acquiring PA Semi. First, by adding little custom accelerators to bog standard ARM architectures, and then by adding custom accelerators to their own custom architecture. Due to a combination of the right custom hardware plus well thought out benchmarking and system design, the iPhone 4 is slightly more responsive than my flagship Android phone, which is multiple years newer and has a much faster processor as well as more RAM.</p> <p>Amazon has picked up a decent chunk of the old Calxeda team and are hiring enough to create a good-sized hardware design team. Facebook has picked up a small handful of ARM SoC folks and is partnering with Qualcomm on something-or-other. <a href="http://www.realworldtech.com/forum/?threadid=146066&amp;curpostid=146227">Linus is on record as saying we're going to see more dedicated hardware all over the place</a>. And so on and so forth.</p> <h3 id="conclusion">Conclusion</h3> <p>x86 chips have picked up a lot of new features and whiz-bang gadgets. For the most part, you don't have to know what they are to take advantage of them. As a first-order approximation, making your code predictable and keeping memory locality in mind works pretty well. The really low-level stuff is usually hidden by libraries or drivers, and compilers will try to take care of the rest of it. The exceptions are if you're writing really low-level code, in which case <a href="https://github.com/vishmohan/vmlaunch/blob/master/vmlaunch_simple.c">the world has gotten a lot messier</a>, or if you're trying to get the absolute best possible performance out of your code, in which case <a href="https://docs.google.com/document/d/18gs0bkEwQ5cO8pMXT_MsOa8Xey4NEavXq-OvtdUXKck/pub">the world has gotten a lot weirder</a>.</p> <p>Also, things will happen in the future. But most predictions are wrong, so who knows?</p> <h3 id="resources">Resources</h3> <p><a href="https://www.youtube.com/watch?v=hgcNM-6wr34">This is a talk by Matt Godbolt</a> that covers a lot of the implementation details that I don't get into. To down into one more level of detail, see <a href="http://www.amazon.com/gp/product/1478607831/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1478607831&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=HXBNFFJ3CIXMWUZP">Modern Processor Design, by Shen and Lipasti</a>. Despite the date listed on Amazon (2013), the book is pretty old, but it's still the best book I've found on processor internals. It describes, in good detail, what you need to implement to make a P6-era high-performance CPU. It also derives theoretical performance limits given different sets of assumptions and talks about a lot of different engineering tradeoffs, with explanations of why for a lot of them.</p> <p>For one level deeper of &quot;why&quot;, you'll probably need to look at a VLSI text, which will explain how devices and interconnect scale and how that affects circuit design, which in turn affects architecture. I really like <a href="http://www.amazon.com/gp/product/B008VIXPI2/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=B008VIXPI2&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=YOJVAWVH5XTIF6LN">Weste &amp; Harris</a> because they have clear explanations and good exercises with solutions that you can find online, but if you're not going to work the problems pretty much any VLSI text will do. For one more level deeper of the &quot;why&quot; of things, you'll want a solid state devices text and something that explains how transmission lines and interconnect can work. For devices, I really like <a href="http://www.amazon.com/gp/product/0201543931/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0201543931&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=MXV5K7IJXWXJD446">Pierret's</a> <a href="http://www.amazon.com/gp/product/013061792X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=013061792X&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=N7YLCANVDPI3M35R">books</a>. I got introduced to the E-mag stuff through <a href="http://www.amazon.com/gp/product/8126515252/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=8126515252&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=XZUQXDRDJVABGITB">Ramo, Whinnery &amp; Van Duzer</a>, but <a href="http://www.amazon.com/gp/product/3319078054/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=3319078054&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=GTQSYCVTMIFZKSPZ">Ida</a> is a better intro text.</p> <p>For specifics about current generation CPUs and specific optimization techniques, <a href="http://www.agner.org/optimize/">see Agner Fog's site</a>. For something on <a href="//danluu.com/perf-tracing/">optimization tools from the future, see this post</a>. <a href="http://www.akkadia.org/drepper/cpumemory.pdf">What Every Programmer Should Know About Memory</a> is also good background knowledge. Those docs cover a lot of important material, but if you're writing in a higher level language there <a href="http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html">are a lot of other things you need to keep in mind</a>. For more on Intel CPU history, Xao-Feng Li <a href="https://people.apache.org/~xli/presentations/history_Intel_CPU.pdf">has a nice overview</a>.</p> <p>For something a bit off the wall, see <a href="//danluu.com/cpu-backdoors/">this post on the possibility of <b>CPU backdoors</b></a>. For something less off the wall, see <a href="//danluu.com/cpu-bugs/">this post on how complexity we have in modern CPUs enables <b>all sorts of exciting bugs</b></a>.</p> <p>For more benchmarks on locking, See <a href="http://shipilev.net/blog/2014/on-the-fence-with-dependencies/">this post by Aleksey Shipilev</a>, <a href="http://www.pvk.ca/Blog/2015/01/13/lock-free-mutual-exclusion/">this post by Paul Khuong</a>, as well as their archives.</p> <p>For general benchmarking, last year's <a href="https://www.youtube.com/watch?v=XmImGiVuJno">Strange Loop benchmarking talk by Aysylu Greenberg</a> is a nice intro to common gotchas. For something more advanced but more specific, <a href="https://www.youtube.com/watch?v=9MKY4KypBzg">Gil Tene's talk on latency</a> is great.</p> <p>For historical computing that predates everything I've mentioned by quite some time, see <a href="http://www.amazon.com/gp/product/0262523930/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0262523930&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=NB4TSSXVLME77X4B">IBM's Early Computers</a> and <a href="http://ygdes.com/CDC/DesignOfAComputer_CDC6600.pdf">Design of a Computer</a>, which describes the design of the CDC 6600. <a href="http://www.amazon.com/gp/product/1558605398/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1558605398&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=MCCNUSAAP4B6NZAP">Readings in Computer Architecture</a> is also good for seeing where a lot of these ideas originally came from.</p> <p>Sorry, this list is pretty incomplete. Suggestions welcome!</p> <p><small></p> <h4 id="tiny-disclaimer">Tiny Disclaimer</h4> <p>I have no doubt that I'm leaving stuff out. <a href="https://twitter.com/danluu/">Let me know</a> if I'm leaving out anything you think is important and I'll update this. I've tried to keep things as simple as possible while still capturing the flavor of what's going on, but I'm sure that there are some cases where I'm oversimplifying, and some things that I just completely forgot to mention. And of course basically every generalization I've made is wrong if you're being <em>really</em> precise. Even just picking at my first couple sentences, A20M isn't always and everywhere irrelevant (I've probably spent about 0.2% of my career dealing with it), x86-64 isn't strictly superior to x86 (on one workload I had to deal with, the performance benefit from the extra registers was more than canceled out by the cost of the longer instructions; it's pretty rare that the instruction stream and icache misses are the long pole for a workload, but it happens), etc. The biggest offender is probably in my NUMA explanation, since it is actually possible for P6 busses to respond with a defer or retry to a request. It's reasonable to avoid using a similar mechanism to enforce coherency but I couldn't think of a reasonable explanation of why that didn't involve multiple levels of explanations. I'm really not kidding when I say that pretty much every generalization falls apart if you dig deep enough. Every abstraction I'm talking about is leaky. I've tried to include links to docs that go at least one level deeper, but I'm probably missing some areas.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p>Thanks to Leah Hanson and Nathan Kurz for comments that results in major edits, and to Nicholas Radcliffe, Stefan Kanthak, Garret Reid, Matt Godbolt, Nikos Patsiouras, Aleksey Shipilev, and Oscar R Moll for comments that resulted in minor edits, and to David Albert for allowing me to quote him and also for some interesting follow-up questions when we talked about this a while back. Also, thanks for Cliff Burdick for writing the section on GPUs and for Hari Angepat for spotting the Google kOpenCLAltera code in TensorFlow. </small></p> <hr /> A review of the Julia language julialang/ Sun, 28 Dec 2014 00:00:00 +0000 julialang/ <p>Here's a language that gives near-C performance that feels like Python or Ruby with optional type annotations (that you can feed to <a href="https://github.com/tonyhffong/Lint.jl">one of two</a> <a href="https://github.com/astrieanna/TypeCheck.jl">static analysis tools</a>) that has good support for macros plus decent-ish support for FP, plus a lot more. What's not to like? I'm mostly not going to talk about how great Julia is, though, because you can find plenty of blog posts that do that all over the internet.</p> <p>The last time I used Julia (around Oct. 2014), I ran into two new (to me) bugs involving bogus exceptions when processing Unicode strings. To work around those, I used a try/catch, but of course that runs into a non-deterministic bug I've found with try/catch. I also hit a bug where a function returned a completely wrong result if you passed it an argument of the wrong type instead of throwing a &quot;no method&quot; error. I spent half an hour writing a throwaway script and ran into four bugs in the core language.</p> <p></p> <p>The second to last time I used Julia, I ran into too many bugs to list; the worst of them caused generating plots to take 30 seconds per plot, which caused me to switch to R/ggplot2 for plotting. First there was this bug with <a href="https://github.com/dcjones/Gadfly.jl/issues/462">plotting dates didn't work</a>. When I worked around that I ran into a regression that caused plotting to <a href="https://github.com/JuliaLang/julia/issues/8631">break large parts of the core language</a>, so that data manipulation had to be done before plotting. That would have been fine if I knew exactly what I wanted, but for exploratory data analysis I want to plot some data, do something with the data, and then plot it again. Doing that required restarting the REPL for each new plot. That would have been fine, except that it takes 22 seconds to load Gadfly on my 1.7GHz Haswell (timed by using <code>time</code> on a file that loads Gadfly and does no work), plus another 10-ish seconds to load the other packages I was using, turning my plotting workflow into: restart REPL, wait 30 seconds, make a change, make a plot, look at a plot, repeat.</p> <p>It's not unusual to run into bugs when using a young language, but Julia has more than its share of bugs for something at its level of maturity. If you look at the test process, that's basically inevitable.</p> <p>As far as I can tell, FactCheck is the most commonly used thing resembling a modern test framework, and it's barely used. Until quite recently, it was unmaintained and broken, but even now the vast majority of tests are written using <code>@test</code>, which is basically an assert. It's theoretically possible to write good tests by having a file full of test code and asserts. But in practice, anyone who's doing that isn't serious about testing and isn't going to write good tests.</p> <p>Not only are existing tests not very good, most things aren't tested at all. You might point out that the coverage stats for a lot of packages aren't so bad, but last time I looked, there was a bug in the coverage tool that caused it to only aggregate coverage statistics for functions with non-zero coverage. That is to say, code in untested functions doesn't count towards the coverage stats! That, plus the weak notion of test coverage that's used (line coverage<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">1</a></sup>) make the coverage stats unhelpful for determining if packages are well tested.</p> <p>The lack of testing doesn't just mean that you run into regression bugs. Features just disappear at random, too. When the REPL got rewritten a lot of existing shortcut keys and other features stopped working. As far as I can tell, that wasn't because anyone wanted it to work differently. It was because there's no way to re-write something that isn't tested without losing functionality.</p> <p>Something that goes hand-in-hand with the level of testing on most Julia packages (and the language itself) is the lack of a good story for error handling. Although you can easily use <code>Nullable</code> (the Julia equivalent of <code>Some/None</code>) or error codes in Julia, the most common idiom is to use exceptions. And if you use things in <code>Base</code>, like arrays or <code>/</code>, you're stuck with exceptions. I'm <a href="http://blogs.msdn.com/b/oldnewthing/archive/2005/01/14/352949.aspx">not a fan</a>, but that's fine -- plenty of reliable software uses exceptions for error handling.</p> <p>The problem is that because the niche Julia occupies doesn't care<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">2</a></sup> about error handling, it's extremely difficult to write a robust Julia program. When you're writing smaller scripts, you often want to “fail-fast” to make debugging easier, but for some programs, you want the program to do something reasonable, keep running, and maybe log the error. It's hard to write a robust program, even for this weak definition of robust. There are problems at multiple levels. For the sake of space, I'll just list two.</p> <p>If I'm writing something I'd like to be robust, I really want function documentation to include all exceptions the function might throw. Not only do the Julia docs not have that, it's common to call some function and get a random exception that has to do with an implementation detail and nothing to do with the API interface. Everything I've written that actually has to be reliable has been exception free, so maybe that's normal when people use exceptions? Seems pretty weird to me, though.</p> <p>Another problem is that catching exceptions doesn't work (sometimes, at random). I ran into one bug where using exceptions caused code to be incorrectly optimized out. You might say that's not fair because it was <a href="https://github.com/danluu/Fuzz.jl">caught using a fuzzer</a>, and fuzzers are supposed to find bugs, but the fuzzer wasn't fuzzing exceptions or even expressions. The implementation of the fuzzer just happens to involve eval'ing function calls, in a loop, with a <code>try/catch</code> to handle exceptions. Turns out, if you do that, the function might not get called. This isn't a case of using a fuzzer to generate billions of tests, one of which failed. This was a case of trying one thing, one of which failed. That bug is now fixed, but there's still a nasty bug that causes exceptions to sometimes fail to be caught by <code>catch</code>, which is pretty bad news if you're putting something in a <code>try/catch</code> block because you don't want an exception to trickle up to the top level and kill your program.</p> <p>When I grepped through <code>Base</code> to find instances of actually catching an exception and doing something based on the particular exception, I could only find a single one. Now, it's me scanning grep output in less, so I might have missed some instances, but it isn't common, and grepping through common packages finds a similar ratio of error handling code to other code. Julia folks don't care about error handling, so it's buggy and incomplete. I once asked about this and was told that it didn't matter that exceptions didn't work because you shouldn't use exceptions anyway -- you should use Erlang style error handling where you kill the entire process on an error and build transactionally robust systems that can survive having random processes killed. Putting aside the difficulty of that in a language that doesn't have Erlang's support for that kind of thing, you can easily spin up a million processes in Erlang. In Julia, if you load just one or two commonly used packages, firing up a single new instance of Julia can easily take half a minute or a minute. To spin up a million independent instances would at 30 seconds a piece would take approximately two years.</p> <p>Since we're broadly on the topic of APIs, error conditions aren't the only place where the <code>Base</code> API leaves something to be desired. Conventions are inconsistent in many ways, from function naming to the order of arguments. Some methods on collections take the collection as the first argument and some don't (e.g., replace takes the string first and the regex second, whereas match takes the regex first and the string second).</p> <p>More generally, <code>Base</code> APIs outside of the niche Julia targets often don't make sense. There are too many examples to list them all, but consider this one: the UDP interface throws an exception on a partial packet. This is really strange and also unhelpful. <a href="https://github.com/JuliaLang/julia/pull/5697">Multiple people stated that on this issue</a> but the devs decided to throw the exception anyway. The Julia implementers have great intuition when it comes to linear algebra and other areas they're familiar with. But they're only human and their intuition isn't so great in areas they're not familiar with. The problem is that they go with their intuition anyway, even in the face of comments about how that might not be the best idea.</p> <p>Another thing that's an issue for me is that I'm not in the audience the package manager was designed for. It's backed by git in a clever way that lets people do all sorts of things I never do. The result of all that is that it needs to do <code>git status</code> on each package when I run <code>Pkg.status()</code>, which makes it horribly slow; most other <code>Pkg</code> operations I care about are also slow for a similar reason.</p> <p>That might be ok if it had the feature I most wanted, which is the ability to specify exact versions of packages and have multiple, conflicting, versions of packages installed<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">3</a></sup>. Because of all the regressions in the core language libraries and in packages, I often need to use an old version of some package to make some function actually work, which can require old versions of its dependencies. There's no non-hacky way to do this.</p> <p>Since I'm talking about issues where I care a lot more than the core devs, there's also benchmarking. The website shows off some impressive sounding speedup numbers over other languages. But they're all benchmarks that are pretty far from real workloads. Even if you have a strong background in workload characterization and systems architecture (computer architecture, not software architecture), it's difficult to generalize performance results on anything resembling real workload from microbenchmark numbers. From what I've heard, performance optimization of Julia is done from a larger set of similar benchmarks, which has problems for all of the same reasons. Julia is actually pretty fast, but this sort of ad hoc benchmarking basically guarantees that performance is being left on the table. Moreover, the benchmarks are written in a way that stacks the deck against other languages. People from other language communities often get rebuffed when they submit PRs to rewrite the benchmarks in their languages idiomatically. The Julia website claims that &quot;all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom&quot;, and that making adjustments that are idiomatic for specific languages would be unfair. However, if you look at the Julia code, you'll notice that they're written in a way to avoid doing one of a number of things that would crater performance. If you follow the mailing list, you'll see that there are quite a few intuitive ways to write Julia code that has very bad performance. The Julia benchmarks avoid those pitfalls, but the code for other languages isn't written with anywhere near that care; in fact, it's just the opposite.</p> <p>I've just listed a bunch of issues with Julia. I believe the canonical response for complaints about an open source project is, why don't you fix the bugs yourself, you entitled brat? Well, I tried that. For one thing, <a href="//danluu.com/everything-is-broken/#julia">there are so many bugs that I often don't file bugs, let alone fix them, because it's too much of an interruption</a>. But the bigger issue are the barriers to new contributors. I spent a few person-days fixing bugs (mostly debugging, not writing code) and that was almost enough to get me into the top 40 on GitHub's list of contributors. My point isn't that I contributed a lot. It's that I didn't, and that still put me right below the top 40.</p> <p>There's lots of friction that keeps people from contributing to Julia. The build is often broken or has failing tests. When I <a href="//danluu.com/broken-builds/">polled Travis CI stats</a> for languages on GitHub, Julia was basically tied for last in uptime. This isn't just a statistical curiosity: the first time I tried to fix something, the build was non-deterministically broken for the better part of a week because someone checked bad code directly into master without review. I spent maybe a week fixing a few things and then took a break. The next time I came back to fix something, tests were failing for a day because of another bad check-in and I gave up on the idea of fixing bugs. That tests fail so often is even worse than it sounds when you take into account the poor test coverage. And even when the build is &quot;working&quot;, it uses <a href="http://aegis.sourceforge.net/auug97.pdf">recursive makefiles</a>, and often fails with a message telling you that you need to run <code>make clean</code> and build again, which takes half an hour. When you do so, it often fails with a message telling you that you need to <code>make clean all</code> and build again, with takes an hour. And then there's some chance that will fail and you'll have to manually clean out <code>deps</code> and build again, which takes even longer. And that's the good case! The bad case is when the build fails non-deterministically. These are well-known problems that occur when using recursive make, described in <a href="http://aegis.sourceforge.net/auug97.pdf">Recursive Make Considered Harmful</a> circa 1997.</p> <p>And that's not even the biggest barrier to contributing to core Julia. The biggest barrier is that the vast majority of the core code is written with no markers of intent (comments, meaningful variable names, asserts, meaningful function names, explanations of short variable or function names, design docs, etc.). There's a tax on debugging and fixing bugs deep in core Julia because of all this. I happen to know one of the Julia core contributors (presently listed as the #2 contributor by GitHub's ranking), and when I asked him about some of the more obtuse functions I was digging around in, he couldn't figure it out either. His suggestion was to ask the mailing list, but for the really obscure code in the core codebase, there's perhaps one to three people who actually understand the code, and if they're too busy to respond, you're out of luck.</p> <p>I don't mind spending my spare time working for free to fix other people's bugs. In fact, I do quite a bit of that and it turns out I often enjoy it. But I'm too old and crotchety to spend my leisure time deciphering code that even the core developers can't figure out because it's too obscure.</p> <p>None of this is to say that Julia is bad, but the concerns of the core team are pretty different from my concerns. This is the point in a complain-y blog post where you're supposed to suggest an alternative or make a call to action, but I don't know that either makes sense here. The purely technical problems, like slow load times or the package manager, are being fixed or will be fixed, so there's not much to say there. As for process problems, like not writing tests, not writing internal documentation, and checking unreviewed and sometimes breaking changes directly into master, well, that's “easy”<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">4</a></sup> to fix by adding a code review process that forces people to write tests and documentation for code, but that's not free.</p> <p>A small team of <a href="http://lwn.net/2000/0824/a/esr-sharing.php3">highly talented</a> developers who can basically hold all of the code in their collective heads can make great progress while eschewing anything that isn't just straight coding at the cost of making it more difficult for other people to contribute. Is that worth it? It's hard to say. If you have to slow down <a href="https://github.com/JeffBezanson">Jeff</a>, <a href="https://github.com/Keno">Keno</a>, and the other super productive core contributors and all you get out of it is a couple of bums like me, that's probably not worth it. If you get a thousand people like me, that's probably worth it. The reality is in the ambiguous region in the middle, where it might or might not be worth it. The calculation is complicated by the fact that most of the benefit comes in the long run, whereas the costs are disproportionately paid in the short run. I once had <a href="http://morrow.ece.wisc.edu/">an engineering professor</a> who claimed that the answer to every engineering question is &quot;it depends&quot;. What should Julia do? It depends.</p> <h3 id="2022-update">2022 Update</h3> <p>This post originally mentioned how friendly the Julia community is, but I removed that since it didn't seem accurate in light of the responses. Many people were highly supportive, such as this Julia core developer:</p> <p><img src="images/julialang/julia_community.png"></p> <p>However, a number of people had some pretty nasty responses and I don't think it's accurate to say that a community is friendly when the response is mostly positive, but with a significant fraction of nasty responses, since it doesn't really take a lot of nastiness to make a group seem unfriendly. Also, sentiment about this post has gotten more negative over time as <a href="https://yosefk.com/blog/people-can-read-their-managers-mind.html">communities tend to take their direction from the top</a> and a couple of the Julia co-creators have consistently been quite negative about this post.</p> <p>Now, onto the extent to which these issues have been fixed. The initial response from the co-founders was that the issues aren't really real and the post is badly mistaken. Over time, as some of the issues had some work done on them, the response changed to being that this post is out of date and the issues were all fixed, e.g., <a href="https://news.ycombinator.com/item?id=11073642">here's a response from one of the co-creators of Julia in 2016</a>:</p> <blockquote> <p>The main valid complaints in Dan's post were:</p> <ol> <li><p>Insufficient testing &amp; coverage. Code coverage is now at 84% of base Julia, from somewhere around 50% at the time he wrote this post. While you can always have more tests (and that is happening), I certainly don't think that this is a major complaint at this point.</p></li> <li><p>Package issues. Julia now has package precompilation so package loading is pretty fast. The package manager itself was rewritten to use libgit2, which has made it much faster, especially on Windows where shelling out is painfully slow.</p></li> <li><p>Travis uptime. This is much better. There was a specific mystery issue going on when Dan wrote that post. That issue has been fixed. We also do Windows CI on AppVeyor these days.</p></li> <li><p>Documentation of Julia internals. Given the quite comprehensive developer docs that now exist, it's hard to consider this unaddressed: <a href="http://julia.readthedocs.org/en/latest/devdocs/julia/">http://julia.readthedocs.org/en/latest/devdocs/julia/</a></p></li> </ol> <p>So the legitimate issues raised in that blog post are fixed.</p> </blockquote> <p>The top response to that is:</p> <blockquote> <blockquote> <p>The main <i>valid</i> complaints [...] the <i>legitimate</i> issues raised [...]</p> </blockquote> <p>This is a really passive-aggressive weaselly phrasing. I’d recommend reconsidering this type of tone in public discussion responses.</p> <p>Instead of suggesting that the other complaints were invalid or illegitimate, you could just not mention them at all, or at least use nicer language in brushing them aside. E.g. “... the main <i>actionable</i> complaints...” or “the main <i>technical</i> complaints ...”</p> </blockquote> <p>Putting aside issues of tone, I would say that the main issue from the post, the core team's attitude towards correctness, is both a legitimate issue and one that's unfixed, as we'll see when we look at how the specific issues mentioned as fixed are also unfixed.</p> <p>On correctness, if the correctness issues were fixed, we wouldn't continue to see showstopping bugs in Julia, but I have a couple of friends who continued to use Julia for years until they got fed up with correctness issues and sent me quite a few bugs that they personally ran into that were serious well after the 2016 comment about correctness being fixed, such as <a href="https://github.com/JuliaStats/Distributions.jl/issues/1241">getting an incorrect result when sampling from a distribution</a>, <a href="https://github.com/JuliaStats/StatsBase.jl/issues/642">sampling from an array produces incorrect results</a>, <a href="https://github.com/JuliaLang/julia/issues/39183">the product function, i.e., multiplication, produces incorrect results</a>, <a href="https://github.com/JuliaStats/Statistics.jl/issues/119">quantile produces incorrect results</a>, <a href="https://github.com/JuliaStats/Statistics.jl/issues/22">mean produces incorrect results</a>, <a href="https://github.com/JuliaLang/julia/issues/39379">incorrect array indexing</a>, <a href="https://github.com/JuliaLang/julia/issues/49422">divide produces incorrect results</a>, <a href="https://github.com/JuliaLang/julia/issues/49422">converting from float to int produces incorrect results</a>, <a href="https://github.com/JuliaStats/Statistics.jl/issues/144">quantile produces incorrect results (again)</a>, <a href="https://github.com/JuliaStats/Statistics.jl/issues/140">mean produces incorrect results (again)</a>, etc.</p> <p>There has been a continued flow of very serious bugs from Julia and numerous other people noting that they've run into serious bugs, such as <a href="https://kidger.site/thoughts/jax-vs-julia/">here</a>:</p> <blockquote> <p>I remember all too un-fondly a time in which one of my Julia models was failing to train. I spent multiple months on-and-off trying to get it working, trying every trick I could think of.</p> <p>Eventually – eventually! – I found the error: <b>Julia/Flux/Zygote was returning incorrect gradients</b>. After having spent so much energy wrestling with points 1 and 2 above, this was the point where I simply gave up. Two hours of development work later, I had the model successfully training… in PyTorch.</p> </blockquote> <p>And <a href="https://discourse.julialang.org/t/state-of-machine-learning-in-julia/74385/13">here</a></p> <blockquote> <p>I have been bit by incorrect gradient bugs in Zygote/ReverseDiff.jl. This cost me weeks of my life and has thoroughly shaken my confidence in the entire Julia AD landscape. [...] In all my years of working with PyTorch/TF/JAX I have not once encountered an incorrect gradient bug.</p> </blockquote> <p>And <a href="https://discourse.julialang.org/t/state-of-machine-learning-in-julia/74385/21">here</a></p> <blockquote> <p>Since I started working with Julia, I’ve had two bugs with Zygote which have slowed my work by several months. On a positive note, this has forced me to plunge into the code and learn a lot about the libraries I’m using. But I’m finding myself in a situation where this is becoming too much, and I need to spend a lot of time debugging code instead of doing climate research.</p> </blockquote> <p>Despite this continued flow of bugs, public responses from the co-creators of Julia as well as a number of core community members generally claim, as they did for this post, that the issues will be fixed very soon (e.g., see <a href="https://news.ycombinator.com/item?id=31263516">the comments here by some core devs on a recent post, saying that all of the issues are being addressed and will be fixed soon</a>, or this <a href="https://news.ycombinator.com/item?id=22138071">2020 comment about how the there were serious correctness issues in 2016 but things are now good</a>, etc.).</p> <p>Instead of taking the correctness issues or other issues seriously, the developers make statements like <a href="images/julialang/julia-testing-seriously.jpg">the following comments from a co-creator of Julia, passed to me by a friend of mine as my friend ran into yet another showstopping bug</a>:</p> <blockquote> <p>takes that Julia doesn't take testing seriously... I don't get it. the amount of time and energy we spend on testing the bejeezus out of everything. I literally don't know any other open source project as thoroughly end-to-end tested.</p> </blockquote> <p>The general claim is that, not only has Julia fixed its correctness issues, it's as good as it gets for correctness.</p> <p>On the package issues, the claim was that package load times were fixed by 2016. But this continues to be a major complaint of the people I know who use Julia, e.g., Jamie Brandon switched away from using Julia in 2022 because it took two minutes for his CSV parsing pipeline to run, where most of the time was package loading. Another example is that, in 2020, <a href="https://news.ycombinator.com/item?id=24746057">on a benchmark where the Julia developers bragged that Julia is very fast at the curious workload of repeatedly loading the same CSV over and over again (in a loop, not by running a script repeatedly) compared to R</a>, some people noted that this was unrealistic due to Julia's very long package load times, saying that it takes 2 seconds to open the CSV package and then 104 seconds to load a plotting library. In 2022, in response to comments that package loading is painfully slow, a Julia developer responds to each issue saying each one will be fixed; on package loading, they say</p> <blockquote> <p>We're getting close to native code caching, and more: <a href="https://discourse.julialang.org/t/precompile-why/78770/8">https://discourse.julialang.org/t/precompile-why/78770/8</a>. As you'll also read, the difficulty is due to important tradeoffs Julia made with composability and aggressive specialization...but it's not fundamental and can be surmounted. Yes there's been some pain, but in the end hopefully we'll have something approximating the best of both worlds.</p> </blockquote> <p>It's curious that these problems could exist in 2020 and 2022 after a co-creator of Julia claimed, in 2016, that the package load time problems were fixed. But this is the general pattern of Julia PR that we see. On any particular criticism, the criticism is one of: illegitimate, fixed soon or, when the criticism is more than a year old, already fixed. But we can see by looking at responses over time that the issues that are &quot;already fixed&quot; or &quot;will be fixed soon&quot; are, in fact, not fixed many years after claims that they were fixed. It's true that there is progress on the issues, but it wasn't really fair to say that package load time issues were fixed and &quot;package loading is pretty fast&quot; when it takes nearly two minutes to load a CSV and use a standard plotting library (an equivalent to ggplot2) to generate a plot in Julia. And likewise for correctness issues when there's still a steady stream of issues in core libraries, Julia itself, and libraries that are named as part of the magic that makes Julia great (e.g., <a href="https://news.ycombinator.com/item?id=29684214">autodiff is frequently named as a huge advantage of Julia when it comes to features</a>, but then <a href="https://news.ycombinator.com/item?id=31268796">when it comes to bugs, those bugs don't count because they're not in Julia itself</a> (that last comment, of course, has a comment from a Julia developer noting that all of the issues will be addressed soon).</p> <p>There's a sleight of hand here where the reflexive response from a number of the co-creators as well as core developers of Julia is to brush off any particular issue with a comment that sounds plausible if read on HN or Twitter by <a href="https://twitter.com/danluu/status/1524949021252411392">someone who doesn't know people who've used Julia</a>. This makes for good PR since, with an emerging language like Julia, most potential users won't have real connections who've used it seriously and the reflexive comments sound plausible if you don't look into them.</p> <p>I use the word reflexive here because it seems that some co-creators of Julia respond to any criticism with a rebuttal, such as <a href="https://twitter.com/danluu/status/1526256307736412160">here</a>, where a core developer responds to a post about showstopping bugs by saying that having bugs is actually good, and <a href="https://twitter.com/Viral_B_Shah/status/1315425480548462592">here</a>, where in response to my noting that some people had commented that they were tired of misleading benchmarking practices by Julia developers, a co-creator of Julia drops in to say &quot;I would like to let it be known for the record that I do not agree with your statements about Julia in this thread.&quot; But my statements in the thread were merely that there existed comments like <a href="https://news.ycombinator.com/item?id=24748582">https://news.ycombinator.com/item?id=24748582</a>. It's quite nonsensical to state, for the record, a disagreement that those kinds of comments exist because they clearly do exist.</p> <p>Another example of a reflexive response is <a href="https://news.ycombinator.com/item?id=31401814">this 2022 thread</a>, where someone who tried Julia but stopped using it for serious work after running into one too many bugs that took weeks to debug suggests that the Julia ecosystem needs a rewrite because the attitude and culture in the community results in a large number of correctness issues. A core Julia developer &quot;rebuts&quot; the comment by saying that things are re-written all the time and gives examples of things that were re-written for performance reasons. Performance re-writes are, famously, a great way to introduce bugs, making the &quot;rebuttal&quot; actually a kind of anti-rebuttal. But, as is typical for many core Julia developers, the person saw that there was an issue (not enough re-writes) and reflexively responded with a denial, that there are enough re-writes.</p> <p>These reflexive responses are pretty obviously bogus if you spend a bit of time reading them and looking at the historical context but this kind of &quot;deny deny deny&quot; response is generally highly effective PR and has been effective for Julia, so it's understandable that it's done. For example, <a href="https://news.ycombinator.com/item?id=22138071">on this 2020 comment that belies the 2016 comment about correctness being fixed that says that there were serious issues in 2016 but things are &quot;now&quot; good in 2020</a>, someone responds &quot;Thank you, this is very heartening.&quot; since it relieves them of their concern that there are still issues. Of course, you can see basically the same discussion on discussions in 2022, but people reading the discussion in 2022 generally won't go back to see that this same discussion happened in 2020, 2016, 2013, etc.</p> <p>On the build uptime, the claim is that the issue causing uptime issues was fixed, but my comment there was on the attitude of brushing off the issue for an extended period of time with &quot;works on my machine&quot;. As we can see from the examples above, the meta-issue of brushing off issues continued.</p> <p>On the last issue that was claimed to legitimate, which was also claimed to be fixed, documentation, this is still a common complaint from the community, e.g., <a href="https://news.ycombinator.com/item?id=17730094">here in 2018</a>, 2 years after it was claimed that documentation was fixed in 2016, <a href="https://news.ycombinator.com/item?id=20589167">here in 2019</a>, <a href="https://news.ycombinator.com/item?id=31398705">here in 2022</a>, etc. In a much lengthier complaint, one person notes</p> <blockquote> <p>The biggest issue, and one they seem unwilling to really address, is that actually using the type system to do anything cool requires you to rely entirely on documentation which may or may not exist (or be up-to-date).</p> </blockquote> <p>And another echoes this sentiment with</p> <blockquote> <p>This is truly an important issue.</p> </blockquote> <p>Of course, <a href="https://news.ycombinator.com/item?id=17741797">there's a response saying this will be fixed soon</a>, as is generally the case. And yet, you can still find people complaining about the documentation.</p> <p>If you go back and read discussions on Julia correctness issues, three more common defenses are that everything has bugs, bugs are quickly fixed, and testing is actually great because X is well tested. You can see examples of &quot;everything has bugs&quot; <a href="https://groups.google.com/g/julia-users/c/GyH8nhExY9I/m/0BznqtD5VJgJ">here in 2014</a> as well as <a href="https://news.ycombinator.com/item?id=31397499">here in 2022</a> (and in between as well, of course), as if all non-zero bug rates are the same, even though a number of developers have noted that they stopped using Julia for work and switched to other ecosystems because, while everything has bugs, all non-zero numbers are, of course, not the same. Bugs getting fixed quickly is sometimes not true (e.g., many of the bugs linked in this post have been open for quite a while and are still open) and is also <a href="https://twitter.com/danluu/status/1270987200545435648">a classic defense that's used to distract from the issue of practices that directly lead to the creation of an unusually large number of new bugs</a>. As noted in a number of links, above, it can take weeks or months to debug correctness issues since many of the correctness issues are of the form &quot;silently return incorrect results&quot; and, as noted above, I ran into a bug where exceptions were non-deterministically incorrectly not caught. It may be true that, in some cases, these sorts of bugs are quickly fixed when found, but those issues still cost users a lot of time to track down. We saw an example of &quot;testing is actually great because X is well tested&quot; <a href="https://news.ycombinator.com/item?id=31401814">above</a>. If you'd like a more recent example, <a href="https://news.ycombinator.com/item?id=31397499">here's one from 2022</a> where, in response to someone saying that ran into more correctness bugs in Julia than than in any other ecosystem they've used in their decades of programming, a core Julia dev responds by saying that a number of things are very well tested in Julia, such as libuv, as if testing some components well is a talisman that can be wielded against bugs in other components. This is obviously absurd, in that it's like saying that a building with an open door can't be insecure because it also has very sturdy walls, but it's a common defense used by core Julia developers. And, of course, there's also just straight-up FUD about writing about Julia. For example, in 2022, on Yuri Vishnevsky's post on Julia bugs, a co-creator of Julia said <a href="https://news.ycombinator.com/item?id=31881164">&quot;Yuri's criticism was not that Julia has correctness bugs as a language, but that certain libraries when composed with common operations had bugs (many of which are now addressed).&quot;</a>. This is, of course, completely untrue. In conversations with Yuri, he noted to me that he specifically included examples of core language and core library bugs because those happened so frequently, and it was frustrating that core Julia people pretended those didn't exist and that their FUD seemed to work since people would often respond as if their comments weren't untrue. As mentioned above, this kind of flat denial of simple matters of fact is highly effective, so it's understandable that people employ it but, personally, it's not to my taste.</p> <p>To be clear, I don't inherently have a problem with software being buggy. As I've mentioned, <a href="https://twitter.com/danluu/status/1487232684363370496">I think move fast and break things can be a good value</a> because it clearly states that velocity is more valued than correctness. Comments from the creators of Julia as well as core developers broadcast that Julia is not just highly reliable and correct, but actually world class (&quot;the amount of time and energy we spend on testing the bejeezus out of everything. I literally don't know any other open source project as thoroughly end-to-end tested.&quot;, etc.). But, by revealed preference, we can see that Julia's values are &quot;move fast and break things&quot;.</p> <h3 id="appendix-blog-posts-on-julia">Appendix: blog posts on Julia</h3> <ul> <li>2014: <a href="julialang/">this post</a></li> <li>2016: <a href="https://www.zverovich.net/2016/05/13/giving-up-on-julia.html">Victor Zverovich</a> <ul> <li>Julia brags about high performance in unrepresentative microbenchmarks but often has poor performance in practice</li> <li>Complex codebase leading to many bugs</li> </ul></li> <li>2022: <a href="https://weissmann.pm/julialang/">Volker Weissman</a> <ul> <li>Poor documentation</li> <li>Unclear / confusing error messages</li> <li>Benchmarks claim good performance but benchmarks are of unrealistic workloads and performance is often poor in practice</li> </ul></li> <li>2022: <a href="https://kidger.site/thoughts/jax-vs-julia/">Patrick Kidger</a> comparison of Julia to JAX and PyTorch <ul> <li>Poor documentation</li> <li>Correctness issues in widely relied on, important, libraries</li> <li>Inscrutable error messages</li> <li>Poor code quality, leading to bugs and other issues</li> </ul></li> <li>2022: <a href="https://yuri.is/not-julia/">Yuri Vishnevsky</a> <ul> <li>Many very serious correctness bugs in both the language runtime and core libraries that are heavily relied on</li> <li>Culture / attitude has persistently caused a large number of bugs, &quot;Julia and its packages have the highest rate of serious correctness bugs of any programming system I’ve used, and I started programming with Visual Basic 6 in the mid-2000s&quot; <ul> <li>Stream of serious bugs is in stark contrast to comments from core Julia developers and Julia co-creators saying that Julia is very solid and has great correctness properties</li> </ul></li> </ul></li> </ul> <p><small> Thanks (or anti-thanks) to Leah Hanson for pestering me to write this for the past few months. It's not the kind of thing I'd normally write, but the concerns here got repeatedly brushed off when I brought them up in private. For example, when I brought up testing, I was told that Julia is better tested than most projects. While that's true in some technical sense (the median project on GitHub probably has zero tests, so any non-zero number of tests is above average), I didn't find that to be a meaningful rebuttal (as opposed to a reply that Julia is still expected to be mostly untested because it's in an alpha state). After getting a similar response on a wide array of topics I stopped using Julia. Normally that would be that, but Leah really wanted these concerns to stop getting ignored, so I wrote this up.</p> <p>Also, thanks to Leah Hanson, Julia Evans, Joe Wilder, Eddie V, David Andrzejewski, and Yuri Vishnevsky for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:L">What I mean here is that you can have lots of bugs pop up despite having 100% line coverage. It's not that line coverage is bad, but that it's not sufficient, not even close. And because it's not sufficient, it's a pretty bad sign when you not only don't have 100% line coverage, you don't even have 100% function coverage. <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:R">I'm going to use the word care a few times, and when I do I mean something specific. When I say care, I mean that in the colloquial <a href="http://en.wikipedia.org/wiki/Revealed_preference">revealed preference</a> sense of the word. There's another sense of the word, in which everyone cares about testing and error handling, the same way every politician cares about family values. But that kind of caring isn't linked to what I care about, which involves concrete actions. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:T">It's technically possible to have multiple versions installed, but the process is a total hack. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:E">By &quot;easy&quot;, I mean extremely hard. Technical fixes can be easy, but process and cultural fixes are almost always hard. <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> </ol> </div> Integer overflow checking cost integer-overflow/ Wed, 17 Dec 2014 00:00:00 +0000 integer-overflow/ <p>How much overhead should we expect from enabling integer overflow checks? Using a compiler flag or built-in intrinsics, we should be able to do the check with a conditional branch that branches based on the overflow flag that <code>add</code> and <code>sub</code> set. Code that looks like</p> <pre><code>add %esi, %edi </code></pre> <p>should turn into something like</p> <pre><code>add %esi, %edi jo &lt;handle_overflow&gt; </code></pre> <p>Assuming that branch is always correctly predicted (which should be the case for most code), the costs of the branch are the cost of executing that correctly predicted not-taken branch, the pollution the branch causes in the branch history table, and the cost of decoding the branch (on x86, <code>jo</code> and <code>jno</code> don't fuse with <code>add</code> or <code>sub</code>, which means that on the fast path, the branch will take up one of the 4 opcodes that can come from the decoded instruction cache per cycle). That's probably less than a 2x penalty per <code>add</code> or <code>sub</code> on front-end limited in the worst case (which might happen in a tightly optimized loop, but should be rare in general), plus some nebulous penalty from branch history pollution which is really difficult to measure in microbenchmarks. Overall, we can use 2x as a pessimistic guess for the total penalty.</p> <p>2x sounds like a lot, but how much time do applications spend adding and subtracting? If we look at the most commonly used benchmark of “workstation” integer workloads, <a href="http://www.spec.org/cpu2006/">SPECint</a>, the composition is maybe 40% load/store ops, 10% branches, and 50% other operations. Of the 50% “other” operations, maybe 30% of those are integer add/sub ops. If we guesstimate that load/store ops are 10x as expensive as add/sub ops, and other ops are as expensive as add/sub, a 2x penalty on add/sub should result in a <code>(40*10+10+50 + 12) / (40*10+10+50) = 3%</code> penalty. That the penalty for a branch is 2x, that add/sub ops are only 10x faster than load/store ops, and that add/sub ops aren't faster than other &quot;other&quot; ops are all pessimistic assumptions, so this estimate should be on the high end for most workloads.</p> <p></p> <p>John Regehr, who's done serious analysis on integer overflow checks estimates that the penalty should be <a href="http://blog.regehr.org/archives/1154">about 5%</a>, which is in the same ballpark as our napkin sketch estimate.</p> <p>A spec license costs $800, so let's benchmark bzip2 (which is a component of SPECint) instead of paying $800 for SPECint. Compiling bzip2 with <code>clang -O3</code> vs. <code>clang -O3 -fsanitize=signed-integer-overflow,unsigned-integer-overflow</code> (which prints out a warning on overflow) vs. <code>-fsanitize-undefined-trap-on-error</code> with undefined overflow checks (which causes a crash on an undefined overflow), we get the following results on compressing and decompressing 1GB of code and binaries that happened to be lying around on my machine.</p> <table> <thead> <tr> <th>options</th> <th>zip (s)</th> <th>unzip (s)</th> <th>zip (ratio)</th> <th>unzip (ratio)</th> </tr> </thead> <tbody> <tr> <td>normal</td> <td>93</td> <td>45</td> <td>1.0</td> <td>1.0</td> </tr> <tr> <td>fsan</td> <td>119</td> <td>49</td> <td>1.28</td> <td>1.09</td> </tr> <tr> <td>fsan ud</td> <td>94</td> <td>45</td> <td>1.01</td> <td>1.00</td> </tr> </tbody> </table> <p>In the table, ratio is the relative ratio of the run times, not the compression ratio. The difference between <code>fsan ud, unzip</code> and <code>normal, unzip</code> isn't actually 0, but it rounds to 0 if we measure in whole seconds. If we enable good error messages, decompression doesn't slow down all that much (45s v. 49s), but compression is a lot slower (93s v. 119s). The penalty for integer overflow checking is 28% for compression and 9% decompression if we print out nice diagnostics, but almost nothing if we don't. How is that possible? Bzip2 normally has a couple of unsigned integer overflows. If I patch the code to remove those so that the diagnostic printing code path is never executed it still causes a large performance hit.</p> <p>Let's check out the penalty when we just do some adds with something like</p> <pre><code>for (int i = 0; i &lt; n; ++i) { sum += a[i]; } </code></pre> <p>On my machine (a 3.4 GHz Sandy Bridge), this turns out to be about 6x slower with <code>-fsanitize=signed-integer-overflow,unsigned-integer-overflow</code>. Looking at the disassembly, the normal version uses SSE adds, whereas the fsanitize version uses normal adds. Ok, 6x sounds plausible for unchecked SSE adds v. checked adds.</p> <p>But if I try different permutations of the same loop that don't allow the the compiler to emit SSE instructions for the unchecked version, I still get a 4x-6x performance penalty for versions compiled with fsanitize. Since there are a lot of different optimizations in play, including loop unrolling, let's take a look at a simple function that does a single add to get a better idea of what's going on.</p> <p>Here's the disassembly for a function that adds two ints, first compiled with <code>-O3</code> and then compiled with <code>-O3 -fsanitize=signed-integer-overflow,unsigned-integer-overflow</code>.</p> <pre><code>0000000000400530 &lt;single_add&gt;: 400530: 01 f7 add %esi,%edi 400532: 89 f8 mov %edi,%eax 400534: c3 retq </code></pre> <p>The compiler does a reasonable job on the <code>-O3</code> version. Per the standard AMD64 calling convention, the arguments are passed in via the <code>esi</code> and <code>edi</code> registers, and passed out via the <code>eax</code> register. There's some overhead over an inlined <code>add</code> instruction because we have to move the result to <code>eax</code> and then return from the function call, but considering that it's a function call, it's a totally reasonable implementation.</p> <pre><code>000000000041df90 &lt;single_add&gt;: 41df90: 53 push %rbx 41df91: 89 fb mov %edi,%ebx 41df93: 01 f3 add %esi,%ebx 41df95: 70 04 jo 41df9b &lt;single_add+0xb&gt; 41df97: 89 d8 mov %ebx,%eax 41df99: 5b pop %rbx 41df9a: c3 retq 41df9b: 89 f8 mov %edi,%eax 41df9d: 89 f1 mov %esi,%ecx 41df9f: bf a0 89 62 00 mov $0x6289a0,%edi 41dfa4: 48 89 c6 mov %rax,%rsi 41dfa7: 48 89 ca mov %rcx,%rdx 41dfaa: e8 91 13 00 00 callq 41f340 &lt;__ubsan_handle_add_overflow&gt; 41dfaf: eb e6 jmp 41df97 &lt;single_add+0x7&gt; </code></pre> <p>The compiler does not do a reasonable job on the <code>-O3 -fsanitize=signed-integer-overflow,unsigned-integer-overflow</code> version. Optimization wizard Nathan Kurz, had this to say about clang's output:</p> <blockquote> <p>That's awful (although not atypical) compiler generated code. For some reason the compiler decided that it wanted to use %ebx as the destination of the add. Once it did this, it has to do the rest. The question would by why it didn't use a scratch register, why it felt it needed to do the move at all, and what can be done to prevent it from doing so in the future. As you probably know, %ebx is a 'callee save' register, meaning that it must have the same value when the function returns --- thus the push and pop. Had the compiler just done the add without the additional mov, leaving the input in %edi/%esi as it was passed (and as done in the non-checked version), this wouldn't be necessary. I'd guess that it's a residue of some earlier optimization pass, but somehow the ghost of %ebx remained.</p> </blockquote> <p>However, adding <code>-fsanitize-undefined-trap-on-error</code> changes this to</p> <pre><code>0000000000400530 &lt;single_add&gt;: 400530: 01 f7 add %esi,%edi 400532: 70 03 jo 400537 &lt;single_add+0x7&gt; 400534: 89 f8 mov %edi,%eax 400536: c3 retq 400537: 0f 0b ud2 </code></pre> <p>Although this is a tiny, contrived, example, we can see a variety of mis-optimizations in other code compiled with options that allow fsanitize to print out diagnostics.</p> <p>While a better C compiler could do better, in theory, gcc 4.82 doesn't do better than clang 3.4 here. For one thing, gcc's <code>-ftrapv</code> only checks signed overflow. Worse yet, it doesn't work, <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35412">and this bug on ftrapv has been open since 2008</a>. Despite doing fewer checks and not doing them correctly, gcc's <code>-ftrapv</code> slows things down about as much as clang's <code>-fsanitize=signed-integer-overflow,unsigned-integer-overflow</code> on bzip2, and substantially more than <code>-fsanitize=signed-integer-overflow</code>.</p> <p>Summing up, integer overflow checks ought to cost a few percent on typical integer-heavy workloads, and they do, as long as you don't want nice error messages. The current mechanism that produces nice error messages somehow causes optimizations to get screwed up in a lot of cases<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">1</a></sup>.</p> <h3 id="update">Update</h3> <p>On clang 3.8.0 and after, and gcc 5 and after, register allocation seems to work as expected (although you may need to pass <code>-fno-sanitize-recover</code>. I haven't gone back and re-run my benchmarks across different versions of clang and gcc, but I'd like to do that when I get some time.</p> <h3 id="cpu-internals-series">CPU internals series</h3> <ul> <li><a href="//danluu.com/branch-prediction/">A brief history of branch prediction</a></li> <li><a href="//danluu.com/new-cpu-features/">New CPU features since the 80s</a></li> <li><a href="//danluu.com/integer-overflow/">The cost of branches and integer overflow checking in real code</a></li> <li><a href="cpu-bugs/">CPU bugs</a></li> <li><a href="//danluu.com/hardware-unforgiving/">Why CPU development is hard</a></li> <li><a href="//danluu.com/why-hardware-development-is-hard/">Verilog sucks, part 1</a></li> <li><a href="//danluu.com/pl-troll/">Verilog sucks, part 2</a></li> </ul> <p><small> Thanks to Nathan Kurz for comments on this topic, including, but not limited to, the quote that's attributed to him, and to Stan Schwertly, Nick Bergson-Shilcock, Scott Feeney, Marek Majkowski, Adrian and Juan Carlos Borras for typo corrections and suggestions for clarification. Also, huge thanks to Richard Smith, who pointed out the <code>-fsanitize-undefined-trap-on-error</code> option to me. This post was updated with results for that option after Richard's comment. Also, thanks to Filipe Cabecinhas for noticing that clang fixed this behavior in clang 3.8 (released approximately 1.5 years after this post).</p> <p>John Regehr has some <a href="https://plus.google.com/105487075784331805819/posts/dyKKLrW8jXb">more comments here</a> on why clang's implementation of integer overflow checking isn't fast (yet).</p> <p></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:H"><p>People often <a href="http://yosefk.com/blog/the-high-level-cpu-challenge.html">call for hardware support</a> for integer overflow checking above and beyond the existing overflow flag. That would add expense and complexity to every chip made to get, at most, a few percent extra performance in the best case, on optimized code. That might be worth it -- there are lots of features Intel adds that only speed up a subset of applications by a few percent.</p> <p>This is often described as a chicken and egg problem; people would use overflow checks if checks weren't so slow, and hardware support is necessary to make the checks fast. But there's already hardware support to get good-enough performance for the vast majority of applications. It's just not taken advantage of because people don't actually care about this problem.</p> <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> </ol> </div> Malloc tutorial malloc-tutorial/ Thu, 04 Dec 2014 00:00:00 +0000 malloc-tutorial/ <p>Let's write a <a href="http://man7.org/linux/man-pages/man3/malloc.3.html">malloc</a> and see how it works with existing programs!</p> <p>This is basically an expanded explanation of what I did after reading <a href="http://www.inf.udec.cl/~leo/Malloc_tutorial.pdf">this tutorial</a> by Marwan Burelle and then sitting down and trying to write my own implementation, so the steps are going to be fairly similar. The main implementation differences are that my version is simpler and more vulnerable to memory fragmentation. In terms of exposition, my style is a lot more casual.</p> <p>This tutorial is going to assume that you know what pointers are, and that you know enough C to know that <code>*ptr</code> dereferences a pointer, <code>ptr-&gt;foo</code> means <code>(*ptr).foo</code>, that malloc is used to <a href="http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/">dynamically allocate space</a>, and that you're familiar with the concept of a linked list. For a basic intro to C, <a href="http://www.amazon.com/gp/product/0673999866/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0673999866&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=5C3DNUKAQELP2KUL">Pointers on C</a> is one of my favorite books. If you want to look at all of this code at once, it's available <a href="https://github.com/danluu/malloc-tutorial/blob/master/malloc.c">here</a>.</p> <p>Preliminaries aside, malloc's function signature is</p> <pre><code>void *malloc(size_t size); </code></pre> <p>It takes as input a number of bytes and returns a pointer to a block of memory of that size.</p> <p>There are a number of ways we can implement this. We're going to arbitrarily choose to use <a href="http://man7.org/linux/man-pages/man2/sbrk.2.html">sbrk</a>. The OS reserves stack and heap space for processes and sbrk lets us manipulate the heap. <code>sbrk(0)</code> returns a pointer to the current top of the heap. <code>sbrk(foo)</code> increments the heap size by <code>foo</code> and returns a pointer to the previous top of the heap.</p> <p></p> <p><img src="images/malloc-tutorial/heap.png" alt="Diagram of linux memory layout, courtesy of Gustavo Duarte." width="194" height="301"></p> <p>If we want to implement a really simple malloc, we can do something like</p> <pre><code>#include &lt;assert.h&gt; #include &lt;string.h&gt; #include &lt;sys/types.h&gt; #include &lt;unistd.h&gt; void *malloc(size_t size) { void *p = sbrk(0); void *request = sbrk(size); if (request == (void*) -1) { return NULL; // sbrk failed. } else { assert(p == request); // Not thread safe. return p; } } </code></pre> <p>When a program asks malloc for space, malloc asks sbrk to increment the heap size and returns a pointer to the start of the new region on the heap. This is missing a technicality, that <code>malloc(0)</code> should either return <code>NULL</code> or another pointer that can be passed to free without causing havoc, but it basically works.</p> <p>But speaking of free, how does free work? Free's prototype is</p> <pre><code>void free(void *ptr); </code></pre> <p>When free is passed a pointer that was previously returned from malloc, it's supposed to free the space. But given a pointer to something allocated by our malloc, we have no idea what size block is associated with it. Where do we store that? If we had a working malloc, we could malloc some space and store it there, but we're going to run into trouble if we need to call malloc to reserve space each time we call malloc to reserve space.</p> <p>A common trick to work around this is to store meta-information about a memory region in some space that we squirrel away just below the pointer that we return. Say the top of the heap is currently at <code>0x1000</code> and we ask for <code>0x400</code> bytes. Our current malloc will request <code>0x400</code> bytes from <code>sbrk</code> and return a pointer to <code>0x1000</code>. If we instead save, say, <code>0x10</code> bytes to store information about the block, our malloc would request <code>0x410</code> bytes from <code>sbrk</code> and return a pointer to <code>0x1010</code>, hiding our <code>0x10</code> byte block of meta-information from the code that's calling malloc.</p> <p>That lets us free a block, but then what? The heap region we get from the OS has to be contiguous, so we can't return a block of memory in the middle to the OS. Even if we were willing to copy everything above the newly freed region down to fill the hole, so we could return space at the end, there's no way to notify all of the code with pointers to the heap that those pointers need to be adjusted.</p> <p>Instead, we can mark that the block has been freed without returning it to the OS, so that future calls to malloc can use re-use the block. But to do that we'll need be able to access the meta information for each block. There are a lot of possible solutions to that. We'll arbitrarily choose to use a single linked list for simplicity.</p> <p>So, for each block, we'll want to have something like</p> <pre><code>struct block_meta { size_t size; struct block_meta *next; int free; int magic; // For debugging only. TODO: remove this in non-debug mode. }; #define META_SIZE sizeof(struct block_meta) </code></pre> <p>We need to know the size of the block, whether or not it's free, and what the next block is. There's a magic number here for debugging purposes, but it's not really necessary; we'll set it to arbitrary values, which will let us easily see which code modified the struct last.</p> <p>We'll also need a head for our linked list:</p> <pre><code>void *global_base = NULL; </code></pre> <p>For our malloc, we'll want to re-use free space if possible, allocating space when we can't re-use existing space. Given that we have this linked list structure, checking if we have a free block and returning it is straightforward. When we get a request of some size, we iterate through our linked list to see if there's a free block that's large enough.</p> <pre><code>struct block_meta *find_free_block(struct block_meta **last, size_t size) { struct block_meta *current = global_base; while (current &amp;&amp; !(current-&gt;free &amp;&amp; current-&gt;size &gt;= size)) { *last = current; current = current-&gt;next; } return current; } </code></pre> <p>If we don't find a free block, we'll have to request space from the OS using sbrk and add our new block to the end of the linked list.</p> <pre><code>struct block_meta *request_space(struct block_meta* last, size_t size) { struct block_meta *block; block = sbrk(0); void *request = sbrk(size + META_SIZE); assert((void*)block == request); // Not thread safe. if (request == (void*) -1) { return NULL; // sbrk failed. } if (last) { // NULL on first request. last-&gt;next = block; } block-&gt;size = size; block-&gt;next = NULL; block-&gt;free = 0; block-&gt;magic = 0x12345678; return block; } </code></pre> <p>As with our original implementation, we request space using <code>sbrk</code>. But we add a bit of extra space to store our struct, and then set the fields of the struct appropriately.</p> <p>Now that we have helper functions to check if we have existing free space and to request space, our malloc is simple. If our global base pointer is <code>NULL</code>, we need to request space and set the base pointer to our new block. If it's not <code>NULL</code>, we check to see if we can re-use any existing space. If we can, then we do; if we can't, then we request space and use the new space.</p> <pre><code>void *malloc(size_t size) { struct block_meta *block; // TODO: align size? if (size &lt;= 0) { return NULL; } if (!global_base) { // First call. block = request_space(NULL, size); if (!block) { return NULL; } global_base = block; } else { struct block_meta *last = global_base; block = find_free_block(&amp;last, size); if (!block) { // Failed to find free block. block = request_space(last, size); if (!block) { return NULL; } } else { // Found free block // TODO: consider splitting block here. block-&gt;free = 0; block-&gt;magic = 0x77777777; } } return(block+1); } </code></pre> <p>For anyone who isn't familiar with C, we return block+1 because we want to return a pointer to the region after block_meta. Since block is a pointer of type <code>struct block_meta</code>, <code>+1</code> increments the address by one <code>sizeof(struct block_meta)</code>.</p> <p>If we just wanted a malloc without a free, we could have used our original, much simpler malloc. So let's write free! The main thing free needs to do is set <code>-&gt;free</code>.</p> <p>Because we'll need to get the address of our struct in multiple places in our code, let's define this function.</p> <pre><code>struct block_meta *get_block_ptr(void *ptr) { return (struct block_meta*)ptr - 1; } </code></pre> <p>Now that we have that, here's free:</p> <pre><code>void free(void *ptr) { if (!ptr) { return; } // TODO: consider merging blocks once splitting blocks is implemented. struct block_meta* block_ptr = get_block_ptr(ptr); assert(block_ptr-&gt;free == 0); assert(block_ptr-&gt;magic == 0x77777777 || block_ptr-&gt;magic == 0x12345678); block_ptr-&gt;free = 1; block_ptr-&gt;magic = 0x55555555; } </code></pre> <p>In addition to setting <code>-&gt;free</code>, it's valid to call free with a NULL ptr, so we need to check for NULL. Since free shouldn't be called on arbitrary addresses or on blocks that are already freed, we can assert that those things never happen.</p> <p>You never really need to assert anything, but it often makes debugging a lot easier. In fact, when I wrote this code, I had a bug that would have resulted in silent data corruption if these asserts weren't there. Instead, the code failed at the assert, which make it trivial to debug.</p> <p>Now that we've got malloc and free, we can write programs using our custom memory allocator! But before we can drop our allocator into existing code, we'll need to implement a couple more common functions, realloc and calloc. Calloc is just malloc that initializes the memory to 0, so let's look at realloc first. Realloc is supposed to adjust the size of a block of memory that we've gotten from malloc, calloc, or realloc.</p> <p>Realloc's function prototype is</p> <pre><code>void *realloc(void *ptr, size_t size) </code></pre> <p>If we pass realloc a NULL pointer, it's supposed to act just like malloc. If we pass it a previously malloced pointer, it should free up space if the size is smaller than the previous size, and allocate more space and copy the existing data over if the size is larger than the previous size.</p> <p>Everything will still work if we don't resize when the size is decreased and we don't free anything, but we absolutely have to allocate more space if the size is increased, so let's start with that.</p> <pre><code>void *realloc(void *ptr, size_t size) { if (!ptr) { // NULL ptr. realloc should act like malloc. return malloc(size); } struct block_meta* block_ptr = get_block_ptr(ptr); if (block_ptr-&gt;size &gt;= size) { // We have enough space. Could free some once we implement split. return ptr; } // Need to really realloc. Malloc new space and free old space. // Then copy old data to new space. void *new_ptr; new_ptr = malloc(size); if (!new_ptr) { return NULL; // TODO: set errno on failure. } memcpy(new_ptr, ptr, block_ptr-&gt;size); free(ptr); return new_ptr; } </code></pre> <p>And now for calloc, which just clears the memory before returning a pointer.</p> <pre><code>void *calloc(size_t nelem, size_t elsize) { size_t size = nelem * elsize; // TODO: check for overflow. void *ptr = malloc(size); memset(ptr, 0, size); return ptr; } </code></pre> <p>Note that this doesn't check for overflow in <code>nelem * elsize</code>, which is actually required by the spec. All of the code here is just enough to get something that kinda sorta works.</p> <p>Now that we have something that kinda works, we can use our with existing programs (and we don't even need to recompile the programs)!</p> <p>First, we need to compile our code. On linux, something like</p> <pre><code>clang -O0 -g -W -Wall -Wextra -shared -fPIC malloc.c -o malloc.so </code></pre> <p>should work.</p> <p><code>-g</code> adds debug symbols, so we can look at our code with <code>gdb</code> or <code>lldb</code>. <code>-O0</code> will help with debugging, by preventing individual variables from getting optimized out. <code>-W -Wall -Wextra</code> adds extra warnings. <code>-shared -fPIC</code> will let us dynamically link our code, which is what lets us <a href="http://jvns.ca/blog/2014/11/27/ld-preload-is-super-fun-and-easy/">use our code with existing binaries</a>!</p> <p>On macs, we'll want something like</p> <pre><code>clang -O0 -g -W -Wall -Wextra -dynamiclib malloc.c -o malloc.dylib </code></pre> <p>Note that <code>sbrk</code> is deprecated on recent versions of OS X. Apple uses an unorthodox definition of deprecated -- some deprecated syscalls are badly broken. I didn't really test this on a Mac, so it's possible that this will cause weird failures or or just not work on a mac.</p> <p>Now, to use get a binary to use our malloc on linux, we'll need to set the <code>LD_PRELOAD</code> environment variable. If you're using bash, you can do that with</p> <pre><code>export LD_PRELOAD=/absolute/path/here/malloc.so </code></pre> <p>If you've got a mac, you'll want</p> <pre><code>export DYLD_INSERT_LIBRARIES=/absolute/path/here/malloc.so </code></pre> <p>If everything works, you can run some arbitrary binary and it will run as normal (except that it will be a bit slower).</p> <pre><code>$ ls Makefile malloc.c malloc.so README.md test test-0 test-1 test-2 test-3 test-4 </code></pre> <p>If there's a bug, you might get something like</p> <pre><code>$ ls Segmentation fault (core dumped) </code></pre> <h4 id="debugging">Debugging</h4> <p>Let's talk about debugging! If you're familiar with using a debugger to set breakpoints, inspect memory, and step through code, you can skip this section and go straight to <a href="#exercises">the exercises</a>.</p> <p>This section assumes you can figure out how to install gdb on your system. If you're on a mac, you may want to just use lldb and translate the commands appropriately. Since I don't know what bugs you might run into, I'm going to introduce a couple of bugs and show how I'd track them down.</p> <p>First, we need to figure out how to run gdb without having it segfault. If ls segfaults, and we try to run <code>gdb ls</code>, gdb is almost certainly going to segfault, too. We could write a wrapper to do this, but gdb also supports this. If we start gdb and then run <code>set environment LD_PRELOAD=./malloc.so</code> before running the program, <code>LD_PRELOAD</code> will work as normal.</p> <pre><code>$ gdb /bin/ls (gdb) set environment LD_PRELOAD=./malloc.so (gdb) run Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7bd7dbd in free (ptr=0x0) at malloc.c:113 113 assert(block_ptr-&gt;free == 0); </code></pre> <p>As expected, we get a segfault. We can look around with <code>list</code> to see the code near the segfault.</p> <pre><code>(gdb) list 108 } 109 110 void free(void *ptr) { 111 // TODO: consider merging blocks once splitting blocks is implemented. 112 struct block_meta* block_ptr = get_block_ptr(ptr); 113 assert(block_ptr-&gt;free == 0); 114 assert(block_ptr-&gt;magic == 0x77777777 || block_ptr-&gt;magic == 0x12345678); 115 block_ptr-&gt;free = 1; 116 block_ptr-&gt;magic = 0x55555555; 117 } </code></pre> <p>And then we can use <code>p</code> (for print) to see what's going on with the variables here:</p> <pre><code>(gdb) p ptr $6 = (void *) 0x0 (gdb) p block_ptr $7 = (struct block_meta *) 0xffffffffffffffe8 </code></pre> <p><code>ptr</code> is <code>0</code>, i.e., <code>NULL</code>, which is the cause of the problem: we forgot to check for NULL.</p> <p>Now that we've figured that out, let's try a slightly harder bug. Let's say that we decided to replace our struct with</p> <pre><code>struct block_meta { size_t size; struct block_meta *next; int free; int magic; // For debugging only. TODO: remove this in non-debug mode. char data[1]; }; </code></pre> <p>and then return <code>block-&gt;data</code> instead of <code>block+1</code> from malloc, with no other changes. This seems pretty similar to what we're already doing -- we just define a member that points to the end of the struct, and return a pointer to that.</p> <p>But here's what happens if we try to use our new malloc:</p> <pre><code>$ /bin/ls Segmentation fault (core dumped) gdb /bin/ls (gdb) set environment LD_PRELOAD=./malloc.so (gdb) run Program received signal SIGSEGV, Segmentation fault. _IO_vfprintf_internal (s=s@entry=0x7fffff7ff5f0, format=format@entry=0x7ffff7567370 &quot;%s%s%s:%u: %s%sAssertion `%s' failed.\n%n&quot;, ap=ap@entry=0x7fffff7ff718) at vfprintf.c:1332 1332 vfprintf.c: No such file or directory. 1327 in vfprintf.c </code></pre> <p>This isn't as nice as our last error -- we can see that one of our asserts failed, but gdb drops us into some print function that's being called when the assert fails. But that print function uses our buggy malloc and blows up!</p> <p>One thing we could do from here would be to inspect <code>ap</code> to see what <code>assert</code> was trying to print:</p> <pre><code>(gdb) p *ap $4 = {gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffff7ff7f0, reg_save_area = 0x7fffff7ff730} </code></pre> <p>That would work fine; we could poke around until we figure out what's supposed to get printed and figure out the fail that way. Some other solutions would be to write our own custom assert or to use the right hooks to prevent <code>assert</code> from using our malloc.</p> <p>But in this case, we know there are only a few asserts in our code. The one in malloc checking that we don't try to use this in a multithreaded program and the two in free checking that we're not freeing something we shouldn't. Let's look at free first, by setting a breakpoint.</p> <pre><code>$ gdb /bin/ls (gdb) set environment LD_PRELOAD=./malloc.so (gdb) break free Breakpoint 1 at 0x400530 (gdb) run /bin/ls Breakpoint 1, free (ptr=0x61c270) at malloc.c:112 112 if (!ptr) { </code></pre> <p><code>block_ptr</code> isn't set yet, but if we use <code>s</code> a few times to step forward to after it's set, we can see what the value is:</p> <pre><code>(gdb) s (gdb) s (gdb) s free (ptr=0x61c270) at malloc.c:118 118 assert(block_ptr-&gt;free == 0); (gdb) p/x *block_ptr $11 = {size = 0, next = 0x78, free = 0, magic = 0, data = &quot;&quot;} </code></pre> <p>I'm using <code>p/x</code> instead of <code>p</code> so we can see it in hex. The <code>magic</code> field is 0, which should be impossible for a valid struct that we're trying to free. Maybe <code>get_block_ptr</code> is returning a bad offset? We have <code>ptr</code> available to us, so we can just inspect different offsets. Since it's a <code>void *</code>, we'll have to cast it so that gdb knows how to interpret the results.</p> <pre><code>(gdb) p sizeof(struct block_meta) $12 = 32 (gdb) p/x *(struct block_meta*)(ptr-32) $13 = {size = 0x0, next = 0x78, free = 0x0, magic = 0x0, data = {0x0}} (gdb) p/x *(struct block_meta*)(ptr-28) $14 = {size = 0x7800000000, next = 0x0, free = 0x0, magic = 0x0, data = {0x78}} (gdb) p/x *(struct block_meta*)(ptr-24) $15 = {size = 0x78, next = 0x0, free = 0x0, magic = 0x12345678, data = {0x6e}} </code></pre> <p>If we back off a bit from the address we're using, we can see that the correct offset is 24 and not 32. What's happening here is that structs get padded, so that <code>sizeof(struct block_meta)</code> is 32, even though the last valid member is at <code>24</code>. If we want to cut out that extra space, we need to fix <code>get_block_ptr</code>.</p> <p>That's it for debugging!</p> <h4 id="exercises">Exercises</h4> <p>Personally, this sort of thing never sticks with me unless I work through some exercises, so I'll leave a couple exercises here for anyone who's interested.</p> <ol> <li><p>malloc is supposed to return a pointer “which is suitably aligned for any built-in type”. Does our malloc do that? If so, why? If not, fix the alignment. Note that “any built-in type” is basically up to 8 bytes for C because SSE/AVX types aren't built-in types.</p></li> <li><p>Our malloc is really wasteful if we try to re-use an existing block and we don't need all of the space. Implement a function that will split up blocks so that they use the minimum amount of space necessary</p></li> <li><p>After doing <code>2</code>, if we call malloc and free lots of times with random sizes, we'll end up with a bunch of small blocks that can only be re-used when we ask for small amounts of space. Implement a mechanism to merge adjacent free blocks together so that any consecutive free blocks will get merged into a single block.</p></li> <li><p>Find bugs in the existing code! I haven't tested this much, so I'm sure there are bugs, even if this basically kinda sorta works.</p></li> </ol> <h4 id="resources">Resources</h4> <p>As noted above, there's Marwan Burelle <a href="http://www.inf.udec.cl/~leo/Malloc_tutorial.pdf">tutorial</a>.</p> <p>For more on how Linux deals with memory management, see <a href="http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/">this post</a> by Gustavo Duarte.</p> <p>For more on how real-world malloc implementations work, <a href="http://g.oswego.edu/dl/html/malloc.html">dlmalloc</a> and <a href="http://goog-perftools.sourceforge.net/doc/tcmalloc.html">tcmalloc</a> are both great reading. I haven't read the code for <a href="http://www.canonware.com/jemalloc/">jemalloc</a>, and I've heard that it's a bit more more difficult to understand, but it's also the most widely used high-performance malloc implementation around.</p> <p>For help debugging, <a href="https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer">Address Sanitizer</a> is amazing. If you want to write a thread-safe version, <a href="https://code.google.com/p/data-race-test/wiki/ThreadSanitizer">Thread Sanitizer</a> is also a great tool.</p> <p>There's a <a href="http://mgarciaisaia.github.io/tutorial-c/blog/2014/12/26/un-tutorial-rapido-para-implementar-y-debuggear-malloc/">Spanish translation of this post here</a> thanks to Matias Garcia Isaia.</p> <h4 id="acknowledgements">Acknowledgements</h4> <p>Thanks to Gustavo Duarte for letting me use one of his images to illustrate sbrk, and to Ian Whitlock, Danielle Sucher, Nathan Kurz, &quot;tedu&quot;, @chozu@fedi.absturztau.be, and David Farrel for comments/corrections/discussion. Please <a href="https://twitter.com/danluu">let me know</a> if you find other bugs in this post (whether they're in the writing or the code).</p> Markets, discrimination, and "lowering the bar" tech-discrimination/ Mon, 01 Dec 2014 00:00:00 +0000 tech-discrimination/ <p>Public discussions of discrimination in tech often result in someone claiming that discrimination is impossible because of market forces. Here's a quote from Marc Andreessen that sums up a common view<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>.</p> <blockquote> <p>Let's launch right into it. I think the critique that Silicon Valley companies are deliberately, systematically discriminatory is incorrect, and there are two reasons to believe that that's the case. ... No. 2, our companies are desperate for talent. Desperate. Our companies are dying for talent. They're like lying on the beach gasping because they can't get enough talented people in for these jobs. The motivation to go find talent wherever it is unbelievably high.</p> </blockquote> <p>Marc Andreessen's point is that the market is too competitive for discrimination to exist. But VC funded startups aren't the first companies in the world to face a competitive hiring market. Consider the market for PhD economists from, say, 1958 to 1987. Alan Greenspan had this to say about how that market looked to his firm, Townsend-Greenspan.</p> <blockquote> <p>Townsend-Greenspan was unusual for an economics firm in that the men worked for the women (we had about twenty-five employees in all). My hiring of women economists was not motivated by women's liberation. It just made great business sense. I valued men and women equally, and found that because other employers did not, good women economists were less expensive than men. Hiring women . . . gave Townsend-Greenspan higher-quality work for the same money . . .</p> </blockquote> <p>Not only did competition not end discrimination, there was enough discrimination that the act of not discriminating provided a significant competitive advantage for Townsend-Greenspan. And this is in finance, which is known for being cutthroat. And not just any part of finance, but one where it's PhD economists hiring other PhD economists. This is one of the industries where the people doing the hiring are the most likely to be familiar with both the theoretical models and the empirical research showing that discrimination opens up market opportunities by suppressing wages of some groups. But even that wasn't enough to equalize wages between men and women when Greenspan took over Townsend-Greenspan in 1958 and it still wasn't enough when Greenspan left to become chairman of the Fed in 1987. That's the thing about discrimination. When it's part of a deep-seated belief, it's hard for people to tell that they're discriminating.</p> <p>And yet, in discussions on tech hiring, people often claim that, since markets and hiring are perfectly competitive or efficient, companies must already be hiring the best people presented to them. A corollary of this is that anti-discrimination or diversity oriented policies necessarily mean &quot;lowering the bar since these would mean diverging from existing optimal hiring practices. And conversely, even when &quot;market forces&quot; aren't involved in the discussion, claiming that increasing hiring diversity necessarily means &quot;lowering the bar&quot; relies on an assumption of a kind of optimality in hiring. I think that an examination of tech hiring practices makes it pretty clear that <a href="programmer-moneyball/">practices are</a> <a href="algorithms-interviews/">far from optimal</a>, but rather than address this claim based on practices (which has been done in the linked posts), I'd look to look at the meta-claim that market forces make discrimination impossible. People make vauge claims about market efficiency and economics, like <a href="https://twitter.com/rogerdickey/status/1268327765905772546">this influential serial founder who concludes his remarks on hiring with &quot;Capitalism is real and markets are efficient.&quot;</a><sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">2</a></sup>. People seem to love handwave-y citations of &quot;the market&quot; or &quot;economists&quot;.</p> <p>But if we actually read what economists have to say on how hiring markets work, they do not, in general, claim that markets are perfectly efficient or that discrimination does not occur in markets that might colloquially be called highly competitive. Since we're talking about discrimination, a good place to start might be <a href="http://web.archive.org/web/20150401024833/http://economics.mit.edu/files/553">Becker's seminal work on discrimination</a>. What Becker says is that markets impose a cost on discrimination, and that under certain market conditions, what Becker calls &quot;taste-based&quot;<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">3</a></sup> discrimination occuring on average doesn't mean there's discrimination at the margin. This is quite a specific statement and, if you read other papers in the literature on discrimination, they also make similarly specific statements. What you don't see is anything like the handwave-y claims in tech discussions, that &quot;market forces&quot; or &quot;competition&quot; is incompatible with discrimination or non-optimal hiring. Quite frankly, I've never had a discussion with someone who says things like &quot;Capitalism is real and markets are efficient&quot; where it appears that they have even a passing familiarity with Becker's seminal work in the field of the economics of discrimination or, for that matter, any other major work on the topic.</p> <p>In discussions among the broader tech community, I have never seen anyone make a case that the tech industry (or any industry) meets the conditions under which taste-based discrimination on average doesn't imply marginal taste-based discrimination. Nor have I ever seen people make the case that we only have taste-based discrimination or that we also meet the conditions for not having other forms of discrimination. When people cite &quot;efficient markets&quot; with respect to hiring or other parts of tech, it's generally vague handwaving that sounds like an appeal to authority, but the authority is what someone might call a teenage libertarian's idea of how markets might behave.</p> <p>Since people often don't find abstract reasoning of the kind you see in Becker's work convincing, let's look at a few concrete examples. You can see discrimination in a lot of fields. A problem is that it's hard to separate out the effect of discrimination from confounding variables because it's hard to get good data on employee performance v. compensation over time. Luckily, there's one set of fields where that data is available: sports. And before we go into the examples, it's worth noting that we should, directionally, expect much less discrimination in sports than in tech. Not only is there much better data available on employee performance, it's easier to predict future employee performance from past performance, the impact of employee performance on &quot;company&quot; performance is greater and easier quantify, and the market is more competitive. Relatively to tech, these forces both increase the cost of discrimination while making the cost more visible.</p> <p>In baseball, <a href="https://ideas.repec.org/a/ucp/jpolec/v82y1974i4p873-81.html">Gwartney and Haworth (1974)</a> found that teams that discriminated less against non-white players in the decade following de-segregation performed better. Studies of later decades using “classical” productivity metrics mostly found that salaries equalize. However, <a href="http://www.hardballtimes.com/searching-for-racial-earnings-differentials-in-major-league-baseball/">Swartz (2014)</a>, using newer and more accurate metrics for productivity, found that Latino players are significantly underpaid for their productivity level. Compensation isn't the only way to discriminate -- <a href="http://sf.oxfordjournals.org/content/67/2/524.short">Jibou (1988)</a> found that black players had higher exit rates from baseball after controlling for age and performance. This should sound familiar to anyone who's wondered about <a href="http://leanangry.tumblr.com/post/125716699460/your-pipeline-argument-is-bullshit">exit rates in tech fields</a>.</p> <p>This slow effect of the market isn't limited to baseball; it actually seems to be worse in other sports. A review article by <a href="http://www.jstor.org/stable/2524152">Kahn (1991)</a> notes that in basketball, the most recent studies (up to the date of the review) found an 11%-25% salary penalty for black players as well as a higher exit rate. Kahn also noted multiple studies showing discrimination against French-Canadians in hockey, which is believed to be due to stereotypes about how French-Canadian men are less masculine than other men<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">4</a></sup>.</p> <p>In tech, some people are concerned that increasing diversity will &quot;lower the bar&quot;, but in sports, which has a more competitive hiring market than tech, we saw the opposite, increasing diversity raised the level instead of lowering it because it means hiring people on their qualifications instead of on what they look like. I don't disagree with people who say that it would be absurd for tech companies to leave money on the table by not hiring qualified minorities. But this is exactly what we saw in the sports we looked at, where that's even more absurd due to the relative ease of quantifying performance. And yet, for decades, teams left huge amounts of money on the table by favoring white players (and, in the case of hockey, non-French Canadian players) who were, quite simply, less qualified than their peers. <a href="https://twitter.com/danluu/status/1270181411534761984">The world is an absurd place</a>.</p> <p>In fields where there's enough data to see if there might be discrimination, we often find discrimination. Even in fields that are among the most competitive fields in existence, like major professional sports. Studies on discrimination aren't limited to empirical studies and data mining. There have been experiments showing discrimination at every level, from initial resume screening to phone screening to job offers to salary negotiation to workplace promotions. And those studies are mostly in fields where there's something resembling gender parity. In fields where discrimination is weak enough that there's gender parity or racial parity in entrance rates, we can see steadily decreasing levels of discrimination over the last two generations. Discrimination hasn't been eliminated, but it's much reduced.</p> <p><img src="images/tech-discrimination/cs_enrollment.png" alt="Graph of enrollment by gender in med school, law school, the sciences, and CS. Graph courtesy of NPR." width="568" height="418"></p> <p>And then we have computer science. The disparity in entrance rates is about what it was for medicine, law, and the physical sciences in the 70s. As it happens, the excuses for the gender disparity are the exact same excuses that were trotted out in the 70s to justify why women didn't want to go into or couldn't handle technical fields like medicine, economics, finance, and biology.</p> <p>One argument that's commonly made is that women are inherently less interested in the &quot;harder&quot; sciences, so you'd expect more women to go into biology or medicine than programming. There are two major reasons I don't find that line of reasoning to be convincing. First, proportionally more women go into fields like math and chemical engineering than go into programming. I think it's pointless to rank math and the sciences by how &quot;hard science&quot; they are, but if you ask people to rank these things, most people will put math above programming and if they know what's involved in a chemical engineering degree, I think they'll also put chemical engineering above programming and yet those fields have proportionally more women than programming. Second, if you look at other countries, they have wildly different proportions of people who study computer science for reasons that seem to mostly be cultural. Given that we do see all of this variation, I don't see any reason to think that the U.S. reflects the &quot;true&quot; rate that women want to study programming and that countries where (proportionally) many more women want to study programming have rates that are distorted from the &quot;true&quot; rate by cultural biases.</p> <p>Putting aside theoretical arguments, I wonder how it is that I've had such a different lived experience than Andreessen. His reasoning must sound reasonable in his head and stories of discrimination from women and minorities must not ring true. But to me, it's just the opposite.</p> <p>Just the other day, I was talking to John (this and all other names were chosen randomly in order to maintain anonymity), a friend of mine who's a solid programmer. It took him two years to find a job, which is shocking in today's job market for someone my age, but sadly normal for someone like him, who's twice my age.</p> <p>You might wonder if it's something about John besides his age, but when a Google coworker and I mock interviewed him he did fine. I did the standard interview training at Google and I interviewed for Google, and when I compare him to that bar, I'd say that his getting hired at Google would pretty much be a coin flip. Yes on a good day; no on a bad day. And when he interviewed at Google, he didn't get an offer, but he passed the phone screen and after the on-site they strongly suggested that he apply again in a year, which is a good sign. But most places wouldn't even talk to John.</p> <p>And even at Google, which makes a lot of hay about removing bias from their processes, the processes often fail to do so. When I referred Mary to Google, she got rejected in the recruiter phone screen as not being technical enough and I saw William face increasing levels of ire from a manager because of a medical problem, which eventually caused him to quit.</p> <p>Of course, in online discussions, people will call into question the technical competency of people like Mary. Well, Mary is one of the most impressive engineers I've ever met in any field. People mean different things when they say that, so let me provide a frame of reference: the other folks who fall into that category for me include an <a href="http://en.wikipedia.org/wiki/IBM_Fellow">IBM Fellow</a>, the person that IBM Fellow called the best engineer at IBM, a Math Olympiad medalist who's now a professor at CMU, a distinguished engineer at Sun, and a few other similar folks.</p> <p>So anyway, Mary gets on the phone with a Google recruiter. The recruiter makes some comments about how Mary has a degree in math and not CS, and might not be technical enough, and questions Mary's programming experience: was it “algorithms” or “just coding”? It goes downhill from there.</p> <p>Google has plenty of engineers without a CS degree, people with degrees in history, music, and the arts, and lots of engineers without any degree at all, not even a high school diploma. But somehow a math degree plus my internal referral mentioning that this was one of the best engineers I've ever seen resulted in the decision that Mary wasn't technical enough.</p> <p>You might say that, like the example with John, this is some kind of a fluke. Maybe. But from what I've seen, if Mary were a man and not a woman, the odds of a fluke would have been lower.</p> <p>This dynamic isn't just limited to hiring. I notice it every time I read the comments on one of Anna's blog posts. As often as not, someone will question Anna's technical chops. It's not even that they find a &quot;well, actually&quot; in the current post (although that sometimes happens); it's usually that they dig up some post from six months ago which, according to them, wasn't technical enough.</p> <p>I'm no more technical than Anna, but I have literally never had that happen to me. I've seen it happen to men, but only those who are extremely high profile (among the top N most well-known tech bloggers, like Steve Yegge or Jeff Atwood), or who are pushing an agenda that's often condescended to (like dynamic languages). But it regularly happens to moderately well-known female bloggers like Anna.</p> <p>Differential treatment of women and minorities isn't limited to hiring and blogging. I've lost track of the number of times a woman has offhandedly mentioned to me that some guy assumed she was a recruiter, a front-end dev, a wife, a girlfriend, or a UX consultant. It happens everywhere. At conferences. At parties full of devs. At work. Everywhere. Not only has that never happened to me, the opposite regularly happens to me -- if I'm hanging out with physics or math grad students, people assume I'm a fellow grad student.</p> <p>When people bring up the market in discussions like these, they make it sound like it's a force of nature. It's not. It's just a word that describes the collective actions of people under some circumstances. Mary's situation didn't automatically get fixed because it's a free market. Mary's rejection by the recruiter got undone when I complained to my engineering director, who put me in touch with an HR director who patiently listened to the story and overturned the decision<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">5</a></sup>. The market is just humans. It's humans all the way down.</p> <p>We can fix this, if we stop assuming the market will fix it for us.</p> <h3 id="appendix-a-few-related-items">Appendix: a few related items</h3> <ul> <li><a href="//danluu.com/gender-gap/">People literally read the opposite result into studies they look at</a></li> <li>At <a href="http://fse22.gatech.edu/">FSE</a> this year, a the speaker noted that if you show a bunch of programmers random data, they will interpret that data as supporting their prior beliefs of best practices</li> <li><a href="https://ed.stanford.edu/sites/default/files/uhlmann_et_2005.pdf">The more biased people are, the more objective they think they are</a></li> </ul> <p>Also, note that although this post was originally published in 2014, it was updated in 2020 with links to some more recent comments and a bit of re-organization.</p> <p><small> Thanks to Leah Hanson, Kelley Eskridge, Lindsey Kuper, Nathan Kurz, Scott Feeney, Katerina Barone-Adesi, Yuri Vishnevsky, @teles_dev, &quot;Negative12DollarBill&quot;, and Patrick Roberts for feedback on this post, and to Julia Evans for encouraging me to post this when I was on the fence about writing this up publicly.</p> <p>Note that all names in this post are aliases, taken from a list common names in the U.S. as of 1880. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">If you're curious what his “No. 1” was, it was that there can't be discrimination because just look at all the diversity we have. Chinese. Indians. Vietnamese. And so on. The argument is that it's not possible that we're discriminating against some groups because we're not discriminating against other groups. In particular, it's not possible that we're discriminating against groups that don't fit the stereotypical engineer mold because we're not discriminating against groups that do fit the stereotypical engineer mold. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:A">See also, this comment by Benedict Evans &quot;refuting&quot; a comment that SV companies may have sub-optimal hiring practices for employees by saying &quot;<a href="https://twitter.com/benedictevans/status/1274124694913011715">I don’t have to tell you that there is a ferocious war for talent in the valley.</a>&quot;. That particular comment isn't one about diversity or discrimination, but the general idea that the SV job market somehow enfores a kind of optimality is pervasive among SV thought leaders. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:T"><p>&quot;taste-based&quot; discrimination is discrimination based on preferences that are unrelated to any actual productivity differences between groups that might exist. Of course, it's common for people to claim that <a href="https://lwn.net/Articles/824148/">they've never seen racism or sexism in some context</a>, often with the implication and sometimes with an explicit claim that any differences we see are due to population level differences. If that were the case, we'd want to look at the literature on &quot;statistical&quot; discrimination. However, statistical discrimination doesn't seem like it should be relevant to this discussion. A contrived example of a case where statistical discrimination would be relevant is if we had to hire basketball players solely off of their height and weight with no ability to observe their play, either directly or statistically.</p> <p>In that case, teams would want to exclusively hire tall basketball players, since, if all you have to go on is height, height is a better proxy for basketball productivity than nothing. However, if we consider the non-contrived example of actual basketball productivity and compare the actual productivity of NBA basketball players vs. their height, there is (with the exception of outliers who are very unusually short for basketball players), no correlation between height and performance. The reason is that, if we can measure performance directly, we can simply hire based on performance, which takes height out of the performance equation. The exception to this is for very short players, who have to overcome biases (taste-based discrimination) that cause people to overlook them.</p> <p>While measure of programming productivity are quite poor, the actual statistical correlation between race and gender and productivity among the entire population is zero as best as anyone can tell, making statistical discrimination irrelevant.</p> <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:E">The evidence here isn't totally unequivocal. In the review, Kahn notes that for some areas, there are early studies finding no pay gap, but those were done with small samples of players. Also, Kahn notes that (at the time), there wasn't enough evidence in football to say much either way. <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> <li id="fn:B">In the interest of full disclosure, this <a href="http://fromthearchives.blogspot.com/2006/06/always-bid.html">didn't change the end result</a>, since Mary didn't want to have anything to do with Google after the first interview. Given that the first interview went how it did, making that Mary's decision and not Google's was probably the best likely result, though, and from the comments I heard from the HR director, it sounded like there might be a lower probability of the same thing happening again in the future. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> </ol> </div> TF-IDF linux commits linux-devs-say/ Mon, 24 Nov 2014 00:00:00 +0000 linux-devs-say/ <p>I was curious what different people worked on in Linux, so I tried grabbing data from the current git repository to see if I could pull that out of commit message data. This doesn't include history from before they switched to git, so it only goes back to 2005, but that's still a decent chunk of history.</p> <p>Here's a list of the most commonly used words (in commit messages), by the top four most frequent committers, with users ordered by number of commits.</p> <p></p> <table> <thead> <tr> <th>User</th> <th>1</th> <th>2</th> <th>3</th> <th>4</th> <th>5</th> </tr> </thead> <tbody> <tr> <td>viro</td> <td>to</td> <td>in</td> <td>of</td> <td>and</td> <td>the</td> </tr> <tr> <td>tiwai</td> <td>alsa</td> <td>the</td> <td>-</td> <td>for</td> <td>to</td> </tr> <tr> <td>broonie</td> <td>the</td> <td>to</td> <td>asoc</td> <td>for</td> <td>a</td> </tr> <tr> <td>davem</td> <td>the</td> <td>to</td> <td>in</td> <td>and</td> <td>sparc64</td> </tr> </tbody> </table> <p><br> Alright, so their most frequently used words are <code>to</code>, <code>alsa</code>, <code>the</code>, and <code>the</code>. Turns out, Takashi Iwai (tiwai) often works on audio (alsa), and by going down the list we can see that David Miller's (davem) fifth most frequently used term is <code>sparc64</code>, which is a pretty good indicator that he does a lot of sparc work. But the table is mostly noise. Of course people use <code>to</code>, <code>in</code>, and other common words all the time! Putting that into a table provides zero information.</p> <p>There are a number of standard techniques for dealing with this. One is to explicitly filter out &quot;stop words&quot;, common words that we don't care about. Unfortunately, that doesn't work well with this dataset without manual intervention. Standard stop-word lists are going to miss things like <code>Signed-off-by</code> and <code>cc</code>, which are pretty uninteresting. We can generate a custom list of stop words using some threshold for common words in commit messages, but any threshold high enough to catch all of the noise is also going to catch commonly used but interesting terms like <code>null</code> and <code>driver</code>.</p> <p>Luckily, it only takes about a minute to do by hand. After doing that, the result is that many of the top words are the same for different committers. I won't reproduce the table of top words by committer because it's just many of the same words repeated many times. Instead, here's the table of the top words (ranked by number of commit messages that use the word, not raw count), with stop words removed, which has the same data without the extra noise of being broken up by committer.</p> <table> <thead> <tr> <th>Word</th> <th>Count</th> </tr> </thead> <tbody> <tr> <td>driver</td> <td>49442</td> </tr> <tr> <td>support</td> <td>43540</td> </tr> <tr> <td>function</td> <td>43116</td> </tr> <tr> <td>device</td> <td>32915</td> </tr> <tr> <td>arm</td> <td>28548</td> </tr> <tr> <td>error</td> <td>28297</td> </tr> <tr> <td>kernel</td> <td>23132</td> </tr> <tr> <td>struct</td> <td>18667</td> </tr> <tr> <td>warning</td> <td>17053</td> </tr> <tr> <td>memory</td> <td>16753</td> </tr> <tr> <td>update</td> <td>16088</td> </tr> <tr> <td>bit</td> <td>15793</td> </tr> <tr> <td>usb</td> <td>14906</td> </tr> <tr> <td>bug</td> <td>14873</td> </tr> <tr> <td>register</td> <td>14547</td> </tr> <tr> <td>avoid</td> <td>14302</td> </tr> <tr> <td>pointer</td> <td>13440</td> </tr> <tr> <td>problem</td> <td>13201</td> </tr> <tr> <td>x86</td> <td>12717</td> </tr> <tr> <td>address</td> <td>12095</td> </tr> <tr> <td>null</td> <td>11555</td> </tr> <tr> <td>cpu</td> <td>11545</td> </tr> <tr> <td>core</td> <td>11038</td> </tr> <tr> <td>user</td> <td>11038</td> </tr> <tr> <td>media</td> <td>10857</td> </tr> <tr> <td>build</td> <td>10830</td> </tr> <tr> <td>missing</td> <td>10508</td> </tr> <tr> <td>path</td> <td>10334</td> </tr> <tr> <td>hardware</td> <td>10316</td> </tr> </tbody> </table> <p><br> Ok, so there's been a lot of work on <code>arm</code>, lots of stuff related to <code>memory</code>, <code>null</code>, <code>pointer</code>, etc. But if want to see what individuals work on, we'll need something else.</p> <p>That something else could be penalizing more common words without eliminating them entirely. A standard metric to normalize by is the inverse document frequency (IDF), <code>log(# of messages / # of messages with word)</code>. So instead of ordering by term count or term frequency, let's try ordering by <code>(term frequency) * log(# of messages / # of messages with word)</code>, which is commonly called TF-IDF<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">1</a></sup>. This gives us words that one person used that aren't commonly used by other people.</p> <p>Here's a list of the top 40 linux committers and their most commonly used words, according to TF-IDF.</p> <table> <thead> <tr> <th>User</th> <th>1</th> <th>2</th> <th>3</th> <th>4</th> <th>5</th> </tr> </thead> <tbody> <tr> <td>viro</td> <td>switch</td> <td>annotations</td> <td>patch</td> <td>of</td> <td>endianness</td> </tr> <tr> <td>tiwai</td> <td>alsa</td> <td>hda</td> <td>codec</td> <td>codecs</td> <td>hda-codec</td> </tr> <tr> <td>broonie</td> <td>asoc</td> <td>regmap</td> <td>mfd</td> <td>regulator</td> <td>wm8994</td> </tr> <tr> <td>davem</td> <td>sparc64</td> <td>sparc</td> <td>we</td> <td>kill</td> <td>fix</td> </tr> <tr> <td>gregkh</td> <td>cc</td> <td>staging</td> <td>usb</td> <td>remove</td> <td>hank</td> </tr> <tr> <td>mchehab</td> <td>v4l/dvb</td> <td>media</td> <td>at</td> <td>were</td> <td>em28xx</td> </tr> <tr> <td>tglx</td> <td>x86</td> <td>genirq</td> <td>irq</td> <td>prepare</td> <td>shared</td> </tr> <tr> <td>hsweeten</td> <td>comedi</td> <td>staging</td> <td>tidy</td> <td>remove</td> <td>subdevice</td> </tr> <tr> <td>mingo</td> <td>x86</td> <td>sched</td> <td>zijlstra</td> <td>melo</td> <td>peter</td> </tr> <tr> <td>joe</td> <td>unnecessary</td> <td>checkpatch</td> <td>convert</td> <td>pr_<level></td> <td>use</td> </tr> <tr> <td>tj</td> <td>cgroup</td> <td>doesnt</td> <td>which</td> <td>it</td> <td>workqueue</td> </tr> <tr> <td>lethal</td> <td>sh</td> <td>up</td> <td>off</td> <td>sh64</td> <td>kill</td> </tr> <tr> <td>axel.lin</td> <td>regulator</td> <td>asoc</td> <td>convert</td> <td>thus</td> <td>use</td> </tr> <tr> <td>hch</td> <td>xfs</td> <td>sgi-pv</td> <td>sgi-modid</td> <td>remove</td> <td>we</td> </tr> <tr> <td>sachin.kamat</td> <td>redundant</td> <td>remove</td> <td>simpler</td> <td>null</td> <td>of_match_ptr</td> </tr> <tr> <td>bzolnier</td> <td>ide</td> <td>shtylyov</td> <td>sergei</td> <td>acked-by</td> <td>caused</td> </tr> <tr> <td>alan</td> <td>tty</td> <td>gma500</td> <td>we</td> <td>up</td> <td>et131x</td> </tr> <tr> <td>ralf</td> <td>mips</td> <td>fix</td> <td>build</td> <td>ip27</td> <td>of</td> </tr> <tr> <td>johannes.berg</td> <td>mac80211</td> <td>iwlwifi</td> <td>it</td> <td>cfg80211</td> <td>iwlagn</td> </tr> <tr> <td>trond.myklebust</td> <td>nfs</td> <td>nfsv4</td> <td>sunrpc</td> <td>nfsv41</td> <td>ensure</td> </tr> <tr> <td>shemminger</td> <td>sky2</td> <td>net_device_ops</td> <td>skge</td> <td>convert</td> <td>bridge</td> </tr> <tr> <td>bunk</td> <td>static</td> <td>needlessly</td> <td>global</td> <td>patch</td> <td>make</td> </tr> <tr> <td>hartleys</td> <td>comedi</td> <td>staging</td> <td>remove</td> <td>subdevice</td> <td>driver</td> </tr> <tr> <td>jg1.han</td> <td>simpler</td> <td>device_release</td> <td>unnecessary</td> <td>clears</td> <td>thus</td> </tr> <tr> <td>akpm</td> <td>cc</td> <td>warning</td> <td>fix</td> <td>function</td> <td>patch</td> </tr> <tr> <td>rmk+kernel</td> <td>arm</td> <td>acked-by</td> <td>rather</td> <td>tested-by</td> <td>we</td> </tr> <tr> <td>daniel.vetter</td> <td>drm/i915</td> <td>reviewed-by</td> <td>v2</td> <td>wilson</td> <td>vetter</td> </tr> <tr> <td>bskeggs</td> <td>drm/nouveau</td> <td>drm/nv50</td> <td>drm/nvd0/disp</td> <td>on</td> <td>chipsets</td> </tr> <tr> <td>acme</td> <td>galbraith</td> <td>perf</td> <td>weisbecker</td> <td>eranian</td> <td>stephane</td> </tr> <tr> <td>khali</td> <td>hwmon</td> <td>i2c</td> <td>driver</td> <td>drivers</td> <td>so</td> </tr> <tr> <td>torvalds</td> <td>linux</td> <td>commit</td> <td>just</td> <td>revert</td> <td>cc</td> </tr> <tr> <td>chris</td> <td>drm/i915</td> <td>we</td> <td>gpu</td> <td>bugzilla</td> <td>whilst</td> </tr> <tr> <td>neilb</td> <td>md</td> <td>array</td> <td>so</td> <td>that</td> <td>we</td> </tr> <tr> <td>lars</td> <td>asoc</td> <td>driver</td> <td>iio</td> <td>dapm</td> <td>of</td> </tr> <tr> <td>kaber</td> <td>netfilter</td> <td>conntrack</td> <td>net_sched</td> <td>nf_conntrack</td> <td>fix</td> </tr> <tr> <td>dhowells</td> <td>keys</td> <td>rather</td> <td>key</td> <td>that</td> <td>uapi</td> </tr> <tr> <td>heiko.carstens</td> <td>s390</td> <td>since</td> <td>call</td> <td>of</td> <td>fix</td> </tr> <tr> <td>ebiederm</td> <td>namespace</td> <td>userns</td> <td>hallyn</td> <td>serge</td> <td>sysctl</td> </tr> <tr> <td>hverkuil</td> <td>v4l/dvb</td> <td>ivtv</td> <td>media</td> <td>v4l2</td> <td>convert</td> </tr> </tbody> </table> <p><br> That's more like it. Some common words still appear -- this would really be improved with manual stop words to remove things like <code>cc</code> and <code>of</code>. But for the most part, we can see who works on what. Takashi Iwai (tiwai) spends a lot of time in hda land and workig on codecs, David S. Miller (davem) has spent a lot of time on sparc64, Ralf Baechle (ralf) does a lot of work with mips, etc. And then again, maybe it's interesting that some, but not all, people <code>cc</code> other folks so much that it shows up in their top 5 list even after getting penalized by <code>IDF</code>.</p> <p>We can also use this to see the distribution of what people talk about in their commit messages vs. how often they commit.</p> <p><img src="images/linux-devs-say/null-percentile.png" alt="Who cares about null? These people!"></p> <p>This graph has people on the x-axis and relative word usage (ranked by TF-IDF) y-axis. On the x-axis, the most frequent committers on the left and least frequent on the right. On the y-axis, points are higher up if that committer used the word <code>null</code> more frequently, and lower if the person used the word <code>null</code> less frequently.</p> <p><img src="images/linux-devs-say/posix-percentile.png" alt="Who cares about POSIX? Almost nobody!"></p> <p>Relatively, almost no one works on <code>POSIX</code> compliance. You can actually count the individual people who mentioned <code>POSIX</code> in commit messages.</p> <p>This is the point of the blog post where you might expect some kind of summary, or at least a vague point. Sorry. No such luck. I just did this because TF-IDF is one of a zillion concepts presented in the <a href="https://www.coursera.org/course/mmds">Mining Massive Data Sets</a> course running now, and I knew it wouldn't really stick unless I wrote some code.</p> <p>If you really must have a conclusion, TF-IDF is sometimes useful and incredibly easy to apply. You should use it when you should use it (when you want to see what words distinguish different documents/people from each other) and you shouldn't use it when you shouldn't use it (when you want to see what's common to documents/people). The end.</p> <p><small> I'm experimenting with blogging more by spending less time per post and just spewing stuff out in 30-90 minute sitting. Please <a href="https://twitter.com/danluu">let me know</a> if something is unclear or just plain wrong. Seriously. I went way over time on this one, but that's mostly because argh data and tables and bugs in Julia, not because of proofreading. I'm sure there are bugs!</p> <p>Thanks to Leah Hanson for finding a bunch of writing bugs in this post and to Zack Maril for a conversation on how to maybe display change over time in the future. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:H">I actually don't understand why it's standard to take the log here. Sometimes you want to take the log so you can work with smaller numbers, or so that you can convert a bunch of multiplies into a bunch of adds, but neither of those is true here. Please <a href="https://twitter.com/danluu">let me know</a> if this is obvious to you. <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> </ol> </div> One week of bugs everything-is-broken/ Tue, 18 Nov 2014 00:00:00 +0000 everything-is-broken/ <p>If I had to guess, I'd say I probably work around hundreds of bugs in an average week, and thousands in a bad week. It's not unusual for me to run into a hundred new bugs in a single week. But I often get skepticism when I mention that I run into multiple new (to me) bugs per day, and that this is inevitable if we don't change how we write tests. Well, here's a log of one week of bugs, limited to bugs that were new to me that week. After a brief description of the bugs, I'll talk about what we can do to improve the situation. The obvious answer to spend more effort on testing, but everyone already knows we should do that and no one does it. That doesn't mean it's hopeless, though.</p> <h3 id="one-week-of-bugs">One week of bugs</h3> <h4 id="ubuntu">Ubuntu</h4> <p>When logging into my machine, I got a screen saying that I entered my password incorrectly. After a five second delay, it logged me in anyway. This is probably at least two bugs, perhaps more.</p> <h4 id="github">GitHub</h4> <p>GitHub switched from Pygments to whatever they use for Atom, <a href="http://www.greghendershott.com/2014/11/github-dropped-pygments.html">breaking syntax highlighting for most languages</a>. The HN comments on this indicate that it's not just something that affects obscure languages; Java, PHP, C, and C++ all have noticeable breakage.</p> <p><a href="https://github.com/github/linguist/issues/1717">In a GitHub issue</a>, a GitHub developer says</p> <p><small></p> <blockquote> <p>You're of course free to fork the Racket bundle and improve it as you see fit. I'm afraid nobody at GitHub works with Racket so we can't judge what proper highlighting looks like. But we'll of course pull your changes thanks to the magic of O P E N S O U R C E.</p> </blockquote> <p></small></p> <p>A bit ironic after the recent keynote talk by another GitHub employee titled <a href="http://zachholman.com/talk/move-fast-break-nothing/">“move fast and break nothing”</a>. Not to mention that it's unlikely to work. The last time I submitted a PR to linguist, it only got merged after I wrote <a href="//danluu.com/discourage-oss/">a blog post pointing out that they had 100s of open PRs, some of which were a year old</a>, which got them to merge a bunch of PRs after the post hit reddit. As far as I can tell, &quot;the magic of O P E N S O U R C E&quot; is code for the magic of hitting the front page of reddit/HN or having lots of twitter followers.</p> <p>Also, icons were broken for a while. Was that this past week?</p> <h4 id="linkedin">LinkedIn</h4> <p>After replying to someone's “InMail”, I checked on it a couple days later, and their original message was still listed as unread (with no reply). Did it actually send my reply? I had no idea, until the other person responded.</p> <h4 id="inbox">Inbox</h4> <p>The Inbox app (not to be confused with Inbox App) notifies you that you have a new message before it actually downloads the message. It takes an arbitrary amount of time before the app itself gets the message, and refreshing in the app doesn't cause the message to download.</p> <p>The other problem with notifications is that they sometimes don't show up when you get a message. About half the time I get a notification from the gmail app, I also get a notification from the Inbox app. The other half of the time, the notification is dropped.</p> <p>Overall, I get a notification for a message that I can read maybe 1/3 of the time.</p> <h4 id="google-analytics">Google Analytics</h4> <p>Some locations near the U.S. (like Mexico City and Toronto) aren't considered worthy of getting their own country. The location map shows these cities sitting in the blue ocean that's outside of the U.S.</p> <h4 id="octopress">Octopress</h4> <p>Footnotes don't work correctly on the main page if you allow posts on the main page (instead of the index) and use the syntax to put something below the fold. Instead of linking to the footnote, you get a reference to anchor text that goes nowhere. This is in addition to the other footnote bug I already knew about.</p> <p>Tags are only downcased in some contexts but not others, which means that any tags with capitalized letters (sometimes) don't work correctly. I don't even use tags, but I noticed this on someone else's blog.</p> <p>My Atom feed <a href="https://twitter.com/chmaynard/status/534540677078855680">doesn't work correctly</a>.</p> <p>If you consider performance bugs to be problems, I noticed so many of those this past week that they <a href="//danluu.com/octopress-speedup/">have their own blog post</a>.</p> <h4 id="running-with-rifles-game">Running with Rifles (Game)</h4> <p>Weapons that are supposed to stun injure you instead. I didn't even realize that was a bug until someone mentioned that would be fixed in the next version.</p> <p>It's possible to stab people through walls.</p> <p>If you're holding a key when the level changes, your character keeps doing that action continuously during the next level, even after you've released the key.</p> <p>Your character's position will randomly get out of sync from the server. When that happens, the only reliable fix I've found is to randomly shoot for a while. Apparently shooting causes the client to do something like send a snapshot of your position to the server? Not sure why that doesn't just happen regularly.</p> <p>Vehicles can randomly spawn on top of you, killing you.</p> <p>You can randomly spawn under a vehicle, killing you.</p> <p>AI teammates don't consider walls or buildings when throwing grenades, which often causes them to kill themselves.</p> <p>Grenades will sometimes damage the last vehicle you were in even when you're nowhere near the vehicle.</p> <p>AI vehicles can get permanently stuck on pretty much any obstacle.</p> <p>This is the first video game I've played in about 15 years. I tend to think of games as being pretty reliable, but that's probably because games were much simpler 15 years ago. MS Paint doesn't have many bugs, either.</p> <p>Update: The sync issue above is caused by memory leaks. I originally thought that the game just had very poor online play code, but it turns out it's actually ok for the first 6 hours or so after a server restart. There are scripts around to restart the servers periodically, but they sometimes have bugs which cause them to stop running. When that happens on the official servers, the game basically becomes unplayable online.</p> <h4 id="julia">Julia</h4> <p>Unicode sequence causes match/ismatch to blow up with a bounds error.</p> <p>Unicode sequence causes using a string as a hash index to blow up with a bounds error.</p> <p>Exception randomly not caught by catch. This sucks because putting things in a try/catch was the workaround for the two bugs above. I've seen other variants of this before; it's possible this shouldn't count as a new bug because it might be the same root cause as some bug I've already seen.</p> <p>Function (I forget which) returns completely wrong results when given bad length arguments. You can even give it length arguments of the wrong type, and it will still “work” instead of throwing an exception or returning an error.</p> <p>If API design bugs count, methods that work operation on iterables sometimes take the stuff as the first argument and sometimes don't. There are way too many of these to list. To take one example, <code>match</code> takes a regex first and a string second, whereas <code>search</code> takes a string first and a regex second. This week, I got bit by something similar on a numerical function.</p> <p>And of course I'm still running into the <a href="https://github.com/JuliaLang/julia/issues/8631">1+ month old bug that breaks convert</a>, which is pervasive enough that anything that causes it to happen renders Julia unusable.</p> <p>Here's one which might be an OS X bug? I had some bad code that caused an infinite loop in some Julia code. Nothing actually happened in the <code>while</code> loop, so it would just run forever. Oops. The bug is that this somehow caused my system to run out of memory and become unresponsive. Activity monitor showed that the kernel was taking an ever increasing amount of memory, which went away when I killed the Julia process.</p> <p>I won't list bugs in packages because there are too many. Even in core Julia, I've run into so many Julia bugs that I don't file bugs any more. It's just too much of an interruption. When I have some time, I should spend a day filing all the bugs I can remember, but I think it would literally take a whole day to write up a decent, reproducible, bug report for each bug.</p> <p>See <a href="//danluu.com/julialang/">this post</a> for more on why I run into so many Julia bugs.</p> <h4 id="google-hangouts">Google Hangouts</h4> <p>On starting a hangout: &quot;This video call isn't available right now. Try again in a few minutes.&quot;.</p> <p>Same person appears twice in contacts list. Both copies have the same email listed, and double clicking on either brings me to the same chat window.</p> <h4 id="uw-health">UW Health</h4> <p>The latch mechanism isn't quite flush to the door on about 10% of lockers, so your locker won't actually be latched unless you push hard against the door while moving the latch to the closed position.</p> <p>There's no visual (or other) indication that the latch failed to latch. As far as I can tell, the only way to check is to tug on the handle to see if the door opens after you've tried to latch it.</p> <h4 id="coursera-mining-massive-data-sets">Coursera, Mining Massive Data Sets</h4> <p>Selecting the correct quiz answer gives you 0 points. The workaround (independently discovered by multiple people on the forums) is to keep submitting until the correct answer gives you 1 point. This is a week after a quiz had incorrect answer options which resulted in there being no correct answers.</p> <h4 id="facebook">Facebook</h4> <p>If you do something “wrong” with the mouse while scrolling down on someone's wall, the blue bar at the top can somehow transform into a giant block the size of your cover photo that doesn't go away as you scroll down.</p> <p>Clicking on the activity sidebar on the right pops something that's under other UI elements, making it impossible to read or interact with.</p> <h4 id="pandora">Pandora</h4> <p>A particular station keeps playing electronic music, even though I hit thumbs down every time an electronic song comes on. The seed song was a song from a Disney musical.</p> <h4 id="dropbox-zulip">Dropbox/Zulip</h4> <p>An old issue is that you can't disable notifications from <code>@all</code> mentions. Since literally none of them have been relevant to me for as long as I can remember, and <code>@all</code> notifications outnumber other notifications, it means that the majority of notifications I get are spam.</p> <p>The new thing is that I tried muting the streams that regularly spam me, but the notification blows through the mute. My fix for that is that I've disabled all notifications, but now I don't get a notification if someone DMs me or uses <code>@danluu</code>.</p> <h4 id="chrome">Chrome</h4> <p>The Rust guide is unreadable with my version of chrome (no plug-ins).</p> <p><img src="images/everything-is-broken/rust_guide.png" alt="Unreadable quoted blocks" width="773" height="356"></p> <h4 id="google-docs">Google Docs</h4> <p>I tried co-writing a doc with <a href="http://rose.github.io/">Rose Ames</a>. Worked fine for me, but everything displayed as gibberish for her, so we switched to hackpad.</p> <p>I didn't notice this until after I tried hackpad, but Docs is really slow. Hackpad feels amazingly responsive, but it's really just that Docs is laggy. It's the same feeling I had after I tried fastmail. Gmail doesn't seem slow until you use something that isn't slow.</p> <h4 id="hackpad">Hackpad</h4> <p>Hours after the doc was created, it says “ROSE AMES CREATED THIS 1 MINUTE AGO.”</p> <p>The right hand side list, which shows who's in the room, has a stack of N people even though there are only 2 people.</p> <h4 id="rust">Rust</h4> <p>After all that, Rose and I worked through the Rust guide. I won't list the issues here because they're so long that our hackpad doc that's full of bugs is at least twice as long as this blog post. And this isn't a knock against the Rust docs, the docs are actually much better than for almost any other language.</p> <h3 id="wat">WAT</h3> <p><blockquote class="twitter-tweet" lang="en"><p>I&#39;m in a super good mood. Everything is still broken, but now it&#39;s funny instead of making me mad.</p>&mdash; Gary Bernhardt (@garybernhardt) <a href="https://twitter.com/garybernhardt/status/296033898822004738">January 28, 2013</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script></p> <p>What's going on here? If you include the bugs I'm not listing because the software is so buggy that listing all of the bugs would triple the length of this post, that's about 80 bugs in one week. And that's only counting bugs I hadn't seen before. How come there are so many bugs in everything?</p> <p>A common response to this sort of comment is that it's open source, you ungrateful sod, why don't you fix the bugs yourself? I do fix some bugs, but there literally aren't enough hours in a week for me to debug and fix every bug I run into. There's a tragedy of the commons effect here. If there are only a few bugs, developers are likely to fix the bugs they run across. But if there are so many bugs that making a dent is hopeless, a lot of people won't bother.</p> <p>I'm going to take a look at Julia because I'm already familiar with it, but I expect that it's no better or worse tested than most of these other projects (except for Chrome, which is relatively well tested). As a rough proxy for how much test effort has gone into it, it has 18k lines of test code. But that's compared to about 108k lines of code in <code>src</code> plus <code>Base</code>.</p> <p>At every place I've worked, a 2k LOC prototype that exists just so you can get preliminary performance numbers and maybe play with the API is expected to have at least that much in tests because otherwise how do you know that it's not so broken that your performance estimates aren't off by an order of magnitude? Since complexity doesn't scale linearly in LOC, folks expect a lot more test code as the prototype gets bigger.</p> <p>At 18k LOC in tests for 108k LOC of code, users are going to find bugs. A lot of bugs.</p> <p>Here's where I'm supposed to write an appeal to take testing more seriously and <a href="//danluu.com/empirical-pl/#fn2">put real effort into it</a>. But we all know that's not going to work. It would take 90k LOC of tests to get Julia to be as well tested as a poorly tested prototype (falsely assuming linear complexity in size). That's two person-years of work, not even including time to debug and fix bugs (which probably brings it closer to four of five years). Who's going to do that? No one. Writing tests is like writing documentation. Everyone already knows you should do it. Telling people they should do it adds zero information<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">1</a></sup>.</p> <p>Given that people aren't going to put any effort into testing, what's the best way to do it?</p> <p>Property-based testing. Generative testing. Random testing. Concolic Testing (which was done long before the term was coined). Static analysis. <a href="//danluu.com/everything-is-broken/">Fuzzing</a>. <a href="//danluu.com/bugalytics/">Statistical bug finding</a>. There are lots of options. Some of them are actually the same thing because the terminology we use is inconsistent and buggy. I'm going to arbitrarily pick one to talk about, but they're all worth looking into.</p> <p>People are often intimidated by these, though. I've seen a lot of talks on these and they often make it sound like this stuff is really hard. <a href="http://embed.cs.utah.edu/csmith/">Csmith</a> is 40k LOC. <a href="https://code.google.com/p/american-fuzzy-lop/">American Fuzzy Lop</a>'s compile-time instrumentation is smart enough to <a href="http://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html">generate valid JPEGs</a>. <a href="http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2989">Sixth Sense</a> has the same kind of intelligence as American Fuzzy Lop in terms of exploration, and in addition, uses symbolic execution to exhaustively explore large portions of the state space; it will formally verify that your asserts hold if it's able to collapse the state space enough to exhaustively search it, otherwise it merely tries to get the best possible test coverage by covering different paths and states. In addition, it will use symbolic equivalence checking to check different versions of your code against each other.</p> <p>That's all really impressive, but you don't need a formal methods PhD to do this stuff. You can write a fuzzer that will shake out a lot of bugs in an hour<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">2</a></sup>. Seriously. I'm a bit embarrassed to link to this, but <a href="https://github.com/danluu/Fuzz.jl">this fuzzer</a> was written in about an hour and found 20-30 bugs<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">3</a></sup>, including incorrect code generation, and crashes on basic operations like multiplication and exponentiation. My guess is that it would take another 2-3 hours to shake out another 20-30 bugs (with support for more types), and maybe another day of work to get another 20-30 (with very basic support for random expressions). I don't mention this because it's good. It's not. It's totally heinous. But that's the point. You can throw together an absurd hack in an hour and it will turn out to be pretty useful.</p> <p>Compared to writing unit tests by hand: even if I knew what the bugs were in advance, I'd be hard pressed to code fast enough to generate 30 bugs in an hour. 30 bugs in a day? Sure, but not if I don't already know what the bugs are in advance. This isn't to say that unit testing isn't valuable, but if you're going to spend a few hours writing tests, a few hours writing a fuzzer is going to go a longer way than a few hours writing unit tests. You might be able to hit 100 words a minute by typing, but your CPU can easily execute 200 billion instructions a minute. It's no contest.</p> <p>What does it really take to write a fuzzer? Well, you need to generate random inputs for a program. In <a href="https://github.com/danluu/Fuzz.jl">this case</a>, we're generating random function calls in some namespace. Simple. The only reason it took an hour was because I don't really get Julia's reflection capabilities well enough to easily generate random types, which resulted in my writing the type generation stuff by hand.</p> <p>This applies to a lot of different types of programs. Have a GUI? It's pretty easy to prod random UI elements. Read files or things off the network? Generating (or mutating) random data is straightforward. This is something anyone can do.</p> <p>But this isn't a silver bullet. Lackadaisical testing means that <a href="//danluu.com/cpu-bugs/">your users will find bugs</a>. However, even given that developers aren't going to spend nearly enough time on testing, we can do a lot better than we're doing right now.</p> <p><small></p> <h4 id="resources">Resources</h4> <p>There are a lot of great resources out there, but if you're just getting started, I found <a href="http://blog.regehr.org/archives/1039">this description of types of fuzzers</a> to be one of those most helpful (and simplest) things I've read.</p> <p>John Regehr has <a href="https://www.udacity.com/course/cs258">a udacity course on software testing</a>. I haven't worked through it yet (Pablo Torres just pointed to it), but given the quality of Dr. Regehr's writing, I expect the course to be good.</p> <p>For more on my perspective on testing, <a href="//danluu.com/testing/">there's this</a>.</p> <h4 id="acknowledgments">Acknowledgments</h4> <p>Thanks to Leah Hanson and Mindy Preston for catching writing bugs, to Steve Klabnik for explaining the cause/fix of the Chrome bug (bad/corrupt web fonts), and to Phillip Joseph for finding a markdown bug.</p> <p>I'm experimenting with blogging more by spending less time per post and just spewing stuff out in 30-90 minute sitting. Please <a href="https://twitter.com/danluu">let me know</a> if something is unclear or just plain wrong. Seriously.</p> <p></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:D"><p>If I were really trying to convince you of this, I'd devote a post to the business case, diving into the data and trying to figure out the cost of bugs. The short version of that unwritten post is that response times are well studied and it's known that a 100ms of extra latency will cost you a noticeable amount of revenue. A 1s latency hit is a disaster. How do you think that compares to having your product not work at all?</p> <p>Compared to 100ms of latency, how bad is it when your page loads and then bugs out in a way that makes it totally unusable? What if it destroys user state and makes the user re-enter everything they wanted to buy into their cart? Removing one extra click is worth a huge amount of revenue, and now we're talking about adding 10 extra clicks or infinite latency to a random subset of users. And not a small subset, either. Want to stop lighting piles of money on fire? Write tests. If that's too much work, at least <a href="//danluu.com/bugalytics/">use the data you already have to find bugs</a>.</p> <p>Of course it's sometimes worth it to light pile of money on fire. Maybe your rocket ship is powered by flaming piles of money. If you're a very rapidly growing startup, a 20% increase in revenue might not be worth that much. It could be better to focus on adding features that drive growth. The point isn't that you should definitely write more tests, it's that you should definitely do the math to see if you should write more tests.</p> <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:T">Plus debugging time. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:R">I really need to update the readme with more bugs. <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> </ol> </div> Speeding up this site by 50x octopress-speedup/ Mon, 17 Nov 2014 00:00:00 +0000 octopress-speedup/ <p>I've seen all these studies that show how a 100ms improvement in page load time has a significant effect on page views, conversion rate, etc., but I'd never actually tried to optimize my site. This blog is a static Octopress site, hosted on GitHub Pages. Static sites are supposed to be fast, and GitHub Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?</p> <p>Not having done this before, I didn't know what to do. But in a great talk on how the internet works, <a href="https://twitter.com/danielespeset">Dan Espeset</a> suggested trying <a href="http://www.webpagetest.org">webpagetest</a>; let's give it a shot.</p> <p></p> <p>Here's what it shows with my nearly stock Octopress setup<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">1</a></sup>. The only changes I'd made were enabling Google Analytics, the social media buttons at the bottom of posts, and adding CSS styling for tables (which are, by default, unstyled and unreadable).</p> <p><img src="images/octopress-speedup/octo_initial.png" alt="time to start rendering: 9.7s"> <img src="images/octopress-speedup/octo_initial_detail.png" alt="time to visual completion: 10.9s"></p> <p>12 seconds to the first page view! What happened? I thought static sites were supposed to be fast. The first byte gets there in less than half a second, but the page doesn't start rendering until 9 seconds later.</p> <p><img src="images/octopress-speedup/octo_initial_waterfall.png" alt="Lots of js, CSS, and fonts"></p> <p>Looks like the first thing that happens is that we load a bunch of <code>js</code> and <code>CSS</code>. Looking at the source, we have all this <code>js</code> in <code>source/_includes/head.html</code>.</p> <pre><code>&lt;script src=&quot;{{ root_url }}/javascripts/modernizr-2.0.js&quot;&gt;&lt;/script&gt; &lt;script src=&quot;//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js&quot;&gt;&lt;/script&gt; &lt;script&gt;!window.jQuery &amp;&amp; document.write(unescape('%3Cscript src=&quot;./javascripts/lib/jquery.min.js&quot;%3E%3C/script%3E'))&lt;/script&gt; &lt;script src=&quot;{{ root_url }}/javascripts/octopress.js&quot; type=&quot;text/javascript&quot;&gt;&lt;/script&gt; {% include google_analytics.html %} </code></pre> <p>I don't know anything about web page optimization, but Espeset mentioned that <code>js</code> will stall page loading and rendering. What if we move the scripts to <code>source/_includes/custom/after_footer.html</code>?</p> <p><img src="images/octopress-speedup/octo_js_after_footer.png" alt="time to start rendering: 4.9s"></p> <p>That's a lot better! We've just saved about 4 seconds on load time and on time to start rendering.</p> <p>Those script tags load modernizer, jquery, octopress.js, and some google analytics stuff. What is in this <code>octopress.js</code> anyway? It's mostly code to support stuff like embedding flash videos, delicious integration, and github repo integration. There are a few things that do get used for my site, but most of that code is dead weight.</p> <p>Also, why are there multiple <code>js</code> files? Espeset also mentioned that connections are finite resources, and that we'll run out of simultaneous open connections if we have a bunch of different files. Let's strip out all of that unused <code>js</code> and combine the remaining <code>js</code> into a single file.</p> <p><img src="images/octopress-speedup/octo_one_js_file.png" alt="time to start rendering: 1.4s"></p> <p>Much better! But wait a sec. What do I need <code>js</code> for? As far as I can tell, the only thing my site is still using octopress's <code>js</code> for is so that you can push the right sidebar back and forth by clicking on it, and jquery and modernizer are only necessary for the js used in octopress. I never use that, and according to in-page analytics no one else does either. Let's get rid of it.</p> <p><img src="images/octopress-speedup/octo_no_js.png" alt="time to start rendering: .7s"> <img src="images/octopress-speedup/octo_no_js_detail.png" alt="time to visual completion: 1.2s"></p> <p>That didn't change total load time much, but the browser started rendering sooner. We're down to having the site visually complete after 1.2s, compared to 9.6s initially -- an 8x improvement.</p> <p>What's left? There's still some js for the twitter and fb widgets at the bottom of each post, but those all get loaded after things are rendered, so they don't really affect the user's experience, even though they make the “Load Time” number look bad.</p> <p><img src="images/octopress-speedup/octo_no_js_pie.png" alt="That's a lot of fonts!"></p> <p>This is a pie chart of how many bytes of my page are devoted to each type of file. Apparently, the plurality of the payload is spent on fonts. Despite my reference post being an unusually image heavy blog post, fonts are 43.8% and images are such a small percentage that webpagetest doesn't even list the number. Doesn't my browser already have some default fonts? Can we just use those?</p> <p><img src="images/octopress-speedup/octo_no_fonts.png" alt="time to start rendering: .6s"> <img src="images/octopress-speedup/octo_no_fonts_detail.png" alt="time to visual completion: .9s"></p> <p>Turns out, we can. The webpage is now visually complete in 0.9s -- a 12x improvement. The improvement isn't quite as dramatic for “Repeat View”<sup class="footnote-ref" id="fnref:R"><a rel="footnote" href="#fn:R">2</a></sup> -- it's only an 8.6x improvement there -- but that's still pretty good.</p> <p>The one remaining “obvious” issue is that the header loads two css files, one of which isn't minified. This uses up two connections and sends more data than necessary. Minifying the other css file and combining them speeds this up even further.</p> <p><img src="images/octopress-speedup/octo_one_css.png" alt="time to start rendering: .6s"> <img src="images/octopress-speedup/octo_one_css_detail.png" alt="time to visual completion: .7s"></p> <p>Time to visually complete is now 0.7s -- a 15.6x improvement<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">3</a></sup>. And that's on a page that's unusually image heavy for my site.</p> <p><img src="images/octopress-speedup/octo_one_css_waterfall.png" alt="Mostly image load time"></p> <p>At this point the only things that happen before the page starts displaying are:, loading the HTML, loading the one <code>css</code> file, and loading the giant image (reliability.png).</p> <p>We've already minified the css, so the main thing left to do is to make giant image better. I already ran <code>optipng -o7 -zm1-9</code> on all my images, but ImageOptim was able to shave off another 4% of the image, giving a slight improvement. Across all the images in all my posts, ImageOptim was able to reduce images by an additional 20% over optipng, but it didn't help much in this case.</p> <p>I also tried specifying the size of the image to see if that would let the page render before the image was finished downloading, but it didn't result in much of a difference.</p> <p>After that, I couldn't think of anything else to try, but webpagetest had some helpful suggestions.</p> <p><img src="images/octopress-speedup/octo_suggestions.png" alt="Blargh github pages"></p> <p>Apparently, the server I'm on is slow (it gets a D in sending the first byte after the initial request). It also recommends caching static content, but when I look at the individual suggestions, they're mostly for widgets I don't host/control. I should use a CDN, but Github Pages doesn't put content on a CDN for bare domains unless you use a DNS alias record, and my DNS provider doesn't support alias records. That's two reasons to stop servering from Github Pages (or perhaps one reason to move off Github Pages and one reason to get another DNS provider), so I switched to Cloudflare, which shaved over 100ms off the time to first byte.</p> <p>Note that if you use Cloudflare for a static site, you'll want to create a &quot;Page Rule&quot; and enable &quot;Cache Everything&quot;. By default, Cloudflare doesn't cache HTML, which is sort of pointless on a static blog that's mostly HTML. If you've done the optimizations here, you'll also want to avoid their &quot;Rocket Loader&quot; thing which attempts to load js asynchronously by loading blocking javascript. &quot;Rocket Loader&quot; is like AMP, in that it can speed up large, bloated, websites, but is big enough that it slows down moderately optimized websites.</p> <p>Here's what happened after I initally enabled Cloudflare without realizing that I needed to create a &quot;Page Rule&quot;.</p> <p><img src="images/octopress-speedup/octo_cloudflare_wat.png" alt="Cloudflare saves 80MB out of 1GB"></p> <p>That's about a day's worth of traffic in 2013. Initially, Cloudflare was serving my CSS and redirecting to Github Pages for the HTML. Then I inlined my CSS and Cloudflare literally did nothing. Overall, Cloudflare served 80MB out of 1GB of traffic because it was only caching images and this blog is relatively light on images.</p> <p>I haven't talked about inlining CSS, but it's easy and gives a huge speedup on the first visit since it means only one connection is required to display the page, instead of two sequentialy connections. It's a disadvantage on future visits since it means that the CSS has to be re-downloaded for each page, but since most of my traffic is from people running across a single blog post, who don't click through to anything else, it's a net win. In <code>_includes/head.html</code></p> <pre><code>&lt;link href=&quot;{{ root_url }}/stylesheets/all.css&quot; media=&quot;screen, projection&quot; rel=&quot;stylesheet&quot; type=&quot;text/css&quot;&gt; </code></pre> <p>should change to</p> <pre><code>{\% include all.css %} </code></pre> <p>In addition, there's a lot of pointless cruft in the css. Removing the stuff that, as someone who doesn't know CSS can spot as pointless (like support for delicious, support for Firefox 3.5 and below, lines that firefox flags as having syntax errors such as <code>no-wrap</code> instead of <code>nowrap</code>) cuts down the remaining CSS by about half. There's a lot of duplication remaining and I expect that the CSS could be reduced by another factor of 4, but that would require actually knowing CSS. Just doing those things, we get down to .4s before the webpage is visually complete.</p> <p><img src="images/octopress-speedup/octo_inline_css.png" alt="Inlining css"></p> <p>That's a <code>10.9/.4 = 27.5</code> fold speedup. The effect on mobile is a lot more dramatic; there, it's closer to 50x.</p> <p>I'm not sure what to think about all this. On the one hand, I'm happy that I was able to get a 25x-50x speedup on my site. On the other hand, I associate speedups of that magnitude with porting plain Ruby code to optimized C++, optimized C++ to a GPU, or GPU to quick-and-dirty exploratory ASIC. How is it possible that someone with zero knowledge of web development can get that kind of speedup by watching one presentation and then futzing around for 25 minutes? I was hoping to maybe find 100ms of slack, but it turns out there's not just 100ms, or even 1000ms, but 10000ms of slack in a Octopress setup. According to a study I've seen, going from 1000ms to 3000ms costs you 20% of your readers and 50% of your click-throughs. I haven't seen a study that looks at going from 400ms to 10900ms because the idea that a website would be that slow is so absurd that people don't even look into the possibility. But many websites are that slow!<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">4</a></sup></p> <h3 id="update">Update</h3> <p>I found it too hard to futz around with trimming down the massive CSS file that comes with Octopress, so I removed all of the CSS and then added a few lines to allow for a nav bar. This makes almost no difference on the desktop benchmark above, but it's a noticable improvement for slow connections. <a href="//danluu.com/web-bloat/">The difference is quite dramatic for 56k connections as well as connections with high packetloss</a>.</p> <p>Starting the day I made this change, my analytics data shows a noticeable improvement in engagement and traffic. There are too many things confounded here to say what caused this change (performance increase, total lack of styling, etc.), but there are a couple of things find interesting about this. First, it seems to likely show that the advice that it's very important to keep line lengths short is incorrect since, if that had a very large impact, it would've overwhelmed the other changes and resulted in reduced engagement and not increased engagement. Second, despite the Octopress design being widely used and lauded (it appears to have been the most widely used blog theme for programmers when I started my blog), it appears to cause a blog (or at least this blog) to get less readership than literally having no styling at all. Having no styling is surely not optimal, but there's something a bit funny about no styling beating the at-the-time most widely used programmer blog styling, which means it likely also beat wordpress, <a href="writing-non-advice/#svbtle">svtble</a>, blogspot, medium, etc., since those have most oof the same ingredients as Octopress.</p> <p><small></p> <h4 id="resources">Resources</h4> <p>Unfortunately, the video of the presentation I'm referring to is restricted <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a> alums. If you're an RC alum, <a href="https://community.hackerschool.com/t/monday-night-talks-84-daniel-espeset-on-the-anatomy-of-a-web-request/25">check this out</a>. Otherwise <a href="https://hpbn.co/">high-performance browser networking</a> is great, but much longer.</p> <h4 id="acknowledgements">Acknowledgements</h4> <p>Thanks to Leah Hanson, Daniel Espeset, and Hugo Jobling for comments/corrections/discussion.</p> <p>I'm not a front-end person, so I might be totally off in how I'm looking at these benchmarks. If so, please <a href="https://twitter.com/danluu">let me know</a>.</p> <p></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:S">From whatever version was current in September 2013. It's possible some of these issues have been fixed, but based on the extremely painful experience of other people who've tried to update their Octopress installs, it didn't seem worth making the attempt to get a newer version of Octopress. <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:R">Why is “Repeat View” slower than “First View”? <a class="footnote-return" href="#fnref:R"><sup>[return]</sup></a></li> <li id="fn:1">If you look at a video of loading <a href="http://www.webpagetest.org/video/view.php?id=141115_AP_12A9.1.0">the original</a> vs. <a href="http://www.webpagetest.org/video/view.php?id=141116_RH_7WC.1.0">this version</a>, the difference is pretty dramatic. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:D">For example, <a href="//danluu.com/web-bloat/#appendix-experimental-caveats">slashdot</a> takes 15s to load over FIOS. The tests shown above were done on Cable, which is <a href="//danluu.com/web-bloat/">substantially slower</a>. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> </ol> </div> How often is the build broken? broken-builds/ Mon, 10 Nov 2014 00:00:00 +0000 broken-builds/ <p>I've noticed that builds are broken and tests fail a lot more often on open source projects than on “work” projects. I wasn't sure how much of that was my perception vs. reality, so I grabbed the Travis CI data for a few popular categories on GitHub<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">1</a></sup>.</p> <p></p> <p><img src="images/broken-builds/reliability.png" alt="Graph of build reliability. Props to nu, oryx, caffe, catalyst, and Scala." width="640" height="1280"></p> <p>For reference, at every place I've worked, two 9s of reliability (99% uptime) on the build would be considered bad. That would mean that the build is failing for over three and a half days a year, or seven hours per month. Even three 9s (99.9% uptime) is about forty-five minutes of downtime a month. That's kinda ok if there isn't a hard system in place to prevent people from checking in bad code, but it's quite bad for a place that's serious about having working builds.</p> <p>By contrast, 2 9s of reliability is way above average for the projects I pulled data for<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">2</a></sup> -- only 8 of 40 projects are that reliable. Almost twice as many projects -- 15 of 40 -- don't even achieve one 9 of uptime. And my sample is heavily biased towards reliable projects. There are projects that were well-known enough to be “featured” in a hand curated list by GitHub. That's already biases the data right there. And then I only grabbed data from the projects that care enough about testing to set up TravisCI<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">3</a></sup>, which introduces an even stronger bias.</p> <p>To make sure I wasn't grabbing bad samples, I removed any initial set of failing tests (there are often a lot of fails as people try to set up Travis and have it misconfigured) and projects that that use another system for tracking builds that only have Travis as an afterthought (like Rust)<sup class="footnote-ref" id="fnref:T"><a rel="footnote" href="#fn:T">4</a></sup>.</p> <p>Why doesn't the build fail all the time at work? Engineers don't like waiting for someone else to unbreak the build and managers can do the back of the envelope calculation which says that N idle engineers * X hours of build breakage = $Y of wasted money.</p> <p>But that same logic applies to open source projects! Instead of wasting dollars, contributor's time is wasted.</p> <p>Web programmers are hyper-aware of how 100ms of extra latency on a web page load has a noticeable effect on conversion rate. Well, what's the effect on conversion rate when a potential contributor to your project spends 20 minutes installing dependencies and an hour building your project only to find the build is broken?</p> <p>I used to dig through these kinds of failures to find the bug, usually assuming that it must be some configuration issue specific to my machine. But having spent years debugging failures I run into with <code>make check</code> on a clean build, I've found that it's often just that someone checked in bad code. Nowadays, if I'm thinking about contributing to a project or trying to fix a bug and the build doesn't work, I move on to another project.</p> <p>The worst thing about regular build failures is that they're easy<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">5</a></sup> to prevent. Graydon Hoare literally calls keeping a clean build the “<a href="http://graydon2.dreamwidth.org/1597.html">not rocket science rule</a>”, and wrote an open source tool (bors) anyone can use to do not-rocket-science. And yet, most open source projects still suffer through broken and failed builds, along with the associated cost of lost developer time and lost developer “conversions”.</p> <p><small>Please don't read too much into the individual data in the graph. I find it interesting that DevOps projects tend to be more reliable than languages, which tend to be more reliable than web frameworks, and that ML projects are all over the place (but are mostly reliable). But when it comes to individual projects, all sorts of stuff can cause a project to have bad numbers.</p> <p>Thanks to Kevin Lynagh, Leah Hanson, Michael Smith, Katerina Barone-Adesi, and Alexey Romanov for comments.</p> <p>Also, props to Michael Smith of Puppetlabs for a friendly ping and working through the build data for puppet to make sure there wasn't a bug in my scripts. This is one of my most maligned blog posts because no one wants to believe the build for their project is broken more often than the build for other projects. But even though it only takes about a minute to pull down the data for a project and sanity check it using the links in this post, only one person actually looked through the data with me, while a bunch of people told me how it must quite obviously be incorrect without ever checking the data.</p> <p>This isn't to say that I don't have any bugs. This is a quick hack that probably has bugs and I'm always happy to get bugreports! But some non-bugs that have been repeatedly reported are getting data from all branches instead of the main branch, getting data for all PRs and not just code that's actually checked in to the main branch, and using number of failed builds instead of the amount of time that the build is down. I'm pretty sure that you can check that any of those claims are false in about the same amount of time that it takes to make the claim, but that doesn't stop people from making the claim. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:C">Categories determined from GitHub's featured projects lists, which seem to be hand curated. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:B">Wouldn't it be nice if I had test coverage data, too? But I didn't try to grab it since this was a quick 30-minute project and coming up with cross-language test coverage comparisons isn't trivial. However, I spot checked some projects and the ones that do poorly conform to an engineering version of what Tyler Cowen calls &quot;The Law of Below Averages&quot; -- projects that often have broken/failed builds also tend to have very spotty test coverage. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:1"><p>I used <a href="https://github.com/travis-ci/travis.rb">the official Travis API script</a>, modified to return build start time instead of build finish time. Even so, build start time isn't exactly the same as check-in time, which introduces some noise. Only data against the main branch (usually master) was used. Some data was incomplete because their script either got a 500 error from the Travis API server, or ran into a runtime syntax error. All errors happened with and without my modifications, which is pretty appropriate for this blog post.</p> <p>If you want to reproduce the results, apply <a href="https://github.com/danluu/dump/blob/master/travis-api/diff-from-original.diff">this patch</a> to the official script, run it with the appropriate options (usually with --branch master, but not always), and then aggregate the results. You can use <a href="https://github.com/danluu/dump/blob/master/travis-api/parse.jl">this script</a>, but if you don't have <a href="//danluu.com/julialang/">Julia</a> it may be easier to just do it yourself.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:T">I think I filtered all the projects that were actually using a different testing service out. Please let me know if there are any still in my list. This removed one project with one-tenth of a 9 and two projects with about half a 9. BTW, removing the initial Travis fails for these projects bumped some of them up between half a 9 and a full 9 and completely eliminated a project that's had failing Travis tests for over a year. The graph shown looks much better than the raw data, and it's still not good. <a class="footnote-return" href="#fnref:T"><sup>[return]</sup></a></li> <li id="fn:E"><p>Easy technically. Hard culturally. Michael Smith brought up the issue of intermittent failures. When you get those, whether that's because the project itself is broken or because the CI build is broken, people will start checking in bad code. There are environments where people don't do that -- for the better part of a decade, I worked at a company where people would track down basically any test failure ever, even (or especially) if the failure was something that disappeared with no explanation. How do you convince people to care that much? That's hard.</p> <p>How do you convince people to use a system like bors, where you don't have to care to avoid breaking the build? That's much easier, though still harder than the technical problems involved in building bors.</p> <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> </ol> </div> Literature review on the benefits of static types empirical-pl/ Fri, 07 Nov 2014 00:00:00 +0000 empirical-pl/ <p>There are some pretty strong statements about types floating around out there. The claims range from the oft-repeated phrase that when you get the types to line up, everything just works, to “not relying on type safety is unethical (if you have an SLA)”<sup class="footnote-ref" id="fnref:P"><a rel="footnote" href="#fn:P">1</a></sup>, &quot;<a href="https://news.ycombinator.com/item?id=4369359">It boils down to cost vs benefit, actual studies, and mathematical axioms, not aesthetics or feelings</a>&quot;, and <a href="https://twitter.com/posco/status/566130704157667328">I think programmers who doubt that type systems help are basically the tech equivalent of an anti-vaxxer</a>. The first and last of these statements are from &quot;types&quot; thought leaders who are widely quoted. There are probably plenty of strong claims about dynamic languages that I'd be skeptical of if I heard them, but I'm not in the right communities to hear the stronger claims about dynamically typed languages. Either way, it's rare to see people cite actual evidence.</p> <p>Let's take a look at the empirical evidence that backs up these claims.</p> <p></p> <p><a href="#wat_summary">Click here</a> if you just want to see the summary without having to wade through all the studies. The summary of the summary is that most studies find very small effects, if any. However, the studies probably don't cover contexts you're actually interested in. If you want the gory details, here's each study, with its abstract, and a short blurb about the study.</p> <h3 id="a-large-scale-study-of-programming-languages-and-code-quality-in-github-ray-b-posnett-d-filkov-v-devanbu-p-http-dl-acm-org-citation-cfm-id-2635922"><a href="http://dl.acm.org/citation.cfm?id=2635922">A Large Scale Study of Programming Languages and Code Quality in Github; Ray, B; Posnett, D; Filkov, V; Devanbu, P</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (729 projects, 80 Million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static v.s. dynamic typing, strong v.s. weak typing on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size. However, we hasten to caution the reader that even these modest effects might quite possibly be due to other, intangible process factors, e.g., the preference of certain personality types for functional, static and strongly typed languages.</p> </blockquote> <p><strong>Summary</strong></p> <p>The authors looked at the 50 most starred repos on github for each of the 20 most popular languages plus TypeScript (minus CSS, shell, and vim). For each of these projects, they looked at the languages used. The text in the body of the study doesn't support the strong claims made in the abstract. Additionally, the study appears to use a fundamentally flawed methodology that's not capable of revealing much information. Even if the methodology were sound, the study uses bogus data and has what Pinker calls the <a href="http://languagelog.ldc.upenn.edu/nll/?p=1897">igon value problem</a>.</p> <p>As Gary Bernhardt points out, the authors of the study <a href="https://twitter.com/garybernhardt/status/862907601557651457">seem to confuse memory safety and implicit coercion</a> and <a href="https://twitter.com/garybernhardt/status/862907391536279554">make other strange statements</a>, such as</p> <blockquote> <p>Advocates of dynamic typing may argue that rather than spend a lot of time correcting annoying static type errors arising from sound, conservative static type checking algorithms in compilers, it’s better to rely on strong dynamic typing to catch errors as and when they arise.</p> </blockquote> <p>The study uses the following language classification scheme</p> <p><img src="images/empirical-pl/lang_classes.png" alt="Table of classifications"></p> <p>These classifications seem arbitrary and many people would disagree with some of these classifications. Since the results are based on aggregating results with respect to these categories, and the authors have chosen arbitrary classifications, this already makes the aggragated results suspect since they have a number of degrees of freedom here and they've made some odd choicses.</p> <p>In order to get the language level results, the authors looked at commit/PR logs to determine how many bugs there were for each language used. As far as I can tell, open issues with no associated fix don't count towards the bug count. Only commits that are detected by their keyword search technique were counted. With this methodology, the number of bugs found will depend at least as strongly on the bug reporting culture as it does on the actual number of bugs found.</p> <p>After determining the number of bugs, the authors ran a regression, controlling for project age, number of developers, number of commits, and lines of code.</p> <p><img src="images/empirical-pl/lang_defects.png" alt="Defect rate correlations"></p> <p>There are enough odd correlations here that, even if the methodology wasn't known to be flawed, I'd be skeptical that authors have captured a causal relationship. If you don't find it odd that Perl and Ruby are as reliable as each other and significantly more reliable than Erlang and Java (which are also equally reliable), which are significantly more reliable than Python, PHP, and C (which are similarly reliable), and that TypeScript is the safest language surveyed, then maybe this passes the sniff test for you, but even without reading further, this looks suspicious.</p> <p>For example, Erlang and Go are rated as having a lot of concurrency bugs, whereas Perl and CoffeeScript are rated as having few concurrency bugs. Is it more plausible that Perl and CoffeeScript are better at concurrency than Erlang and Go or that people tend to use Erlang and Go more when they need concurrency? The authors note that Go might have a lot of concurrency bugs because there's a good tool to detect concurrency bugs in Go, but they don't explore reasons for most of the odd intermediate results.</p> <p>As for TypeScript, Eirenarch has pointed out that the three projects they list as example TypeScript projects, which they call the &quot;top three&quot; TypeScript projects are bitcoin, litecoin, and qBittorrent). <a href="http://bitcoin.stackexchange.com/questions/22311/why-does-github-say-that-the-bitcoin-project-is-74-typescript">These are C++ projects</a>. So the intermediate result appears to not be that TypeScript is reliable, but that projects mis-identified as TypeScript are reliable. Those projects are reliable because Qt translation files are identified as TypeScript and it turns out that, per line of code, giant dumps of config files from another project don't cause a lot of bugs. It's like saying that a project has few bugs per line of code because it has a giant README. This is the most blatant classification error, but it's far from the only one.</p> <p>For example, of what they call the &quot;top three&quot; perl projects, one is showdown, a javascript project, and one is rails-dev-box, a shell script and a vagrant file used to launch a Rails dev environment. Without knowing anything about the latter project, one might expect it's not a perl project from its name, rails-dev-box, which correctly indicates that it's a rails related project.</p> <p>Since this study uses Github's notoriously inaccurate code classification system to classify repos, it is, at best, a series of correlations with factors that are themselves only loosely correlated with actual language usage.</p> <p>There's more analysis, but much of it is based on aggregating the table above into categories based on language type. Since I'm skeptical of these results, I'm at least as skeptical of any results based on aggregating these results. This section barely even scratches the surface of this study. Even with just a light skim, we see multiple serious flaws, any one of which would invalidate the results, plus numerous <a href="http://languagelog.ldc.upenn.edu/nll/?p=1897">igon value problems</a>. It appears that the authors didn't even look at the tables they put in the paper, since if they did, it would jump out that (just for example), they classified a project called &quot;rails-dev-box&quot; as one of the three biggest perl projects (it's a 70-line shell script used to spin up ruby/rails dev environments).</p> <h3 id="do-static-type-systems-improve-the-maintainability-of-software-systems-an-empirical-study-kleinschmager-s-hanenberg-s-robbes-r-tanter-e-stefik-a-http-pleiad-dcc-uchile-cl-papers-2012-kleinschmageral-icpc2012-pdf"><a href="http://pleiad.dcc.uchile.cl/papers/2012/kleinschmagerAl-icpc2012.pdf">Do Static Type Systems Improve the Maintainability of Software Systems? An Empirical Study Kleinschmager, S.; Hanenberg, S.; Robbes, R.; Tanter, E.; Stefik, A.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Static type systems play an essential role in contemporary programming languages. Despite their importance, whether static type systems influence human software development capabilities remains an open question. One frequently mentioned argument for static type systems is that they improve the maintainability of software systems - an often used claim for which there is no empirical evidence. This paper describes an experiment which tests whether static type systems improve the maintainability of software systems. The results show rigorous empirical evidence that static type are indeed beneficial to these activities, except for fixing semantic errors.</p> </blockquote> <p><strong>Summary</strong></p> <p>While the abstract talks about general classes of languages, the study uses Java and Groovy.</p> <p>Subjects were given classes in which they had to either fix errors in existing code or fill out stub methods. Static classes for Java, dynamic classes for Groovy. In cases of type errors (and their respective no method errors), developers solved the problem faster in Java. For semantic errors, there was no difference.</p> <p>The study used a <a href="https://en.wikipedia.org/wiki/Repeated_measures_design">within-subject design</a>, with randomized task order over 33 subjects.</p> <p>A notable limitation is that the study avoided using “complicated control structures”, such as loops and recursion, because those increase variance in time-to-solve. As a result, all of the bugs are trivial bugs. This can be seen in the median time to solve the tasks, which are in the hundreds of seconds. Tasks can include multiple bugs, so the time per bug is quite low.</p> <p><img src="images/empirical-pl/groovy_v_java.png" alt="Groovy is both better and worse than Java"></p> <p>This paper mentions that its results contradict some prior results, and one of the possible causes they give is that their tasks are more complex than the tasks from those other papers. The fact that the tasks in this paper don't involve using loops and recursion because they're too complicated, should give you an idea of the complexity of the tasks involved in most of these papers.</p> <p>Other limitations in this experiment were that the variables were artificially named such that there was no type information encoded in any of the names, that there were no comments, and that there was zero documentation on the APIs provided. That's an unusually hostile environment to find bugs in, and it's not clear how the results generalize if any form of documentation is provided.</p> <p>Additionally, even though the authors specifically picked trivial tasks in order to minimize the variance between programmers, the variance between programmers was still much greater than the variance between languages in all but two tasks. Those two tasks were both cases of a simple type error causing a run-time exception that wasn't near the type error.</p> <h3 id="a-controlled-experiment-to-assess-the-benefits-of-procedure-argument-type-checking-prechelt-l-tichy-w-f-http-ieeexplore-ieee-org-xpl-articledetails-jsp-arnumber-677186"><a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=677186">A controlled experiment to assess the benefits of procedure argument type checking, Prechelt, L.; Tichy, W.F.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Type checking is considered an important mechanism for detecting programming errors, especially interface errors. This report describes an experiment to assess the defect-detection capabilities of static, intermodule type checking.</p> <p>The experiment uses ANSI C and Kernighan &amp; Ritchie (K&amp;R) C. The relevant difference is that the ANSI C compiler checks module interfaces (i.e., the parameter lists calls to external functions), whereas K&amp;R C does not. The experiment employs a counterbalanced design in which each of the 40 subjects, most of them CS PhD students, writes two nontrivial programs that interface with a complex library (Motif). Each subject writes one program in ANSI C and one in K&amp;R C. The input to each compiler run is saved and manually analyzed for defects.</p> <p>Results indicate that delivered ANSI C programs contain significantly fewer interface defects than delivered K&amp;R C programs. Furthermore, after subjects have gained some familiarity with the interface they are using, ANSI C programmers remove defects faster and are more productive (measured in both delivery time and functionality implemented)</p> </blockquote> <p><strong>Summary</strong></p> <p>The “nontrivial” tasks are the inversion of a 2x2 matrix (with GUI) and a file “browser” menu that has two options, select file and display file. Docs for motif were provided, but example code was deliberately left out.</p> <p>There are 34 subjects. Each subjects solves one problem with the K&amp;R C compiler (which doesn't typecheck arguments) and one with the ANSI C compiler (which does).</p> <p>The authors note that the distribution of results is non-normal, with highly skewed outliers, but they present their results as box plots, which makes it impossible to see the distribution. They do some statistical significance tests on various measures, and find no difference in time to completion on the first task, a significant difference on the second task, but no difference when the tasks are pooled.</p> <p><img src="images/empirical-pl/mysterious_boxplot.png" alt="ANSI C is better, except when it's worse"></p> <p>In terms of how the bugs are introduced during the programming process, they do a significance test against the median of one measure of defects (which finds a significant difference in the first task but not the second), and a significance test against the 75%-quantile of another measure (which finds a significant difference in the second task but not the first).</p> <p>In terms of how many and what sort of bugs are in the final program, they define a variety of measures and find that some differences on the measures are statistically significant and some aren't. In the table below, bolded values indicate statistically significant differences.</p> <p><img src="images/empirical-pl/ansi_kr_table.png" alt="Breakdown by various metrics"></p> <p>Note that here, first task refers to whichever task the subject happened to perform first, which is randomized, which makes the results seem rather arbitrary. Furthermore, the numbers they compare are medians (except where indicated otherwise), which also seems arbitrary.</p> <p>Despite the strong statement in the abstract, I'm not convinced this study presents strong evidence for anything in particular. They have <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple comparisons</a>, many of which seem arbitrary, and find that some of them are significant. They also find that many of their criteria don't have significant differences. Furthermore, they don't mention whether or not they tested any other arbitrary criteria. If they did, the results are much weaker than they look, and they already don't look strong.</p> <p>My interpretation of this is that, if there is an effect, the effect is dwarfed by the difference between programmers, and it's not clear whether there's any real effect at all.</p> <h3 id="an-empirical-comparison-of-c-c-java-perl-python-rexx-and-tcl-prechelt-l-http-page-mi-fu-berlin-de-prechelt-biblio-jccpprt-computer2000-pdf"><a href="http://page.mi.fu-berlin.de/prechelt/Biblio/jccpprt_computer2000.pdf">An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl, Prechelt, L.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>80 implementations of the same set of requirements are compared for several properties, such as run time, memory consumption, source text length, comment density, program structure, reliability, and the amount of effort required for writing them. The results indicate that, for the given programming problem, which regards string manipulation and search in a dictionary, “scripting languages” (Perl, Python, Rexx, Tcl) are more productive than “conventional languages” (C, C++, Java). In terms of run time and memory consumption, they often turn out better than Java and not much worse than C or C++. In general, the differences between languages tend to be smaller than the typical differences due to different programmers within the same language.</p> </blockquote> <p><strong>Summary</strong></p> <p>The task was to read in a list of phone numbers and return a list of words that those phone numbers could be converted to, using the letters on a phone keypad.</p> <p>This study was done in two phases. There was a controlled study for the C/C++/Java group, and a self-timed implementation for the Perl/Python/Rexx/Tcl group. The former group consisted of students while the latter group consisted of respondents from a newsgroup. The former group received more criteria they should consider during implementation, and had to implement the program when they received the problem description, whereas some people in the latter group read the problem description days or weeks before implementation.</p> <p><img src="images/empirical-pl/yellow_runtime.png" alt="C and C++ are fast"> <img src="images/empirical-pl/yellow_devtime.png" alt="C, C++, and Java are slow to write"></p> <p>If you take the results at face value, it looks like the class of language used imposes a lower bound on both implementation time and execution time, but that the variance between programmers is much larger than the variance between languages.</p> <p>However, since the scripting language group had significantly different (and easier) environment than the C-like language group, it's hard to say how much of the measured difference in implementation time is from flaws in the experimental design and how much is real.</p> <h3 id="static-type-systems-sometimes-have-a-positive-impact-on-the-usability-of-undocumented-software-mayer-c-hanenberg-s-robbes-r-tanter-e-stefik-a-http-swp-dcc-uchile-cl-tr-2012-tr-dcc-20120418-005-pdf"><a href="http://swp.dcc.uchile.cl/TR/2012/TR_DCC-20120418-005.pdf">Static type systems (sometimes) have a positive impact on the usability of undocumented software; Mayer, C.; Hanenberg, S.; Robbes, R.; Tanter, E.; Stefik, A.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Static and dynamic type systems (as well as more recently gradual type systems) are an important research topic in programming language design. Although the study of such systems plays a major role in research, relatively little is known about the impact of type systems on software development. Perhaps one of the more common arguments for static type systems is that they require developers to annotate their code with type names, which is thus claimed to improve the documentation of software. In contrast, one common argument against static type systems is that they decrease flexibility, which may make them harder to use. While positions such as these, both for and against static type systems, have been documented in the literature, there is little rigorous empirical evidence for or against either position. In this paper, we introduce a controlled experiment where 27 subjects performed programming tasks on an undocumented API with a static type system (which required type annotations) as well as a dynamic type system (which does not). Our results show that for some types of tasks, programmers were afforded faster task completion times using a static type system, while for others, the opposite held. In this work, we document the empirical evidence that led us to this conclusion and conduct an exploratory study to try and theorize why.</p> </blockquote> <p><strong>Summary</strong></p> <p>The experimental setup is very similar to the previous Hanenberg paper, so I'll just describe the main difference, which is that subjects used either Java, or a restricted subset of Groovy that was equivalent to dynamically typed Java. Subjects were students who had previous experience in Java, but not Groovy, giving some advantage for the Java tasks.</p> <p><img src="images/empirical-pl/groovy_v_java_2.png" alt="Groovy is both better and worse than Java"></p> <p>Task 1 was a trivial warm-up task. The authors note that it's possible that Java is superior on task 1 because the subjects had prior experience in Java. The authors speculate that, in general, Java is superior to untyped Java for more complex tasks, but they make it clear that they're just speculating and don't have enough data to conclusively support that conclusion.</p> <h3 id="how-do-api-documentation-and-static-typing-affect-api-usability-endrikat-s-hanenberg-s-robbes-romain-stefik-a-http-users-dcc-uchile-cl-rrobbes-p-icse2014-docstypes-pdf"><a href="http://users.dcc.uchile.cl/~rrobbes/p/ICSE2014-docstypes.pdf">How Do API Documentation and Static Typing Affect API Usability? Endrikat, S.; Hanenberg, S.; Robbes, Romain; Stefik, A.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>When developers use Application Programming Interfaces (APIs), they often rely on documentation to assist their tasks. In previous studies, we reported evidence indicating that static type systems acted as a form of implicit documentation, benefiting developer productivity. Such implicit documentation is easier to maintain, given it is enforced by the compiler, but previous experiments tested users without any explicit documentation. In this paper, we report on a controlled experiment and an exploratory study comparing the impact of using documentation and a static or dynamic type system on a development task. Results of our study both confirm previous findings and show that the benefits of static typing are strengthened with explicit documentation, but that this was not as strongly felt with dynamically typed languages.</p> </blockquote> <p>There's an earlier study in this series with the following abstract:</p> <blockquote> <p>In the discussion about the usefulness of static or dynamic type systems there is often the statement that static type systems improve the documentation of software. In the meantime there exists even some empirical evidence for this statement. One of the possible explanations for this positive influence is that the static type system of programming languages such as Java require developers to write down the type names, i.e. lexical representations which potentially help developers. Because of that there is a plausible hypothesis that the main benefit comes from the type names and not from the static type checks that are based on these names. In order to argue for or against static type systems it is desirable to check this plausible hypothesis in an experimental way. This paper describes an experiment with 20 participants that has been performed in order to check whether developers using an unknown API already benefit (in terms of development time) from the pure syntactical representation of type names without static type checking. The result of the study is that developers do benefit from the type names in an API's source code. But already a single wrong type name has a measurable significant negative impact on the development time in comparison to APIs without type names.</p> </blockquote> <p>The languages used were Java and Dart. The university running the tests teaches in Java, so subjects had prior experience in Java. The task was one “where participants use the API in a way that objects need to be configured and passed to the API”, which was chosen because the authors thought that both types and documentation should have some effect. “The challenge for developers is to locate all the API elements necessary to properly configure [an] object”. The documentation was free-form text plus examples.</p> <p><img src="images/empirical-pl/java_v_dart.png" alt="Documentation is only helpful with types and vice versa?"></p> <p>Taken at face value, it looks like types+documentation is a lot better than having one or the other, or neither. But since the subjects were students at a school that used Java, it's not clear how much of the effect is from familiarity with the language and how much is from the language. Moreover, the task was a single task that was chosen specifically because it was the kind of task where both types and documentation were expected to matter.</p> <h3 id="an-experiment-about-static-and-dynamic-type-systems-hanenberg-s-http-courses-cs-washington-edu-courses-cse590n-10au-hanenberg-oopsla2010-pdf"><a href="http://courses.cs.washington.edu/courses/cse590n/10au/hanenberg-oopsla2010.pdf">An Experiment About Static and Dynamic Type Systems; Hanenberg, S.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Although static type systems are an essential part in teaching and research in software engineering and computer science, there is hardly any knowledge about what the impact of static type systems on the development time or the resulting quality for a piece of software is. On the one hand there are authors that state that static type systems decrease an application's complexity and hence its development time (which means that the quality must be improved since developers have more time left in their projects). On the other hand there are authors that argue that static type systems increase development time (and hence decrease the code quality) since they restrict developers to express themselves in a desired way. This paper presents an empirical study with 49 subjects that studies the impact of a static type system for the development of a parser over 27 hours working time. In the experiments the existence of the static type system has neither a positive nor a negative impact on an application's development time (under the conditions of the experiment).</p> </blockquote> <p><strong>Summary</strong></p> <p>This is another Hanenberg study with a basically sound experimental design, so I won't go into details about the design. Some unique parts are that, in order to control for familiarity and other things that are difficult to control for with existing languages, the author created two custom languages for this study.</p> <p>The author says that the language has similarities to Smalltalk, Ruby, and Java, and that the language is a class-based OO language with single implementation inheritance and late binding.</p> <p>The students had 16 hours of training in the new language before starting. The author argues that this was sufficient because “the language, its API as well as its IDE was kept very simple”. An additional 2 hours was spent to explain the type system for the static types group.</p> <p>There were two tasks, a “small” one (implementing a scanner) and a “large” one (implementing a parser). The author found a statistically significant difference in time to complete the small task (the dynamic language was faster) and no difference in the time to complete the large task.</p> <p>There are a number of reasons this result may not be generalizable. The author is aware of them and there's a long section on ways this study doesn't generalize as well as a good discussion on threats to validity.</p> <h3 id="work-in-progress-an-empirical-study-of-static-typing-in-ruby-daly-m-sazawal-v-foster-j-http-www-cs-umd-edu-jfoster-papers-plateau09-ruby-pdf"><a href="http://www.cs.umd.edu/~jfoster/papers/plateau09-ruby.pdf">Work In Progress: an Empirical Study of Static Typing in Ruby; Daly, M; Sazawal, V; Foster, J.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>In this paper, we present an empirical pilot study of four skilled programmers as they develop programs in Ruby, a popular, dynamically typed, object-oriented scripting language. Our study compares programmer behavior under the standard Ruby interpreter versus using Diamondback Ruby (DRuby), which adds static type inference to Ruby. The aim of our study is to understand whether DRuby's static typing is beneficial to programmers. We found that DRuby's warnings rarely provided information about potential errors not already evident from Ruby's own error messages or from presumed prior knowledge. We hypothesize that programmers have ways of reasoning about types that compensate for the lack of static type information, possibly limiting DRuby's usefulness when used on small programs.</p> </blockquote> <p><strong>Summary</strong></p> <p>Subjects came from a local Ruby user's group. Subjects implemented a simplified Sudoku solver and a maze solver. DRuby was randomly selected for one of the two problems for each subject. There were four subjects, but the authors changed the protocol after the first subject. Only three subjects had the same setup.</p> <p>The authors find no benefit to having types. This is one of the studies that the first Hanenberg study mentions as a work their findings contradict. That first paper claimed that it was because their tasks were more complex, but it seems to me that this paper has a more complex task. One possible reason they found contradictory results is that the effect size is small. Another is that the specific type systems used matter, and that a DRuby v. Ruby study doesn't generalize to Java v. Groovy. Another is that the previous study attempted to remove anything hinting at type information from the dynamic implementation, including names that indicate types and API documentation. The participants of this study mention that they get a lot of type information from API docs, and the authors note that the participants encode type information in their method names.</p> <p>This study was presented in a case study format, with selected comments from the participants and an analysis of their comments. The authors note that participants regularly think about types, and check types, even when programming in a dynamic language.</p> <h3 id="haskell-vs-ada-vs-c-vs-awk-vs-an-experiment-in-software-prototyping-productivity-hudak-p-jones-m-http-haskell-cs-yale-edu-post-type-publication-p-366"><a href="http://haskell.cs.yale.edu/?post_type=publication&amp;p=366">Haskell vs. Ada vs. C++ vs. Awk vs. ... An Experiment in Software Prototyping Productivity; Hudak, P; Jones, M.</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>We describe the results of an experiment in which several conventional programming languages, together with the functional language Haskell, were used to prototype a Naval Surface Warfare Center (NSWC) requirement for a Geometric Region Server. The resulting programs and development metrics were reviewed by a committee chosen by the Navy. The results indicate that the Haskell prototype took significantly less time to develop and was considerably more concise and easier to understand than the corresponding prototypes written in several different imperative languages, including Ada and C++.</p> </blockquote> <p><strong>Summary</strong></p> <p>Subjects were given an informal text description for the requirements of a geo server. The requirements were behavior oriented and didn't mention performance. The subjects were “expert” programmers in the languages they used. They were asked to implement a prototype and track metrics such as dev time, lines of code, and docs. Metrics were all self reported, and no guidelines were given as to how they should be measured, so metrics varied between subjects. Also, some, but not all, subjects attended a meeting where additional information was given on the assignment.</p> <p>Due to the time-frame and funding requirements, the requirements for the server were extremely simple; the median implementation was a couple hundred lines of code. Furthermore, the panel that reviewed the solutions didn't have time to evaluate or run the code; they based their findings on the written reports and oral presentations of the subjects.</p> <p><img src="images/empirical-pl/hudak_table.png" alt="Table of LOC, dev time, and lines of code"></p> <p>This study hints at a very interesting result, but considering all of its limitations, the fact that each language (except Haskell) was only tested once, and that other studies show much larger intra-group variance than inter-group variance, it's hard to conclude much from this study alone.</p> <h3 id="unit-testing-isn-t-enough-you-need-static-typing-too-farrer-e-http-evanfarrer-blogspot-com-2012-06-unit-testing-isnt-enough-you-need-html"><a href="http://evanfarrer.blogspot.com/2012/06/unit-testing-isnt-enough-you-need.html">Unit testing isn't enough. You need static typing too; Farrer, E</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Unit testing and static type checking are tools for ensuring defect free software. Unit testing is the practice of writing code to test individual units of a piece of software. By validating each unit of software, defects can be discovered during development. Static type checking is performed by a type checker that automatically validates the correct typing of expressions and statements at compile time. By validating correct typing, many defects can be discovered during development. Static typing also limits the expressiveness of a programming language in that it will reject some programs which are ill-typed, but which are free of defects.</p> <p>Many proponents of unit testing claim that static type checking is an insufficient mechanism for ensuring defect free software; and therefore, unit testing is still required if static type checking is utilized. They also assert that once unit testing is utilized, static type checking is no longer needed for defect detection, and so it should be eliminated.</p> <p>The goal of this research is to explore whether unit testing does in fact obviate static type checking in real world examples of unit tested software.</p> </blockquote> <p><strong>Summary</strong></p> <p>The author took four Python programs and translated them to Haskell. Haskell's type system found some bugs. Unlike academic software engineering research, this study involves something larger than a toy program and looks at a type system that's more expressive than Java's type system. The programs were the NMEA Toolkit (9 bugs), MIDITUL (2 bugs), GrapeFruit (0 bugs), and PyFontInfo (6 bugs).</p> <p>As far as I can tell, there isn't an analysis of the severity of the bugs. The programs were 2324, 2253, 2390, and 609 lines long, respectively, so the bugs found / LOC were 17 / 7576 = 1 / 446. For reference, in Code Complete, Steve McConnell estimates that 15-50 bugs per 1kLOC is normal. If you believe that estimate applies to this codebase, you'd expect that this technique caught between 4% and 15% of the bugs in this code. There's no particular reason to believe the estimate should apply, but we can keep this number in mind as a reference in order to compare to a similarly generated number from another study that we'll get to later.</p> <p>The author does some analysis on how hard it would have been to find the bugs through testing, but only considers line coverage directed unit testing; the author comments that bugs might have have been caught by unit testing if they could be missed with 100% line coverage. This seems artificially weak — it's generally well accepted that line coverage is a very weak notion of coverage and that testing merely to get high line coverage isn't sufficient. In fact, it is generally <a href="http://blog.regehr.org/archives/386">considered insufficient to even test merely to get high path coverage</a>, which is a much stronger notion of coverage than line coverage.</p> <h3 id="gradual-typing-of-erlang-programs-a-wrangler-experience-sagonas-k-luna-d-http-www-it-uu-se-research-group-hipe-dialyzer-publications-wrangler-pdf"><a href="http://www.it.uu.se/research/group/hipe/dialyzer/publications/wrangler.pdf">Gradual Typing of Erlang Programs: A Wrangler Experience; Sagonas, K; Luna, D</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Currently most Erlang programs contain no or very little type information. This sometimes makes them unreliable, hard to use, and difficult to understand and maintain. In this paper we describe our experiences from using static analysis tools to gradually add type information to a medium sized Erlang application that we did not write ourselves: the code base of Wrangler. We carefully document the approach we followed, the exact steps we took, and discuss possible difficulties that one is expected to deal with and the effort which is required in the process. We also show the type of software defects that are typically brought forward, the opportunities for code refactoring and improvement, and the expected benefits from embarking in such a project. We have chosen Wrangler for our experiment because the process is better explained on a code base which is small enough so that the interested reader can retrace its steps, yet large enough to make the experiment quite challenging and the experiences worth writing about. However, we have also done something similar on large parts of Erlang/OTP. The result can partly be seen in the source code of Erlang/OTP R12B-3.</p> </blockquote> <p><strong>Summary</strong></p> <p>This is somewhat similar to the study in “Unit testing isn't enough”, except that the authors of this study created a static analysis tool instead of translating the program into another language. The authors note that they spent about half an hour finding and fixing bugs after running their tool. They also point out some bugs that would be difficult to find by testing. They explicitly state “what's interesting in our approach is that all these are achieved without imposing any (restrictive) static type system in the language.” The authors have a follow-on paper, “Static Detection of Race Conditions in Erlang”, which extends the approach.</p> <p>The list of papers that find bugs using static analysis without explicitly adding types is too long to list. This is just one typical example.</p> <h3 id="0install-replacing-python-leonard-t-http-roscidus-com-blog-blog-2013-06-09-choosing-a-python-replacement-for-0install-pt2-http-roscidus-com-blog-blog-2013-06-20-replacing-python-round-2-pt3-http-roscidus-com-blog-blog-2014-02-13-ocaml-what-you-gain"><a href="http://roscidus.com/blog/blog/2013/06/09/choosing-a-python-replacement-for-0install/">0install: Replacing Python; Leonard, T.</a>, <a href="http://roscidus.com/blog/blog/2013/06/20/replacing-python-round-2/">pt2</a>, <a href="http://roscidus.com/blog/blog/2014/02/13/ocaml-what-you-gain/">pt3</a></h3> <p><strong>Abstract</strong></p> <p>No abstract because this is a series of blog posts.</p> <p><strong>Summary</strong></p> <p>This compares ATS, C#, Go, Haskell, OCaml, Python and Rust. The author assigns scores to various criteria, but it's really a qualitative comparison. But it's interesting reading because it seriously considers the effect of language on a non-trivial codebase (30kLOC).</p> <p>The author implemented parts of 0install in various languages and then eventually decided on Ocaml and ported the entire thing to Ocaml. There are some great comments about why the author chose Ocaml and what the author gained by using Ocaml over Python.</p> <h3 id="verilog-vs-vhdl-design-competition-cooley-j-verilog-vs-vhdl"><a href="verilog-vs-vhdl/">Verilog vs. VHDL design competition; Cooley, J</a></h3> <p><strong>Abstract</strong></p> <p>No abstract because it's a usenet posting</p> <p><strong>Summary</strong></p> <p>Subjects were given 90 minutes to create a small chunk of hardware, a synchronous loadable 9-bit increment-by-3 decrement-by-5 up/down counter that generated even parity, carry and borrow, with the goal of optimizing for cycle time of the synthesized result. For the software folks reading this, this is something you'd expect to be able to do in 90 minutes if nothing goes wrong, or maybe if only a few things go wrong.</p> <p>Subjects were judged purely by how optimized their result was, as long as it worked. Results that didn't pass all tests were disqualified. Although the task was quite simple, it was made substantially more complicated by the strict optimization goal. For any software readers out there, this task is approximately as complicated as implementing the same thing in assembly, where your assembler takes 15-30 minutes to assemble something.</p> <p>Subjects could use Verilog (unityped) or VHDL (typed). 9 people chose Verilog and 5 chose VHDL.</p> <p>During the expierment, there were a number of issues that made things easier or harder for some subjects. Overall, Verilog users were affected more negatively than VHDL users. The license server for the Verilog simulator crashed. Also, four of the five VHDL subjects were accidentally given six extra minutes. The author had manuals for the wrong logic family available, and one Verilog user spent 10 minutes reading the wrong manual before giving up and using his intuition. One of the Verilog users noted that they passed the wrong version of their code along to be tested and failed because of that. One of the VHDL users hit a bug in the VHDL simulator.</p> <p>Of the 9 Verilog users, 8 got something synthesized before the 90 minute deadline; of those, 5 had a design that passed all tests. None of the VHDL users were able to synthesize a circuit in time.</p> <p>Two of the VHDL users complained about issues with types “I can't believe I got caught on a simple typing error. I used IEEE std_logic_arith, which requires use of unsigned &amp; signed subtypes, instead of std_logic_unsigned.”, and &quot;I ran into a problem with VHDL or VSS (I'm still not sure.) This case statement doesn't analyze: ‘subtype two_bits is unsigned(1 downto 0); case two_bits'(up &amp; down)...' But what worked was: ‘case two_bits'(up, down)...' Finally I solved this problem by assigning the concatenation first to a[n] auxiliary variable.&quot;</p> <h3 id="comparing-mathematical-provers-wiedijk-f-http-www-cs-ru-nl-f-wiedijk-comparison-diffs-pdf"><a href="http://www.cs.ru.nl/F.Wiedijk/comparison/diffs.pdf">Comparing mathematical provers; Wiedijk, F</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>We compare fifteen systems for the formalizations of mathematics with the computer. We present several tables that list various properties of these programs. The three main dimensions on which we compare these systems are: the size of their library, the strength of their logic and their level of automation.</p> </blockquote> <p><strong>Summary</strong></p> <p>The author compares the type systems and foundations of various theorem provers, and comments on their relative levels of proof automation.</p> <p><img src="images/empirical-pl/prover_types.png" alt="Type systems of provers"> <img src="images/empirical-pl/prover_foundations.png" alt="Foundations of provers"> <img src="images/empirical-pl/prover_graph.png" alt="Graph of automation level of provers"></p> <p>The author looked at one particular problem (proving the irrationality of the square root of two) and examined how different systems handle the problem, including the style of the proof and its length. There's a table of lengths, but it doesn't match the <a href="http://www.cs.ru.nl/~freek/comparison/">updated code examples provided here</a>. For instance, that table claims that the ACL2 proof is 206 lines long, but there's a <a href="http://www.cs.ru.nl/~freek/comparison/files/acl2.lisp">21 line ACL2 proof here</a>.</p> <p>The author has a number of criteria for determining how much automation prover provides, but he freely admits that it's highly subjective. The author doesn't provide the exact rubric used for scoring, but he mentions that a more automated interaction style, user automation, powerful built-in automation, and the Poincare principle (basically whether the system lets you write programs to solve proofs algorithmically) all count towards being more automated, and more powerful logic (e.g., first-order v. higher-order), logical framework dependent types, and de Bruijn criterion (having a small guaranteed kernel) count towards being more mathematical.</p> <h3 id="do-programming-languages-affect-productivity-a-case-study-using-data-from-open-source-projects-delory-d-knutson-c-chun-s-http-sequoia-cs-byu-edu-lab-files-pubs-delorey2007a-pdf"><a href="http://sequoia.cs.byu.edu/lab/files/pubs/Delorey2007a.pdf">Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects; Delory, D; Knutson, C; Chun, S</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Brooks and others long ago suggested that on average computer programmers write the same number of lines of code in a given amount of time regardless of the programming language used. We examine data collected from the CVS repositories of 9,999 open source projects hosted on SourceForge.net to test this assump- tion for 10 of the most popular programming languages in use in the open source community. We find that for 24 of the 45 pairwise comparisons, the programming language is a significant factor in determining the rate at which source code is written, even after accounting for variations between programmers and projects.</p> </blockquote> <p><strong>Summary</strong></p> <p>The authors say “our goal is not to construct a predictive or explanatory model. Rather, we seek only to develop a model that sufficiently accounts for the variation in our data so that we may test the significance of the estimated effect of programming language.” and that's what they do. They get some correlations, but it's hard to conclude much of anything from them.</p> <h3 id="the-unreasonable-effectiveness-of-dynamic-typing-for-practical-programs-smallshire-r-http-vimeo-com-74354480"><a href="http://vimeo.com/74354480">The Unreasonable Effectiveness of Dynamic Typing for Practical Programs; Smallshire, R</a></h3> <p><strong>Abstract</strong></p> <blockquote> <p>Some programming language theorists would have us believe that the one true path to working systems lies in powerful and expressive type systems which allow us to encode rich constraints into programs at the time they are created. If these academic computer scientists would get out more, they would soon discover an increasing incidence of software developed in languages such a Python, Ruby and Clojure which use dynamic, albeit strong, type systems. They would probably be surprised to find that much of this software—in spite of their well-founded type-theoretic hubris—actually works, and is indeed reliable out of all proportion to their expectations.This talk—given by an experienced polyglot programmer who once implemented Hindley Milner static type inference for “fun”, but who now builds large and successful systems in Python—explores the disconnect between the dire outcomes predicted by advocates of static typing versus the near absence of type errors in real world systems built with dynamic languages: Does diligent unit testing more than make up for the lack of static typing? Does the nature of the type system have only a low-order effect on reliability compared to the functional or imperative programming paradigm in use? How often is the dynamism of the type system used anyway? How much type information can JITs exploit at runtime? Does the unwarranted success of dynamically typed languages get up the nose of people who write Haskell?</p> </blockquote> <p><strong>Summary</strong></p> <p>The speaker used data from Github to determine that approximately 2.7% of Python bugs are type errors. Python's <code>TypeError</code>, <code>AttributeError</code>, and <code>NameError</code> were classified as type errors. The speaker rounded 2.7% down to 2% and claimed that 2% of errors were type related. The speaker mentioned that on a commercial codebase he worked with, 1% of errors were type related, but that could be rounded down from anything less than 2%. The speaker mentioned looking at the equivalent errors in Ruby, Clojure, and other dynamic languages, but didn't present any data on those other languages.</p> <p>This data might be good but it's impossible to tell because there isn't enough information about the methodology. Something this has going for is that the number is in the right ballpark, compared to the made up number we got when compared the bug rate from Code Complete to the number of bugs found by Farrer. Possibly interesting, but thin.</p> <h3 id="summary-of-summaries">Summary of summaries</h3> <p>This isn't an exhaustive list. For example, I haven't covered “An Empirical Comparison of Static and Dynamic Type Systems on API Usage in the Presence of an IDE: Java vs. Groovy with Eclipse”, and “Do developers benefit from generic types?: an empirical comparison of generic and raw types in java” because they didn't seem to add much to what we've already seen.</p> <p>I didn't cover a number of older studies that are in the related work section of almost all the listed studies both because the older studies often cover points that aren't really up for debate anymore and also because the experimental design in a lot of those older papers leaves something to be desired. Feel free to <a href="https://twitter.com/danluu">ping me</a> if there's something you think should be added to the list.</p> <p>Not only is this list not exhaustive, it's not objective and unbiased. If you read the studies, you can get a pretty good handle on how the studies are biased. However, I can't provide enough information for you to decide for yourself how the studies are biased without reproducing most of the text of the papers, so you're left with my interpretation of things, filtered through my own biases. That can't be helped, but I can at least explain my biases so you can discount my summaries appropriately.</p> <p>I like types. I find ML-like languages really pleasant to program in, and if I were king of the world, we'd all use F# as our default managed language. The situation with unmanaged languages is a bit messier. I certainly prefer C++ to C because std::unique_ptr and friends make C++ feel a lot safer than C. I suspect I might prefer Rust once it's more stable. But while I like languages with expressive type systems, I haven't noticed that they make me more productive or less bug prone<sup class="footnote-ref" id="fnref:although-I-feel-a-lot-safer-http-blog-metaobject-com-2014-06-the-safyness-of-static-typing-html"><a rel="footnote" href="#fn:although-I-feel-a-lot-safer-http-blog-metaobject-com-2014-06-the-safyness-of-static-typing-html">0</a></sup>.</p> <p><a name="wat_summary"></a></p> <p>Now that you know what my biases are, let me give you my interpretation of the studies. Of the controlled experiments, only three show an effect large enough to have any practical significance. The Prechelt study comparing C, C++, Java, Perl, Python, Rexx, and Tcl; the Endrikat study comparing Java and Dart; and Cooley's experiment with VHDL and Verilog. Unfortunately, they all have issues that make it hard to draw a really strong conclusion.</p> <p>In the Prechelt study, the populations were different between dynamic and typed languages, and the conditions for the tasks were also different. There was a follow-up study that illustrated the issue by inviting Lispers to come up with their own solutions to the problem, which involved comparing folks like <a href="http://wry.me/~darius/hacks/lisp-study.scm">Darius Bacon</a> to random undergrads. A follow-up to the follow-up literally involves <a href="http://norvig.com/java-lisp.html">comparing code from Peter Norvig</a> to code from random college students.</p> <p>In the Endrikat study, they specifically picked a task where they thought static typing would make a difference, and they drew their subjects from a population where everyone had taken classes using the statically typed language. They don't comment on whether or not students had experience in the dynamically typed language, but it seems safe to assume that most or all had less experience in the dynamically typed language.</p> <p>Cooley's experiment was one of the few that drew people from a non-student population, which is great. But, as with all of the other experiments, the task was a trivial toy task. While it seems damning that none of the VHDL (static language) participants were able to complete the task on time, it is extremely unusual to want to finish a hardware design in 1.5 hours anywhere outside of a school project. You might argue that a large task can be broken down into many smaller tasks, but a plausible counterargument is that there are fixed costs using VHDL that can be amortized across many tasks.</p> <p>As for the rest of the experiments, the main takeaway I have from them is that, under the specific set of circumstances described in the studies, any effect, if it exists at all, is small.</p> <p>Moving on to the case studies, the two bug finding case studies make for interesting reading, but they don't really make a case for or against types. One shows that transcribing Python programs to Haskell will find a non-zero number of bugs of unknown severity that might not be found through unit testing that's line-coverage oriented. The pair of Erlang papers shows that you can find some bugs that would be difficult to find through any sort of testing, some of which are severe, using static analysis.</p> <p>As a user, I find it convenient when my compiler gives me an error before I run separate static analysis tools, but that's minor, perhaps even smaller than the effect size of the controlled studies listed above.</p> <p>I found the 0install case study (that compared various languages to Python and eventually settled on Ocaml) to be one of the more interesting things I ran across, but it's the kind of subjective thing that everyone will interpret differently, which you can see by looking.</p> <p>This fits with the impression I have (in my little corner of the world, ACL2, Isabelle/HOL, and PVS are the most commonly used provers, and it makes sense that people would prefer more automation when solving problems in industry), but that's also subjective.</p> <p>And then there are the studies that mine data from existing projects. Unfortunately, I couldn't find anybody who did anything to determine causation (e.g., <a href="http://www.amazon.com/gp/product/0691120358/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0691120358&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=XGRCRGSCGXSGNMJG">find an appropriate instrumental variable</a>), so they just measure correlations. Some of the correlations are unexpected, but there isn't enough information to determine why. The lack of any causal instrument doesn't stop people like Ray et al. from making strong, unsupported, claims.</p> <p>The only data mining study that presents data that's potentially interesting without further exploration is Smallshire's review of Python bugs, but there isn't enough information on the methodology to figure out what his study really means, and it's not clear why he hinted at looking at data for other languages without presenting the data<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">2</a></sup>.</p> <p>Some notable omissions from the studies are comprehensive studies using experienced programmers, let alone studies that have large populations of &quot;good&quot; or &quot;bad&quot; programmers, looking at anything approaching a significant project (in places I've worked, a three month project would be considered small, but that's multiple orders of magnitude larger than any project used in a controlled study), using &quot;modern&quot; statically typed languages, using gradual/optional typing, using modern mainstream IDEs (like VS and Eclipse), using modern radical IDEs (like LightTable), using old school editors (like Emacs and vim), doing maintenance on a non-trivial codebase, doing maintenance with anything resembling a realistic environment, doing maintenance on a codebase you're already familiar with, etc.</p> <p>If you look at the internet commentary on these studies, most of them are passed around to justify one viewpoint or another. The Prechelt study on dynamic vs. static, along with the follow-ups on Lisp are perennial favorites of dynamic language advocates, and github mining study has recently become trendy among functional programmers.</p> <p><img src="images/empirical-pl/confirmation_bias.png" alt="A twitter thread! In the twitter thread, someone who actually read the study notes that other factors dominate language (which is explicitly pointed out by the authors of the study), which prompts someone else to respond 'Why don't you just read the paper? It explains all this. In the abstract, even.'"></p> <p>Other than cherry picking studies to confirm a long-held position, the most common response I've heard to these sorts of studies is that the effect isn't quantifiable by a controlled experiment. However, I've yet to hear a specific reason that doesn't also apply to any other field that empirically measures human behavior. Compared to a lot of those fields, it's easy to run controlled experiments or do empirical studies. It's true that controlled studies only tell you something about a very limited set of circumstances, but the fix to that isn't to dismiss them, but to fund more studies. It's also true that it's tough to determine causation from ex-post empirical studies, but the solution isn't to ignore the data, but to do more sophisticated analysis. For example, <a href="https://www.amazon.com/gp/product/0691120358/ref=as_li_qf_asin_il_tl?ie=UTF8&amp;tag=abroaview-20&amp;creative=9325&amp;linkCode=as2&amp;creativeASIN=0691120358&amp;linkId=8f8c46b4c33ea48f53edcc3691f57d5a">econometric methods</a> are often able to make a case for causation with data that's messier than the data we've looked at here.</p> <p>The next most common response is that their viewpoint is still valid because their specific language or use case isn't covered. Maybe, but if the strongest statement you can make for your position is that there's no empirical evidence against the position, that's not much of a position.</p> <p><small> If you've managed to read this entire thing without falling asleep, you might be interested in <a href="//danluu.com/everything-is-broken/">my opinion on tests</a>.</p> <h4 id="responses">Responses</h4> <p>Here are the responses I've gotten from people mentioned in this post. Robert Smallshire said &quot;Your review article is very good. Thanks for taking the time to put it together.&quot; On my comment about the F# &quot;mistake&quot; vs. trolling, his reply was &quot;Neither. That torque != energy is obviously solved by modeling quantities not dimensions. The point being that this modeling of quantities with types takes effort without necessarily delivering any value.&quot; Not having done much with units myself, I don't have an informed opinion on this, but my natural bias is to try to encode the information in types if at all possible.</p> <p>Bartosz Milewski said <a href="https://twitter.com/BartoszMilewski/status/532651581813325824">&quot;Guilty as charged!&quot;</a>. Wow. Much Respect. But notice that, as of this update, The correction has been retweeted 1/25th as often as the original tweet. People want to believe there's evidence their position is superior. People don't want to believe the evidence is murky, or even possibly against them. Misinformation people want to believe spreads faster than information people don't want to believe.</p> <p>On a related twitter conversation, Andreas Stefik said &quot;That is not true. It depends on which scientific question. Static vs. Dynamic is well studied.&quot;, &quot;Profound rebuttal. I had better retract my peer reviewed papers, given this new insight!&quot;, &quot;Take a look at the papers...&quot;, and &quot;This is a serious misrepresentation of our studies.&quot; I muted the guy since it didn't seem to be going anywhere, but it's possible there was a substantive response buried in some later tweet. It's pretty easy to take twitter comments out of context, so <a href="https://twitter.com/garybernhardt/status/531513870276231168">check out the thread yourself</a> if you're really curious.</p> <p>I have a lot of respect for the folks who do these experiments, which is, unfortunately, not mutual. But the really unfortunate thing is that some of the people who do these experiments think that static v. dynamic is something that is, at present, &quot;well studied&quot;. There are plenty of equally difficult to study subfields in the social sciences that have multiple orders of magnitude more research going on, that are considered open problems, but at least some researchers already consider this to be well studied!</p> <h4 id="acknowledgements">Acknowledgements</h4> <p>Thanks to Leah Hanson, Joe Wilder, Robert David Grant, Jakub Wilk, Rich Loveland, Eirenarch, Edward Knight, and Evan Farrer for comments/corrections/discussion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:P">This was from a talk at Strange Loop this year. The author later clarified his statement with &quot;To me, this follows immediately (a technical term in logic meaning the same thing as “trivially”) from the Curry-Howard Isomorphism we discussed, and from our Types vs. Tests: An Epic Battle? presentation two years ago. If types are theorems (they are), and implementations are proofs (they are), and your SLA is a guarantee of certain behavior of your system (it is), then how can using technology that precludes forbidding undesirable behavior of your system before other people use it (dynamic typing) possibly be anything but unethical?&quot; <a class="footnote-return" href="#fnref:P"><sup>[return]</sup></a></li> <li id="fn:H">Just as an aside, I find the online responses to Smallshire's study to be pretty great. There are, of course, the usual responses about how his evidence is wrong and therefore static types are, in fact, beneficial because there's no evidence against them, and you don't need evidence for them because you can arrive at the proper conclusion using pure reason. The really interesting bit is that, at one point, Smallshire presents an example of an F# program that can't catch a certain class of bug via its type system, and the online response is basically that he's an idiot who should have written his program in a different way so that the type system should have caught the bug. I can't tell if Smallshire's bug was an honest mistake or masterful trolling. <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> </ol> </div> CLWB and PCOMMIT clwb-pcommit/ Wed, 05 Nov 2014 00:00:00 +0000 clwb-pcommit/ <p>The latest version of the Intel manual has a couple of <a href="https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf">new instructions for non-volatile storage, like SSDs</a>. What's that about?</p> <p>Before we look at the instructions in detail, let's take a look at the issues that exist with super fast NVRAM. One problem is that next generation storage technologies (PCM, 3d XPoint, etc.), will be fast enough that syscall and other OS overhead can be more expensive than the actual cost of the disk access<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. Another is the impedance mismatch between the x86 memory hierarchy and persistent memory. In both cases, it's basically an <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's law</a> problem, where one component has improved so much that other components have to improve to keep up.</p> <p></p> <p>There's a <a href="http://research.microsoft.com/pubs/198366/Asplos2012_MonetaD.pdf">good paper by Todor Mollov, Louis Eisner, Arup De, Joel Coburn, and Steven Swanson</a> on the first issue; I'm going to present one of their graphs below.</p> <p><img src="images/clwb-pcommit/Moneta_overhead.png" alt="OS and other overhead for NVRAM operations"></p> <p>Everything says “Moneta” because that's the name of their system (which is pretty cool, BTW; I recommend reading the paper to see how they did it). Their “baseline” case is significantly better than you'll get out of a stock system. They did a number of optimizations (e.g., bypassing Linux's IO scheduler and removing context switches where possible), which reduces latency by 62% over plain old linux. Despite that, the hardware + DMA cost of the transaction (the white part of the bar) is dwarfed by the overhead. Note that they consider the cost of the DMA to be part of the hardware overhead.</p> <p>They're able to bypass the OS entirely and reduce a lot of the overhead, but it's still true that the majority of the cost of a write is overhead.</p> <p><img src="images/clwb-pcommit/Moneta_speedup.png" alt="OS bypass speedup for NVRAM operations"></p> <p>Despite not being able to get rid of all of the overhead, they get pretty significant speedups, both on small microbenchmarks and real code. So that's one problem. The OS imposes a pretty large tax on I/O when your I/O device is really fast.</p> <p>Maybe you can bypass large parts of that problem by just mapping your NVRAM device to a region of memory and committing things to it as necessary. But that runs into another problem. which is the impedance mismatch between how caches interact with the NVRAM region if you want something like transactional semantics.</p> <p>This is described <a href="http://www.hpl.hp.com/techreports/2012/HPL-2012-236.pdf">in more detail in this report by Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm</a>. I'm going to borrow a couple of their figures, too.</p> <p><img src="images/clwb-pcommit/HP_hierarchy.png" alt="Rough memory hierarchy diagram"></p> <p>We've got this NVRAM region which is safe and persistent, but before the CPU can get to it, it has to go through multiple layers with varying ordering guarantees. They give the following example:</p> <blockquote> <p>Consider, for example, a common programming idiom where a persistent memory location N is allocated, initialized, and published by assigning the allocated address to a global persistent pointer p. If the assignment to the global pointer becomes visible in NVRAM before the initialization (presumably because the latter is cached and has not made its way to NVRAM) and the program crashes at that very point, a post-restart dereference of the persistent pointer will read uninitialized data. Assuming writeback (WB) caching mode, this can be avoided by inserting cache-line flushes for the freshly allocated persistent locations N before the assignment to the global persistent pointer p.</p> </blockquote> <p>Inserting <code>CLFLUSH</code> instructions all over the place works, but how much overhead is that?</p> <p><img src="images/clwb-pcommit/HP_synthetic.png" alt="Persistence overhead on reads and writes"></p> <p>The four memory types they look at (and the four that x86 supports) are writeback (WB), writethrough (WT), write combine (WC), and uncacheable (UC). WB is what you deal with under normal circumstances. Memory can be cached and it's written back whenever it's forced to be. WT allows memory to be cached, but writes have to be written straight through to memory, i.e., memory is kept up to date with the cache. UC simply can't be cached. WC is like UC, except that writes can be coalesced before being sent out to memory.</p> <p>The R, W, and RW benchmarks are just benchmarks of reading and writing memory. WB is clearly the best, by far (lower is better). If you want to get an intuitive feel for how much better WB is than the other policies, try booting an OS with anything but WB memory.</p> <p>I've had to do that on occasion because I use to work for a chip company, and when we first got the chip back, we often didn't know which bits we had to disable to work around bugs. The simplest way to make progress is often to disable caches entirely. That “works”, but even minimal OSes like DOS are noticeably slow to boot without WB memory. My recollection is that Win 3.1 takes the better part of an hour, and that Win 95 is a multiple hour process.</p> <p>The _b benchmarks force writes to be visible to memory. For the WB case, that involves an <code>MFENCE</code> followed by a <code>CLFLUSH</code>. WB with visibility constraints is significantly slower than the other alternatives. It's a multiple order of magnitude slowdown over WB when writes don't have to be ordered and flushed.</p> <p>They also run benchmarks on some real data structures, with the constraint that data should be persistently visible.</p> <p><img src="images/clwb-pcommit/HP_data_structures.png" alt="Persistence overhead on data structure operations"></p> <p>The performance of regular WB memory can be terribly slow: within a factor of 2 of the performance of running without caches. And that's just the overhead around getting out of the cache hierarchy -- that's true even if your persistent storage is infinitely fast.</p> <p>Now, let's look how Intel decided to address this. There are two new instructions, <code>CLWB</code> and <code>PCOMMIT</code>.</p> <p><code>CLWB</code> acts like <code>CLFLUSH</code>, in that it forces the data to get written out to memory. However, it doesn't force the cache to throw away the data, which makes future reads and writes a lot faster. Also, <code>CLFLUSH</code> is only ordered with respect to <code>MFENCE</code>, but <code>CLWB</code> is also ordered with respect to <code>SFENCE</code>. Here's their description of <code>CLWB</code>:</p> <blockquote> <p>Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization (treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from the cache hierarchy. The source operand is a byte memory location.</p> <p>It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that references the cache line).</p> <p>CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address.</p> <p>Executions of CLWB interact with executions of PCOMMIT. The PCOMMIT instruction operates on certain store-to-memory operations that have been accepted to memory. CLWB executed for the same cache line as an older store causes the store to become accepted to memory when the CLWB execution becomes globally visible.</p> </blockquote> <p><code>PCOMMIT</code> is applied to entire memory ranges and ensures that everything in the memory range is committed to persistent storage. Here's their description of <code>PCOMMIT</code>:</p> <blockquote> <p>The PCOMMIT instruction causes certain store-to-memory operations to persistent memory ranges to become persistent (power failure protected).1 Specifically, PCOMMIT applies to those stores that have been accepted to memory.</p> <p>While all store-to-memory operations are eventually accepted to memory, the following items specify the actions software can take to ensure that they are accepted:</p> <p>Non-temporal stores to write-back (WB) memory and all stores to uncacheable (UC), write-combining (WC), and write-through (WT) memory are accepted to memory as soon as they are globally visible. If, after an ordinary store to write-back (WB) memory becomes globally visible, CLFLUSH, CLFLUSHOPT, or CLWB is executed for the same cache line as the store, the store is accepted to memory when the CLFLUSH, CLFLUSHOPT or CLWB execution itself becomes globally visible.</p> <p>If PCOMMIT is executed after a store to a persistent memory range is accepted to memory, the store becomes persistent when the PCOMMIT becomes globally visible. This implies that, if an execution of PCOMMIT is globally visible when a later store to persistent memory is executed, that store cannot become persistent before the stores to which the PCOMMIT applies.</p> <p>The following items detail the ordering between PCOMMIT and other operations:</p> <p>A logical processor does not ensure previous stores and executions of CLFLUSHOPT and CLWB (by that logical processor) are globally visible before commencing an execution of PCOMMIT. This implies that software must use appropriate fencing instruction (e.g., SFENCE) to ensure the previous stores-to-memory operations and CLFLUSHOPT and CLWB executions to persistent memory ranges are globally visible (so that they are accepted to memory), before executing PCOMMIT.</p> <p>A logical processor does not ensure that an execution of PCOMMIT is globally visible before commencing subsequent stores. Software that requires that such stores not become globally visible before PCOMMIT (e.g., because the younger stores must not become persistent before those committed by PCOMMIT) can ensure by using an appropriate fencing instruction (e.g., SFENCE) between PCOMMIT and the later stores.</p> <p>An execution of PCOMMIT is ordered with respect to executions of SFENCE, MFENCE, XCHG or LOCK-prefixed instructions, and serializing instructions (e.g., CPUID).</p> <p>Executions of PCOMMIT are not ordered with respect to load operations. Software can use MFENCE to order loads with PCOMMIT.</p> <p>Executions of PCOMMIT do not serialize the instruction stream.</p> </blockquote> <p>How much <code>CLWB</code> and <code>PCOMMIT</code> actually improve performance will be up to their implementations. It will be interesting to benchmark these and see how they do. In any case, this is an attempt to solve the WB/NVRAM impedance mismatch issue. It doesn't directly address the OS overhead issue, but that can, to a large extent, be worked around without extra hardware.</p> <p><strong>If you liked this post, you'll probably also enjoy <a href="//danluu.com/intel-cat/">reading about cache partitioning in Broadwell and newer Intel server parts</a></strong>.</p> <p><small>Thanks to Eric Bron for spotting this in the manual and pointing it out, and to Leah Hanson, Nate Rowe, and 'unwind' for finding typos.</small></p> <p><small>If you haven't had enough of papers, Zvonimir Bandic pointed out <a href="https://www.usenix.org/conference/fast14/technical-sessions/presentation/vucinic">a paper by Dejan Vučinić, Qingbo Wang, Cyril Guyot, Robert Mateescu, Filip Blagojević, Luiz Franca-Neto, Damien Le Moal, Trevor Bunker, Jian Xu, and Steven Swanson on getting 1.4 us latency and 700k IOPS out of a type of NVRAM</a></p> <p>If you liked this post, you might also like <a href="//danluu.com/new-cpu-features/">this related post on &quot;new&quot; CPU features</a>. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">this should sound familiar to HPC and HFT folks with InfiniBand networks. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> Caches: LRU v. random 2choices-eviction/ Mon, 03 Nov 2014 00:00:00 +0000 2choices-eviction/ <p>Once upon a time, my <a href="http://pages.cs.wisc.edu/~sohi/">computer architecture professor</a> mentioned that using a random eviction policy for caches really isn't so bad. That random eviction isn't bad can be surprising — if your cache fills up and you have to get rid of something, choosing the <a href="http://www.mathcs.emory.edu/~cheung/Courses/355/Syllabus/9-virtual-mem/LRU-replace.html">least recently used (LRU)</a> is an obvious choice, since you're more likely to use something if you've used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it's going to cause a miss every time if the loop doesn't fit. A random eviction policy degrades gracefully as the loop gets too big.</p> <p>In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices (2-random) and just use LRU between those two choices?</p> <p>Here are the relative miss rates we get for SPEC CPU<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> with a Sandy Bridge-like cache (<a href="//danluu.com/3c-conflict/">8-way associative</a>, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.</p> <p></p> <table> <thead> <tr> <th>Policy</th> <th>L1 (64k)</th> <th>L2 (256k)</th> <th>L3 (2MB)</th> </tr> </thead> <tbody> <tr> <td>2-random</td> <td>0.91</td> <td>0.93</td> <td>0.95</td> </tr> <tr> <td>FIFO</td> <td>0.96</td> <td>0.97</td> <td>1.02</td> </tr> <tr> <td>LRU</td> <td>0.90</td> <td>0.90</td> <td>0.97</td> </tr> <tr> <td>random</td> <td>1.00</td> <td>1.00</td> <td>1.00</td> </tr> </tbody> </table> <p>Random and FIFO are both strictly worse than either LRU or 2-random. LRU and 2-random are pretty close, with LRU edging out 2-random for the smaller caches and 2-random edging out LRU for the larger caches.</p> <p>To see if anything odd is going on in any individual benchmark, we can look at the raw results on each sub-benchmark. The L1, L2, and L3 miss rates are all plotted in the same column for each benchmark, below:</p> <p><img src="images/2choices-eviction/sandy-bridge-miss.png" alt="Cache miss rates for Sandy Bridge-like cache"></p> <p>As we might expect, LRU does worse than 2-random when the miss rates are high, and better when the miss rates are low.</p> <p>At this point, it's not clear if 2-random is beating LRU in L3 cache miss rates because it does better when the caches are large or because it does better because it's the third level in a hierarchical cache. Since a cache line that's being actively used in L1 or L2 isn't touched in L3, an eviction can happen from the L3 (which forces an eviction of both the L1 and L2) since, as far as the L3 is concerned, that line hasn't been used recently. This makes it less obvious that LRU is a good eviction policy for L3 cache.</p> <p>To separate out the effects, let's look at the relative miss rates for a non-hierarchical (single level) vs. hierarchical caches at various sizes<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>. For the hierarchical cache, the L1 and L2 sizes are as above, 64k and 256k, and only the L3 cache size varies. Below, we've got the geometric means of the ratios<sup class="footnote-ref" id="fnref:G"><a rel="footnote" href="#fn:G">3</a></sup> of how each policy does (over all SPEC sub-benchmarks, compared to random eviction). A possible downside to this metric is that if we have some very low miss rates, those could dominate the mean since small fluctuations will have a large effect on the ratio, but we can look the distribution of results to see if that's the case.</p> <p><img src="images/2choices-eviction/sweep-sizes-mean-ratios.png" alt="Cache miss ratios for cache sizes between 64K and 16M"></p> <p><img src="images/2choices-eviction/sweep-sizes-mean-ratios-l3.png" alt="L3 cache miss ratios for cache sizes between 512K and 16M"></p> <p>Sizes below 512k are missing for the hierarchical case because of the 256k L2 — we're using an inclusive L3 cache here, so it doesn't really make sense to have an L3 that's smaller than the L2. Sizes above 16M are omitted because cache miss rates converge when the cache gets too big, which is uninteresting.</p> <p>Looking at the single cache case, it seems that LRU works a bit better than 2-random for smaller caches (lower miss ratio is better), 2-random edges out LRU as the cache gets bigger. The story is similar in the hierarchical case, except that we don't really look at the smaller cache sizes where LRU is superior.</p> <p>Comparing the two cases, the results are different, but similar enough that it looks our original results weren't only an artifact of looking at the last level of a hierarchical cache.</p> <p>Below, we'll look at the entire distribution so we can see if the mean of the ratios is being skewed by tiny results.</p> <p><img src="images/2choices-eviction/sweep-sizes-miss.png" alt="L3 cache miss ratios for cache sizes between 512K and 16M"></p> <p><img src="images/2choices-eviction/sweep-sizes-miss-l3.png" alt="L3 cache miss ratios for cache sizes between 512K and 16M"></p> <p>It looks like, for a particular cache size (one column of the graph), the randomized algorithms do better when miss rates are relatively high and worse when miss rates are relatively low, so, if anything, they're disadvantaged when we just look at the geometric mean — if we were to take the arithmetic mean, the result would be dominated by the larger results, where 2 random choices and plain old random do relatively well<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">4</a></sup>.</p> <p>From what we've seen of the mean ratios, 2-random looks fine for large caches, and from what we've seen of the distribution of the results, that's despite 2-random being penalized by the mean ratio metric, which makes it seem pretty good for large caches.</p> <p>However, it's common to implement pseudo-LRU policies because LRU can be too expensive to be workable. Since 2-random requires having at least as much information as LRU, let's take a look at what happens we use pseudo 2-random (approximately 80% accurate), and pseudo 3-random (a two-level tournament, each level of which is approximately 80% accurate).</p> <p>Since random and FIFO are clearly not good replacement policies, I'll leave them out of the following graphs. Also, since the results were similar in the single cache as well as multi-level cache case, we can just look at the results from the more realistic multi-level cache case.</p> <p><img src="images/2choices-eviction/sweep-sizes-mean-ratios-l3-pseudo.png" alt="L3 cache miss ratios for cache sizes between 512K and 16M"></p> <p>Since pseudo 2-random acts like random 20% of the time and 2-random 80% of the time, we might expect it to fall somewhere between 2-random and random, which is exactly what happens. A simple tweak to try to improve pseudo 2-random is to try pseudo 3-random (evict the least recently used of 3 random choices). While that's still not quite as good as true 2-random, it's pretty close, and it's still better than LRU (and pseudo LRU) for caches larger than 1M.</p> <p>The one big variable we haven't explored is the set associativity. To see how LRU compares with 2-random across different cache sizes let's look at the LRU:2-random miss ratio (higher/red means LRU is better, lower/green means 2-random is better).</p> <p><img src="images/2choices-eviction/sweep-sizes-assocs-mean-ratios.png" alt="Cache miss ratios for cache sizes between 64K and 16M with associativities between and 64"></p> <p>On average, increasing associativity increases the difference between the two policies. As before, LRU is better for small caches and 2-random is better for large caches. Associativities of 1 and 2 aren't shown because they should be identical for both algorithms.</p> <p>There's still a combinatorial explosion of possibilities we haven't tried yet. One thing to do is to try different eviction policies at different cache levels (LRU for L1 and L2 with 2-random for L3 seems promising). Another thing to do is to try this for different types of caches. I happened to choose CPU caches because it's easy to find simulators and benchmark traces, but in today's “put a cache on it” world, there are a lot of other places 2-random can be applied<sup class="footnote-ref" id="fnref:H"><a rel="footnote" href="#fn:H">5</a></sup>.</p> <p>For any comp arch folks, from this data, I suspect that 2-random doesn't keep up with adaptive policies like DIP (although it might — it's in the right ballpark, but it was characterized on a different workload using a different simulator, so it's not 100% clear). However, A pseudo 2-random policy can be implemented that barely uses more resources than pseudo-LRU policies, which makes this very cheap compared to DIP. Also, we can see that pseudo 3-random is substantially better than pseudo 2-random, which indicates that k-random is probably an improvement over 2-random for the k. Some k-random policy might be an improvement over DIP.</p> <p>So we've seen that this works, but why would anyone think to do this in the first place? <a href="http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf">The Power of Two Random Choices: A Survey of Techniques and Results by Mitzenmacher, Richa, and Sitaraman</a> has a great explanation. The mathematical intuition is that if we (randomly) throw n balls into n bins, the maximum number of balls in any bin is <code>O(log n / log log n)</code> with high probability, which is pretty much just <code>O(log n)</code>. But if (instead of choosing randomly) we choose the least loaded of k random bins, the maximum is <code>O(log log n / log k)</code> with high probability, i.e., even with two random choices, it's basically <code>O(log log n)</code> and each additional choice only reduces the load by a constant factor.</p> <p>This turns out to have all sorts of applications; things like <a href="http://brooker.co.za/blog/2012/01/17/two-random.html">load balancing</a> and hash distribution are natural fits for the balls and bins model. There are also a lot of applications that aren't obviously analogous to the balls and bins model, like <a href="http://citeseerx.ist.psu.edu/showciting?cid=1042246">circuit routing</a> and <a href="http://www.math.cmu.edu/math/mathcolloquium.php?SeminarSelect=522">Erdős–Rényi</a> graphs.</p> <p><small>Thanks to Jan Elder and Mark Hill for making dinero IV freely available, to Aleksandar Milenkovic for providing SPEC CPU traces, and to Carl Vogel, James Porter, Peter Fraenkel, Katerina Barone-Adesi, Jesse Luehrs, Lea Albaugh, and Kevin Lynagh for advice on plots and plotting packages, to Mindy Preston for finding a typo in the acknowledgments, to Lindsey Kuper for pointing out some terminology stuff, to Tom Wenisch for suggesting that I check out CMP$im for future work, and to Leah Hanson for extensive comments on the entire post.</small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1"><p>Simulations were done with <a href="http://pages.cs.wisc.edu/~markhill/DineroIV/">dinero IV</a> with <a href="http://www.ece.uah.edu/~lacasa/sbc/sbc.html">SBC traces</a>. These were used because professors and grad students have gotten more protective of simulator code over the past couple decades, making it hard to find a modern open source simulator on GitHub. However, dinero IV supports hierarchical caches with prefetching, so it should give a reasonable first-order approximation.</p> <p>Note that 175.vpr and 187.facerec weren't included in the traces, so they're missing from all results in this post.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">Sizes are limited by dinero IV, which requires cache sizes to be a power of 2. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:G"><p>Why consider the geometric mean of the ratios? We have different “base” miss rates for different benchmarks. For example, 181.mcf has a much higher miss rate than 252.eon. If we're trying to figure out which policy is best, those differences are just noise. Looking at the ratios removes that noise.</p> <p>And if we were just comparing those two, we'd like being 2x better on both to be equivalent to being 4x better on one and just 1x on the other, or 8x better on one and 1/2x “better” on the other. Since the geometric mean is the nth-root of the product of the results, it has that property.</p> <a class="footnote-return" href="#fnref:G"><sup>[return]</sup></a></li> <li id="fn:C">We can see that 2-choices tends to be better than LRU for high miss rates by looking for the high up clusters of a green triangle, red square, empty diamond, and a blue circle, and seeing that it's usually the case that the green triangle is above the red square. It's too cluttered to really tell what's going on at the lower miss rates. I admit I cheated and looked at some zoomed in plots. <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:H">If you know of a cache simulator for some other domain that I can use, please <a href="https://twitter.com/danluu">let me know</a>! <a class="footnote-return" href="#fnref:H"><sup>[return]</sup></a></li> </ol> </div> Testing v. informal reasoning tests-v-reason/ Mon, 03 Nov 2014 00:00:00 +0000 tests-v-reason/ <p>This is an off-the-cuff comment for Hacker School's Paper of the Week Read Along series for <a href="https://github.com/papers-we-love/papers-we-love/raw/master/design/out-of-the-tar-pit.pdf">Out of the Tar Pit</a>.</p> <p></p> <p>I find the idea itself, which is presented in sections 7-10, at the end of the paper, pretty interesting. However, I have some objections to the motivation for the idea, which makes up the first 60% of the paper.</p> <p>Rather than do one of those blow-by-blow rebuttals that's so common on blogs, I'll limit my comments to one widely circulated idea that I believe is not only mistaken but actively harmful.</p> <p>There's a claim that “informal reasoning” is more important than “testing”<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>, based mostly on the strength of this quote from Dijkstra:</p> <blockquote> <p>testing is hopelessly inadequate....(it) can be used very effectively to show the presence of bugs but never to show their absence.</p> </blockquote> <p>They go on to make a number of related claims, like “The key problem is that a test (of any kind) on a system or component that is in one particular state tells you nothing at all about the behavior of that system or component when it happens to be in another state.”, with the conclusion that stateless simplicity is the only possible fix. Needless to say, they assume that simplicity is actually possible.</p> <p>I actually agree with the bit about testing -- there's no way to avoid bugs if you create a system that's too complex to formally verify.</p> <p>However, there are plenty of real systems with too much irreducible complexity to make simple. Drawing from my own experience, no human can possibly hope to understand a modern high-performance CPU well enough to informally reason about its correctness. That's not only true now, it's been true for decades. It becomes true the moment someone introduces any sort of speculative execution or caching. These things are inherently stateful and complicated. They're so complicated that the only way to model performance (in order to run experiments to design high performance chips) is to simulate precisely what will happen, since the exact results are too complex for humans to reason about and too messy to be mathematically tractable. It's possible to make a simple CPU, but not one that's fast and simple. This doesn't only apply to CPUs -- performance complexity <a href="http://duartes.org/gustavo/blog/post/performance-is-a-science/">leaks all the way up the stack</a>.</p> <p>And it's not only high performance hardware and software that's complex. Some domains are just really complicated. The tax code is <a href="http://finance.townhall.com/columnists/politicalcalculations/2014/04/13/2014-how-many-pages-in-the-us-tax-code-n1823832">73k pages long</a>. It's just not possible to reason effectively about something that complicated, and there are plenty of things that are that complicated.</p> <p>And then there's the fact that we're human. We make mistakes. Euclid's elements contains a bug in the very first theorem. Andrew Gelman likes to use <a href="http://andrewgelman.com/2014/12/25/common-sense-statistics/">this example</a> of an &quot;obviously&quot; bogus published probability result (but not obvious to the authors or the peer reviewers). One of the famous Intel CPU bugs allegedly comes from not testing something because they &quot;knew&quot; it was correct. No matter how smart or knowledgeable, humans are incapable of reasoning correctly all of the time.</p> <p>So what do you do? You write tests! They're necessary for anything above a certain level of complexity. The argument the authors make is that they're not sufficient because the state space is huge and a test of one state tells you literally nothing about a test of any other state.</p> <p>That's true if you look at your system as some kind of unknowable black box, but it turns out to be untrue in practice. There are plenty of <a href="http://researcher.watson.ibm.com/researcher/view_group.php?id=2987">unit testing tools</a> that will do state space reduction based on how similar inputs affect similar states, do symbolic execution, etc. This turns out to work pretty well.</p> <p>And even without resorting to formal methods, you can see this with plain old normal tests. John Regehr has noted that when <a href="http://embed.cs.utah.edu/csmith/">Csmith</a> finds a bug, test case reduction often finds a slew of other bugs. Turns out, tests often tell you something about nearby states.</p> <p>This is not just a theoretical argument. I did CPU design/verification/test for 7.5 years at a company that relied primarily on testing. In that time I can recall two bugs that were found by customers (as opposed to our testing). One was a manufacturing bug that has no software analogue. The software equivalent would be that the software works for years and then after lots of usage at high temperature 1% of customers suddenly can't use their software anymore. Bad, but not a failure of anything analogous to software testing.</p> <p>The other bug was a legitimate logical bug (in the cache memory hierarchy, of course). It's pretty embarrassing that we shipped samples of a chip with a real bug to customers, but I think that most companies would be pretty happy with one logical bug in seven and a half years.</p> <p>Testing may not be sufficient to find all bugs, but it can be sufficient to achieve better reliability than pretty much any software company cares to.</p> <p><small>Thanks (or perhaps anti-thanks) to David Albert for goading me into writing up this response and to Govert Versluis for catching a typo.</small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">These kinds of claims are always a bit odd to talk about. Like nature v. nurture, we clearly get bad results if we set either quantity to zero, and they interact in a way that makes it difficult to quantify the relative effect of non-zero quantities. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> Assembly v. intrinsics assembly-intrinsics/ Sun, 19 Oct 2014 00:00:00 +0000 assembly-intrinsics/ <p>Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.</p> <p>The last time I tried intrinsics was around 2007; for more on why they were hopeless then (<a href="http://virtualdub.org/blog/pivot/entry.php?id=162">see this exploration by the author of VirtualDub</a>). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.</p> <p></p> <p>For example, as of this writing, the first two Google hits for <code>popcnt benchmark</code> (and 2 out of the top 3 bing hits) claim that Intel's hardware <code>popcnt</code> instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the <a href="https://en.wikipedia.org/wiki/SSSE3">SSSE3</a> <code>pshufb</code> instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the <code>popcnt</code> instruction is slow if they coded up a solution using intrinsics.</p> <p>One of the top search hits has sample code and benchmarks for both native <code>popcnt</code> as well as the software version using <code>pshufb</code>. <a href="http://www.strchr.com/media/crc32_popcnt.zip">Their code</a> requires MSVC, which I don't have access to, but their first <code>popcnt</code> implementation just calls the <code>popcnt</code> intrinsic in a loop, which is fairly easy to reproduce in a form that gcc and clang will accept. Timing it is also pretty simple, since we're just timing a function (that happens to count the number of bits set in some fixed sized buffer).</p> <pre><code>uint32_t builtin_popcnt(const uint64_t* buf, int len) { int cnt = 0; for (int i = 0; i &lt; len; ++i) { cnt += __builtin_popcountll(buf[i]); } return cnt; } </code></pre> <p>This is slightly different from the code I linked to above, since they use the dword (32-bit) version of <code>popcnt</code>, and we're using the qword (64-bit) version. Since our version gets twice as much done per loop iteration, I'd expect our version to be faster than their version.</p> <p>Running <code>clang -O3 -mpopcnt -funroll-loops</code> produces a binary that we can examine. On macs, we can use <code>otool -tv</code> to get the disassembly. On linux, there's <code>objdump -d</code>.</p> <pre><code>_builtin_popcnt: ; address instruction 0000000100000b30 pushq %rbp 0000000100000b31 movq %rsp, %rbp 0000000100000b34 movq %rdi, -0x8(%rbp) 0000000100000b38 movl %esi, -0xc(%rbp) 0000000100000b3b movl $0x0, -0x10(%rbp) 0000000100000b42 movl $0x0, -0x14(%rbp) 0000000100000b49 movl -0x14(%rbp), %eax 0000000100000b4c cmpl -0xc(%rbp), %eax 0000000100000b4f jge 0x100000bd4 0000000100000b55 movslq -0x14(%rbp), %rax 0000000100000b59 movq -0x8(%rbp), %rcx 0000000100000b5d movq (%rcx,%rax,8), %rax 0000000100000b61 movq %rax, %rcx 0000000100000b64 shrq %rcx 0000000100000b67 movabsq $0x5555555555555555, %rdx 0000000100000b71 andq %rdx, %rcx 0000000100000b74 subq %rcx, %rax 0000000100000b77 movabsq $0x3333333333333333, %rcx 0000000100000b81 movq %rax, %rdx 0000000100000b84 andq %rcx, %rdx 0000000100000b87 shrq $0x2, %rax 0000000100000b8b andq %rcx, %rax 0000000100000b8e addq %rax, %rdx 0000000100000b91 movq %rdx, %rax 0000000100000b94 shrq $0x4, %rax 0000000100000b98 addq %rax, %rdx 0000000100000b9b movabsq $0xf0f0f0f0f0f0f0f, %rax 0000000100000ba5 andq %rax, %rdx 0000000100000ba8 movabsq $0x101010101010101, %rax 0000000100000bb2 imulq %rax, %rdx 0000000100000bb6 shrq $0x38, %rdx 0000000100000bba movl %edx, %esi 0000000100000bbc movl -0x10(%rbp), %edi 0000000100000bbf addl %esi, %edi 0000000100000bc1 movl %edi, -0x10(%rbp) 0000000100000bc4 movl -0x14(%rbp), %eax 0000000100000bc7 addl $0x1, %eax 0000000100000bcc movl %eax, -0x14(%rbp) 0000000100000bcf jmpq 0x100000b49 0000000100000bd4 movl -0x10(%rbp), %eax 0000000100000bd7 popq %rbp 0000000100000bd8 ret </code></pre> <p>Well, that's interesting. Clang seems to be calculating things manually rather than using <code>popcnt</code>. It seems to be using <a href="http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel">the approach described here</a>, which is something like</p> <pre><code>x = x - ((x &gt;&gt; 0x1) &amp; 0x5555555555555555); x = (x &amp; 0x3333333333333333) + ((x &gt;&gt; 0x2) &amp; 0x3333333333333333); x = (x + (x &gt;&gt; 0x4)) &amp; 0xF0F0F0F0F0F0F0F; ans = (x * 0x101010101010101) &gt;&gt; 0x38; </code></pre> <p>That's not bad for a simple implementation that doesn't rely on any kind of specialized hardware, but that's going to take a lot longer than a single <code>popcnt</code> instruction.</p> <p>I've got a pretty old version of clang (3.0), so let me try this again after upgrading to 3.4, in case they added hardware <code>popcnt</code> support “recently”.</p> <pre><code>0000000100001340 pushq %rbp ; save frame pointer 0000000100001341 movq %rsp, %rbp ; new frame pointer 0000000100001344 xorl %ecx, %ecx ; cnt = 0 0000000100001346 testl %esi, %esi 0000000100001348 jle 0x100001363 000000010000134a nopw (%rax,%rax) 0000000100001350 popcntq (%rdi), %rax ; “eax” = popcnt[rdi] 0000000100001355 addl %ecx, %eax ; eax += cnt 0000000100001357 addq $0x8, %rdi ; increment address by 64-bits (8 bytes) 000000010000135b decl %esi ; decrement loop counter; sets flags 000000010000135d movl %eax, %ecx ; cnt = eax; does not set flags 000000010000135f jne 0x100001350 ; examine flags. if esi != 0, goto popcnt 0000000100001361 jmp 0x100001365 ; goto “restore frame pointer” 0000000100001363 movl %ecx, %eax 0000000100001365 popq %rbp ; restore frame pointer 0000000100001366 ret </code></pre> <p>That's better! We get a hardware <code>popcnt</code>! Let's compare this to the SSSE3 <code>pshufb</code> implementation <a href="http://www.strchr.com/crc32_popcnt">presented here</a> as the fastest way to do a popcnt. We'll use a table like the one in the link to show speed, except that we're going to show a rate, instead of the raw cycle count, so that the relative speed between different sizes is clear. The rate is GB/s, i.e., how many gigs of buffer we can process per second. We give the function data in chunks (varying from 1kb to 16Mb); each column is the rate for a different chunk-size. If we look at how fast each algorithm is for various buffer sizes, we get the following.</p> <table> <thead> <tr> <th>Algorithm</th> <th>1k</th> <th>4k</th> <th>16k</th> <th>65k</th> <th>256k</th> <th>1M</th> <th>4M</th> <th>16M</th> </tr> </thead> <tbody> <tr> <td>Intrinsic</td> <td>6.9</td> <td>7.3</td> <td>7.4</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> </tr> <tr> <td>PSHUFB</td> <td>11.5</td> <td>13.0</td> <td>13.3</td> <td>13.4</td> <td>13.1</td> <td>13.4</td> <td>13.0</td> <td>12.6</td> </tr> </tbody> </table> <p>That's not so great. Relative to the the benchmark linked above, we're doing better because we're using 64-bit <code>popcnt</code> instead of 32-bit <code>popcnt</code>, but the PSHUFB version is still almost twice as fast<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">1</a></sup>.</p> <p>One odd thing is the way <code>cnt</code> gets accumulated. <code>cnt</code> is stored in <code>ecx</code>. But, instead of adding the result of the <code>popcnt</code> to <code>ecx</code>, clang has decided to add <code>ecx</code> to the result of the <code>popcnt</code>. To fix that, clang then has to move that sum into <code>ecx</code> at the end of each loop iteration.</p> <p>The other noticeable problem is that we only get one <code>popcnt</code> per iteration of the loop, which means the loop isn't getting unrolled, and we're paying the entire cost of the loop overhead for each <code>popcnt</code>. Unrolling the loop can also let the CPU extract more instruction level parallelism from the code, although that's a bit beyond the scope of this blog post.</p> <p>Using clang, that happens even with <code>-O3 -funroll-loops</code>. Using gcc, we get a properly unrolled loop, but gcc has other problems, as we'll see later. For now, let's try unrolling the loop ourselves by calling <code>__builtin_popcountll</code> multiple times during each iteration of the loop. For simplicity, let's try doing four <code>popcnt</code> operations on each iteration. I don't claim that's optimal, but it should be an improvement.</p> <pre><code>uint32_t builtin_popcnt_unrolled(const uint64_t* buf, int len) { assert(len % 4 == 0); int cnt = 0; for (int i = 0; i &lt; len; i+=4) { cnt += __builtin_popcountll(buf[i]); cnt += __builtin_popcountll(buf[i+1]); cnt += __builtin_popcountll(buf[i+2]); cnt += __builtin_popcountll(buf[i+3]); } return cnt; } </code></pre> <p>The core of our loop now has</p> <pre><code>0000000100001390 popcntq (%rdi,%rcx,8), %rdx 0000000100001396 addl %eax, %edx 0000000100001398 popcntq 0x8(%rdi,%rcx,8), %rax 000000010000139f addl %edx, %eax 00000001000013a1 popcntq 0x10(%rdi,%rcx,8), %rdx 00000001000013a8 addl %eax, %edx 00000001000013aa popcntq 0x18(%rdi,%rcx,8), %rax 00000001000013b1 addl %edx, %eax </code></pre> <p>with pretty much the same code surrounding the loop body. We're doing four <code>popcnt</code> operations every time through the loop, which results in the following performance:</p> <table> <thead> <tr> <th>Algorithm</th> <th>1k</th> <th>4k</th> <th>16k</th> <th>65k</th> <th>256k</th> <th>1M</th> <th>4M</th> <th>16M</th> </tr> </thead> <tbody> <tr> <td>Intrinsic</td> <td>6.9</td> <td>7.3</td> <td>7.4</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> </tr> <tr> <td>PSHUFB</td> <td>11.5</td> <td>13.0</td> <td>13.3</td> <td>13.4</td> <td>13.1</td> <td>13.4</td> <td>13.0</td> <td>12.6</td> </tr> <tr> <td>Unrolled</td> <td>12.5</td> <td>14.4</td> <td>15.0</td> <td>15.1</td> <td>15.2</td> <td>15.2</td> <td>15.2</td> <td>15.2</td> </tr> </tbody> </table> <p>Between using 64-bit <code>popcnt</code> and unrolling the loop, we've already beaten the allegedly faster <code>pshufb</code> code! But it's close enough that we might get different results with another compiler or some other chip. Let's see if we can do better.</p> <p>So, what's the deal with this <a href="http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance">popcnt false dependency bug</a> that's been getting a lot of publicity lately? Turns out, <code>popcnt</code> has a false dependency on its destination register, which means that even though the result of <code>popcnt</code> doesn't depend on its destination register, the CPU thinks that it does and will wait until the destination register is ready before starting the <code>popcnt</code> instruction.</p> <p>x86 typically has two operand operations, e.g., <code>addl %eax, %edx</code> adds <code>eax</code> and <code>edx</code>, and then places the result in <code>edx</code>, so it's common for an operation to have a dependency on its output register. In this case, there shouldn't be a dependency, since the result doesn't depend on the contents of the output register, but that's an easy bug to introduce, and a hard one to catch<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">2</a></sup>.</p> <p>In this particular case, <code>popcnt</code> has a 3 cycle latency, but it's pipelined such that a <code>popcnt</code> operation can execute each cycle. If we ignore other overhead, that means that a single <code>popcnt</code> will take 3 cycles, 2 will take 4 cycles, 3 will take 5 cycles, and n will take n+2 cycles, as long as the operations are independent. But, if the CPU incorrectly thinks there's a dependency between them, we effectively lose the ability to pipeline the instructions, and that n+2 turns into 3n.</p> <p>We can work around this by buying a CPU from AMD or VIA, or by putting the <code>popcnt</code> results in different registers. Let's making an array of destinations, which will let us put the result from each <code>popcnt</code> into a different place.</p> <pre><code>uint32_t builtin_popcnt_unrolled_errata(const uint64_t* buf, int len) { assert(len % 4 == 0); int cnt[4]; for (int i = 0; i &lt; 4; ++i) { cnt[i] = 0; } for (int i = 0; i &lt; len; i+=4) { cnt[0] += __builtin_popcountll(buf[i]); cnt[1] += __builtin_popcountll(buf[i+1]); cnt[2] += __builtin_popcountll(buf[i+2]); cnt[3] += __builtin_popcountll(buf[i+3]); } return cnt[0] + cnt[1] + cnt[2] + cnt[3]; } </code></pre> <p>And now we get</p> <pre><code>0000000100001420 popcntq (%rdi,%r9,8), %r8 0000000100001426 addl %ebx, %r8d 0000000100001429 popcntq 0x8(%rdi,%r9,8), %rax 0000000100001430 addl %r14d, %eax 0000000100001433 popcntq 0x10(%rdi,%r9,8), %rdx 000000010000143a addl %r11d, %edx 000000010000143d popcntq 0x18(%rdi,%r9,8), %rcx </code></pre> <p>That's better -- we can see that the first popcnt outputs into <code>r8</code>, the second into <code>rax</code>, the third into <code>rdx</code>, and the fourth into <code>rcx</code>. However, this does the same odd accumulation as the original, where instead of adding the result of the <code>popcnt</code> to <code>cnt[i]</code>, it does the opposite, which necessitates moving the results back to <code>cnt[i]</code> afterwards.</p> <pre><code>000000010000133e movl %ecx, %r10d 0000000100001341 movl %edx, %r11d 0000000100001344 movl %eax, %r14d 0000000100001347 movl %r8d, %ebx </code></pre> <p>Well, at least in clang (3.4). Gcc (4.8.2) is too smart to fall for this separate destination thing and “optimizes” the code back to something like our original version.</p> <table> <thead> <tr> <th>Algorithm</th> <th>1k</th> <th>4k</th> <th>16k</th> <th>65k</th> <th>256k</th> <th>1M</th> <th>4M</th> <th>16M</th> </tr> </thead> <tbody> <tr> <td>Intrinsic</td> <td>6.9</td> <td>7.3</td> <td>7.4</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> </tr> <tr> <td>PSHUFB</td> <td>11.5</td> <td>13.0</td> <td>13.3</td> <td>13.4</td> <td>13.1</td> <td>13.4</td> <td>13.0</td> <td>12.6</td> </tr> <tr> <td>Unrolled</td> <td>12.5</td> <td>14.4</td> <td>15.0</td> <td>15.1</td> <td>15.2</td> <td>15.2</td> <td>15.2</td> <td>15.2</td> </tr> <tr> <td>Unrolled 2</td> <td>14.3</td> <td>16.3</td> <td>17.0</td> <td>17.2</td> <td>17.2</td> <td>17.0</td> <td>16.8</td> <td>16.7</td> </tr> </tbody> </table> <p>To get a version that works with both gcc and clang, and doesn't have these extra <code>mov</code>s, we'll have to write the assembly by hand<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">3</a></sup>:</p> <pre><code>uint32_t builtin_popcnt_unrolled_errata_manual(const uint64_t* buf, int len) { assert(len % 4 == 0); uint64_t cnt[4]; for (int i = 0; i &lt; 4; ++i) { cnt[i] = 0; } for (int i = 0; i &lt; len; i+=4) { __asm__( &quot;popcnt %4, %4 \n\ &quot;add %4, %0 \n\t&quot; &quot;popcnt %5, %5 \n\t&quot; &quot;add %5, %1 \n\t&quot; &quot;popcnt %6, %6 \n\t&quot; &quot;add %6, %2 \n\t&quot; &quot;popcnt %7, %7 \n\t&quot; &quot;add %7, %3 \n\t&quot; // +r means input/output, r means intput : &quot;+r&quot; (cnt[0]), &quot;+r&quot; (cnt[1]), &quot;+r&quot; (cnt[2]), &quot;+r&quot; (cnt[3]) : &quot;r&quot; (buf[i]), &quot;r&quot; (buf[i+1]), &quot;r&quot; (buf[i+2]), &quot;r&quot; (buf[i+3])); } return cnt[0] + cnt[1] + cnt[2] + cnt[3]; } </code></pre> <p>This directly translates the assembly into the loop:</p> <pre><code>00000001000013c3 popcntq %r10, %r10 00000001000013c8 addq %r10, %rcx 00000001000013cb popcntq %r11, %r11 00000001000013d0 addq %r11, %r9 00000001000013d3 popcntq %r14, %r14 00000001000013d8 addq %r14, %r8 00000001000013db popcntq %rbx, %rbx </code></pre> <p>Great! The <code>add</code>s are now going the right direction, because we specified exactly what they should do.</p> <table> <thead> <tr> <th>Algorithm</th> <th>1k</th> <th>4k</th> <th>16k</th> <th>65k</th> <th>256k</th> <th>1M</th> <th>4M</th> <th>16M</th> </tr> </thead> <tbody> <tr> <td>Intrinsic</td> <td>6.9</td> <td>7.3</td> <td>7.4</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> <td>7.5</td> </tr> <tr> <td>PSHUFB</td> <td>11.5</td> <td>13.0</td> <td>13.3</td> <td>13.4</td> <td>13.1</td> <td>13.4</td> <td>13.0</td> <td>12.6</td> </tr> <tr> <td>Unrolled</td> <td>12.5</td> <td>14.4</td> <td>15.0</td> <td>15.1</td> <td>15.2</td> <td>15.2</td> <td>15.2</td> <td>15.2</td> </tr> <tr> <td>Unrolled 2</td> <td>14.3</td> <td>16.3</td> <td>17.0</td> <td>17.2</td> <td>17.2</td> <td>17.0</td> <td>16.8</td> <td>16.7</td> </tr> <tr> <td>Assembly</td> <td>17.5</td> <td>23.7</td> <td>25.3</td> <td>25.3</td> <td>26.3</td> <td>26.3</td> <td>25.3</td> <td>24.3</td> </tr> </tbody> </table> <p>Finally! A version that blows away the <code>PSHUFB</code> implementation. How do we know this should be the final version? We can see from <a href="http://www.agner.org/optimize/instruction_tables.pdf">Agner's instruction tables</a> that we can execute, at most, one <code>popcnt</code> per cycle. I happen to have run this on a 3.4Ghz Sandy Bridge, so we've got an upper bound of <code>8 bytes / cycle * 3.4 G cycles / sec = 27.2 GB/s</code>. That's pretty close to the <code>26.3 GB/s</code> we're actually getting, which is a sign that we can't make this much faster<sup class="footnote-ref" id="fnref:S"><a rel="footnote" href="#fn:S">4</a></sup>.</p> <p>In this case, the hand coded assembly version is about 3x faster than the original intrinsic loop (not counting the version from a version of clang that didn't emit a <code>popcnt</code>). It happens that, for the compiler we used, the unrolled loop using the <code>popcnt</code> intrinsic is a bit faster than the <code>pshufb</code> version, but that wasn't true of one of the two unrolled versions when I tried this with <code>gcc</code>.</p> <p>It's easy to see why someone might have benchmarked the same code and decided that <code>popcnt</code> isn't very fast. It's also easy to see why using intrinsics for performance critical code can be a huge time sink<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">5</a></sup>.</p> <p><small>Thanks to <a href="https://github.com/graue/">Scott</a> for some comments on the organization of this post, and to <a href="http://blog.leahhanson.us/">Leah</a> for extensive comments on just about everything</small></p> <p><strong>If you liked this, you'll probably enjoy <a href="//danluu.com/new-cpu-features/">this post about how CPUs have changed since the 80s</a>.</strong></p> <div class="footnotes"> <hr /> <ol> <li id="fn:B"><a href="https://github.com/danluu/dump/blob/master/popcnt-speed-comparison/popcnt.c">see this</a> for the actual benchmarking code. On second thought, it's an embarrassingly terrible hack, and I'd prefer that you don't look. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:D">If it were the other way around, and the hardware didn't realize there was a dependency when there should be, that would be easy to catch -- any sequence of instructions that was dependent might produce an incorrect result. In this case, some sequences of instructions are just slower than they should be, which is not trivial to check for. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:A">This code is a simplified version of <a href="http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance">Alex Yee's stackoverflow answer about the popcnt false dependency bug</a> <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:S"><p>That's not quite right, since the CPU has TurboBoost, but it's pretty close. Putting that aside, this example is pretty simple, but calculating this stuff by hand can get tedious for more complicated code. Luckily, the <a href="https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/">Intel Architecture Code Analyzier</a> can figure this stuff out for us. It finds the bottleneck in the code (assuming infinite memory bandwidth at zero latency), and displays how and why the processor is bottlenecked, which is usually enough to determine if there's room for more optimization.</p> <p>You might have noticed that the performance decreases as the buffer size becomes larger than our cache. It's possible to do a back of the envelope calculation to find the upper bound imposed by the limits of memory and cache performance, but working through the calculations would take a lot more space this this footnote has available to it. You can see a good example of how do it for one simple case <a href="https://software.intel.com/en-us/forums/topic/480004">here</a>. The comments by Nathan Kurz and John McCaplin are particularly good.</p> <a class="footnote-return" href="#fnref:S"><sup>[return]</sup></a></li> <li id="fn:1"><p>In the course of running these benchmarks, I also noticed that <code>_mm_cvtsi128_si64</code> produces bizarrely bad code on gcc (although it's fine in clang). <code>_mm_cvtsi128_si64</code> is the intrinsic for moving an SSE (SIMD) register to a general purpose register (GPR). The compiler has a lot of latitude over whether or not a variable should live in a register or in memory. Clang realizes that it's probably faster to move the value from an SSE register to a GPR if the result is about to get used. Gcc decides to save a register and move the data from the SSE register to memory, and then have the next instruction operate on memory, if that's possible. In our <code>popcnt</code> example, clang uses about 2x for not unrolling the loop, and the rest comes from not being up to date on a CPU bug, which is understandable. It's hard to imagine why a compiler would do a register to memory move when it's about to operate on data unless it either doesn't do optimizations at all, or it has some bug which makes it unaware of the register to register version of the instruction. But at least it gets the right result, <a href="https://social.msdn.microsoft.com/Forums/en-US/b2e688e6-1d28-4cf0-9880-735e6838db6a/a-bug-in-vc-compiler?forum=vsprereleaseannouncements">unlike this version of MSVC</a>.</p> <p>icc and armcc are reputed to be better at dealing with intrinsics, but they're non starters for most open source projects. Downloading icc's free non-commercial version has been disabled for the better part of a year, and even if it comes back, who's going to trust that it won't disappear again? As for armcc, I'm not sure it's ever had a free version?</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> Google wage fixing, 11-CV-02509-LHK, ORDER DENYING PLAINTIFFS' MOTION FOR PRELIMINARY APPROVAL OF SETTLEMENTS WITH ADOBE, APPLE, GOOGLE, AND INTEL google-wage-fixing/ Thu, 14 Aug 2014 00:00:00 +0000 google-wage-fixing/ <pre><code>UNITED STATES DISTRICT COURT NORTHERN DISTRICT OF CALIFORNIA SAN JOSE DIVISION IN RE: HIGH-TECH EMPLOYEE ANTITRUST LITIGATION THIS DOCUMENT RELATES TO: ALL ACTIONS Case No.: 11-CV-02509-LHK </code></pre> <h3 id="order-denying-plaintiffs-motion-for-preliminary-approval-of-settlements-with-adobe-apple-google-and-intel">ORDER DENYING PLAINTIFFS' MOTION FOR PRELIMINARY APPROVAL OF SETTLEMENTS WITH ADOBE, APPLE, GOOGLE, AND INTEL</h3> <p>Before the Court is a Motion for Preliminary Approval of Class Action Settlement with Defendants Adobe Systems Inc. (&quot;Adobe&quot;), Apple Inc. (&quot;Apple&quot;), Google Inc. (&quot;Google&quot;), and Intel Corp. (&quot;Intel&quot;) (hereafter, &quot;Remaining Defendants&quot;) brought by three class representatives, Mark Fichtner, Siddharth Hariharan, and Daniel Stover (hereafter, &quot;Plaintiffs&quot;). See ECF No. 920. The Settlement provides for $324.5 million in recovery for the class in exchange for release of antitrust claims. A fourth class representative, Michael Devine (&quot;Devine&quot;), has filed an Opposition contending that the settlement amount is inadequate. See ECF No. 934. Plaintiffs have filed a Reply. See ECF No. 938. Plaintiffs, Remaining Defendants, and Devine appeared at a hearing on June 19, 2014. See ECF No. 940. In addition, a number of Class members have submitted letters in support of and in opposition to the proposed settlement. ECF Nos. 914, 949-51. The Court, having considered the briefing, the letters, the arguments presented at the hearing, and the record in this case, DENIES the Motion for Preliminary Approval for the reasons stated below.</p> <h3 id="i-background-and-procedural-history">I. BACKGROUND AND PROCEDURAL HISTORY</h3> <p>Michael Devine, Mark Fichtner, Siddharth Hariharan, and Daniel Stover, individually and on behalf of a class of all those similarly situated, allege antitrust claims against their former employers, Adobe, Apple, Google, Intel, Intuit Inc. (&quot;Intuit&quot;), Lucasfilm Ltd. (&quot;Lucasfilm&quot;), and Pixar (collectively, &quot;Defendants&quot;). Plaintiffs allege that Defendants entered into an overarching conspiracy through a series of bilateral agreements not to solicit each other's employees in violation of Section 1 of the Sherman Antitrust Act, 15 U.S.C. § 1, and Section 4 of the Clayton Antitrust Act, 15 U.S.C. § 15. Plaintiffs contend that the overarching conspiracy, made up of a series of six bilateral agreements (Pixar-Lucasfilm, Apple-Adobe, Apple-Google, Apple-Pixar, Google-Intuit, and Google-Intel) suppressed wages of Defendants' employees.</p> <p>The five cases underlying this consolidated action were initially filed in California Superior Court and removed to federal court. See ECF No. 532 at 5. The cases were related by Judge Saundra Brown Armstrong, who also granted a motion to transfer the related actions to the San Jose Division. See ECF Nos. 52, 58. After being assigned to the undersigned judge, the cases were consolidated pursuant to the parties' stipulation. See ECF No. 64. Plaintiffs filed a consolidated complaint on September 23, 2011, see ECF No. 65, which Defendants jointly moved to dismiss, see ECF No. 79. In addition, Lucasfilm filed a separate motion to dismiss on October 17, 2011. See ECF No. 83. The Court granted in part and denied in part the joint motion to dismiss and denied Lucasfilm's separate motion to dismiss. See ECF No. 119.</p> <p>On October 1, 2012, Plaintiffs filed a motion for class certification. See ECF No. 187. The motion sought certification of a class of all of the seven Defendants' employees or, in the alternative, a narrower class of just technical employees of the seven Defendants. After full briefing and a hearing, the Court denied class certification on April 5, 2013. See ECF No. 382. The Court was concerned that Plaintiffs' documentary evidence and empirical analysis were insufficient to determine that common questions predominated over individual questions with respect to the issue of antitrust impact. See id. at 33. Moreover, the Court expressed concern that there was insufficient analysis in the class certification motion regarding the class of technical employees. Id. at 29. The Court afforded Plaintiffs leave to amend to address the Court's concerns. See id. at 52.</p> <p>On May 10, 2013, Plaintiffs filed their amended class certification motion, seeking to certify only the narrower class of technical employees. See ECF No. 418. Defendants filed their opposition on June 21, 2013, ECF No. 439, and Plaintiffs filed their reply on July 12, 2013, ECF No. 455. The hearing on the amended motion was set for August 5, 2013.</p> <p>On July 12 and 30, 2013, after class certification had been initially denied and while an amended motion was pending, Plaintiffs settled with Pixar, Lucasfilm, and Intuit (hereafter, &quot;Settled Defendants&quot;). See ECF Nos. 453, 489. Plaintiffs filed a motion for preliminary approval of the settlements with Settled Defendants on September 21, 2013. See ECF No. 501. No opposition to the motion was filed, and the Court granted the motion on October 30, 2013, following a hearing on October 21, 2013. See ECF No. 540. The Court held a fairness hearing on May 1, 2014, ECF No. 913, and granted final approval of the settlements and accompanying requests for attorneys' fees, costs, and incentive awards over five objections on May 16, 2014, ECF Nos. 915-16. Judgment was entered as to the Settled Defendants on June 20, 2014. ECF No. 947.</p> <p>After the Settled Defendants settled, this Court certified a class of technical employees of the seven Defendants (hereafter, &quot;the Class&quot;) on October 25, 2013 in an 86-page order granting Plaintiffs' amended class certification motion. See ECF No. 532. The Remaining Defendants petitioned the Ninth Circuit to review that order under Federal Rule of Civil Procedure 23(f). After full briefing, including the filing of an amicus brief by the National and California Chambers of Commerce and the National Association of Manufacturing urging the Ninth Circuit to grant review, the Ninth Circuit denied review on January 15, 2014. See ECF No. 594.</p> <p>Meanwhile, in this Court, the Remaining Defendants filed a total of five motions for summary judgment and filed motions to strike and to exclude the testimony of Plaintiffs' principal expert on antitrust impact and damages, Dr. Edward Leamer, who opined that the total damages to the Class exceeded $3 billion in wages Class members would have earned in the absence of the anti-solicitation agreements.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> The Court denied the motions for summary judgment on March 28, 2014, and on April 4, 2014, denied the motion to exclude Dr. Leamer and denied in large part the motion to strike Dr. Leamer's testimony. ECF Nos. 777, 788.</p> <p>On April 24, 2014, counsel for Plaintiffs and counsel for Remaining Defendants sent a joint letter to the Court indicating that they had reached a settlement. See ECF No. 900. This settlement was reached two weeks before the Final Pretrial Conference and one month before the trial was set to commence.<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>: Upon receipt of the joint letter, the Court vacated the trial date and pretrial deadlines and set a schedule for preliminary approval. See ECF No. 904. Shortly after counsel sent the letter, the media disclosed the total amount of the settlement, and this Court received three letters from individuals, not including Devine, objecting to the proposed settlement in response to media reports of the settlement amount.<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup> See ECF No. 914. On May 22, 2014, in accordance with this Court's schedule, Plaintiffs filed their Motion for Preliminary Approval. See ECF No. 920. Devine filed an Opposition on June 5, 2014.<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">4</a></sup> See ECF No. 934. Plaintiffs filed a Reply on June 12, 2014. See ECF No. 938. The Court held a hearing on June 19, 2014. See ECF No. 948. After the hearing, the Court received a letter from a Class member in opposition to the proposed settlement and two letters from Class members in support of the proposed settlement. See ECF Nos. 949-51.</p> <h3 id="ii-legal-standard">II. LEGAL STANDARD</h3> <p>The Court must review the fairness of class action settlements under Federal Rule of Civil Procedure 23(e). The Rule states that &quot;[t]he claims, issues, or defenses of a certified class may be settled, voluntarily dismissed, or compromised only with the court's approval.&quot; The Rule requires the Court to &quot;direct notice in a reasonable manner to all class members who would be bound by the proposal&quot; and further states that if a settlement &quot;would bind class members, the court may approve it only after a hearing and on finding that it is fair, reasonable, and adequate.&quot; Fed. R. Civ. P. 23(e)(1)-(2). The principal purpose of the Court's supervision of class action settlements is to ensure &quot;the agreement is not the product of fraud or overreaching by, or collusion between, the negotiating parties.&quot; Officers for Justice v. Civil Serv. Comm'n of City &amp; Cnty. of S.F., 688 F.2d 615, 625 (9th Cir. 1982).</p> <p>District courts have interpreted Rule 23(e) to require a two-step process for the approval of class action settlements: &quot;the Court first determines whether a proposed class action settlement deserves preliminary approval and then, after notice is given to class members, whether final approval is warranted.&quot; Nat'l Rural Telecomms. Coop. v. DIRECTV, Inc., 221 F.R.D. 523, 525 (C.D. Cal. 2004). At the final approval stage, the Ninth Circuit has stated that &quot;[a]ssessing a settlement proposal requires the district court to balance a number of factors: the strength of the plaintiffs' case; the risk, expense, complexity, and likely duration of further litigation; the risk of maintaining class action status throughout the trial; the amount offered in settlement; the extent of discovery completed and the stage of the proceedings; the experience and views of counsel; the presence of a governmental participant; and the reaction of the class members to the proposed settlement.&quot; Hanlon v. Chrysler Corp., 150 F.3d 1011, 1026 (9th Cir. 1998).</p> <p>In contrast to these well-established, non-exhaustive factors for final approval, there is relatively scant appellate authority regarding the standard that a district court must apply in reviewing a settlement at the preliminary approval stage. Some district courts, echoing commentators, have stated that the relevant inquiry is whether the settlement &quot;falls within the range of possible approval&quot; or &quot;within the range of reasonableness.&quot; In re Tableware Antitrust Litig., 484 F. Supp. 2d 1078, 1079 (N.D. Cal. 2007); see also Cordy v. USS-Posco Indus., No. 12-553, 2013 WL 4028627, at *3 (N.D. Cal. Aug. 1, 2013) (&quot;Preliminary approval of a settlement and notice to the proposed class is appropriate if the proposed settlement appears to be the product of serious, informed, non-collusive negotiations, has no obvious deficiencies, does not improperly grant preferential treatment to class representatives or segments of the class, and falls with the range of possible approval.&quot; (internal quotation marks omitted)). To undertake this analysis, the Court &quot;must consider plaintiffs' expected recovery balanced against the value of the settlement offer.&quot; In re Nat'l Football League Players' Concussion Injury Litig., 961 F. Supp. 2d 708, 714 (E.D. Pa. 2014) (internal quotation marks omitted).</p> <h3 id="iii-discussion">III. DISCUSSION</h3> <p>Pursuant to the terms of the instant settlement, Class members who have not already opted out and who do not opt out will relinquish their rights to file suit against the Remaining Defendants for the claims at issue in this case. In exchange, Remaining Defendants will pay a total of $324.5 million, of which Plaintiffs' counsel may seek up to 25% (approximately $81 million) in attorneys' fees, $1.2 million in costs, and $80,000 per class representative in incentive payments. In addition, the settlement allows Remaining Defendants a pro rata reduction in the total amount they must pay if more than 4% of Class members opt out after receiving notice.<sup class="footnote-ref" id="fnref:5"><a rel="footnote" href="#fn:5">5</a></sup> Class members would receive an average of approximately $3,750<sup class="footnote-ref" id="fnref:6"><a rel="footnote" href="#fn:6">6</a></sup> from the instant settlement if the Court were to grant all requested deductions and there were no further opt-outs.<sup class="footnote-ref" id="fnref:7"><a rel="footnote" href="#fn:7">7</a></sup></p> <p>The Court finds the total settlement amount falls below the range of reasonableness. The Court is concerned that Class members recover less on a proportional basis from the instant settlement with Remaining Defendants than from the settlement with the Settled Defendants a year ago, despite the fact that the case has progressed consistently in the Class's favor since then. Counsel's sole explanation for this reduced figure is that there are weaknesses in Plaintiffs' case such that the Class faces a substantial risk of non-recovery. However, that risk existed and was even greater when Plaintiffs settled with the Settled Defendants a year ago, when class certification had been denied.</p> <p>The Court begins by comparing the instant settlement with Remaining Defendants to the settlements with the Settled Defendants, in light of the facts that existed at the time each settlement was reached. The Court then discusses the relative strengths and weaknesses of Plaintiffs' case to assess the reasonableness of the instant settlement.</p> <h2 id="a-comparison-to-the-initial-settlements">A. Comparison to the Initial Settlements</h2> <h3 id="1-comparing-the-settlement-amounts">1. Comparing the Settlement Amounts</h3> <p>The Court finds that the settlements with the Settled Defendants provide a useful benchmark against which to analyze the reasonableness of the instant settlement. The settlements with the Settled Defendants led to a fund totaling $20 million. See ECF No. 915 at 3. In approving the settlements, the Court relied upon the fact that the Settled Defendants employed 8% of Class members and paid out 5% of the total Class compensation during the Class period. See ECF No. 539 at 16:20-22 (Plaintiffs' counsel's explanation at the preliminary approval hearing with the Settled Defendants that the 5% figure &quot;giv[es] you a sense of how big a slice of the case this settlement is relative to the rest of the case&quot;). If Remaining Defendants were to settle at the same (or higher) rate as the Settled Defendants, Remaining Defendants' settlement fund would need to total at least $380 million. This number results from the fact that Remaining Defendants paid out 95% of the Class compensation during the Class period, while Settled Defendants paid only 5% of the Class compensation during the Class period.<sup class="footnote-ref" id="fnref:8"><a rel="footnote" href="#fn:8">8</a></sup></p> <p>At the hearing on the instant Motion, counsel for Remaining Defendants suggested that the relevant benchmark is not total Class compensation, but rather is total Class membership. This would result in a benchmark figure for the Remaining Defendants of $230 million<sup class="footnote-ref" id="fnref:92-divided-by-8-is-11-5-11-5-times-20-million-is-230-million"><a rel="footnote" href="#fn:92-divided-by-8-is-11-5-11-5-times-20-million-is-230-million">0</a></sup>. At a minimum, counsel suggested, the Court should compare the settlement amount to a range of $230 million to $380 million, within which the instant settlement falls. The Court rejects counsel's suggestion, which is contrary to the record. Counsel has provided no basis for why the number of Class members employed by each Defendant is a relevant metric. To the contrary, the relevant inquiry has always been total Class compensation. For example, in both of the settlements with the Settled Defendants and in the instant settlement, the Plans of Allocation call for determining each individual Class member's pay out by dividing the Class member's compensation during the Class period by the total Class compensation during the Class period. ECF No. 809 at 6 (noting that the denominator in the plan of allocation in the settlements with the Settled Defendants is the &quot;total of base salaries paid to all approved Claimants in class positions during the Class period&quot;); ECF No. 920 at 22 (same in the instant settlement); see also ECF No. 539 at 16:20-22 (Plaintiffs' counsel's statement that percent of the total Class compensation was relevant for benchmarking the settlements with the Settled Defendants to the rest of the case). At no point in the record has the percentage of Class membership employed by each Defendant ever been the relevant factor for determining damages exposure. Accordingly, the Court rejects the metric proposed by counsel for Remaining Defendants. Using the Settled Defendants' settlements as a yardstick, the appropriate benchmark settlement for the Remaining Defendants would be at least $380 million, more than $50 million greater than what the instant settlement provides.</p> <p>Counsel for Remaining Defendants also suggested that benchmarking against the initial settlements would be inappropriate because the magnitude of the settlement numbers for Remaining Defendants dwarfs the numbers at issue in the Settled Defendants' settlements. This argument is premised on the idea that Defendants who caused more damage to the Class and who benefited more by suppressing a greater portion of class compensation should have to pay less than Defendants who caused less damage and who benefited less from the allegedly wrongful conduct. This argument is unpersuasive. Remaining Defendants are alleged to have received 95% of the benefit of the anti-solicitation agreements and to have caused 95% of the harm suffered by the Class in terms of lost compensation. Therefore, Remaining Defendants should have to pay at least 95% of the damages, which, under the instant settlement, they would not.</p> <p>The Court also notes that had Plaintiffs prevailed at trial on their more than $3 billion damages claim, antitrust law provides for automatic trebling, see 15 U.S.C. § 15(a), so the total damages award could potentially have exceeded $9 billion. While the Ninth Circuit has not determined whether settlement amounts in antitrust cases must be compared to the single damages award requested by Plaintiffs or the automatically trebled damages amount, see Rodriguez v. W. Publ'g Corp., 563 F.3d 948, 964-65 (9th Cir. 2009), the instant settlement would lead to a total recovery of 11.29% of the single damages proposed by Plaintiffs' expert or 3.76% of the treble damages. Specifically, Dr. Leamer has calculated the total damages to the Class resulting from Defendants' allegedly unlawful conduct as $3.05 billion. See ECF No. 856-10. If the Court approves the instant settlements, the total settlements with all Defendants would be $344.5 million. This total would amount to 11.29% of the single damages that Dr. Leamer opines the Class suffered or 3.76% if Dr. Leamer's damages figure had been trebled.</p> <h3 id="2-relative-procedural-posture">2. Relative Procedural Posture</h3> <p>The discount that Remaining Defendants have received vis-a-vis the Settled Defendants is particularly troubling in light of the changes in the procedural posture of the case between the two settlements, changes that the Court would expect to have increased, rather than decreased, Plaintiffs' bargaining power. Specifically, at the time the Settled Defendants settled, Plaintiffs were at a particularly weak point in their case. Though Plaintiffs had survived Defendants' motion to dismiss, Plaintiffs' motion for class certification had been denied, albeit without prejudice. Plaintiffs had re-briefed the class certification motion, but had no class certification ruling in their favor at the time they settled with the Settled Defendants. If the Court ultimately granted certification, Plaintiffs also did not know whether the Ninth Circuit would grant Federal Rule of Civil Procedure 23(f) review and reverse the certification. Accordingly, at that point, Defendants had significant leverage.</p> <p>In contrast, the procedural posture of the case swung dramatically in Plaintiffs' favor after the initial settlements were reached. Specifically, the Court certified the Class over the vigorous objections of Defendants. In the 86-page order granting class certification, the Court repeatedly referred to Plaintiffs' evidence as &quot;substantial&quot; and &quot;extensive,&quot; and the Court stated that it &quot;could not identify a case at the class certification stage with the level of documentary evidence Plaintiffs have presented in the instant case.&quot; ECF No. 531 at 69. Thereafter, the Ninth Circuit denied Defendants' request to review the class certification order under Federal Rule of Civil Procedure 23(f). This Court also denied Defendants' five motions for summary judgment and denied Defendants' motion to exclude Plaintiffs' principal expert on antitrust impact and damages. The instant settlement was reached a mere two weeks before the final pretrial conference and one month before a trial at which damaging evidence regarding Defendants would have been presented.</p> <p>In sum, Plaintiffs were in a much stronger position at the time of the instant settlement—after the Class had been certified, appellate review of class certification had been denied, and Defendants' dispositive motions and motion to exclude Dr. Leamer's testimony had been denied—than they were at the time of the settlements with the Settled Defendants, when class certification had been denied. This shift in the procedural posture, which the Court would expect to have increased Plaintiffs' bargaining power, makes the more recent settlements for a proportionally lower amount even more troubling.</p> <h2 id="b-strength-of-plaintiffs-case">B. Strength of Plaintiffs' Case</h2> <p>The Court now turns to the strength of Plaintiffs' case against the Remaining Defendants to evaluate the reasonableness of the settlement.</p> <p>At the hearing on the instant Motion, Plaintiffs' counsel contended that one of the reasons the instant settlement was proportionally lower than the previous settlements is that the documentary evidence against the Settled Defendants (particularly, Lucasfilm and Pixar) is more compelling than the documentary evidence against the Remaining Defendants. As an initial matter, the Court notes that relevant evidence regarding the Settled Defendants would be admissible at a trial against Remaining Defendants because Plaintiffs allege an overarching conspiracy that included all Defendants. Accordingly, evidence regarding the role of Lucasfilm and Pixar in the creation of and the intended effect of the overarching conspiracy would be admissible.</p> <p>Nonetheless, the Court notes that Plaintiffs are correct that there are particularly clear statements from Lucasfilm and Pixar executives regarding the nature and goals of the alleged conspiracy. Specifically, Edward Catmull (Pixar President) conceded in his deposition that anti-solicitation agreements were in place because solicitation &quot;messes up the pay structure.&quot; ECF No. 431-9 at 81. Similarly, George Lucas (former Lucasfilm Chairman of the Board and CEO) stated, &quot;we cannot get into a bidding war with other companies because we don't have the margins for that sort of thing.&quot; ECF No. 749-23 at 9.</p> <p>However, there is equally compelling evidence that comes from the documents of the Remaining Defendants. This is particularly true for Google and Apple, the executives of which extensively discussed and enforced the anti-solicitation agreements. Specifically, as discussed in extensive detail in this Court's previous orders, Steve Jobs (Co-Founder, Former Chairman, and Former CEO of Apple, Former CEO of Pixar), Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO), and Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google) were key players in creating and enforcing the anti-solicitation agreements. The Court now turns to the evidence against the Remaining Defendants that the finder of fact is likely to find compelling.</p> <h3 id="1-evidence-related-to-apple">1. Evidence Related to Apple</h3> <p>There is substantial and compelling evidence that Steve Jobs (Co-Founder, Former Chairman, and Former CEO of Apple, Former CEO of Pixar) was a, if not the, central figure in the alleged conspiracy. Several witnesses, in their depositions, testified to Mr. Jobs' role in the anti-solicitation agreements. For example, Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) stated that Mr. Jobs &quot;believed that you should not be hiring each others', you know, technical people&quot; and that &quot;it was inappropriate in [Mr. Jobs'] view for us to be calling in and hiring people.&quot; ECF No. 819-12 at 77. Edward Catmull (Pixar President) stated that Mr. Jobs &quot;was very adamant about protecting his employee force.&quot; ECF No. 431-9 at 97. Sergey Brin (Google Co-Founder) testified that &quot;I think Mr. Jobs' view was that people shouldn't piss him off. And I think that things that pissed him off were—would be hiring, you know—whatever.&quot; ECF No. 639-1 at 112. There would thus be ample evidence Mr. Jobs was involved in expanding the original anti-solicitation agreement between Lucasfilm and Pixar to the other Defendants in this case. After the agreements were extended, Mr. Jobs played a central role in enforcing these agreements. Four particular sets of evidence are likely to be compelling to the fact-finder.</p> <p>First, after hearing that Google was trying to recruit employees from Apple's Safari team, Mr. Jobs threatened Mr. Brin, stating, as Mr. Brin recounted, &quot;if you hire a single one of these people that means war.&quot; ECF No. 833-15.<sup class="footnote-ref" id="fnref:10"><a rel="footnote" href="#fn:10">9</a></sup> In an email to Google's Executive Management Team as well as Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), Mr. Brin advised: &quot;lets [sic] not make any new offers or contact new people at Apple until we have had a chance to discuss.&quot; Id. Mr. Campbell then wrote to Mr. Jobs: &quot;Eric [Schmidt] told me that he got directly involved and firmly stopped all efforts to recruit anyone from Apple.&quot; ECF No. 746-5. As Mr. Brin testified in his deposition, &quot;Eric made a—you know, a—you know, at least some kind of—had a conversation with Bill to relate to Steve to calm him down.&quot; ECF No. 639-1 at 61. As Mr. Schmidt put it, &quot;Steve was unhappy, and Steve's unhappiness absolutely influenced the change we made in recruiting practice.&quot; ECF No. 819-12 at 21. Danielle Lambert (Apple's head of Human Resources) reciprocated to maintain Apple's end of the anti-solicitation agreements, instructing Apple recruiters: &quot;Please add Google to your 'hands-off list. We recently agreed not to recruit from one another so if you hear of any recruiting they are doing against us, please be sure to let me know.&quot; ECF No. 746-15.</p> <p>Second, other Defendants' CEOs maintained the anti-solicitation agreements out of fear of and deference to Mr. Jobs. For example, in 2005, when considering whether to enter into an anti-solicitation agreement with Apple, Bruce Chizen (former Adobe CEO), expressed concerns about the loss of &quot;top talent&quot; if Adobe did not enter into an anti-solicitation agreement with Apple, stating, &quot;if I tell Steve it's open season (other than senior managers), he will deliberately poach Adobe just to prove a point. Knowing Steve, he will go after some of our top Mac talent like Chris Cox and he will do it in a way in which they will be enticed to come (extraordinary packages and Steve wooing).&quot;<sup class="footnote-ref" id="fnref:11"><a rel="footnote" href="#fn:11">10</a></sup> ECF No. 297-15.</p> <p>This was the genesis of the Apple-Adobe agreement. Specifically, after Mr. Jobs complained to Mr. Chizen on May 26, 2005 that Adobe was recruiting Apple employees, ECF No. 291-17, Mr. Chizen responded by saying, &quot;I thought we agreed not to recruit any senior level employees . . . . I would propose we keep it that way. Open to discuss. It would be good to agree.&quot; Id. Mr. Jobs was not satisfied, and replied by threatening to send Apple recruiters after Adobe's employees: &quot;OK, I'll tell our recruiters that they are free to approach any Adobe employee who is not a Sr. Director or VP. Am I understanding your position correctly?&quot; Id. Mr. Chizen immediately gave in: &quot;I'd rather agree NOT to actively solicit any employee from either company . . . . If you are in agreement I will let my folks know.&quot; Id. (emphasis in original). The next day, Theresa Townsley (Adobe Vice President Human Resources) announced to her recruiting team, &quot;Bruce and Steve Jobs have an agreement that we are not to solicit ANY Apple employees, and vice versa.&quot; ECF No. 291-18 (emphasis in original). Adobe then placed Apple on its &quot;[c]ompanies that are off limits&quot; list, which instructed Adobe employees not to cold call Apple employees. ECF No. 291-11.</p> <p>Google took even more drastic actions in response to Mr. Jobs. For example, when a recruiter from Google's engineering team contacted an Apple employee in 2007, Mr. Jobs forwarded the message to Mr. Schmidt and stated, &quot;I would be very pleased if your recruiting department would stop doing this.&quot; ECF No. 291-23. Google responded by making a &quot;public example&quot; out of the recruiter and &quot;terminat[ing] [the recruiter] within the hour.&quot; Id. The aim of this public spectacle was to &quot;(hopefully) prevent future occurrences.&quot; Id. Once the recruiter was terminated, Mr. Schmidt emailed Mr. Jobs, apologizing and informing Mr. Jobs that the recruiter had been terminated. Mr. Jobs forwarded Mr. Schmidt's email to an Apple human resources official and stated merely, &quot;:).&quot; ECF No. 746-9.</p> <p>A year prior to this termination, Google similarly took seriously Mr. Jobs' concerns. Specifically, in 2006, Mr. Jobs emailed Mr. Schmidt and said, &quot;I am told that Googles [sic] new cell phone software group is relentlessly recruiting in our iPod group. If this is indeed true, can you put a stop to it?&quot; ECF No. 291-24 at 3. After Mr. Schmidt forwarded this to Human Resources professionals at Google, Arnnon Geshuri (Google Recruiting Director) prepared a detailed report stating that an extensive investigation did not find a breach of the anti-solicitation agreement.</p> <p>Similarly, in 2006, Google scrapped plans to open a Google engineering center in Paris after a Google executive emailed Mr. Jobs to ask whether Google could hire three former Apple engineers to work at the prospective facility, and Mr. Jobs responded &quot;[w]e'd strongly prefer that you not hire these guys.&quot; ECF No. 814-2. The whole interaction began with Google's request to Steve Jobs for permission to hire Jean-Marie Hullot, an Apple engineer. The record is not clear whether Mr. Hullot was a current or former Apple employee. A Google executive contacted Steve Jobs to ask whether Google could make an offer to Mr. Hullot, and Mr. Jobs did not timely respond to the Google executive's request. At this point, the Google executive turned to Intuit's Board Chairman Bill Campbell as a potential ambassador from Google to Mr. Jobs. Specifically, the Google executive noted that Mr. Campbell &quot;is on the board at Apple and Google, so Steve will probably return his call.&quot; ECF No. 428-6. The same day that Mr. Campbell reached out to Mr. Jobs, Mr. Jobs responded to the Google executive, seeking more information on what exactly the Apple engineer would be working. ECF No. 428-9. Once Mr. Jobs was satisfied, he stated that the hire &quot;would be fine with me.&quot; Id. However, two weeks later, when Mr. Hullot and a Google executive sought Mr. Jobs' permission to hire four of Mr. Hullot's former Apple colleagues (three were former Apple employees and one had given notice of impending departure from Apple), Mr. Jobs promptly responded, indicating that the hires would not be acceptable. ECF No. 428-9. Google promptly scrapped the plan, and the Google executive responded deferentially to Mr. Jobs, stating, &quot;Steve, Based on your strong preference that we not hire the ex-Apple engineers, Jean-Marie and I decided not to open a Google Paris engineering center.&quot; Id. The Google executive also forwarded the email thread to Mr. Brin, Larry Page (Google Co-Founder), and Mr. Campbell. Id.</p> <p>Third, Mr. Jobs attempted (unsuccessfully) to expand the anti-solicitation agreements to Palm, even threatening litigation. Specifically, Mr. Jobs called Edward Colligan (former President and CEO of Palm) to ask Mr. Colligan to enter into an anti-solicitation agreement and threatened patent litigation against Palm if Palm refused to do so. ECF No. 293 ¶¶ 6-8. Mr. Colligan responded via email, and told Mr. Jobs that Mr. Jobs' &quot;proposal that we agree that neither company will hire the other's employees, regardless of the individual's desires, is not only wrong, it is likely illegal.&quot; Id. at 4-5. Mr. Colligan went on to say that, &quot;We can't dictate where someone will work, nor should we try. I can't deny people who elect to pursue their livelihood at Palm the right to do so simply because they now work for Apple, and I wouldn't want you to do that to current Palm employees.&quot; Id. at 5. Finally, Mr. Colligan wrote that &quot;[t]hreatening Palm with a patent lawsuit in response to a decision by one employee to leave Apple is just out of line. A lawsuit would not serve either of our interests, and will not stop employees from migrating between our companies . . . . We will both just end up paying a lot of lawyers a lot of money.&quot; Id. at 5-6. Mr. Jobs wrote the following back to Mr. Colligan: &quot;This is not satisfactory to Apple.&quot; Id. at 8. Mr. Jobs went on to write that &quot;I'm sure you realize the asymmetry in the financial resources of our respective companies when you say: 'we will both just end up paying a lot of lawyers a lot of money.'&quot; Id. Mr. Jobs concluded: &quot;My advice is to take a look at our patent portfolio before you make a final decision here.&quot; Id.</p> <p>Fourth, Apple's documents provide strong support for Plaintiffs' theory of impact, namely that rigid wage structures and internal equity concerns would have led Defendants to engage in structural changes to compensation structures to mitigate the competitive threat that solicitation would have posed. Apple's compensation data shows that, for each year in the Class period, Apple had a &quot;job structure system,&quot; which included categorizing and compensating its workforce according to a discrete set of company-wide job levels assigned to all salaried employees and four associated sets of base salary ranges applicable to &quot;Top,&quot; &quot;Major,&quot; &quot;National,&quot; and &quot;Small&quot; geographic markets. ECF No. 745-7 at 14-15, 52-53; ECF No.517-16 ¶¶ 6, 10 &amp; Ex. B. Every salary range had a &quot;min,&quot; &quot;mid,&quot; and &quot;max&quot; figure. See id. Apple also created a Human Resources and recruiting tool called &quot;Merlin,&quot; which was an internal system for tracking employee records and performance, and required managers to grade employees at one of four pre-set levels. See ECF No. 749-6 at 142-43, 145-46; ECF No. 749-11 at 52-53; ECF No. 749-12 at 33. As explained by Tony Fadell (former Apple Senior Vice President, iPod Division, and advisor to Steve Jobs), Merlin &quot;would say, this is the employee, this is the level, here are the salary ranges, and through that tool we were then—we understood what the boundaries were.&quot; ECF No. 749-11 at 53. Going outside these prescribed &quot;guidelines&quot; also required extra approval. ECF No. 749-7 at 217; ECF No. 749-11 at 53 (&quot;And if we were to go outside of that, then we would have to pull in a bunch of people to then approve anything outside of that range.&quot;).</p> <p>Concerns about internal equity also permeated Apple's compensation program. Steven Burmeister (Apple Senior Director of Compensation) testified that internal equity—which Mr. Burmeister defined as the notion of whether an employee's compensation is &quot;fair based on the individual's contribution relative to the other employees in your group, or across your organization&quot;—inheres in some, &quot;if not all,&quot; of the guidelines that managers consider in determining starting salaries. ECF No. 745-7 at 61-64; ECF No. 753-12. In fact, as explained by Patrick Burke (former Apple Technical Recruiter and Staffing Manager), when hiring a new employee at Apple, &quot;compar[ing] the candidate&quot; to the other people on the team they would join &quot;was the biggest determining factor on what salary we gave.&quot; ECF No. 745-6 at 279.</p> <h3 id="2-evidence-related-to-google">2. Evidence Related to Google</h3> <p>The evidence against Google is equally compelling. Email evidence reveals that Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) terminated at least two recruiters for violations of anti-solicitation agreements, and threatened to terminate more. As discussed above, there is direct evidence that Mr. Schmidt terminated a recruiter at Steve Jobs' behest after the recruiter attempted to solicit an Apple employee. Moreover, in an email to Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), Mr. Schmidt indicated that he directed a for-cause termination of another Google recruiter, who had attempted to recruit an executive of eBay, which was on Google's do-not-cold-call list. ECF No. 814-14. Finally, as discussed in more detail below, Mr. Schmidt informed Paul Otellini (CEO of Intel and Member of the Google Board of Directors) that Mr. Schmidt would terminate any recruiter who recruited Intel employees.</p> <p>Furthermore, Google maintained a formal &quot;Do Not Call&quot; list, which grouped together Apple, Intel, and Intuit and was approved by top executives. ECF No. 291-28. The list also included other companies, such as Genentech, Paypal, and eBay. Id. A draft of the &quot;Do Not Call&quot; list was presented to Google's Executive Management Group, a committee consisting of Google's senior executives, including Mr. Schmidt, Larry Page (Google Co-Founder), Sergey Brin (Google Co-Founder), and Shona Brown (former Google Senior Vice President of Business Operations). ECF No. 291-26. Mr. Schmidt approved the list. See id.; see also ECF No. 291-27 (email from Mr. Schmidt stating: &quot;This looks very good.&quot;). Moreover, there is evidence that Google executives knew that the anti-solicitation agreements could lead to legal troubles, but nevertheless proceeded with the agreements. When Ms. Brown asked Mr. Schmidt whether he had any concerns with sharing information regarding the &quot;Do Not Call&quot; list with Google's competitors, Mr. Schmidt responded that he preferred that it be shared &quot;verbally[,] since I don't want to create a paper trail over which we can be sued later?&quot; ECF No. 291-40. Ms. Brown responded: &quot;makes sense to do orally. i agree.&quot; Id.</p> <p>Google's response to competition from Facebook also demonstrates the impact of the alleged conspiracy. Google had long been concerned about Facebook hiring's effect on retention. For example, in an email to top Google executives, Mr. Brin in 2007 stated that &quot;the facebook phenomenon creates a real retention problem.&quot; ECF No. 814-4. A month later, Mr. Brin announced a policy of making counteroffers within one hour to any Google employee who received an offer from Facebook. ECF No. 963-2.</p> <p>In March 2008, Arnnon Geshuri (Google Recruiting Director) discovered that non-party Facebook had been cold calling into Google's Site Reliability Engineering (&quot;SRE&quot;) team. Mr. Geshuri's first response was to suggest contacting Sheryl Sandberg (Chief Operating Officer for non-party Facebook) in an effort to &quot;ask her to put a stop to the targeted sourcing effort directed at our SRE team&quot; and &quot;to consider establishing a mutual 'Do Not Call' agreement that specifies that we will not cold-call into each other.&quot; ECF No. 963-3. Mr. Geshuri also suggested &quot;look[ing] internally and review[ing] the attrition rate for the SRE group,&quot; stating, &quot;[w]e may want to consider additional individual retention incentives or team incentives to keep attrition as low as possible in SRE.&quot; Id. (emphasis added). Finally, an alternative suggestion was to &quot;[s]tart an aggressive campaign to call into their company and go after their folks—no holds barred. We would be unrelenting and a force of nature.&quot; Id. In response, Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), in his capacity as an advisor to Google, suggested &quot;Who should contact Sheryl <a href="or Mark [Zuckerberg]">Sandberg</a> to get a cease fire? We have to get a truce.&quot; Id. Facebook refused.</p> <p>In 2010, Google altered its salary structure with a &quot;Big Bang&quot; in response to Facebook's hiring, which provides additional support for Plaintiffs' theory of antitrust impact. Specifically, after a period in which Google lost a significant number of employees to Facebook, Google began to study Facebook's solicitation of Google employees. ECF No. 190 ¶ 109. One month after beginning this study, Google announced its &quot;Big Bang,&quot; which involved an increase to the base salary of all of its salaried employees by 10% and provided an immediate cash bonus of $1,000 to all employees. ECF No. 296-18. Laszlo Bock (Google Senior Vice President of People Operations) explained that the rationale for the Big Bang included: (1) being &quot;responsive to rising attrition;&quot; (2) supporting higher retention because &quot;higher salaries generate higher fixed costs;&quot; and (3) being &quot;very strategic because start-ups don't have the cash flow to match, and big companies are (a) too worried about internal equity and scalability to do this and (b) don't have the margins to do this.&quot; ECF No. 296-20.</p> <p>Other Google documents provide further evidence of Plaintiffs' theory of antitrust impact. For example, Google's Chief Culture Officer stated that &quot;[c]old calling into companies to recruit is to be expected unless they're on our 'don't call' list.&quot; ECF No. 291-41. Moreover, Google found that although referrals were the largest source of hires, &quot;agencies and passively sourced candidates offer[ed] the highest yield.&quot; ECF No. 780-8. The spread of information between employees had there been active solicitations—which is central to Plaintiffs' theory of impact—is also demonstrated in Google's evidence. For example, one Google employee states that &quot;[i]t's impossible to keep something like this a secret. The people getting counter offers talk, not just to Googlers and ex-Googlers, but also to the competitors where they received their offers (in the hopes of improving them), and those competitors talk too, using it as a tool to recruit more Googlers.&quot; ECF No. 296-23.</p> <p>The wage structure and internal equity concerns at Google also support Plaintiffs' theory of impact. Google had many job families, many grades within job families, and many job titles within grades. See, e.g., ECF No. 298-7, ECF No. 298-8; see also Cisneros Decl., Ex. S (Brown Depo.) at 74-76 (discussing salary ranges utilized by Google); ECF No. 780-4 at 25-26 (testifying that Google's 2007 salary ranges had generally the same structure as the 2004 salary ranges). Throughout the Class period, Google utilized salary ranges and pay bands with minima and maxima and either means or medians. ECF No. 958-1 ¶ 66; see ECF No. 427-3 at 15-17. As explained by Shona Brown (former Google Senior Vice President, Business Operations), &quot;if you discussed a specific role [at Google], you could understand that role was at a specific level on a certain job ladder.&quot; ECF No. 427-3 at 27-28; ECF No. 745-11. Frank Wagner (Google Director of Compensation) testified that he could locate the target salary range for jobs at Google through an internal company website. See ECF No. 780-4 at 31-32 (&quot;Q: And if you wanted to identify what the target salary would be for a certain job within a certain grade, could you go online or go to some place . . . and pull up what that was for that job family and that grade? . . . A: Yes.&quot;). Moreover, Google considered internal equity to be an important goal. Google utilized a salary algorithm in part for the purpose of &quot;[e]nsur[ing] internal equity by managing salaries within a reasonable range.&quot; ECF No. 814-19. Furthermore, because Google &quot;strive[d] to achieve fairness in overall salary distribution,&quot; &quot;high performers with low salaries [would] get larger percentage increases than high performers with high salaries.&quot; ECF No. 817-1 at 15.</p> <p>In addition, Google analyzed and compared its equity compensation to Apple, Intel, Adobe, and Intuit, among other companies, each of which it designated as a &quot;peer company&quot; based on meeting criteria such as being a &quot;high-tech company,&quot; a &quot;high-growth company,&quot; and a &quot;key labor market competitor.&quot; ECF No. 773-1. In 2007, based in part on an analysis of Google as compared to its peer companies, Mr. Bock and Dave Rolefson (Google Equity Compensation Manager) wrote that &quot;[o]ur biggest labor market competitors are significantly exceeding their own guidelines to beat Google for talent.&quot; Id.</p> <p>Finally, Google's own documents undermine Defendants' principal theory of lack of antitrust impact, that compensation decisions would be one off and not classwide. Alan Eustace (Google Senior Vice President) commented on concerns regarding competition for workers and Google's approach to counteroffers by noting that, &quot;it sometimes makes sense to make changes in compensation, even if it introduces discontinuities in your current comp, to save your best people, and send a message to the hiring company that we'll fight for our best people.&quot; ECF No. 296-23. Because recruiting &quot;a few really good people&quot; could inspire &quot;many, many others [to] follow,&quot; Mr. Eustace concluded, &quot;[y]ou can't afford to be a rich target for other companies.&quot; Id. According to him, the &quot;long-term . . . right approach is not to deal with these situations as one-offs but to have a <em>systematic approach</em> to compensation that makes it very difficult for anyone to get a better offer.&quot; Id. (emphasis added).</p> <p>Google's impact on the labor market before the anti-solicitation agreements was best summarized by Meg Whitman (former CEO of eBay) who called Mr. Schmidt &quot;to talk about [Google's] hiring practices.&quot; ECF No. 814-15. As Eric Schmidt told Google's senior executives, Ms. Whitman said &quot;Google is the talk of the valley because [you] are driving up salaries across the board.&quot; Id. A year after this conversation, Google added eBay to its do-not-cold-call list. ECF No. 291-28.</p> <h3 id="3-evidence-related-to-intel">3. Evidence Related to Intel</h3> <p>There is also compelling evidence against Intel. Google reacted to requests regarding enforcement of the anti-solicitation agreement made by Intel executives similarly to Google's reaction to Steve Jobs' request to enforce the agreements discussed above. For example, after Paul Otellini (CEO of Intel and Member of the Google Board of Directors) received an internal complaint regarding Google's successful recruiting efforts of Intel's technical employees on September 26, 2007, ECF No. 188-8 (&quot;Paul, I am losing so many people to Google . . . . We are countering but thought you should know.&quot;), Mr. Otellini forwarded the email to Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) and stated &quot;Eric, can you pls help here???&quot; Id. Mr. Schmidt obliged and forwarded the email to his recruiting team, who prepared a report for Mr. Schmidt on Google's activities. ECF No. 291-34. The next day, Mr. Schmidt replied to Mr. Otellini, &quot;If we find that a recruiter called into Intel, we will terminate the recruiter,&quot; the same remedy afforded to violations of the Apple-Google agreement. ECF No. 531 at 37. In another email to Mr. Schmidt, Mr. Otellini stated, &quot;Sorry to bother you again on this topic, but my guys are very troubled by Google continuing to recruit our key players.&quot; See ECF No. 428-8.</p> <p>Moreover, Mr. Otellini was aware that the anti-solicitation agreement could be legally troublesome. Specifically, Mr. Otellini stated in an email to another Intel executive regarding the Google-Intel agreement: &quot;Let me clarify. We have nothing signed. We have a handshake 'no recruit' between eric and myself. I would not like this broadly known.&quot; Id.</p> <p>Furthermore, there is evidence that Mr. Otellini knew of the anti-solicitation agreements to which Intel was not a party. Specifically, both Sergey Brin (Google Co-Founder) and Mr. Schmidt of Google testified that they would have told Mr. Otellini that Google had an anti-solicitation agreement with Apple. ECF No. 639-1 at 74:15 (&quot;I'm sure that we would have mentioned it[.]&quot;); ECF No. 819-12 at 60 (&quot;I'm sure I spoke with Paul about this at some point.&quot;). Intel's own expert testified that Mr. Otellini was likely aware of Google's other bilateral agreements by virtue of Mr. Otellini's membership on Google's board. ECF No. 771 at 4. The fact that Intel was added to Google's do-not-cold-call list on the same day that Apple was added further suggests Intel's participation in an overarching conspiracy. ECF No. 291-28.</p> <p>Additionally, notwithstanding the fact that Intel and Google were competitors for talent, Mr. Otellini &quot;lifted from Google&quot; a Google document discussing the bonus plans of peer companies including Apple and Intel. Cisneros Decl., Ex. 463. True competitors for talent would not likely share such sensitive bonus information absent agreements not to compete.</p> <p>Moreover, key documents related to antitrust impact also implicate Intel. Specifically, Intel recognized the importance of cold calling and stated in its &quot;Complete Guide to Sourcing&quot; that &quot;[Cold] [c]alling candidates is one of the most efficient and effective ways to recruit.&quot; ECF No. 296-22. Intel also benchmarked compensation against other &quot;tech companies generally considered comparable to Intel,&quot; which Intel defined as a &quot;[b]lend of semiconductor, software, networking, communications, and diversified computer companies.&quot; ECF No. 754-2. According to Intel, in 2007, these comparable companies included Apple and Google. Id. These documents suggest, as Plaintiffs contend, that the anti-solicitation agreements led to structural, rather than individual depression, of Class members' wages.</p> <p>Furthermore, Intel had a &quot;compensation structure,&quot; with job grades and job classifications. See ECF No. 745-13 at 73 (&quot;[W]e break jobs into one of three categories—job families, we call them—R&amp;D, tech, and nontech, there's a lot more . . . .&quot;). The company assigned employees to a grade level based on their skills and experience. ECF No. 745-11 at 23; see also ECF No. 749-17 at 45 (explaining that everyone at Intel is assigned a &quot;classification&quot; similar to a job grade). Intel standardized its salary ranges throughout the company; each range applied to multiple jobs, and most jobs spanned multiple salary grades. ECF No. 745-16 at 59. Intel further broke down its salary ranges into quartiles, and compensation at Intel followed &quot;a bell-curve distribution, where most of the employees are in the middle quartiles, and a much smaller percentage are in the bottom and top quartiles.&quot; Id. at 62-63.</p> <p>Intel also used a software tool to provide guidance to managers about an employee's pay range which would also take into account market reference ranges and merit. ECF No. 758-9. As explained by Randall Goodwin (Intel Technology Development Manager), &quot;[i]f the tool recommended something and we thought we wanted to make a proposed change that was outside its guidelines, we would write some justification.&quot; ECF No. 749-15 at 52. Similarly, Intel regularly ran reports showing the salary range distribution of its employees. ECF No. 749-16 at 64.</p> <p>The evidence also supports the rigidity of Intel's wage structure. For example, in a 2004 Human Resources presentation, Intel states that, although &quot;[c]ompensation differentiation is desired by Intel's Meritocracy philosophy,&quot; &quot;short and long term high performer differentiation is questionable.&quot; ECF No. 758-10 at 13. Indeed, Intel notes that &quot;[l]ack of differentiation has existed historically based on an analysis of '99 data.&quot; Id. at 19. As key &quot;[v]ulnerability [c]hallenges,&quot; Intel identifies: (1) &quot;[m]anagers (in)ability to distinguish at [f]ocal&quot;—&quot;actual merit increases are significantly reduced from system generated increases,&quot; &quot;[l]ong term threat to retention of key players&quot;; (2) &quot;[l]ittle to no actual pay differentiation for HPs [high performers]&quot;; and (3) &quot;[n]o explicit strategy to differentiate.&quot; Id. at 24 (emphasis added).</p> <p>In addition, Intel used internal equity &quot;to determine wage rates for new hires and current employees that correspond to each job's relative value to Intel.&quot; ECF No. 749-16 at 210-11; ECF No. 961-5. To assist in that process, Intel used a tool that generates an &quot;Internal Equity Report&quot; when making offers to new employees. ECF No. 749-16 at 212-13. In the words of Ogden Reid (Intel Director of Compensation and Benefits), &quot;[m]uch of our culture screams egalitarianism . . . . While we play lip service to meritocracy, we really believe more in treating everyone the same within broad bands.&quot; ECF No. 769-8.</p> <p>An Intel human resources document from 2002—prior to the anti-solicitation agreements—recognized &quot;continuing inequities in the alignment of base salaries/EB targets between hired and acquired Intel employees&quot; and &quot;parallel issues relating to accurate job grading within these two populations.&quot; ECF No. 750-15. In response, Intel planned to: (1) &quot;Review exempt job grade assignments for job families with 'critical skills.' Make adjustments, as appropriate&quot;; and (2) &quot;Validate perception of inequities . . . . Scope impact to employees. Recommend adjustments, as appropriate.&quot; Id. An Intel human resources document confirms that, in or around 2004, &quot;[n]ew hire salary premiums drove salary range adjustment.&quot; ECF No. 298-5 at 7 (emphasis added).</p> <p>Intel would &quot;match an Intel job code in grade to a market survey job code in grade,&quot; ECF No. 749-16 at 89, and use that as part of the process for determining its &quot;own focal process or pay delivery,&quot; id. at 23. If job codes fell below the midpoint, plus or minus a certain percent, the company made &quot;special market adjustment[s].&quot; Id. at 90.</p> <h3 id="4-evidence-related-to-adobe">4. Evidence Related to Adobe</h3> <p>Evidence from Adobe also suggests that Adobe was aware of the impact of its anti-solicitation agreements. Adobe personnel recognized that &quot;Apple would be a great target to look into&quot; for the purpose of recruiting, but knew that they could not do so because, &quot;[u]nfortunately, Bruce [Chizen (former Adobe CEO)] and Apple CEO Steve Jobs have a gentleman's agreement not to poach each other's talent.&quot; ECF No. 291-13. Adobe executives were also part and parcel of the group of high-ranking executives that entered into, enforced, and attempted to expand the anti-solicitation agreements. Specifically, Mr. Chizen, in response to discovering that Apple was recruiting employees of Macromedia (a separate entity that Adobe would later acquire), helped ensure, through an email to Mr. Jobs, that Apple would honor Apple's pre-existing anti-solicitation agreements with both Adobe and Macromedia after Adobe's acquisition of Macromedia. ECF No. 608-3 at 50.</p> <p>Adobe viewed Google and Apple to be among its top competitors for talent and expressed concern about whether Adobe was &quot;winning the talent war.&quot; ECF No. 296-3. Adobe further considered itself in a &quot;six-horse race from a benefits standpoint,&quot; which included Google, Apple, and Intuit as among the other &quot;horses.&quot; See ECF No. 296-4. In 2008, Adobe benchmarked its compensation against nine companies including Google, Apple, and Intel. ECF No. 296-4; cf. ECF No. 652-6 (showing that, in 2010, Adobe considered Intuit to be a &quot;direct peer,&quot; and considered Apple, Google, and Intel to be &quot;reference peers,&quot; though Adobe did not actually benchmark compensation against these latter companies).</p> <p>Nevertheless, despite viewing other Defendants as competitors, evidence from Adobe suggests that Adobe had knowledge of the bilateral agreements to which Adobe was not a party. Specifically, Adobe shared confidential compensation information with other Defendants, despite the fact that Adobe viewed at least some of the other Defendants as competitors and did not have a bilateral agreement with them. For example, HR personnel at Intuit and at Adobe exchanged information labeled &quot;confidential&quot; regarding how much compensation each firm would give and to which employees that year. ECF No. 652-8. Adobe and Intuit shared confidential compensation information even though the two companies had no bilateral anti-solicitation agreement, and Adobe viewed Intuit as a direct competitor for talent. Such direct competitors for talent would not likely share such sensitive compensation information in the absence of an overarching conspiracy.</p> <p>Meanwhile, Google circulated an email that expressly discussed how its &quot;budget is comparable to other tech companies&quot; and compared the precise percentage of Google's merit budget increases to that of Adobe, Apple, and Intel. ECF No. 807-13. Google had Adobe's precise percentage of merit budget increases even though Google and Adobe had no bilateral anti-solicitation agreement. Such sharing of sensitive compensation information among competitors is further evidence of an overarching conspiracy.</p> <p>Adobe recognized that in the absence of the anti-solicitation agreements, pay increases would be necessary, echoing Plaintiffs' theory of impact. For example, out of concern that one employee—a &quot;star performer&quot; due to his technical skills, intelligence, and collaborative abilities—might leave Adobe because &quot;he could easily get a great job elsewhere if he desired,&quot; Adobe considered how best to retain him. ECF No. 799-22. In so doing, Adobe expressed concern about the fact that this employee had already interviewed with four other companies and communicated with friends who worked there. Id. Thus, Adobe noted that the employee &quot;was aware of his value in the market&quot; as well as the fact that the employee's friends from college were &quot;making approximately $15k more per year than he [wa]s.&quot; Id. In response, Adobe decided to give the employee an immediate pay raise. Id.</p> <p>Plaintiffs' theory of impact is also supported by evidence that every job position at Adobe was assigned a job title, and every job title had a corresponding salary range within Adobe's salary structure, which included a salary minimum, middle, and maximum. See ECF No. 804-17 at 4, 8, 72, 85-86. Adobe expected that the distribution of its existing employees' salaries would fit &quot;a bell curve.&quot; ECF No. 749-5 at 57. To assist managers in staying within the prescribed ranges for setting and adjusting salaries, Adobe had an online salary planning tool as well as salary matrices, which provided managers with guidelines based on market salary data. See ECF No. 804-17 at 29-30 (&quot;[E]ssentially the salary planning tool is populated with employee information for a particular manager, so the employees on their team [sic]. You have the ability to kind of look at their current compensation. It shows them what the range is for the current role that they're in . . . . The tool also has the ability to provide kind of the guidelines that we recommend in terms of how managers might want to think about spending their allocated budget.&quot;). Adobe's practice, if employees were below the minimum recommended salary range, was to &quot;adjust them to the minimum as part of the annual review&quot; and &quot;red flag them.&quot; Id. at 12. Deviations from the salary ranges would also result in conversations with managers, wherein Adobe's officers explained, &quot;we have a minimum for a reason because we believe you need to be in this range to be competitive.&quot; Id.</p> <p>Internal equity was important at Adobe, as it was at other Defendants. As explained by Debbie Streeter (Adobe Vice President, Total Rewards), Adobe &quot;always look[ed] at internal equity as a data point, because if you are going to go hire somebody externally that's making . . . more than somebody who's an existing employee that's a high performer, you need to know that before you bring them in.&quot; ECF No.749-5 at 175. Similarly, when considering whether to extend a counteroffer, Adobe advised &quot;internal equity should ALWAYS be considered.&quot; ECF No. 746-7 at 5.</p> <p>Moreover, Donna Morris (Adobe Senior Vice President, Global Human Resources Division) expressed concern &quot;about internal equity due to compression (the market driving pay for new hires above the current employees).&quot; ECF No. 298-9 (&quot;Reality is new hires are requiring base pay at or above the midpoint due to an increasingly aggressive market.&quot;). Adobe personnel stated that, because of the fixed budget, they may not be able to respond to the problem immediately &quot;but could look at [compression] for FY2006 if market remains aggressive.&quot;<sup class="footnote-ref" id="fnref:12"><a rel="footnote" href="#fn:12">11</a></sup> Id.</p> <h3 id="d-weaknesses-in-plaintiffs-case">D. Weaknesses in Plaintiffs' Case</h3> <p>Plaintiffs contend that though this evidence is compelling, there are also weaknesses in Plaintiffs' case that make trial risky. Plaintiffs contend that these risks are substantial. Specifically, Plaintiffs point to the following challenges that they would have faced in presenting their case to a jury: (1) convincing a jury to find a single overarching conspiracy among the seven Defendants in light of the fact that several pairs of Defendants did not have anti-solicitation agreements with each other; (2) proving damages in light of the fact that Defendants intended to present six expert economists that would attack the methodology of Plaintiffs' experts; and (3) overcoming the fact that Class members' compensation has increased in the last ten years despite a sluggish economy and overcoming general anti-tech worker sentiment in light of the perceived and actual wealth of Class members. Plaintiffs also point to outstanding legal issues, such as the pending motions in limine and the pending motion to determine whether the per se or rule of reason analysis should apply, which could have aided Defendants' ability to present a case that the bilateral agreements had a pro-competitive purpose. See ECF No. 938 at 10-14.</p> <p>The Court recognizes that Plaintiffs face substantial risks if they proceed to trial. Nonetheless, the Court cannot, in light of the evidence above, conclude that the instant settlement amount is within the range of reasonableness, particularly compared to the settlements with the Settled Defendants and the subsequent development of the litigation. The Court further notes that there is evidence in the record that mitigate at least some of the weaknesses in Plaintiffs' case.</p> <p>As to proving an overarching conspiracy, several pieces of evidence undermine Defendants' contentions that the bilateral agreements were unrelated to each other. Importantly, two individuals, Steve Jobs (Co-Founder, Former Chairman, and Former CEO of Apple) and Bill Campbell (Chairman of Intuit Board of Directors, Co-Lead Director of Apple, and advisor to Google), personally entered into or facilitated each of the bilateral agreements in this case. Specifically, Mr. Jobs and George Lucas (former Chairman and CEO of Lucasfilm), created the initial anti-solicitation agreement between Lucasfilm and Pixar when Mr. Jobs was an executive at Pixar. Thereafter, Apple, under the leadership of Mr. Jobs, entered into an agreement with Pixar, which, as discussed below, Pixar executives compared to the Lucasfilm-Pixar agreement. It was Mr. Jobs again, who, as discussed above, reached out to Sergey Brin (Google Co-Founder) and Eric Schmidt (Google Executive Chairman, Member of the Board of Directors, and former CEO) to create the Apple-Google agreement. This agreement was reached with the assistance of Mr. Campbell, who was Intuit's Board Chairman, a friend of Mr. Jobs, and an advisor to Google. The Apple-Google agreement was discussed at Google Board meetings, at which both Mr. Campbell and Paul Otellini (Chief Executive Officer of Intel and Member of the Google Board of Directors) were present. ECF No. 819-10 at 47. After discussions between Mr. Brin and Mr. Otellini and between Mr. Schmidt and Mr. Otellini, Intel was added to Google's do-not-cold-call list. Mr. Campbell then used his influence at Google to successfully lobby Google to add Intuit, of which Mr. Campbell was Chairman of the Board of Directors, to Google's do-not-cold-call list. See ECF No. 780-6 at 8-9. Moreover, it was a mere two months after Mr. Jobs entered into the Apple-Google agreement that Apple pressured Bruce Chizen (former CEO of Adobe) to enter into an Apple-Adobe agreement. ECF No. 291-17. As this discussion demonstrates, Mr. Jobs and Mr. Campbell were the individuals most closely linked to the formation of each step of the alleged conspiracy, as they were present in the process of forming each of the links.</p> <p>In light of the overlapping nature of this small group of executives who negotiated and enforced the anti-solicitation agreements, it is not surprising that these executives knew of the other bilateral agreements to which their own firms were not a party. For example, both Mr. Brin and Mr. Schmidt of Google testified that they would have told Mr. Otellini of Intel that Google had an anti-solicitation agreement with Apple. ECF No. 639-1 at 74:15 (&quot;I'm sure we would have mentioned it[.]&quot;); ECF No. 819-12 at 60 (&quot;I'm sure I spoke with Paul about this at some point.&quot;). Intel's own expert testified that Mr. Otellini was likely aware of Google's other bilateral agreements by virtue of Mr. Otellini's membership on Google's board. ECF No. 771 at 4. Moreover, Google recruiters knew of the Adobe-Apple agreement. Id. (Google recruiter's notation that Apple has &quot;a serious 'hands-off policy with Adobe&quot;). In addition, Mr. Schmidt of Google testified that it would be &quot;fair to extrapolate&quot; based on Mr. Schmidt's knowledge of Mr. Jobs, that Mr. Jobs &quot;would have extended [anti-solicitation agreements] to others.&quot; ECF No. 638-8 at 170. Furthermore, it was this same mix of top executives that successfully and unsuccessfully attempted to expand the agreement to other companies in Silicon Valley, such as eBay, Facebook, Macromedia, and Palm, as discussed above, suggesting that the agreements were neither isolated nor one off agreements.</p> <p>In addition, the six bilateral agreements contained nearly identical terms, precluding each pair of Defendants from affirmatively soliciting any of each other's employees. ECF No. 531 at 30. Moreover, as discussed above, Defendants recognized the similarity of the agreements. For example, Google lumped together Apple, Intel, and Intuit on Google's &quot;do-not-cold-call&quot; list. Furthermore, Google's &quot;do-not-cold-call&quot; list stated that the Apple-Google agreement and the Intel-Google agreement commenced on the same date. Finally, in an email, Lori McAdams (Pixar Vice President of Human Resources and Administration), explicitly compared the anti-solicitation agreements, stating that &quot;effective now, we'll follow a gentleman's agreement with Apple that is similar to our Lucasfilm agreement.&quot; ECF No. 531 at 26.</p> <p>As to the contention that Plaintiffs would have to rebut Defendants' contentions that the anti-solicitation agreements aided collaborations and were therefore pro-competitive, there is no documentary evidence that links the anti-solicitation agreements to any collaboration. None of the documents that memorialize collaboration agreements mentions the broad anti-solicitation agreements, and none of the documents that memorialize broad anti-solicitation agreements mentions collaborations. Furthermore, even Defendants' experts conceded that those closest to the collaborations did not know of the anti-solicitation agreements. ECF No. 852-1 at 8. In addition, Defendants' top executives themselves acknowledge the lack of any collaborative purpose. For example, Mr. Chizen of Adobe admitted that the Adobe-Apple anti-solicitation agreement was &quot;not limited to any particular projects on which Apple and Adobe were collaborating.&quot; ECF No. 962-7 at 42. Moreover, the U.S. Department of Justice (&quot;DOJ&quot;) also determined that the anti-solicitation agreements &quot;were not ancillary to any legitimate collaboration,&quot; &quot;were broader than reasonably necessary for the formation or implementation of any collaborative effort,&quot; and &quot;disrupted the normal price-setting mechanisms that apply in the labor setting.&quot; ECF No. 93-1 ¶ 16; ECF No. 93-4 ¶ 7. The DOJ concluded that Defendants entered into agreements that were restraints of trade that were per se unlawful under the antitrust laws. ECF No. 93-1 ¶ 35; ECF No. 93-4 ¶ 3. Thus, despite the fact that Defendants have claimed since the beginning of this litigation that there were pro-competitive purposes related to collaborations for the anti-solicitation agreements and despite the fact that the purported collaborations were central to Defendants' motions for summary judgment, Defendants have failed to produce persuasive evidence that these anti-solicitation agreements related to collaborations or were pro-competitive.</p> <h3 id="iv-conclusion">IV. CONCLUSION</h3> <p>This Court has lived with this case for nearly three years, and during that time, the Court has reviewed a significant number of documents in adjudicating not only the substantive motions, but also the voluminous sealing requests. Having done so, the Court cannot conclude that the instant settlement falls within the range of reasonableness. As this Court stated in its summary judgment order, there is ample evidence of an overarching conspiracy between the seven Defendants, including &quot;[t]he similarities in the various agreements, the small number of intertwining high-level executives who entered into and enforced the agreements, Defendants' knowledge about the other agreements, the sharing and benchmarking of confidential compensation information among Defendants and even between firms that did not have bilateral anti-solicitation agreements, along with Defendants' expansion and attempted expansion of the anti-solicitation agreements.&quot; ECF No. 771 at 7-8. Moreover, as discussed above and in this Court's class certification order, the evidence of Defendants' rigid wage structures and internal equity concerns, along with statements from Defendants' own executives, are likely to prove compelling in establishing the impact of the anti-solicitation agreements: a Class-wide depression of wages.</p> <p>In light of this evidence, the Court is troubled by the fact that the instant settlement with Remaining Defendants is proportionally lower than the settlements with the Settled Defendants. This concern is magnified by the fact that the case evolved in Plaintiffs' favor since those settlements. At the time those settlements were reached, Defendants still could have defeated class certification before this Court, Defendants still could have successfully sought appellate review and reversal of any class certification, Defendants still could have prevailed on summary judgment, or Defendants still could have succeeded in their attempt to exclude Plaintiffs' principal expert. In contrast, the instant settlement was reached a mere month before trial was set to commence and after these opportunities for Defendants had evaporated. While the unpredictable nature of trial would have undoubtedly posed challenges for Plaintiffs, the exposure for Defendants was even more substantial, both in terms of the potential of more than $9 billion in damages and in terms of other collateral consequences, including the spotlight that would have been placed on the evidence discussed in this Order and other evidence and testimony that would have been brought to light. The procedural history and proximity to trial should have increased, not decreased, Plaintiffs' leverage from the time the settlements with the Settled Defendants were reached a year ago.</p> <p>The Court acknowledges that Class counsel have been zealous advocates for the Class and have funded this litigation themselves against extraordinarily well-resourced adversaries. Moreover, there very well may be weaknesses and challenges in Plaintiffs' case that counsel cannot reveal to this Court. Nonetheless, the Court concludes that the Remaining Defendants should, at a minimum, pay their fair share as compared to the Settled Defendants, who resolved their case with Plaintiffs at a stage of the litigation where Defendants had much more leverage over Plaintiffs.</p> <p>For the foregoing reasons, the Court DENIES Plaintiffs' Motion for Preliminary Approval of the settlements with Remaining Defendants. The Court further sets a Case Management Conference for September 10, 2014 at 2 p.m.</p> <p>IT IS SO ORDERED.</p> <pre><code>Dated: August 8, 2014 LUCY H. KOH United States District Judge </code></pre> <div class="footnotes"> <hr /> <ol> <li id="fn:1">Dr. Leamer was subject to vigorous attack in the initial class certification motion, and this Court agreed with some of Defendants' contentions with respect to Dr. Leamer and thus rejected the initial class certification motion. See ECF No. 382 at 33-43. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">Defendants' motions in limine, Plaintiffs' motion to exclude testimony from certain experts, Defendants' motion to exclude testimony from certain experts, a motion to determine whether the per se or rule of reason analysis applied, and a motion to compel were pending at the time the 3settlement was reached. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:3">Plaintiffs in the instant Motion represent that two of the letters are from non-Class members and that the third letter is from a Class member who may be withdrawing his objection. See ECF No. 920 at 18 n.11. The objection has not been withdrawn at the time of this Order. <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> <li id="fn:4">Devine stated in his Opposition that the Opposition was designed to supersede a letter that he had previously sent to the Court. See ECF No. at 934 n.2. The Court did not receive any letter from Devine. Accordingly, the Court has considered only Devine's Opposition. <a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li> <li id="fn:5">Plaintiffs also assert that administration costs for the settlement would be $160,000. <a class="footnote-return" href="#fnref:5"><sup>[return]</sup></a></li> <li id="fn:6">Devine calculated that Class members would receive an average of $3,573. The discrepancy between this number and the Court's calculation may result from the fact that Devine's calculation does not account for the fact that 147 individuals have already opted out of the Class. The Court's calculation resulted from subtracting the requested attorneys' fees ($81,125,000), costs ($1,200,000), incentive awards ($400,000), and estimated administration costs ($160,000) from the settlement amount ($324,500,000) and dividing the resulting number by the total number of 7remaining class members (64,466). <a class="footnote-return" href="#fnref:6"><sup>[return]</sup></a></li> <li id="fn:7">If the Court were to deny any portion of the requested fees, costs, or incentive payments, this would increase individual Class members' recovery. If less than 4% of the Class were to opt out, that would also increase individual Class members' recovery. <a class="footnote-return" href="#fnref:7"><sup>[return]</sup></a></li> <li id="fn:8">One way to think about this is to set up the simple equation: 5/95 = $20,000,000/x. This equation asks the question of how much 95% would be if 5% were $20,000,000. Solving for x would result in $380,000,000. <a class="footnote-return" href="#fnref:8"><sup>[return]</sup></a></li> <li id="fn:10">On the same day, Mr. Campbell sent an email to Mr. Brin and to Larry Page (Google Co-Founder) stating, &quot;Steve just called me again and is pissed that we are still recruiting his browser guy.&quot; ECF No. 428-13. Mr. Page responded &quot;[h]e called a few minutes ago and demanded to talk to me.&quot; Id. <a class="footnote-return" href="#fnref:10"><sup>[return]</sup></a></li> <li id="fn:11">Mr. Jobs successfully expanded the anti-solicitation agreements to Macromedia, a company acquired by Adobe, both before and after Adobe's acquisition of Macromedia. <a class="footnote-return" href="#fnref:11"><sup>[return]</sup></a></li> <li id="fn:12">Adobe also benchmarked compensation off external sources, which supports Plaintiffs' theory of Class-wide impact and undermines Defendants' theory that the anti-solicitation agreements had only one off, non-structural effects. For example, Adobe pegged its compensation structure as a &quot;percentile&quot; of average market compensation according to survey data from companies such as Radford. ECF No. 804-17 at 4. Mr. Chizen explained that the particular market targets that Adobe used as benchmarks for setting salary ranges &quot;tended to be software, high-tech, those that were geographically similar to wherever the position existed.&quot; ECF No. 962-7 at 22. This demonstrated that the salary structures of the various Defendants were linked, such that the effect of one Defendant's salary structure would ripple across to the other Defendants through external sources like Radford. <a class="footnote-return" href="#fnref:12"><sup>[return]</sup></a></li> </ol> </div> Verilog Won & VHDL Lost? — You Be The Judge! verilog-vs-vhdl/ Thu, 14 Aug 2014 00:00:00 +0000 verilog-vs-vhdl/ <p><i>This is an archived USENET post from John Cooley on a competitive comparison between VHDL and Verilog that was done in 1997.</i></p> <p>I knew I hit a nerve. Usually when I publish a candid review of a particular conference or EDA product I typically see around 85 replies in my e-mail &quot;in&quot; box. Buried in my review of the recent Synopsys Users Group meeting, I very tersely reported that 8 out of the 9 Verilog designers managed to complete the conference's design contest yet <em>none</em> of the 5 VHDL designers could. I apologized for the terseness and promised to do a detailed report on the design contest at a later date. Since publishing this, my e-mail &quot;in&quot; box has become a veritable Verilog/VHDL Beirut filling up with 169 replies! Once word leaked that the detailed contest write-up was going to be published in the DAC issue of &quot;Integrated System Design&quot; (formerly &quot;ASIC &amp; EDA&quot; magazine), I started getting phone calls from the chairman of VHDL International, Mahendra Jain, and from the president of Open Verilog International, Bill Fuchs. A small army of hired gun spin doctors (otherwise know as PR agents) followed with more phone calls. I went ballistic when VHDL columnist Larry Saunders had approached the Editor-in-Chief of ISD for an advanced copy of my design contest report. He felt I was &quot;going to do a hatchet job on VHDL&quot; and wanted to write a rebuttal that would follow my article... and all this was happening before I had even written <em>one</em> damned word of the article!</p> <p>Because I'm an independent consultant who makes his living training and working <em>both</em> HDL's, I'd rather not go through a VHDL Salem witch trial where I'm publically accused of being secretly in league with the Devil to promote Verilog, thank you. Instead I'm going present <em>everything</em> that happened at the Design Contest, warts and all, and let <em>you</em> judge! At the end of court evidence, I'll ask you, the jury, to write an e-mail reply which I can publish in my column in the follow-up &quot;Integrated System Design&quot;.</p> <h4 id="the-unexpected-results">The Unexpected Results</h4> <p>Contestants were given 90 minutes using either Verilog or VHDL to create a gate netlist for the fastest fully synchronous loadable 9-bit increment-by-3 decrement-by-5 up/down counter that generated even parity, carry and borrow.</p> <p>Of the 9 Verilog designers in the contest, only 1 didn't get to a final gate level netlist because he tried to code a look-ahead parity generator. Of the 8 remaining, 3 had netlists that missed on functional test vectors. The 5 Verilog designers who got fully functional gate-level designs were:</p> <pre> Larry Fiedler NVidea 3.90 nsec 1147 gates Steve Golson Trilobyte Systems 4.30 nsec 1909 gates Howard Landman HaL Computer 5.49 nsec 1495 gates Mark Papamarcos EDA Associates 5.97 nsec 1180 gates Ed Paluch Paluch & Assoc. 7.85 nsec 1514 gates </pre> <p>The surprize was that, during the same time, <em>none</em> of 5 VHDL designers in the contest managed to produce any gate level designs.</p> <h4 id="not-vhdl-newbies-vs-verilog-pros">Not VHDL Newbies vs. Verilog Pros</h4> <p>The first reaction I get from the VHDL bigots (who weren't at the competition) is: &quot;Well, this is obviously a case where Verilog veterans whipped some VHDL newbies. Big deal.&quot; Well, they're partially right. Many of those Verilog designers are damned good at what they do — but so are the VHDL designers!</p> <p>I've known Prasad Paranjpe of LSI Logic for years. He has taught and still teaches VHDL with synthesis classes at U.C. Santa Cruz University Extention in the heart of Silicon Valley. He was VP of the Silicon Valley VHDL Local Users Group. He's been a full time ASIC designer since 1987 and has designed <em>real</em> ASIC's since 1990 using VHDL &amp; Synopsys since rev 1.3c. Prasad's home e-mail address is &quot;vhdl@ix.netcom.com&quot; and his home phone is (XXX) XXX-VHDL. ASIC designer Jan Decaluwe has a history of contributing insightful VHDL and synthesis posts to ESNUG while at Alcatel and later as a founder of Easics, a European ASIC design house. (Their company motto: &quot;Easics - The VHDL Design Company&quot;.) Another LSI Logic/VHDL contestant, Vikram Shrivastava, has used the VHDL/Synopsys design approach since 1992. These guys aren't newbies!</p> <h4 id="creating-the-contest">Creating The Contest</h4> <p>I followed a double blind approach to putting together this design contest. That is, not only did I have Larry Saunders (a well known VHDL columnist) and Yatin Trivedi (a well known Verilog columnist), both of Seva Technologies comment on the design contest — unknown to them I had Ken Nelsen (a VHDL oriented Methodology Manager from Synopsys) and Jeff Flieder (a Verilog based designer from Ford Microelectronics) also help check the design contest for any conceptual or implementation flaws.</p> <p>My initial concern in creating the contest was to not have a situation where the Synopsys Design Compiler could quickly complete the design by just placing down a DesignWare part. Yet, I didn't want to have contestants trying (and failing) to design some fruity, off-the-wall thingy that no one truely understood. Hence, I was restricted to &quot;standard&quot; designs that all engineers knew — but with odd parameters thrown in to keep DesignWare out of the picture. Instead of a simple up/down counter, I asked for an up-by-3 and down-by-5 counter. Instead of 8 bits, everything was 9 bits.</p> <pre> recycled COUNT_OUT [8:0] o---------------<---------------<-------------------o | | V | ------------- -------- | DATA_IN -->-| up-by-3 |->-----carry----->-| D Q |->- CARRY_OUT | [8:0] | down-by-5 |->-----borrow---->-| D Q |->- BORROW_OUT | | | | | | UP -->-| logic | | | | DOWN -->-| |-o------->---------| D[8:0] | | ------------- | new_count [8:0] | Q[8:0] |->-o---->------o | | | | o------<-----o CLOCK ---|> | o->- COUNT_OUT | -------- [8:0] new_count [8:0] | ----------- | | even | -------- o-->-| parity |->-parity-->-| D Q |->- PARITY_OUT | generator | (1 bit) | | ----------- o--|> | | -------- CLOCK ----o Fig.1) Basic block diagram outlining design's functionality </pre> <p>The even PARITY, CARRY and BORROW requirements were thrown in to give the contestants some space to make significant architectural trade-offs that could mean the difference between winning and losing.</p> <p>The counter loaded when the UP and DOWN were both &quot;low&quot;, and held its state when UP and DOWN were &quot;high&quot; — exactly opposite to what 99% of the world's loadable counters traditionally do.</p> <pre> UP DOWN DATA_IN | COUNT_OUT ----------------------------------------- 0 0 valid | load DATA_IN 0 1 don't care | (Q - 5) 1 0 don't care | (Q + 3) 1 1 don't care | Q unchanged Fig. 2) Loading and up/down counting specifications. All I/O events happen on the rising edge of CLOCK. </pre> <p>To spice things up a bit further, I chose to use the LSI Logic 300K ASIC library because wire loading &amp; wire delay is a significant factor in this technology. Having the &quot;home library&quot; advantage, one saavy VHDL designer, Prasad Paranjpe of LSI Logic, cleverly asked if the default wire loading model was required (he wanted to use a zero wire load model to save in timing!) I replied: &quot;Nice try. Yes, the default wire model is required.&quot;</p> <p>To let the focus be on design and not verification, contestants were given equivalent Verilog and VHDL testbenches provided by Yatin Trivedi &amp; Larry Saunder's Seva Technologies. These testbenches threw the same 18 vectors at the Verilog/VHDL source code the contestants were creating and if it passed, for contest purposes, their design was judged &quot;functionally correct.&quot;</p> <p>For VHDL, contestants had their choice of Synopsys VSS 3.2b and/or Cadence Leapfrog VHDL 2.1.4; for Verilog, contestants had their choice of Cadence Verilog-XL 2.1.2 or Chronologic VCS 2.3.2 plus their respective Verilog/VHDL design environments. (The CEO of Model Technology Inc., Bob Hunter, was too paranoid about the possiblity of Synopsys employees seeing his VHDL to allow it in the contest.) LCB 300K rev 3.1A.1.1.101 was the LSI Logic library.</p> <p>I had a concern that some designers might not know that an XOR reduction tree is how one generates parity — but Larry, Yatin, Ken &amp; Jeff all agreed that any engineer not knowing this shouldn't be helped to win a design contest. As a last minute hint, I put in every contestant's directory an &quot;xor.readme&quot; file that named the two XOR gates available in LSI 300K library (EO and EO3) plus their drive strengths and port lists.</p> <p>To be friendly synthesis-wise, I let the designers keep the unrealistic Synopsys default setting of all inputs having infinite input drive strength and all outputs were driving zero loads.</p> <p>The contest took place in three sessions over the same day. To keep things equal, my guiding philosophy throughout these sessions was to conscientiously <em>not</em> fix/improve <em>anything</em> between sessions — no matter how frustrating!</p> <p>After all that was said &amp; done, Larry &amp; Yatin thought that the design contest would be too easy while Ken &amp; Jeff thought it would have just about the right amount of complexity. I asked all four if they saw any Verilog or VHDL specific &quot;gotchas&quot; with the contest; all four categorically said &quot;no.&quot;</p> <h4 id="murphy-s-law">Murphy's Law</h4> <p>Once the contest began, Murphy's Law — &quot;that which can go wrong, will go wrong&quot; — prevailed. Because we couldn't get the SUN and HP workstations until a terrifying 3 days before the contest, I lived through a nightmare domino effect on getting all the Verilog, VHDL, Synopsys and LSI libraries in and installed. Nobody could cut keys for the software until the machine ID's were known — and this wasn't until 2 days before the contest! (As it was, I had to drop the HP machines because most of the EDA vendors couldn't cut software keys for HP machines as fast as they could for SUN workstations.)</p> <p>The LSI 300K Libraries didn't arrive until an hour before the contest began. The Seva guys found and fixed a bug in the Verilog testbench (that didn't exist in the VHDL testbench) some 15 minutes before the constest began.</p> <p>Some 50 minutes into the first design session, one engineer's machine crashed — which also happened to be the licence server for all the Verilog simulation software! (Luckily, by this time all the Verilog designers were deep into the synthesis stage.) Unfortunately, the poor designer who had his machine crash couldn't be allowed to redo the contest in a following session because of his prior knowlege of the design problem. This machine was rebooted and used solely as a licence server for the rest of the contest.</p> <p>The logistics nightmare once again reared its ugly head when two designers innocently asked: &quot;John, where are your Synopsys manuals?&quot; Inside I screamed to myself: &quot;OhMyGod! OhMyGod! OhMyGod!&quot;; outside I calmly replied: &quot;There are no manuals for any software here. You have to use the online docs available.&quot;</p> <p>More little gremlins danced in my head when I realized that six of the eight data books that the LSI lib person brought weren't for the <em>exact</em> LCB 300K library we were using — these data books would be critical for anyone trying to hand build an XOR reduction tree — and one Verilog contestant had just spent ten precious minutes reading a misleading data book! (There were two LCB 300K, one LCA 300K and five LEA 300K databooks.) Verilog designer Howard Landman of HaL Computer noted: &quot;I probably wasted 15 minutes trying to work through this before giving up and just coding functional parity — although I used parentheses in hopes of Synopsys using 3-input XOR gates.&quot;</p> <p>Then, just as things couldn't get worst, everyone got to discover that when Synopsys's Design Compiler runs for the first time in a new account — it takes a good 10 to 15 minutes to build your very own personal DesignWare cache. Verilog contestant Ed Paluch, a consultant, noted: &quot;I thought that first synthesis run building [expletive deleted] DesignWare caches would <em>never</em> end! It felt like days!&quot;</p> <p>Although, in my opinion, none of these headaches compromised the integrity of the contest, at the time I had to continually remind myself: &quot;To keep things equal, I can <em>not</em> fix nor improve <em>anything</em> no matter how frustrating.&quot;</p> <h3 id="judging-the-results">Judging The Results</h3> <p>Because I didn't want to be in the business of judging source code <em>intent</em>, all judging was based solely on whether the gate level passed the previously described 18 test vectors. Once done, the design was read into the Synopsys Design Compiler and all constraints were removed. Then I applied the command &quot;clocks_at 0, 6, 12 clock&quot; and then took the longest path as determined by &quot;report_timing -path full -delay max -max_paths 12&quot; as the final basis for comparing designs — determining that Verilog designer Larry Fiedler of NVidia won with a 1147 gate design timed at 3.90 nsec.</p> <pre> reg [9:0] cnt_up, cnt_dn; reg [8:0] count_nxt; always @(posedge clock) begin cnt_dn = count_out - 3'b 101; // synopsys label add_dn cnt_up = count_out + 2'b 11; // synopsys label add_up case ({up,down}) 2'b 00 : count_nxt = data_in; 2'b 01 : count_nxt = cnt_dn; 2'b 10 : count_nxt = cnt_up; 2'b 11 : count_nxt = 9'bX; // SPEC NOT MET HERE!!! default : count_nxt = 9'bX; // avoiding ambiguity traps endcase parity_out <= ^count_nxt; carry_out <= up & cnt_up[9]; borrow_out <= down & cnt_dn[9]; count_out <= count_nxt; end Fig. 3) The winning Verilog source code. (Note that it failed to meet the spec of holding its state when UP and DOWN were both high.) </pre> <p>Since judging was open to any and all who wanted to be there, Kurt Baty, a Verilog contestant and well respected design consultant, registered a vocal double surprize because he knew his design was of comparable speed but had failed to pass the 18 test vectors. (Kurt's a good friend — I really enjoyed harassing him over this discovery — especially since he had bragged to so many people on how he was going to win this contest!) An on the spot investigation yielded that Kurt had accidently saved the wrong design in the final minute of the contest. Even further investigation then also yielded that the 18 test vectors didn't cover exactly all the counter's specified conditions. Larry's &quot;winning&quot; gate level Verilog based design had failed to meet the spec of holding its state when UP and DOWN were high — even though his design had successfully passed the 18 test vectors!</p> <p>If human visual inspection of the Verilog/VHDL source code to subjectively check for places where the test vectors might have missed was part of the judging criteria, Verilog designer Steve Golson would have won. Once again, I had to reiterate that all designs which passed the testbench vectors were considered &quot;functionally correct&quot; by definition.</p> <h4 id="what-the-contestants-thought">What The Contestants Thought</h4> <p>Despite NASA VHDL designer Jeff Solomon's &quot;I didn't like the idea of taking the traditional concept of counters and warping it to make a contest design problem&quot;, the remaining twelve contestants really liked the architectural flexiblity of the up-by-3/down-by-5, 9 bit, loadable, synchronous counter with even party, carry and borrow. Verilog designer Mark Papamarcos summed up the majority opinion with: &quot;I think that the problem was pretty well devised. There was a potential resource sharing problem, some opportunities to schedule some logic to evaluate concurrently with other logic, etc. When I first saw it, I thought it would be very easy to implement and I would have lots of time to tune. I also noticed the 2 and 3-input XOR's in the top-level directory, figured that it might be somehow relevant, but quickly dismissed any clever ideas when I ran into problems getting the vectors to match.&quot;</p> <p>Eleven of contestants were tempted by the apparent correlation between known parity and the adding/subtracting of odd numbers. Only one Verilog designer, Oren Rubinstein of Hewlett-Packard Canada, committed to this strategy but ran way out of time. Once home, Kurt Baty helped Oren conceptually finish his design while Prasad Paranjpe helped with the final synthesis. It took about 7 hours brain time and 8 hours coding/sim/synth time (15 hours total) to get a final design of 3.05 nsec &amp; 1988 gates. Observing it took 10x the original estimated 1.5 hours to get a 22% improvement in speed, Oren commented: &quot;Like real life, it's impossible to create accurate engineering design schedules.&quot;</p> <p>Two of the VHDL designers, Prasad Paranjpe of LSI Logic and Jan Decaluwe of Easics, both complained of having to deal with type conversions in VHDL. Prasad confessed: &quot;I can't believe I got caught on a simple typing error. I used IEEE std_logic_arith, which requires use of unsigned &amp; signed subtypes, instead of std_logic_unsigned.&quot; Jan agreed and added: &quot;I ran into a problem with VHDL or VSS (I'm still not sure.) This case statement doesn't analyze: &quot;subtype two_bits is unsigned(1 downto 0); case two_bits'(up &amp; down)...&quot; But what worked was: &quot;case two_bits'(up, down)...&quot; Finally I solved this problem by assigning the concatenation first to a auxiliary variable.&quot;</p> <p>Verilog competitor Steve Golson outlined the first-get-a-working-design-and- then-tweak-it-in-synthesis strategy that most of the Verilog contestants pursued with: &quot;As I recall I had some stupid typos which held me up; also I had difficulty with parity and carry/borrow. Once I had a correctly functioning baseline design, I began modifying it for optimal synthesis. My basic idea was to split the design into four separate modules: the adder, the 4:1 MUXes, the XOR logic (parity and carry/borrow), and the top counter module which contains only the flops and instances of the other three modules. My strategy was to first compile the three (purely combinational) submodules individually. I used a simple &quot;max_delay 0 all_outputs()&quot; constraint on each of them. The top-level module got the proper clock constraint. Then &quot;dont_touch&quot; these designs, and compile the top counter module (this just builds the flops). Then to clean up I did an &quot;ungroup -all&quot; followed by a &quot;compile -incremental&quot; (which shaved almost 1 nsec off my critical path.)&quot;</p> <p>Typos and panic hurt the performance of a lot of contestants. Verilog designer Daryoosh Khalilollahi of National Semiconductor said: &quot;I thought I would not be able to finish it on time, but I just made it. I lost some time because I would get a Verilog syntax error that turned up because I had one extra file in my Verilog &quot;include&quot; file (verilog -f include) which was not needed.&quot; Also, Verilog designer Howard Landman of Hal Computers never realized he had put both a complete behavioral <em>and</em> a complete hand instanced parity tree in his source Verilog. (Synopsys Design Compiler just optimized one of Howard's dual parity trees away!)</p> <p>On average, each Verilog designer managed to get two to five synthesis runs completed before running out of time. Only two VHDL designers, Jeff Solomon and Jan Decaluwe, managed to start (but not complete) one synthesis run. In both cases I disqualified them from the contest for not making the deadline but let their synthesis runs attempt to finish. Jan arrived a little late so we gave Jan's run some added time before disqualifying him. His unfinished run had to be killed after 21 minutes because another group of contestants were arriving. (Incidently, I had accidently given the third session an extra 6 design minutes because of a goof on my part. No Verilog designers were in this session but VHDL designers Jeff Solomon, Prasad Paranjpe, Vikram Shrivastava plus Ravi Srinivasan of Texus Instruments all benefited from this mistake.) Since Jeff was in the last session, I gave him all the time needed for his run to complete. After an additional 17 minutes (total) he produced a gate level design that timed out to 15.52 nsec. After a total of 28 more minutes he got the timing down to 4.46 nsec but his design didn't pass functional vectors. He had an error somewhere in his VHDL source code.</p> <p>Failed Verilog designer Kurt Baty closed with: &quot;John, I look forward to next year's design contest in whatever form or flavor it takes, and a chance to redeem my honor.&quot;</p> <h4 id="closing-arguments-to-the-jury">Closing Arguments To The Jury</h4> <p>Closing aurguments the VHDL bigots may make in this trial might be: &quot;What 14 engineers do isn't statistically significant. Even the guy who ran this design contest admitted all sorts of last minute goofs with it. You had a workstation crash, no manuals &amp; misleading LSI databooks. The test vectors were incomplete. One key VHDL designer ran into a Synopsys VHDL simulator bug after arriving late to his session. The Verilog design which won this contest didn't even meet the spec completely! In addition, this contest wasn't put together to be a referendum on whether Verilog or VHDL is the better language to design in — hence it may miss some major issues.&quot;</p> <p>The Verilog bigots might close with: &quot;No engineers work under the contrived conditions one may want for an ideal comparision of Verilog &amp; VHDL. Fourteen engineers may or may not be statistally significant, but where there's smoke, there's fire. I saw all the classical problems engineers encounter in day to day designing here. We've all dealt with workstation crashes, bad revision control, bugs in tools, poor planning and incomplete testing. It's because of these realities I think this design contest was <em>perfect</em> to determine how each HDL measures up in real life. And Verilog won hands down!&quot;</p> <p>The jury's veridict will be seen in the next &quot;Integrated System Design&quot;.</p> <h4 id="you-the-jury">You The Jury...</h4> <p>You the jury are now asked to please take ten minutes to think about what you have just read and, in 150 words or less, send your thoughts to me at &quot;jcooley@world.std.com&quot;. Please don't send me &quot;VHDL sucks.&quot; or &quot;Verilog must die!!!&quot; — but personal experiences and/or observations that add to the discussion. It's OK to have strong/violent opinions, just back them with something more than hot air. (Since I don't want to be in the business of chasing down permissions, my default setting is <em>whatever</em> you send me is completely publishable. If you wish to send me letters with a mix of publishable and non-publishable material CLEARLY indicate which is which.) I will not only be reprinting replied letters, I'll also be publishing stats on how many people had reported each type of specific opinion/experience.</p> <p><i>John Cooley<br> Part Time EDA Consumer Advocate<br> Full Time ASIC, FPGA &amp; EDA Design Consultant</i></p> <p>P.S. In replying, please indicate your job, your company, whether you use Verilog or VHDL, why, and for how long. Also, please DO NOT copy this article back to me — I know why you're replying! :^)</p> Data-driven bug finding bugalytics/ Sun, 06 Apr 2014 00:00:00 +0000 bugalytics/ <p><a href="everything-is-broken/">I can't remember the last time I went a whole day without running into a software bug</a>. For weeks, I couldn't invite anyone to Facebook events due to a bug that caused the invite button to not display on the invite screen. Google Maps has been giving me illegal and sometimes impossible directions ever since I moved to a small city. And Google Docs regularly hangs when I paste an image in, giving me a busy icon until I delete the image.</p> <p>It's understandable that bugs escape testing. Testing is hard. Integration testing is harder. End to end testing is even harder. But there's an easier way. A third of bugs like this – bugs I run into daily – could be found automatically using analytics.</p> <p></p> <p>If you think finding bugs with analytics sounds odd, ask a hardware person about performance counters. Whether or not they're user accessible, every ASIC has analytics to allow designers to figure out what changes need to be made for the next generation chip. Because people look at perf counters anyway, they notice when a <a href="http://acg.cis.upenn.edu/papers/micro05_storeq.pdf">forwarding path</a> never gets used, when <a href="http://infoscience.epfl.ch/record/135571/files/micro01-way.pdf">way prediction</a> has a strange distribution, or when the prefetch buffer never fills up. <a href="discontinuities/">Unexpected distributions in analytics</a> are a sign of a misunderstanding, which is often a sign of a bug<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">1</a></sup>.</p> <p>Facebook logs all user actions. That can be used to determine user dead ends. Google Maps reroutes after “wrong” turns. That can be used to determine when the wrong turns are the result of bad directions. Google Docs could track all undos<sup class="footnote-ref" id="fnref:U"><a rel="footnote" href="#fn:U">2</a></sup>. That could be used to determine when users run into misfeatures or bugs<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">3</a></sup>.</p> <p>I understand why it might feel weird to borrow hardware practices for software development. For the most part, hardware tools are decades behind software tools. As examples: current hardware tools include simulators on Linux that are only half ported from windows, resulting in some text boxes requiring forward slashes while others require backslashes; libraries that fail to compile with `default_nettype none<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">4</a></sup>; and components that come with support engineers because they're expected to be too buggy to work without full-time people supporting any particular use.</p> <p>But when it comes to testing, hardware is way ahead of software. When I write software, fuzzing is considered a state of the art technique. But in hardware land, fuzzing doesn't have a special name. It's just testing, and why should there be a special name for &quot;testing that uses randomness&quot;? That's like having a name for &quot;testing by running code&quot;. Well over a decade ago, I did <a href="http://en.wikipedia.org/wiki/POWER6">hardware testing</a> via a tool that used constrained randomness on inputs, symbolic execution, with state reduction via structural analysis. For small units, the tool was able to generate a formal proof of correctness. For larger units, the tool automatically generated and used coverage statistics and used them to exhaustively search over as diverse a state space as possible. In the case of a bug, a short, easy to debug, counter example would be produced. And hardware testing tools have gotten a lot better since then.</p> <p>But in software land, I'm lucky if a random project I want to contribute to has tests at all. When tests exist, they're usually handwritten, with all the limitations that implies. Once in a blue moon, I'm pleasantly surprised to find that a software project uses <a href="http://hackage.haskell.org/package/QuickCheck">a test framework</a> which has 1% of the functionality that was standard a decade ago in chip designs.</p> <p>Considering the relative cost of <a href="//danluu.com/testing/">hardware bugs vs. software bugs</a>, it's not too surprising that a lot more effort goes into hardware testing. But here's a case where there's almost no extra effort. You've already got analytics measuring the conversion rate through all sorts of user funnels. The only new idea here is that clicking on an ad or making a purchase isn't the only type of conversion you should measure. Following directions at an intersection is a conversion, not deleting an image immediately after pasting it is a conversion, and using a modal dialogue box after opening it up is a conversion.</p> <p>Of course, whether it's ad click conversion rates or cache hit rates, blindly optimizing a single number will get you into a local optima that will hurt you in the long run, and setting thresholds for conversion rates that should send you an alert is nontrivial. There's a combinatorially large space of user actions, so it takes judicious use of machine learning to figure out reasonable thresholds. That's going to cost time and effort. But think of all the effort you put into optimizing clicks. You probably figured out, years ago, <a href="http://www.kalzumeus.com/2009/03/07/how-to-successfully-compete-with-open-source-software/">that replacing boring text with giant pancake buttons gives you 3x the clickthrough rate</a>; you're now down to optimizing 1% here and 2% there. That's great, and it's a sign that you've captured all the low hanging fruit. But what do you think the future clickthrough rate is when a user encounters a show-stopping bug that prevents any forward progress on a modal dialogue box?</p> <p>If this sounds like an awful lot of work, find a known bug that you've fixed, and grep your logs data for users who ran into that bug. Alienating those users by providing a profoundly broken product is doing a lot more to your clickthrough rate than having a hard to find checkout button, and the exact same process that led you to that gigantic checkout button can solve your other problem, too. Everyone knows that adding 200ms of load time can cause 20% of users to close the window. What do you think the effect of exposing them to a bug that takes 5,000ms of user interaction to fix is?</p> <p>If that's worth fixing, pull out <a href="https://github.com/twitter/scalding">scalding</a>, <a href="http://research.google.com/pubs/pub36632.html">dremel</a>, <a href="http://cascalog.org/">cascalog</a>, or whatever your favorite data processing tool is. Start looking for user actions that don't make sense. Start looking for bugs.</p> <p><small> Thanks to Pablo Torres for catching a typo in this post </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:A">It's not that all chip design teams do this systematically (although they should), but that people are looking at the numbers anyway, and will see anomalies. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:U">Undos aren't just literal undos; pasting an image in and then deleting it afterwards because it shows a busy icon forever counts, too. <a class="footnote-return" href="#fnref:U"><sup>[return]</sup></a></li> <li id="fn:1"><p>This is worse than it sounds. In addition to producing a busy icon forever in the doc, it disconnects that session from the server, which is another thing that could be detected: it's awfully suspicious if a certain user action is always followed by a disconnection.</p> <p>Moreover, both of these failure modes could have been found with fuzzing, since they should never happen. Bugs are hard enough to find that defense in depth is the only reasonable solution.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">if you talk to a hardware person, call this verification instead of testing, or they'll think you're talking about DFT, testing silicon for manufacturing defects, or some other weird thing with no software analogue. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> </ol> </div> Editing binaries edit-binary/ Sun, 23 Mar 2014 00:00:00 +0000 edit-binary/ <p>Editing binaries is a trick that comes in handy a few times a year. You don't often need to, but when you do, there's no alternative. When I mention patching binaries, I get one of two reactions: complete shock or no reaction at all. As far as I can tell, this is because most people have one of these two models of the world:</p> <blockquote> <ol> <li><p>There exists source code. Compilers do something to source code to make it runnable. If you change the source code, different things happen.</p></li> <li><p>There exists a processor. The processor takes some bits and decodes them to make things happen. If you change the bits, different things happen.</p></li> </ol> </blockquote> <p>If you have the first view, breaking out a hex editor to modify a program is the action of a deranged lunatic. If you have the second view, editing binaries is the most natural thing in the world. Why wouldn't you just edit the binary? It's often the easiest way to get what you need.</p> <p></p> <p>For instance, you're forced to do this all the time if you use a <a href="http://forum.osdev.org/viewtopic.php?f=1&amp;t=26351">non-Intel non-AMD x86 processor</a>. Instead of checking <a href="http://www.sandpile.org/x86/cpuid.htm">CPUID feature flags, programs will check the CPUID family, model, and stepping</a> to determine features, which results in incorrect behavior on non-standard CPUs. Sometimes you have to do an edit to get the program to use the latest SSE instructions and sometimes you have to do an edit to get the program to run at all. You can try filing a bug, but it's much <a href="https://code.google.com/p/nativeclient/issues/detail?id=2508">easier to just edit your binaries</a>.</p> <p>Even if you're running on a mainstream Intel CPU, these tricks are useful when you run into <a href="http://blogs.msdn.com/b/oldnewthing/archive/2012/11/13/10367904.aspx?Redirected=true">bugs in closed sourced software</a>. And then there are emergencies.</p> <p>The other day, a DevOps friend of mine at a mid-sized startup told me about the time they released an internal alpha build externally, which caused their auto-update mechanism to replace everyone's working binary with a buggy experimental version. It only took a minute to figure out what happened. Updates gradually roll out to all users over a couple days, which meant that the bad version had only spread to <code>1 / (60*24*2) = 0.03%</code> of all users. But they couldn't push the old version into the auto-updater because the client only accepts updates from higher numbered versions. They had to go through the entire build and release process (an hour long endeavor) just to release a version that was identical to their last good version. If it had occurred to anyone to edit the binary to increment the version number, they could have pushed out a good update in a minute instead of an hour, which would have kept the issue from spreading to more than <code>0.06%</code> of their users, instead of sending <code>2%</code> of their users a broken update<sup class="footnote-ref" id="fnref:E"><a rel="footnote" href="#fn:E">1</a></sup>.</p> <p>This isn't nearly as hard as it sounds. Let's try an example. If you're going to do this sort of thing regularly, you probably want to use a real disassembler like <a href="https://hex-rays.com/products/ida/support/download_freeware.shtml">IDA</a><sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">2</a></sup>. But, you can get by with simple tools if you only need to do this every once in a while. I happen to be on a Mac that I don't use for development, so I'm going to use <a href="http://lldb.llvm.org/">lldb</a> for disassembly and <a href="http://ridiculousfish.com/hexfiend/">HexFiend</a> to edit this example. Gdb, otool, and objdump also work fine for quick and dirty disassembly.</p> <p>Here's a toy code snippet, <code>wat-arg.c</code>, that should be easy to binary edit:</p> <pre><code class="language-c">#include &lt;stdio.h&gt; int main(int argc, char **argv) { if (argc &gt; 1) { printf(&quot;got an arg\n&quot;); } else { printf(&quot;no args\n&quot;); } } </code></pre> <p>If we compile this and then launch lldb on the binary and step into main, we can see the following machine code:</p> <pre><code>$ lldb wat-arg (lldb) breakpoint set -n main Breakpoint 1: where = original`main, address = 0x0000000100000ee0 (lldb) run (lldb) disas -b -p -c 20 ; address hex opcode disassembly -&gt; 0x100000ee0: 55 pushq %rbp 0x100000ee1: 48 89 e5 movq %rsp, %rbp 0x100000ee4: 48 83 ec 20 subq $32, %rsp 0x100000ee8: c7 45 fc 00 00 00 00 movl $0, -4(%rbp) 0x100000eef: 89 7d f8 movl %edi, -8(%rbp) 0x100000ef2: 48 89 75 f0 movq %rsi, -16(%rbp) 0x100000ef6: 81 7d f8 01 00 00 00 cmpl $1, -8(%rbp) 0x100000efd: 0f 8e 16 00 00 00 jle 0x100000f19 ; main + 57 0x100000f03: 48 8d 3d 4c 00 00 00 leaq 76(%rip), %rdi ; &quot;got an arg\n&quot; 0x100000f0a: b0 00 movb $0, %al 0x100000f0c: e8 23 00 00 00 callq 0x100000f34 ; symbol stub for: printf 0x100000f11: 89 45 ec movl %eax, -20(%rbp) 0x100000f14: e9 11 00 00 00 jmpq 0x100000f2a ; main + 74 0x100000f19: 48 8d 3d 42 00 00 00 leaq 66(%rip), %rdi ; &quot;no args\n&quot; 0x100000f20: b0 00 movb $0, %al 0x100000f22: e8 0d 00 00 00 callq 0x100000f34 ; symbol stub for: printf </code></pre> <p>As expected, we load a value, compare it to 1 with <code>cmpl $1, -8(%rbp)</code>, and then print <code>got an arg</code> or <code>no args</code> depending on which way we jump as a result of the compare.</p> <pre><code>$ ./wat-arg no args $ ./wat-arg 1 got an arg </code></pre> <p>If we open up a hex editor and change <code>81 7d f8 01 00 00 00; cmpl 1, -8(%rbp)</code> to <code>81 7d f8 06 00 00 00; cmpl 6, -8(%rbp)</code>, that should cause the program to check for 6 args instead of 1</p> <p><img src="images/edit-binary/edit-binary.png" alt="Replace cmpl with cmpl 6"></p> <pre><code>$ ./wat-arg no args $ ./wat-arg 1 no args $ ./wat-arg 1 2 no args $ ./wat-arg 1 2 3 4 5 6 7 8 got an arg </code></pre> <p>Simple! If you do this a bit more, you'll soon get in the habit of patching in <code>90</code><sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">3</a></sup> to overwrite things with NOPs. For example, if we replace <code>0f 8e 16 00 00 00; jle</code> and <code>e9 11 00 00 00; jmpq</code> with <code>90</code>, we get the following:</p> <pre><code> 0x100000ee1: 48 89 e5 movq %rsp, %rbp 0x100000ee4: 48 83 ec 20 subq $32, %rsp 0x100000ee8: c7 45 fc 00 00 00 00 movl $0, -4(%rbp) 0x100000eef: 89 7d f8 movl %edi, -8(%rbp) 0x100000ef2: 48 89 75 f0 movq %rsi, -16(%rbp) 0x100000ef6: 81 7d f8 01 00 00 00 cmpl $1, -8(%rbp) 0x100000efd: 90 nop 0x100000efe: 90 nop 0x100000eff: 90 nop 0x100000f00: 90 nop 0x100000f01: 90 nop 0x100000f02: 90 nop 0x100000f03: 48 8d 3d 4c 00 00 00 leaq 76(%rip), %rdi ; &quot;got an arg\n&quot; 0x100000f0a: b0 00 movb $0, %al 0x100000f0c: e8 23 00 00 00 callq 0x100000f34 ; symbol stub for: printf 0x100000f11: 89 45 ec movl %eax, -20(%rbp) 0x100000f14: 90 nop 0x100000f15: 90 nop 0x100000f16: 90 nop 0x100000f17: 90 nop 0x100000f18: 90 nop 0x100000f19: 48 8d 3d 42 00 00 00 leaq 66(%rip), %rdi ; &quot;no args\n&quot; 0x100000f20: b0 00 movb $0, %al 0x100000f22: e8 0d 00 00 00 callq 0x100000f34 ; symbol stub for: printf </code></pre> <p>Note that since we replaced a couple of multi-byte instructions with single byte instructions, the program now has more total instructions.</p> <pre><code>$ ./wat-arg got an arg no args </code></pre> <p>Other common tricks include patching in <code>cc</code> to redirect to an interrupt handler, <code>db</code> to cause a debug breakpoint, knowing which bit to change to flip the polarity of a compare or jump, etc. These things are all detailed in the <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">Intel architecture manuals</a>, but the easiest way to learn these is to develop the muscle memory for them one at a time.</p> <p>Have fun!</p> <div class="footnotes"> <hr /> <ol> <li id="fn:E"><p>I don't actually recommend doing this in an emergency if you haven't done it before. Pushing out a known broken binary that leaks details from future releases is bad, but pushing out an update that breaks your updater is worse. You'll want, at a minimum, a few people who create binary patches in their sleep to code review the change to make sure it looks good, even after running it on a test client.</p> <p>Another solution, not quite as &quot;good&quot;, but much less dangerous, would have been to disable the update server until the new release was ready.</p> <a class="footnote-return" href="#fnref:E"><sup>[return]</sup></a></li> <li id="fn:D">If you don't have $1000 to spare, <a href="https://github.com/radare/radare2">r2</a> is a nice, free, tool with IDA-like functionality. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> <li id="fn:1">on x86 <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> That bogus gender gap article gender-gap/ Sun, 09 Mar 2014 00:00:00 +0000 gender-gap/ <p>Last week, Quartz published an article titled <a href="https://news.ycombinator.com/item?id=7334659">“There is no gender gap in tech salaries”</a>. That resulted in linkbait copycat posts all over the internet, from obscure livejournals to <a href="http://www.smithsonianmag.com/smart-news/female-computer-scientists-make-same-salary-their-male-counterparts-180949965/?no-ist">Smithsonian.com</a>. The claims are awfully strong, considering that the main study cited only looked at people who graduated with a B.S. exactly one year ago, not to mention the fact that the study makes literally the opposite claim.</p> <p>Let's look at the evidence from the <a href="http://www.aauw.org/files/2013/02/graduating-to-a-pay-gap-the-earnings-of-women-and-men-one-year-after-college-graduation.pdf?_ga=1.7578036.722397424.1379578621">AAUW study</a> that all these posts cite.</p> <p><img src="images/gender-gap/actual_result.png" alt="Who are you going to believe, me or your lying eyes?" width="626" height="390"></p> <p></p> <p>Looks like women make 88% of what men do in “engineering and engineering technology” and 77% of what men do in “computer and information sciences”.</p> <p>The study controls for a number of factors to try to find the source of the pay gap. It finds that after controlling for self-reported hours worked, type of employment, and quality of school, “over one-third of the pay gap cannot be explained by any of these factors and appears to be attributable to gender alone”. One-third is not zero, nor is one-third of 12% or 23%. If that sounds small, consider an average raise in the post-2008 economy and how many years of experience that one-third of 23% turns into.</p> <p>The Quartz article claims that, since the entire gap can be explained by some variables, the gap is by choice. In fact, the study explicitly calls out that view as being false, citing <a href="http://scholar.google.com/scholar_case?case=2721168065773704326">Stender v. Lucky Stores</a> and a <a href="http://ideas.repec.org/a/bpj/evoice/v4y2007i4n5.html">related study</a><sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">1</a></sup>, saying that “The case illustrates how discrimination can play a role in the explained portion of the pay gap when employers mistakenly assume that female employees prefer lower-paid positions traditionally held by women and --intentionally or not--place men and women into different jobs, ensuring higher pay for men and lower pay for women”. Women do not, in fact, just want lower paying jobs; this is, once again, diametrically opposed to the claims in the Quartz article.</p> <p>Note that the study selectively controls for factors that reduce the pay gap, but not for factors that increase it. For instance, the study notes that “Women earn higher grades in college, on average, than men do, so academic achievement does not help us understand the gender pay gap”. Adjusting for grades would increase the pay gap; adjusting for all possible confounding factors, not only the factors that reduce the gap, would only make the adjusted pay gap larger.</p> <p>The AAUW study isn't the only evidence the Quartz post cites. To support the conclusion that “Despite strong evidence suggesting gender pay equality, there is still a general perception that women earn less than men do”, the Quartz author cites three additional pieces of evidence. First, the BLS figure that, “when measured hourly, not annually, the pay gap between men and women is 14% not 23%”; 14% is not 0%. Second, <a href="http://www.bls.gov/cps/cpswom2012.pdf?_ga=1.211369494.241714935.1393863612">a BLS report</a> that indicates that men make more than women, cherry picking a single figure where women do better than men (“women who work between 30 and 39 hours a week … see table 4”); this claim is incorrect<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>. Third, a study from the 80s which is directly contradicted by the AAUW report from 2012; the older study indicates that cohort effects are responsible for the gender gap, but the AAUW report shows a gender gap despite studying only a single cohort.</p> <p>The Smithsonian Mag published a correction in response to criticism about their article, but most of the mis-informed articles remain uncorrected.</p> <p>It's clear that the author of the Quartz piece had an agenda in mind, picked out evidence that supported that agenda, and wrote a blog post. A number of bloggers picked up the post and used its thesis as link bait to drive hits to their sites, without reading any of the cited evidence. If this is how “digitally native news” works, I'm opting out.</p> <p><strong>If you liked reading this, you might also enjoy <a href="//danluu.com/tech-discrimination/">this post on the interaction of markets with discrimination</a>, and <a href="//danluu.com/teach-debugging/">this post, which has a very partial explanation of why so many people drop out of science and engineering</a>.</strong></p> <h3 id="updates">Updates</h3> <p><em>Update: A correction! I avoided explicitly linking to the author of the original article, because I find the sort of twitter insults and witch hunts that often pop up to be unconstructive, and this is really about</em> what's <em>right and not</em> who's <em>right. The author obviously disagrees because I saw no end of insults until I blocked the author.</em></p> <p><em>Charlie Clarke was kind enough to wade through the invective and decode the author's one specific claim that about my illiteracy was that footnote 3 was not rounded from 110.3 to 111. It turns out that instead of rounding from 110.3 to 111, the author of the article cited the wrong source entirely and the other source just happened to have a number that was similar to 111.</em></p> <div class="footnotes"> <hr /> <ol> <li id="fn:B">There's plenty of good news in this study. The gender gap has gotten much smaller over the past forty years. There's room for a nuanced article that explores why things improved, and why certain aspects have improved while others have remained stubbornly stuck in the 70s. I would love to read that article. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:2">The Quartz article claims that “women who work 30 to 39 hours per week make 111% of what men make (see table 4)”. Table 4 is a breakdown of part-time workers. There is no 111% anywhere in the table, unless 110.3% is rounded to 111%; perhaps the author is referring to the racial breakdown in the table, which indicates that among Asian part-time workers, women earn 110.3% of what men do per hour. Note that Table 3, showing a breakdown of full-time workers (who are the vast majority of workers) indicates that women earn much less than men when working full time. To find a figure that supports the author's agenda, the author had to not only look at part time workers, but only look at part-time Asian women, and then round .3% up to 1%. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> </ol> </div> That time Oracle tried to have a professor fired for benchmarking their database anon-benchmark/ Wed, 05 Mar 2014 00:00:00 +0000 anon-benchmark/ <p>In 1983, at the University of Wisconsin, Dina Bitton, David DeWitt, and Carolyn Turbyfill created a <a href="http://pages.cs.wisc.edu/~dewitt/includes/benchmarking/vldb83.pdf">database benchmarking framework</a>. Some of their results included (lower is better):</p> <p>Join without indices</p> <style>table {border-collapse:collapse;margin:0px auto;}table,th,td {border: 1px solid black;}td {text-align:center;}</style> <table> <thead> <tr> <th>system</th> <th>joinAselB</th> <th>joinABprime</th> <th>joinCselAselB</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>10.2</td> <td>9.6</td> <td>9.4</td> </tr> <tr> <td>C-INGRES</td> <td>1.8</td> <td>2.6</td> <td>2.1</td> </tr> <tr> <td>ORACLE</td> <td>&gt; 300</td> <td>&gt; 300</td> <td>&gt; 300</td> </tr> <tr> <td>IDMnodac</td> <td>&gt; 300</td> <td>&gt; 300</td> <td>&gt; 300</td> </tr> <tr> <td>IDMdac</td> <td>&gt; 300</td> <td>&gt; 300</td> <td>&gt; 300</td> </tr> <tr> <td>DIRECT</td> <td>10.2</td> <td>9.5</td> <td>5.6</td> </tr> <tr> <td>SQL/DS</td> <td>2.2</td> <td>2.2</td> <td>2.1</td> </tr> </tbody> </table> <p>Join with indices, primary (clustered) index</p> <table> <thead> <tr> <th>system</th> <th>joinAselB</th> <th>joinABprime</th> <th>joinCselAselB</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>2.11</td> <td>1.66</td> <td>9.07</td> </tr> <tr> <td>C-INGRES</td> <td>0.9</td> <td>1.71</td> <td>1.07</td> </tr> <tr> <td>ORACLE</td> <td>7.94</td> <td>7.22</td> <td>13.78</td> </tr> <tr> <td>IDMnodac</td> <td>0.52</td> <td>0.59</td> <td>0.74</td> </tr> <tr> <td>IDMdac</td> <td>0.39</td> <td>0.46</td> <td>0.58</td> </tr> <tr> <td>DIRECT</td> <td>10.21</td> <td>9.47</td> <td>5.62</td> </tr> <tr> <td>SQL/DS</td> <td>0.92</td> <td>1.08</td> <td>1.33</td> </tr> </tbody> </table> <p>Join with indicies, secondary (non-clustered) index</p> <table> <thead> <tr> <th>system</th> <th>joinAselB</th> <th>joinABprime</th> <th>joinCselAselB</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>4.49</td> <td>3.24</td> <td>10.55</td> </tr> <tr> <td>C-INGRES</td> <td>1.97</td> <td>1.80</td> <td>2.41</td> </tr> <tr> <td>ORACLE</td> <td>8.52</td> <td>9.39</td> <td>18.85</td> </tr> <tr> <td>IDMnodac</td> <td>1.41</td> <td>0.81</td> <td>1.81</td> </tr> <tr> <td>IDMdac</td> <td>1.19</td> <td>0.59</td> <td>1.47</td> </tr> <tr> <td>DIRECT</td> <td>10.21</td> <td>9.47</td> <td>5.62</td> </tr> <tr> <td>SQL/DS</td> <td>1.62</td> <td>1.4</td> <td>2.66</td> </tr> </tbody> </table> <p>Projection (duplicate tuples removed)</p> <table> <thead> <tr> <th>system</th> <th>100/10000</th> <th>1000/10000</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>64.6</td> <td>236.8</td> </tr> <tr> <td>C-INGRES</td> <td>26.4</td> <td>132.0</td> </tr> <tr> <td>ORACLE</td> <td>828.5</td> <td>199.8</td> </tr> <tr> <td>IDMnodac</td> <td>29.3</td> <td>122.2</td> </tr> <tr> <td>IDMdac</td> <td>22.3</td> <td>68.1</td> </tr> <tr> <td>DIRECT</td> <td>2068.0</td> <td>58.0</td> </tr> <tr> <td>SQL/DS</td> <td>28.8</td> <td>28.0</td> </tr> </tbody> </table> <p>Aggregate without indicies</p> <table> <thead> <tr> <th>system</th> <th>MIN scalar</th> <th>MIN agg fn 100 parts</th> <th>SUM agg fun 100 parts</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>40.2</td> <td>176.7</td> <td>174.2</td> </tr> <tr> <td>C-INGRES</td> <td>34.0</td> <td>495.0</td> <td>484.4</td> </tr> <tr> <td>ORACLE</td> <td>145.8</td> <td>1449.2</td> <td>1487.5</td> </tr> <tr> <td>IDMnodac</td> <td>32.0</td> <td>65.0</td> <td>67.5</td> </tr> <tr> <td>IDMdac</td> <td>21.2</td> <td>38.2</td> <td>38.2</td> </tr> <tr> <td>DIRECT</td> <td>41.0</td> <td>227.0</td> <td>229.5</td> </tr> <tr> <td>SQL/DS</td> <td>19.8</td> <td>22.5</td> <td>23.5</td> </tr> </tbody> </table> <p>Aggregate with indicies</p> <table> <thead> <tr> <th>system</th> <th>MIN scalar</th> <th>MIN agg fn 100 parts</th> <th>SUM agg fun 100 parts</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>41.2</td> <td>186.5</td> <td>182.2</td> </tr> <tr> <td>C-INGRES</td> <td>37.2</td> <td>242.2</td> <td>254.0</td> </tr> <tr> <td>ORACLE</td> <td>160.5</td> <td>1470.2</td> <td>1446.5</td> </tr> <tr> <td>IDMnodac</td> <td>27.0</td> <td>65.0</td> <td>66.8</td> </tr> <tr> <td>IDMdac</td> <td>21.2</td> <td>38.0</td> <td>38.0</td> </tr> <tr> <td>DIRECT</td> <td>41.0</td> <td>227.0</td> <td>229.5</td> </tr> <tr> <td>SQL/DS</td> <td>8.5</td> <td>22.8</td> <td>23.8</td> </tr> </tbody> </table> <p>Selection without indicies</p> <table> <thead> <tr> <th>system</th> <th>100/10000</th> <th>1000/10000</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>53.2</td> <td>64.4</td> </tr> <tr> <td>C-INGRES</td> <td>38.4</td> <td>53.9</td> </tr> <tr> <td>ORACLE</td> <td>194.2</td> <td>230.6</td> </tr> <tr> <td>IDMnodac</td> <td>31.7</td> <td>33.4</td> </tr> <tr> <td>IDMdac</td> <td>21.6</td> <td>23.6</td> </tr> <tr> <td>DIRECT</td> <td>43.0</td> <td>46.0</td> </tr> <tr> <td>SQL/DS</td> <td>15.1</td> <td>38.1</td> </tr> </tbody> </table> <p>Selection with indicies</p> <table> <thead> <tr> <th>system</th> <th>100/10000 clustered</th> <th>100/10000 clustered</th> <th>100/10000</th> <th>1000/10000</th> </tr> </thead> <tbody> <tr> <td>U-INGRES</td> <td>7.7</td> <td>27.8</td> <td>59.2</td> <td>78.9</td> </tr> <tr> <td>C-INGRES</td> <td>3.9</td> <td>18.9</td> <td>11.4</td> <td>54.3</td> </tr> <tr> <td>ORACLE</td> <td>16.3</td> <td>130.0</td> <td>17.3</td> <td>129.2</td> </tr> <tr> <td>IDMnodac</td> <td>2.0</td> <td>9.9</td> <td>3.8</td> <td>27.6</td> </tr> <tr> <td>IDMdac</td> <td>1.5</td> <td>8.7</td> <td>3.3</td> <td>23.7</td> </tr> <tr> <td>DIRECT</td> <td>43.0</td> <td>46.0</td> <td>43.0</td> <td>46.0</td> </tr> <tr> <td>SQL/DS</td> <td>3.2</td> <td>27.5</td> <td>12.3</td> <td>39.2</td> </tr> </tbody> </table> <p>In case you're familiar with the database universe as of 1983, at the time, <code>INGRES</code> was a research project by Stonebreaker and Wong at Berkeley that had been commercialized. <code>C-INGRES</code> is the commercial versionn and <code>U-INGRES</code> is the university version. <code>IDM*</code> are the <code>IDM/500</code> database machine, the first widely used commercial database machine; <code>dac</code> is with a &quot;database accelerator&quot; and <code>nodac</code> is without. <code>DIRECT</code> was a research project in database machines that was started by DeWitt in 1977.</p> <p>In Bitton et al.'s work, Oracle's performance stood out as unusually poor.</p> <p>Larry Ellison wasn't happy with the results and it's said that he <a href="https://starcounter.com/the-story-of-professor-dewitt/">tried to have</a> <a href="http://www.citeulike.org/user/marclijour/article/6883245">DeWitt fired</a>. Given how difficult it is to fire professors when there's actual misconduct, the probability of Ellison sucessfully getting someone fired for doing legitimate research in their field was pretty much zero. It's also said that, after DeWitt's non-firing, Larry banned Oracle from hiring Wisconsin grads and Oracle added a term to their EULA forbidding the publication of benchmarks. Over the years, many major commercial database vendors added a license clause that made benchmarking their database illegal.</p> <!---more---> <p>Today, Oracle hires from Wisconsin, but Oracle still forbids benchmarking of their database. Oracle's shockingly poor performance and Larry Ellison's response have gone down in history; anti-benchmarking clauses are now often known as &quot;DeWitt Clauses&quot;, and they've spread from databases to all software, from <a href="https://news.ycombinator.com/item?id=7306121">compilers</a> to cloud offerings<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>.</p> <p>Meanwhile, Bitcoin users have created <a href="http://rt.com/news/bitcoin-assassination-market-anarchist-983/">anonymous markets for assassinations</a> -- users can put money into a pot that gets paid out to the assassin who kills a particular target.</p> <p>Anonymous assassination markets appear to be a joke, but how about anonymous markets for benchmarks? People who want to know what kind of performance a database offers under a certain workload puts money into a pot that gets paid out to whoever runs the benchmark.</p> <p>With things as they are now, you often see comments and blog posts about how someone was using <code>postgres</code> until management made them switch to &quot;some commercial database&quot; which had much worse performance and it's hard to tell if the terrible database was Oracle, MS SQL server, or perhaps another database.</p> <p>If we look at major commercial databases today, two out of the three big names in commericial databases forbid publishing benchmarks. Microsoft's SQL server eula says:</p> <blockquote> <p>You may not disclose the results of any benchmark test ... without Microsoft’s prior written approval</p> </blockquote> <p>Oracle says:</p> <blockquote> <p>You may not disclose results of any Program benchmark tests without Oracle’s prior consent</p> </blockquote> <p>IBM is notable for actually allowing benchmarks:</p> <blockquote> <p>Licensee may disclose the results of any benchmark test of the Program or its subcomponents to any third party provided that Licensee (A) publicly discloses the complete methodology used in the benchmark test (for example, hardware and software setup, installation procedure and configuration files), (B) performs Licensee's benchmark testing running the Program in its Specified Operating Environment using the latest applicable updates, patches and fixes available for the Program from IBM or third parties that provide IBM products (&quot;Third Parties&quot;), and (C) follows any and all performance tuning and &quot;best practices&quot; guidance available in the Program's documentation and on IBM's support web sites for the Program...</p> </blockquote> <p>This gives people ammunition for a meta-argument that IBM probably delivers better performance than either Oracle or Microsoft, since they're the only company that's not scared of people publishing benchmark results, but it would be nice if we had actual numbers.</p> <p><i>Thanks to Leah Hanson and Nathan Wailes for comments/corrections/discussion.</i></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1"><p>There's at least one cloud service that disallows not only publishing benchmarks, but even &quot;competitive benchmarking&quot;, running benchmarks to see how well the competition does. As a result, there's a product I'm told I shouldn't use to avoid even the appearance of impropriety because I work in an office with people who work on cloud related infrastructure.</p> <p>An example of a clause like this is the following term in the Salesforce agreement:</p> <blockquote> <p>You may not access the Services for purposes of monitoring their availability, performance or functionality, or for any other benchmarking or competitive purposes.</p> </blockquote> <p>If you ever wondered why uptime &quot;benchmarking&quot; services like cloudharmony don't include Salesforce, this is probably why. You will sometimes see speculation that Salesforce and other companies with these terms know that their service is so poor that it would be worse to have public benchmarks than to have it be known that they're afraid of public benchmarks.</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> Why don't schools teach debugging? teach-debugging/ Sat, 08 Feb 2014 00:00:00 +0000 teach-debugging/ <p>In the fall of 2000, I took my first engineering class: <a href="http://courses.engr.wisc.edu/ece/ece352.html">ECE 352</a>, an entry-level digital design class for first-year computer engineers. It was standing room only, filled with waitlisted students who would find seats later in the semester as people dropped out. We had been warned in orientation that half of us wouldn't survive the year. In class, We were warned again that half of us were doomed to fail, and that ECE 352 was the weed-out class that would be responsible for much of the damage.</p> <p>The class moved briskly. The first lecture wasted little time on matters of the syllabus, quickly diving into the real course material. Subsequent lectures built on previous lectures; anyone who couldn't grasp one had no chance at the next. Projects began after two weeks, and also built upon their predecessors; anyone who didn't finish one had no hope of doing the next.</p> <p>A friend of mine and I couldn't understand why some people were having so much trouble; the material seemed like common sense. The <a href="http://c2.com/cgi/wiki?FeynmanAlgorithm">Feynman Method</a> was the only tool we needed.</p> <blockquote> <ol> <li>Write down the problem</li> <li>Think real hard</li> <li>Write down the solution</li> </ol> </blockquote> <p>The Feynman Method failed us on the last project: the design of a <a href="http://i.stanford.edu/pub/cstr/reports/csl/tr/95/675/CSL-TR-95-675.pdf">divider</a>, a real-world-scale project an order of magnitude more complex than anything we'd been asked to tackle before. On the day he assigned the project, the professor exhorted us to begin early. Over the next few weeks, we heard rumors that some of our classmates worked day and night without making progress.</p> <p>But until 6pm the night before the project was due, my friend and I ignored all this evidence. It didn't surprise us that people were struggling because half the class had trouble with all of the assignments. We were in the half that breezed through everything. We thought we'd start the evening before the deadline and finish up in time for dinner.</p> <p>We were wrong.</p> <p>An hour after we thought we'd be done, we'd barely started; neither of us had a working design. Our failures were different enough that we couldn't productively compare notes. The lab, packed with people who had been laboring for weeks alongside those of us who waited until the last minute, was full of bad news: a handful of people had managed to produce a working division unit on the first try, but no one had figured how to convert an incorrect design into something that could do third-grade arithmetic.</p> <p>I proceeded to apply the only tool I had: thinking really hard. That method, previously infallible, now yielded nothing but confusion because the project was too complex to visualize in its entirety. I tried thinking about the parts of the design separately, but that only revealed that the problem was in some interaction between the parts; I could see nothing wrong with each individual component. Thinking about the relationship between pieces was an exercise in frustration, a continual feeling that the solution was just out of reach, as concentrating on one part would push some other critical piece of knowledge out of my head. The following semester I would acquire enough experience in managing complexity and thinking about collections of components as black-box abstractions that I could reason about a design another order of magnitude more complicated without problems — but that was three long winter months of practice away, and this night I was at a loss for how to proceed.</p> <p>By 10pm, I was starving and out of ideas. I rounded up people for dinner, hoping to get a break from thinking about the project, but all we could talk about was how hopeless it was. How were we supposed to finish when the only approach was to flawlessly assemble thousands of parts without a single misstep? It was a tedious version of a deranged Atari game with no lives and no continues. Any mistake was fatal.</p> <p>A number of people resolved to restart from scratch; they decided to work in pairs to check each other's work. I was too stubborn to start over and too inexperienced to know what else to try. After getting back to the lab, now half empty because so many people had given up, I resumed staring at my design, as if thinking about it for a third hour would reveal some additional insight.</p> <p>It didn't. Nor did the fourth hour.</p> <p>And then, just after midnight, a number of our newfound buddies from dinner reported successes. Half of those who started from scratch had working designs. Others were despondent, because their design was still broken in some subtle, non-obvious way. As I talked with one of those students, I began poring over his design. And after a few minutes, I realized that the Feynman method wasn't the only way forward: it should be possible to systematically apply a mechanical technique repeatedly to find the source of our problems. Beneath all the abstractions, our projects consisted purely of NAND gates (woe to those who dug around our toolbox enough to uncover dynamic logic), which outputs a 0 only when both inputs are 1. If the correct output is 0, both inputs should be 1. If the output is, incorrectly, 1, then at least one of the inputs must incorrectly be 0. The same logic can then be applied with the opposite polarity. We did this recursively, finding the source of all the problems in both our designs in under half an hour.</p> <p>We excitedly explained our newly discovered technique to those around us, walking them through a couple steps. No one had trouble; not even people who'd struggled with every previous assignment. Within an hour, the group of folks within earshot of us had finished, and we went home.</p> <p>I understand now why half the class struggled with the earlier assignments. Without an explanation of how to systematically approach problems, anyone who didn't intuitively grasp the correct solution was in for a semester of frustration. People who were, like me, above average but not great, skated through most of the class and either got lucky or wasted a huge chunk of time on the final project. I've even seen people talented enough to breeze through the entire degree without ever running into a problem too big to intuitively understand; those people have a very bad time when they run into a 10 million line codebase in the real world. <a href="http://lwn.net/2000/0824/a/esr-sharing.php3">The more talented the engineer, the more likely they are to hit a debugging wall outside of school</a>.</p> <p>What I don't understand is why schools don't teach systematic debugging. It's one of the most fundamental skills in engineering: start at the symptom of a problem and trace backwards to find the source. It takes, at most, half an hour to teach the absolute basics – and even that little bit would be enough to save a significant fraction of those who wash out and switch to non-STEM majors. Using the standard engineering class sequence of progressively more complex problems, a focus on debugging could expand to fill up to a semester, which would be enough to cover an obnoxious real-world bug: perhaps there's a system that crashes once a day when a Blu-ray DVD is repeatedly played using hardware acceleration with a specific video card while two webcams and record something with significant motion, as long as an obscure benchmark from 1994 is running<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">1</a></sup>.</p> <p>This dynamic isn't unique to ECE 352, or even Wisconsin – I saw the same thing when TA'ed <a href="https://engineering.purdue.edu/~ee202/pastexams/F13FE.pdf">EE 202</a>, a second year class on signals and systems at Purdue. The problems were FFTs and Laplace transforms instead of dividers and Boolean<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">2</a></sup>, but the avoidance of teaching fundamental skills was the same. It was clear, from the questions students asked me in office hours, that those who were underperforming weren't struggling with the fundamental concepts in the class, but with algebra: the problems were caused by not having an intuitive understanding of, for example, the difference between <code>f(x+a)</code> and <code>f(x)+a</code>.</p> <p>When I suggested to the professor<sup class="footnote-ref" id="fnref:Z"><a rel="footnote" href="#fn:Z">3</a></sup> that he spend half an hour reviewing algebra for those students who never had the material covered cogently in high school, I was told in no uncertain terms that it would be a waste of time because some people just can't hack it in engineering. I was told that I wouldn't be so naive once the semester was done, because some people just can't hack it in engineering. I was told that helping students with remedial material was doing them no favors; they wouldn't be able to handle advanced courses anyway because some students just can't hack it in engineering. I was told that Purdue has a loose admissions policy and that I should expect a high failure rate, because some students just can't hack it in engineering.</p> <p>I agreed that a few students might take an inordinately large amount of help, but it would be strange if people who were capable of the staggering amount of memorization required to pass first year engineering classes plus calculus without deeply understanding algebra couldn't then learn to understand the algebra they had memorized. I'm no great teacher, but I was able to get all but one of the office hour regulars up to speed over the course of the semester. An experienced teacher, even one who doesn't care much for teaching, could have easily taught the material to everyone.</p> <p>Why do we leave material out of classes and then fail students who can't figure out that material for themselves? Why do we make the first couple years of an engineering major some kind of hazing ritual, instead of simply teaching people what they need to know to be good engineers? For all the high-level talk about how we need to plug the <a href="http://www.vtcite.com/~vtcite/system/files/blickenstaff-1.pdf">leaks</a> <a href="http://www.itif.org/files/2010-refueling-innovation-exec-summ.pdf">in</a> <a href="http://digitalcommons.ilr.cornell.edu/cgi/viewcontent.cgi?article=1137&amp;context=workingpapers&amp;sei-redir=1&amp;referer=http%3A%2F%2Fscholar.google.com%2Fscholar_url%3Fhl%3Den%26q%3Dhttp%3A%2F%2Fdigitalcommons.ilr.cornell.edu%2Fcgi%2Fviewcontent.cgi%253Farticle%253D1137%2526context%253Dworkingpapers%26sa%3DX%26scisig%3DAAGBfm1oWNg44e2TO8gFdV0XAK2yRRHpfQ%26oi%3Dscholarr#search=%22http%3A%2F%2Fdigitalcommons.ilr.cornell.edu%2Fcgi%2Fviewcontent.cgi%3Farticle%3D1137%26context%3Dworkingpapers%22">our</a> <a href="http://www.tandfonline.com/doi/abs/10.1080/02783190903386553#.UvapAHddU_0">STEM</a> <a href="http://eric.ed.gov/?id=EJ789574">education</a> <a href="http://books.google.com/books?hl=en&amp;lr=&amp;id=zgsvXQ4GbfkC&amp;oi=fnd&amp;pg=PA3&amp;dq=leaky+stem+education&amp;ots=24p6ZeYtp4&amp;sig=t2HNOrp2P9BGwq44w3k7mwZjteE#v=onepage&amp;q=leaky%20stem%20education&amp;f=false">pipeline</a>, not only are we not plugging the holes, we're proud of how fast the pipeline is leaking.</p> <p><small>Thanks to Kelley Eskridge, @brcpo9, and others for comments/corrections.</p> <h3 id="elsewhere">Elsewhere</h3> <ul> <li><a href="https://blog.regehr.org/archives/849">John Regehr with four debugging book recommendations</a></li> </ul> <div class="footnotes"> <hr /> <ol> <li id="fn:2"><p>This is an actual CPU bug I saw that took about a month to track down. And this is the easy form of the bug, with a set of ingredients that causes the fail to be reproduced about once a day - the original form of the bug only failed once every few days. I'm not picking this example because it's particularly hard, either: I can think of plenty of bugs that took longer to track down and had stranger symptoms, including a disastrous bug that took six months for our best debugger to understand.</p> <p>For ASIC post-silicon debug folks out there, this chip didn't have anything close to full scan, and our only method of dumping state out of the chip perturbed the state of the chip enough to make some bugs disappear. Good times. On the bright side, after dealing with non-deterministic hardware bugs with poor state visibility, software bugs seem easy. At worst, they're boring and tedious because debugging them is a matter of tracing things backwards to the source of the issue.</p> <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:1">A co-worker of mine told me about a time at <a href="http://en.wikipedia.org/wiki/Cray">Cray</a> when a high-level PM referred to the lack of engineering resources by saying that the project “needed more Boolean.” Ever since, I've thought of digital designers as people who consume caffeine and produce Boolean. I'm still not sure what analog magicians produce. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:Z">When I TA'd EE 202, there were two separate sections taught be two different professors. The professor who told me that students who fail just can't hack it was the professor who was more liked by students. He's affable and charismatic and people like him. Grades in his section were also lower than grades under the professor who people didn't like because he was thought to be mean. TA'ing this class taught me quite a bit, that people have no idea who's doing a good job and who's helping them, and also basic signals and systems (I took signals and systems I as an undergrad to fulfill a requirement and showed up to exams and passed them without learning any of the material, so to walk students through signals and systems II, I had to actually learn the material from both signals and systems I and II; before TA'ing the course, I told the department I hadn't taken the class and should probably TA a different class, but they didn't care, which taught another good life lesson). <a class="footnote-return" href="#fnref:Z"><sup>[return]</sup></a></li> </ol> </div> Do programmers need math? math-bias/ Thu, 09 Jan 2014 00:00:00 +0000 math-bias/ <p>Dear <a href="http://dpb.bitbucket.org/">David</a>,</p> <p>I'm afraid my off the cuff response the other day wasn't too well thought out; when you talked about taking calc III and linear algebra, and getting resistance from one of your friends because &quot;wolfram alpha can do all of that now,&quot; my first reaction was horror-- which is why I replied that while I've often regretted not taking a class seriously because I've later found myself in a situation where I could have put the skills to good use, I've never said to myself &quot;<a href="http://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html">what a waste of time it was to learn that fundamental mathematical concept and use it enough to that I truly understand it</a>.&quot;</p> <p></p> <p>But could this be selection bias? It's easier to recall the math that I use than the math I don't. To check, let's look at the nine math classes I took as an undergrad. If I exclude the jobs I've had that are obviously math oriented (pure math and CS theory, plus femtosecond optics), and consider only whether I've used math skills in non-math-oriented work, here's what I find: three classes whose material I've used daily for months or years on end (Calc I/II, Linear Algebra, and Calc III); three classes that have been invaluable for short bursts (Combinatorics, Error Correcting Codes, and Computational Learning Theory); one course I would have had use for had I retained any of the relevant information when I needed it (Graduate Level Matrix Analysis); one class whose material I've only relied on once (Mathematical Economics); and only one class I can't recall directly applying to any non-math-y work (Real Analysis). Here's how I ended up using these:</p> <p><strong>Calculus I/II<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup></strong>: critical for dealing with real physical things as well as physically inspired algorithms. Moreover, one of my most effective tricks is substituting a Taylor or Remez series (or some other approximation function) for a complicated function, where the error bounds aren't too high and great speed is required.</p> <p><strong>Linear Algebra</strong>: although I've gone years without, it's hard to imagine being able to dodge linear algebra for the rest of my career because of <a href="http://walpurgisriot.github.io/blog/2014/01/03/matrix-meditation.html">how general matrices are</a>.</p> <p><strong>Calculus III</strong>: same as Calc I/II.</p> <p><strong>Combinatorics</strong>: useful for impressing people in interviews, if nothing else. Most of my non-interview use of combinatorics comes from seeing simplifications of seemingly complicated problems; combines well with probability and <a href="randomize-hn">randomized algorithms</a>.</p> <p><strong>Error Correcting Codes</strong>: there's no substitute when you need <a href="http://en.wikipedia.org/wiki/Error-correcting_code">ECC</a>. More generally, information theory is invaluable.</p> <p><strong>Graduate Level Matrix Analysis</strong>: had a decade long gap between learning this and working on something where the knowledge would be applicable. Still worthwhile, though, for the same reason Linear Algebra is important.</p> <p><strong>Real Analysis</strong>: can't recall any direct applications, although this material is useful for understanding topology and measure theory.</p> <p><strong>Computational Learning Theory</strong>: useful for making the parts of machine learning people think are scary quite easy, and for providing an intuition for areas of ML that are more alchemy than engineering.</p> <p><strong>Mathematical Economics</strong>: <a href="http://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a> have come in handy sometimes, but more for engineering than programming.</p> <p>Seven out of nine. Not bad. So I'm not sure how to reconcile my experience with the common sentiment that, outside of a handful of esoteric areas like computer graphics and machine learning, there is <a href="https://news.ycombinator.com/item?id=6949849">no</a> <a href="https://news.ycombinator.com/item?id=519928">need</a> <a href="http://www.catb.org/~esr/faqs/hacker-howto.html#mathematics">to</a> <a href="https://news.ycombinator.com/item?id=3040286">understand</a> <a href="https://news.ycombinator.com/item?id=571090">textbook</a> <a href="http://www.reddit.com/r/programming/comments/1250eg/how_to_crack_the_toughest_coding_interviews_by/c6sb39l">algorithms</a>, let alone more abstract concepts like math.</p> <p>Part of it is selection bias in the jobs I've landed; companies that do math-y work are more likely to talk to me. A couple weeks ago, I had a long discussion with a group of our old <a href="https://www.hackerschool.com/">Hacker School</a> friends, who now do a lot of recruiting at career fairs; a couple of them, whose companies don't operate at the intersection of research and engineering, mentioned that they politely try to end the discussion when they run into someone like me because they know that I won't take a job with them<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">2</a></sup>.</p> <p>But it can't all be selection bias. I've gotten a lot of mileage out of math even in jobs that are not at all mathematical in nature. Even in low-level systems work that's as far removed from math as you can get, it's not uncommon to be find a simple combinatorial proof to show that a solution that seems too stupid to be correct is actually optimal, or correct with high probability; even when doing work that's far outside the realm of numerical methods, it sometimes happens that the bottleneck is a function that can be more quickly computed using some freshman level approximation technique like a Taylor expansion or Newton's method.</p> <p>Looking back at my career, I've gotten more bang for the buck from understanding algorithms and computer architecture than from understanding math, but I really enjoy math and I'm glad that knowing a bit of it has biased my career towards more mathematical jobs, and handed me some mathematical interludes in profoundly non-mathematical jobs.</p> <p>All things considered, my real position is a bit more relaxed than I thought: if you enjoy math, taking more classes for the pure joy of solving problems is worthwhile, but math classes aren't the best use of your time if your main goal is to transition from an academic career to programming.</p> <p><br><br> Cheers,<br> Dan</p> <p><a href="http://itmozg.ru/news/1232/">Russian translation available here</a></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">A brilliant but mad lecturer crammed both semesters of the theorem/proof-oriented <a href="http://www.amazon.com/gp/product/0471000051/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0471000051&amp;linkCode=as2&amp;tag=abroaview-20">Apostol text</a> into two months and then started lecturing about complex analysis when we ran out of book. I didn't realize that math is fun until I took this class. This footnote really ought to be on the class name, but rdiscount doesn't let you put a footnote on or in bolded text. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:3">This is totally untrue, by the way. It would be super neat to see what a product oriented role is like. As it is now, I'm five teams removed from any actual customer. Oh well. I'm one step closer than I was in my last job. <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> </ol> </div> Data alignment and caches 3c-conflict/ Thu, 02 Jan 2014 00:00:00 +0000 3c-conflict/ <p>Here's the graph of <a href="https://github.com/danluu/setjmp-longjmp-ucontext-snippets/tree/master/benchmark/cache-conflict">a toy benchmark</a><sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">1</a></sup> of page-aligned vs. mis-aligned accesses; it shows a ratio of performance between the two at different working set sizes. If this benchmark seems contrived, it actually comes from a real world example of the <a href="https://groups.google.com/forum/#!msg/comp.arch/_uecSnSEQc4/mvfRnOvIyzUJ">disastrous performance implications of using nice power of 2 alignment, or page alignment in an actual system</a><sup class="footnote-ref" id="fnref:M"><a rel="footnote" href="#fn:M">2</a></sup>.</p> <p><img src="images/3c-conflict/sandy.png" alt="Graph of Sandy Bridge Performance" width="504" height="216"> <img src="images/3c-conflict/westmere.png" alt="Graph of Westmere Performance" width="504" height="504"></p> <p>Except for very small working sets (1-8), the unaligned version is noticeably faster than the page-aligned version, and there's a large region up to a working set size of 512 where the ratio in performance is somewhat stable, but more so on our Sandy Bridge chip than our Westmere chip.</p> <p>To understand what's going on here, we have to look at how caches organize data. By way of analogy, consider a 1,000 car parking garage that has 10,000 permits. With a direct mapped scheme (which you could call 1-way associative<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">3</a></sup>), each of the ten permits that has the same 3 least significant digits would be assigned the same spot, i.e., permits 0618, 1618, 2618, and so on, are only allowed to park in spot 618. If you show up at your spot and someone else is in it, you kick them out and they have to drive back home. The next time they get called in to work, they have to drive all the way back to the parking garage.</p> <p>Instead, if each car's permit allows it to park in a set that has ten possible spaces, we'll call that a 10-way set associative scheme, which gives us 100 sets of ten spots. Each set is now defined by the last 2 significant digits instead of the last 3. For example, with permit 2618, you can park in any spot from the set {018, 118, 218, …, 918}. If all of them are full, you kick out one unlucky occupant and take their spot, as before.</p> <p>Let's move out of analogy land and back to our benchmark. The main differences are that there isn't just one garage-cache, but a hierarchy of them, from the L1<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">4</a></sup>, which is the smallest (and hence, fastest) to the L2 and L3. Each seat in a car corresponds to an address. On x86, each addresses points to a particular byte. In the Sandy Bridge chip we're running on, we've got a 32kB L1 cache with 64-byte line size and, 64 sets, with 8-way set associativity. In our analogy, a line size of 64 would correspond to a car with 64 seats. We always transfer things in 64-byte chunks and the bottom log₂(64) = 6 bits of an address refer to a particular byte offset in a cache line. The next log₂(64) = 6 bits determine which set an address falls into<sup class="footnote-ref" id="fnref:L"><a rel="footnote" href="#fn:L">5</a></sup>. Each of those sets can contain 8 different things, so we have 64 sets * 8 lines/set * 64 bytes/line = 32kB. If we use the cache optimally, we can store 32,768 items. But, since we're accessing things that are page (4k) aligned, we effectively lose the bottom log₂(4k) = 12 bits, which means that every access falls into the same set, and we can only loop through 8 things before our working set is too large to fit in the L1! But if we'd misaligned our data to different cache lines, we'd be able to use 8 * 64 = 512 locations effectively.</p> <p>Similarly, our chip has a 512 set L2 cache, of which 8 sets are useful for our page aligned accesses, and a 12288 set L3 cache, of which 192 sets are useful for page aligned accesses, giving us 8 sets * 8 lines / set = 64 and 192 sets * 8 lines / set = 1536 useful cache lines, respectively. For data that's misaligned by a cache line, we have an extra 6 bits of useful address, which means that our L2 cache now has 32,768 useful locations.</p> <p>In the Sandy Bridge graph above, there's a region of stable relative performance between 64 and 512, as the page-aligned version version is running out of the L3 cache and the unaligned version is running out of the L1. When we pass a working set of 512, the relative ratio gets better for the aligned version because it's now an L2 access vs. an L3 access. Our graph for Westmere looks a bit different because its L3 is only 3072 sets, which means that the aligned version can only stay in the L3 up to a working set size of 384. After that, we can see the terrible performance we get from spilling into main memory, which explains why the two graphs differ in shape above 384.</p> <p>For a visualization of this, you can think of a 32 bit pointer looking like this to our L1 and L2 caches:</p> <p><code>TTTT TTTT TTTT TTTT TTTT SSSS SSXX XXXX</code></p> <p><code>TTTT TTTT TTTT TTTT TSSS SSSS SSXX XXXX</code></p> <p>The bottom 6 bits are ignored, the next bits determine which set we fall into, and the top bits are a tag that let us know what's actually in that set. Note that page aligning things, i.e., setting the address to</p> <p><code>???? ???? ???? ???? ???? 0000 0000 0000</code></p> <p>was just done for convenience in our benchmark. Not only will aligning to any large power of 2 cause a problem, generating addresses with a power of 2 offset from each other will cause the same problem.</p> <p>Nowadays, the importance of caches is well understood enough that, when I'm asked to look at a cache related performance bug, it's usually due to the kind of thing we just talked about: conflict misses that prevent us from using our full cache effectively<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">6</a></sup>. This isn't the only way for that to happen -- bank conflicts and and false dependencies are also common problems, but I'll leave those for another blog post.</p> <h4 id="resources">Resources</h4> <p>For more on caches on memory, see <a href="http://www.akkadia.org/drepper/cpumemory.pdf">What Every Programmer Should Know About Memory</a>. For something with more breadth, see <a href="//danluu.com/new-cpu-features/">this blog post for something &quot;short&quot;</a>, or <a href="http://www.amazon.com/gp/product/1478607831/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1478607831&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=HXBNFFJ3CIXMWUZP">Modern Processor Design</a> for something book length. For even more breadth (those two links above focus on CPUs and memory), see <a href="http://www.amazon.com/gp/product/012383872X/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=012383872X&amp;linkCode=as2&amp;tag=abroaview-20&amp;linkId=ZHRJAO77SUQ6V7GL">Computer Architecture: A Quantitative Approach</a>, which talks about the whole system up to the datacenter level.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:A">The Sandy Bridge is an i7 3930K and the Westmere is a mobile i3 330M <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:M">Or anyone who aligned their data too nicely on a calculation with two source arrays and one destination when running on a chip with a 2-way associative or direct mapped cache. This is surprisingly common when you set up your arrays in some nice way in order to do cache blocking, if you're not careful. <a class="footnote-return" href="#fnref:M"><sup>[return]</sup></a></li> <li id="fn:1">Don't call it that. People will you look at you funny the same way they would if you pronounced SQL as squeal or squll. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:B">In this post, L1 refers to the l1d. Since we're only concerned with data, the l1i isn't relevant. Apologies for the sloppy use of terminology. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:L">If it seems odd that the least significant available address bits are used for the set index, that's because of the cardinal rule of computer architecture, make the common case fast -- Google Instant completes “make the common” to “make the common case fast”, “make the common case fast mips”, and “make the common case fast computer architecture”. The vast majority of accesses are close together, so moving the set index bits upwards would cause more conflict misses. You might be able to get away with a hash function that isn't simply the least significant bits, but most proposed schemes hurt about as much as they help while adding extra complexity. <a class="footnote-return" href="#fnref:L"><sup>[return]</sup></a></li> <li id="fn:4">Cache misses are often described using the 3C model: conflict misses, which are caused by the type of aliasing we just talked about; compulsory misses, which are caused by the first access to a memory location; and capacity misses, which are caused by having a working set that's too large for a cache, even without conflict misses. Page-aligned accesses like these also make compulsory misses worse, because prefetchers won't prefetch beyond a page boundary. But if you have enough data that you're aligning things to page boundaries, you probably can't do much about that anyway. <a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li> </ol> </div> PCA is not a panacea linear-hammer/ Fri, 13 Dec 2013 00:00:00 +0000 linear-hammer/ <p>Earlier this year, I interviewed with a well-known tech startup, one of the hundreds of companies that claims to have harder interviews, more challenging work, and smarter employees than Google<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. My first interviewer, John, gave me the standard tour: micro-kitchen stocked with a combination of healthy snacks and candy; white male 20-somethings gathered around a foosball table; bright spaces with cutesy themes; a giant TV set up for video games; and the restroom. Finally, he showed me a closet-sized conference room and we got down to business.</p> <p>After the usual data structures and algorithms song and dance, we moved on to the main question: how would you design a classification system for foo<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>? We had a discussion about design tradeoffs, but the key disagreement was about the algorithm. I said, if I had to code something up in an interview, I'd use a naive matrix factorization algorithm, but that I didn't expect that I would get great results because not everything can be decomposed easily. John disagreed – he was adamant that PCA was the solution for any classification problem.</p> <p></p> <p>We discussed the mathematical underpinnings for twenty-five minutes – half the time allocated for the interview – and it became clear that neither of us was going to convince the other with theory. I switched gears and tried the empirical approach, referring to an old result on classifying text with LSA (which can only capture pairwise correlations between words)<sup class="footnote-ref" id="fnref:C"><a rel="footnote" href="#fn:C">3</a></sup> vs. deep learning<sup class="footnote-ref" id="fnref:B"><a rel="footnote" href="#fn:B">4</a></sup>. Here's what you get with LSA:</p> <p><img src="images/linear-hammer/PCA.png" alt="2-d LSA" width="321" height="329"></p> <p>Each color represents a different type of text, projected down to two dimensions; you might not want to reduce to the dimensionality that much, but it's a good way to visualize what's going on. There's some separation between the different categories; the green dots tend to be towards the bottom right, the black dots are a lot denser in the top half of the diagram, etc. But any classification based on that is simply not going to be very good when documents are similar and the differences between them are nuanced.</p> <p>Here's what we get with a deep autoencoder:</p> <p><img src="images/linear-hammer/deep_autoencoder.png" alt="2-d deep autoencoder" width="446" height="344"></p> <p>It's not perfect, but the results are a lot better.</p> <p>Even after the example, it was clear that I wasn't going to come to an agreement with my interviewer, so I asked if we could agree to disagree and move on to the next topic. No big deal, since it was just an interview. But I see this sort of misapplication of bog standard methods outside of interviews at least once a month, usually with the conviction that all you need to do is apply this linear technique for any problem you might see.</p> <p>Engineers are the first to complain when consultants with generic business knowledge come in, charge $500/hr and dispense common sense advice while making a mess of the details. But data science is new and hot enough that people get a pass when they call themselves data scientists instead of technology consultants. I don't mean to knock data science (whatever that means), or even linear methods<sup class="footnote-ref" id="fnref:D"><a rel="footnote" href="#fn:D">5</a></sup>. They're useful. But I keep seeing people try to apply the same four linear methods to every problem in sight.</p> <p>In fact, as I was writing this, my girlfriend was in the other room taking a phone interview with the data science group of a big company, where they're attempting to use multivariate regression to predict the performance of their systems and decomposing resource utilization down to the application and query level from the regression coefficient, giving you results like 4000 QPS of foobar uses 18% of the CPU. The question they posed to her, which they're currently working on, was how do you speed up the regression so that you can push their test system to web scale?</p> <p>The real question is, why would you want to? There's a reason pretty much every intro grad level computer architecture course involves either writing or modifying a simulator; real system performance is full of non-linear cliffs, the sort of thing where you can't just apply a queuing theory model, let alone a linear regression model. But when all you have are linear hammers, non-linear screws look a lot like nails.</p> <p><small><em>In response to this, John Myles White made the good point that linear vs. non-linear isn't really the right framing, and that there really isn't a good vocabulary for talking about this sort of thing. Sorry for being sloppy with terminology. If you want to be more precise, you can replace each mention of &quot;linear&quot; with &quot;mumble mumble objective function&quot; or maybe &quot;simple&quot;.</em></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">When I was in college, the benchmark was MS. I wonder who's going to be next. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">I'm not disclosing the exact problem because they asked to keep the interview problems a secret, so I'm describing a similar problem where matrix decomposition has the same fundamental problems. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:C">If you're familiar with PCA and not LSA, <a href="http://stats.stackexchange.com/questions/65699/lsa-vs-pca-document-clustering">you can think of LSA as something PCA-like</a> <a class="footnote-return" href="#fnref:C"><sup>[return]</sup></a></li> <li id="fn:B"><a href="http://www.sciencemag.org/content/313/5786/504.abstract">http://www.sciencemag.org/content/313/5786/504.abstract</a>, <a href="http://www.cs.toronto.edu/~amnih/cifar/talks/salakhut_talk.pdf">http://www.cs.toronto.edu/~amnih/cifar/talks/salakhut_talk.pdf</a>. In a strict sense, this work was obsoleted by a slew of papers from 2011 which showed that you can <a href="http://www.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf">achieve</a> <a href="http://www.stanford.edu/~acoates/papers/coatesng_icml_2011.pdf">similar</a> <a href="http://cs.stanford.edu/~jngiam/papers/NgiamKohChenBhaskarNg2011.pdf">results</a> to this 2006 result with &quot;simple&quot; algorithms, but it's still true that current deep learning methods are better than the best &quot;simple&quot; feature learning schemes, and this paper was the first example that came to mind. <a class="footnote-return" href="#fnref:B"><sup>[return]</sup></a></li> <li id="fn:D">It's funny that I'm writing this blog post because I'm a huge fan of using the simplest thing possible for the job. That's often a linear method. Heck, one of my most common tricks is to replace a complex function with a first order Taylor expansion. <a class="footnote-return" href="#fnref:D"><sup>[return]</sup></a></li> </ol> </div> Why hardware development is hard hardware-unforgiving/ Sun, 10 Nov 2013 00:00:00 +0000 hardware-unforgiving/ <p>In CPU design, most successful teams have a fairly long lineage and rely heavily on experienced engineers. When we look at CPU startups, teams that have a successful exist often have a core team that's been together for decades. For example, PA Semi's acquisition by Apple was a moderately successful exit, but where did that team come from? They were the SiByte team, which left after SiByte was acquired by Broadcom, and SiByte was composed of many people from DEC who had been working together for over a decade. My old company was similar: an IBM fellow collected <a href="http://conservancy.umn.edu/handle/120173">the best people he worked with at IBM</a> who was a very early Dell employee and then exec (back when Dell still did interesting design work), then split off to create a chip startup. There have been quite a few CPU startups that have raised tens to hundreds of millions and leaned heavily on inexperienced labor; fresh PhDs and hardware engineers with only a few years of experience. Every single such startup I know of failed<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">1</a></sup>.</p> <p>This is in stark contrast to software startups, where it's common to see successful startups founded by people who are just out of school (or who dropped out of school). Why should microprocessors be any different? It's unheard of for a new, young, team to succeed at making a high-performance microprocessor, although this hasn't stopped people from funding these efforts.</p> <p>In software, it's common to hear about disdain for experience, such as Zuckerberg's comment, &quot;I want to stress the importance of being young and technical, Young people are just smarter.&quot;. Even when people don't explicitly devalue experience, they often don't value it either. As of this writing, Joel Spolsky's ”<a href="http://www.joelonsoftware.com/articles/GuerrillaInterviewing3.html">Smart and gets things done</a>” is probably the most influential piece of writing on software hiring. Note that it doesn't say &quot;smart, experienced, and gets things done.&quot;. Just &quot;smart and gets things done&quot; appears to be enough, no experience required. If you lean more towards the Paul Graham camp than the Joel Spolsky camp, there will be a lot of differences in how you hire, but Paul's advice is the same in that <a href="http://www.paulgraham.com/founders.html">experience doesn't rank as one of his most important criteria</a>, <a href="https://twitter.com/danluu/status/1356462960172883969">except as a diss</a>.</p> <p>Let's say you wanted to hire a plumber or a carptener, what would you choose? &quot;Smart and gets things done&quot; or &quot;experienced and effective&quot;? Ceteris paribus, I'll go for &quot;experienced and effective&quot;, doubly so if it's an emergency.</p> <p>Physical work isn't the kind of thing you can derive from first principles, no matter how smart you are. Consider South Korea after WWII. <a href="http://www.nationmaster.com/graph/eco_gdp_per_cap_in_195-economy-gdp-per-capita-1950">Its GDP per capita was lower than Ghana, Kenya, and just barely above the Congo</a>. For various reasons, the new regime didn't have to deal with legacy institutions; and they wanted Korea to become a first-world nation.</p> <p>The story <a href="http://delong.typepad.com/aeh/">I've heard</a> is that the government started by subsidizing concrete. After many years making concrete, they wanted to move up the chain and start more complex manufacturing. They eventually got to building ships, because shipping was a critical part of the export economy they wanted to create.</p> <p>They pulled some of their best business people who had learned skills like management and operations in other manufacturing. Those people knew they didn't have the expertise to build ships themselves, so they contracted it out. They made the choice to work with Scottish firms, because Scotland has a long history of shipbuilding. Makes sense, right?</p> <p>It didn't work. For historical and geographic reasons, Scotland's shipyards weren't full-sized; they built their ships in two halves and then assembled them. Worked fine for them, because they'd be doing it at scale since the 1800s, and had <a href="http://www.educationscotland.gov.uk/scotlandshistory/makingindustrialurban/shipbuilding/index.asp">world renowned expertise</a> by the 1900s. But when the unpracticed Koreans tried to build ships using Scottish plans and detailed step-by-step directions, the result was two ship halves that didn't quite fit together and sunk when assembled.</p> <p>The Koreans eventually managed to start a shipbuilding industry by hiring foreign companies to come and build ships locally, showing people how it's done. And it took decades to get what we would consider basic manufacturing working smoothly, even though one might think that all of the requisite knowledge existed in books, was taught in university courses, and could be had from experts for a small fee. Now, their manufacturing industries are world class, e.g., according to Consumer Reports, Hyundai and Kia produce reliable cars. Going from producing unreliable econoboxes to reliable cars you can buy took over a decade, like it did for Toyota when they did it decades earlier. If there's a shortcut to quality other than hiring a lot of people who've done it before, no one's discovered it yet.</p> <p>Today, any programmer can take Geoffrey Hinton's <a href="https://www.coursera.org/course/neuralnets">course on neural networks and deep learning</a>, and start applying state of the art machine learning techniques. In software land, you can fix minor bugs in real time. If it takes a whole day to run your regression test suite, you consider yourself lucky because it means you're in one of the few environments that takes testing seriously. If the architecture is fundamentally flawed, you pull out your copy of Feathers' “<a href="http://www.amazon.com/gp/product/B005OYHF0A/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=B005OYHF0A&amp;linkCode=as2&amp;tag=abroaview-20">Working Effectively with Legacy Code</a>” and repeatedly apply fixes.</p> <p>This isn't to say that software isn't hard, but there are a lot of valueable problems that don't need a decade of hard-won experience to attack. But if you want to build a ship, and you &quot;only&quot; have a decade of experience with carpentry, milling, metalworking, etc., well, good luck. You're going to need it. With a large ship, “minor” fixes can take days or weeks, and a fundamental flaw means that your ship sinks and you've lost half a year of work and tens of millions of dollars. By the time you get to something with the complexity of a modern high-performance microprocessor, a minor bug discovered in production costs three months and millions of dollars. A fundamental flaw in the architecture will cost you five years and hundreds of millions of dollars<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">2</a></sup>.</p> <p>Physical mistakes are costly. There's no undo and editing isn't simply a matter of pressing some keys; changes consume real, physical resources. You need enough wisdom and experience to avoid common mistakes entirely – especially the ones that can't be fixed.</p> <p><i>Thanks to Sophia Wisdom for comments/corrections/discussion.</i></p> <h3 id="cpu-internals-series">CPU internals series</h3> <ul> <li><a href="cpu-bugs/">CPU bugs</a></li> <li><a href="//danluu.com/new-cpu-features/">New CPU features since the 80s</a></li> <li><a href="//danluu.com/branch-prediction/">A brief history of branch prediction</a></li> <li><a href="//danluu.com/integer-overflow/">The cost of branches and integer overflow checking in real code</a></li> <li><a href="//danluu.com/hardware-unforgiving/">Why CPU development is hard</a></li> <li><a href="//danluu.com/why-hardware-development-is-hard/">Verilog sucks, part 1</a></li> <li><a href="//danluu.com/pl-troll/">Verilog sucks, part 2</a></li> </ul> <h3 id="2021-comments">2021 comments</h3> <p>In retrospect, I think that I was too optimistic about software in this post. If we're talking about product-market fit and success, I don't think the attitude in the post is wrong and people with little to no experience often do create hits. But now that I've been in the industry for a while and talked to numerous people about infra at various startups as well as large companies, I think creating high quality software infra requires no less experience than creating high quality physical items. Companies that decided this wasn't the case and hire a bunch of smart folks from top schools to build their infra have ended up with low quality, unreliable, expensive, and difficult to operate infrastructure. It just turns out that, if you have very good product-market fit, you don't need your infra to work. Your company can survive and even <a href="programmer-moneyball/">thrive while having infra that has 2 9s of uptime and costs an order of magnitude more than your competitor's infra</a> or <a href="nothing-works/">if your product's architecture means that it can't possibly work correctly</a>. You'll make less money than you would've otherwise, but the high order bits are all on the product side. If you contrast that chip companies with inexperienced engineers that didn't produce a working product, well, you can't really sell a product that doesn't work even if you try. If you get very lucky, like if you happened to start deep learning chip company at the right time, you might get big company to acquire your non-working product. But, it's much harder to get an exit like that for a microprocessor.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:2">Comparing <a href="//danluu.com/glenn-henry-interview/">my old company</a> to another x86 startup founded within the year is instructive. Both started at around the same time. Both had great teams of smart people. Our competitor even had famous software and business people on their side. But it's notable that their hardware implementers weren't a core team of multi-decade industry veterans who had worked together before. It took us about two years to get a working x86 chip, on top of $15M in funding. Our goal was to produce a low-cost chip and we nailed it. It took them five years, with over $250M in funding. Their original goal was to produce a high performance low-power processor, but they missed their performance target so badly that they were forced into the low-cost space. They ended up with worse performance than us, with a chip was 50% bigger (and hence, cost more than 50% more to produce) using team four times our size. They eventually went under, because there's no way they could survive with 4x our burn rate and weaker performance. But, not before burning through $969M in funding (including $230M from patent lawsuits). <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:1"><p>A funny side effect of the importance of experience is that age discrimination doesn't affect the areas I've worked in. At 30, I'm bizarrely young for someone who's done microprocessor design. The core folks at my old place were in their 60s. They'd picked up some younger folks along the way, but 30? Freakishly young. People are much younger at the new gig: I'm surrounded by ex-supercomputer folks from <a href="http://en.wikipedia.org/wiki/Cray#Cray_Research_Inc._and_Cray_Computer_Corporation:_1972_to_1996">Cray and SGI</a>, who are barely pushing 50, along with a couple kids from <a href="http://en.wikipedia.org/wiki/Synplicity">Synplify</a> and <a href="http://www.deshawresearch.com/">DESRES</a> who, at 40, are unusually young. Not all hardware folks are that old. In another arm of the company, there are folks who grew up in the FPGA world, which is a lot more forgiving. In that group, I think I met someone who's only a few years older than me. Kidding aside, you'll see younger folks doing RTL design on complex projects at large companies that are willing to spend a decade mentoring folks. But, at startups and on small hardware teams that move fast, it's rare to hire someone into design who doesn't have a decade of experience.</p> <p>There's a crowd that's even younger than the FPGA folks, even younger than me, working on Arduinos and microcontrollers, doing hobbyist electronics and consumer products. I'm genuinely curious how many of those folks will decide to work on large-scale systems design. In one sense, it's inevitable, as the area matures, and solutions become more complex. The other sense is what I'm curious about: will the hardware renaissance spark an interest in supercomputers, microprocessors, and warehouse-scale computers?</p> <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> How to discourage open source contributions discourage-oss/ Sun, 27 Oct 2013 00:00:00 +0000 discourage-oss/ <p>What's the first thing you do when you find a bug or see a missing feature in an open source project? Check out the project page and submit a patch!</p> <p><img src="images/discourage-oss/send-us-116-pull-requests.png" alt="Send us a pull request! (116 open pull requests)" width="1012" height="152"></p> <p>Oh. Maybe their message is so encouraging that they get hundreds of pull requests a week, and the backlog isn't that bad.</p> <p><img src="images/discourage-oss/why-is-this-ignored.png" alt="Multiple people ask why this bug fix is being ignored. No response." width="745" height="472"></p> <p>Maybe not. Giant sucker than I am, I submitted a pull request even after seeing that. All things considered, I should consider myself lucky that it's possible to <a href="http://www.jwz.org/blog/2013/02/wow-remember-when-people-tracked-bugs-welcome-to-the-next-level/">submit pull requests at all</a>. If I'm really lucky, maybe they'll get around to <a href="http://www.jwz.org/doc/cadt.html">looking at it one day</a>.</p> <p></p> <p>I don't mean to pick on this particular project. I can understand how this happens. You're a dev who can merge pull requests, but you're not in charge of triaging bugs and pull requests; you have a day job, projects that you own, and a life outside of coding. Maybe you take a look at the repo every once in a while, merge in good pull requests, and make comments on the ones that need more work, but you don't look at all 116 open pull requests; who has that kind of time?</p> <p>This behavior, eminently reasonable on the part of any individual, results in a systemic failure, a tax on new open source contributors. I often get asked how to get started with open source. It's easy for me to forget that getting started can be hard because the first projects I contributed to have a response time measured in hours for issues and pull requests<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. But a lot of people have experiences which aren't so nice. They contribute a few patches to a couple projects that get ignored, and have no idea where to go from there. It doesn't take <a href="http://lostechies.com/derickbailey/2012/12/14/dear-open-source-project-leader-quit-being-a-jerk/">egregious individual behavior</a> to create a hostile environment.</p> <p>That's kept me from contributing to some projects. At my last job, I worked on making a well-known open source project production quality, fixing hundreds of bugs over the course of a couple months. When I had some time, I looked into pushing the changes back to the open source community. But when I looked at the mailing list for the project, I saw a wasteland of good patches that were completely ignored, where the submitter would ping the list a couple times and then give up. Did it seem worth spending a week to disentangle our IP from the project in order to submit a set of patches that would, in all likelihood, get ignored? No.</p> <p>If you have commit access to a project that has this problem, please <a href="http://en.wikipedia.org/wiki/Diffusion_of_responsibility">own the process</a> for incoming pull requests (or don't ask for pull requests in your repo description). It doesn't have to permanent; just until you have a system in place<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>. Not only will you get more contributors to your project, you'll help break down one barrier to becoming an open source contributor.</p> <p><img src="images/discourage-oss/plz-review.png" alt="joewiz replies to an month old comment. Asks for review months later. No reply." width="747" height="522"></p> <p><small></p> <p>For an update on the repo featured in this post, <a href="//danluu.com/everything-is-broken/#github">check out this response to a breaking change</a>.</p> <p></small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">Props to <a href="https://github.com/xianyi/OpenBLAS">OpenBlas</a>, <a href="https://github.com/mozilla/rust">Rust</a>, <a href="https://github.com/levskaya/jslinux-deobfuscated">jslinux-deobfuscated</a>, and <a href="https://github.com/softprops/np">np</a> for being incredibly friendly to new contributors. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">I don't mean to imply that this is trivial. It can be hard, if your project doesn't have an accepting culture, but there are <a href="http://paulmillr.com/posts/github-pull-request-stats/">popular, high traffic projects that manage to do it</a>. If all else fails, you can always try <a href="http://felixge.de/2013/03/11/the-pull-request-hack.html">the pull request hack</a>. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> </ol> </div> Randomize HN randomize-hn/ Fri, 04 Oct 2013 00:00:00 +0000 randomize-hn/ <p>You ever notice that there's this funny threshold for getting to the front page on sites like HN? The exact threshold varies depending on how much traffic there is, but, for articles that aren't wildly popular, there's this moment when the article is at N-1 votes. There is, perhaps, a 60% chance that the vote will come and the article will get pushed to the front page, where it will receive a slew of votes. There is, maybe, a 40% chance it will never get the vote that pushes it to the front page, causing it to languish in obscurity forever.</p> <p>It's non-optimal that an article that will receive 50 votes <a href="https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/expected-value--e-x">in expectation</a> has a 60% chance of getting 100+ votes, and a 40% chance of getting 2 votes. Ideally, each article would always get its expected number of votes and stay on the front page for the expected number of time, giving readers exposure to the article in proportion to its popularity. Instead, by random happenstance, plenty of interesting content never makes it the front page, and as a result, the content that does make it gets a higher than optimal level of exposure.</p> <p></p> <p>You also see the same problem, with the sign bit flipped, on low traffic sites that push things to the front page the moment they're posted, like <a href="https://lobste.rs/">lobste.rs</a> and <a href="http://www.reddit.com/r/programming/">the smaller sub-reddits</a>: they displace links that most people would be interested in by putting links that almost no one cares about on the front page just so that the few things people do care about get enough exposure to be upvoted. On reddit, users &quot;fix&quot; this problem by heavily downvoting most submissions, pushing them off the front page, resulting in a problem that's fundamentally the same as the problem HN has.</p> <p>Instead of implementing some simple and easy to optimize, sites pile on ad hoc rules. Reddit <a href="https://news.ycombinator.com/item?id=6499144">implemented the rising page</a>, but it fails to solve the problem. On low-traffic subreddits, like <a href="http://www.reddit.com/r/programming/rising">r/programming</a> the threshold is so high that it's almost always empty. On high-traffic sub-reddits, anything that's upvoted enough to make it to the rising page is already wildly successful, and whether or not an article becomes successful is heavily dependent on whether or not the first couple voters happen to be people who upvote the post instead of downvoting it, i.e., the problem of getting onto the rising page is no different than the problem of getting to the top normally.</p> <p>HN tries to solve the problem by manually penalizing certain domains and keywords. That doesn't solve the problem for the 95% of posts that aren't penalized. For posts that don't make it to the front page, the obvious workaround is to <a href="https://news.ycombinator.com/item?id=6861735">delete and re-submit your post if it doesn't make the front page the first time around</a>, but that's now <a href="https://news.ycombinator.com/item?id=6861740">a ban worthy offense</a>. Of course, people are working around that, and HN has a workaround for the workaround, and so on. It's endless. That's the problem with &quot;simple&quot; ad hoc solutions.</p> <p>There's an easy fix, but it's counter-intuitive. By adding a small amount of random noise to the rank of an article, we can smooth out the discontinuity between making it onto the front page and languishing in obscurity. The math is simple, but the intuition is even simpler<sup class="footnote-ref" id="fnref:A"><a rel="footnote" href="#fn:A">1</a></sup>. Imagine a vastly oversimplified model where, for each article, every reader upvotes with a fixed probability and the front page gets many more eyeballs than the new page. The result follows. If you like, you can work through the exercise with a more realistic model, but the result is the same<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">2</a></sup>.</p> <p>Adding noise to smooth out a discontinuity is a common trick when you can settle for an approximate result. I recently employed it to work around the classic floating point problem, where adding a tiny number to a large number results in no change, which is problem when adding many small numbers to some large numbers<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">3</a></sup>. For a simple example of applying this, consider keeping a reduced precision counter that uses <code>loglog(n)</code> bits to store the value. Let <code>countVal(x) = 2^x</code> and <code>inc(x) = if (rand(2^x) == 0) x++</code><sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">4</a></sup>. Like understanding when to apply Taylor series, this is a simple trick that people are often impressed by if they haven't seen it before<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">5</a></sup>.</p> <p><strong>Update</strong>: HN tried this! Dan Gackle tells me that it didn't work very well (it resulted in a lot of low quality junk briefly hitting the front page and then disappearing. I think that might be fixable by tweaking some parameters, but the solution that HN settled on, having a human (or multiple humans) put submissions that are deemed to be good or interesting into a &quot;second chance queue&quot; that boosts the submission onto the front page, works better than an a simple randomized algorithm with no direct human input could with any amount of parameter tweaking. I think this is also true of moderation, where the &quot;new&quot; dang/sctb moderation regime has resulted in a marked increase in comment quality, probably better than anything that could be done with an automated ML-based solution today — Google and FB have some of the most advanced automated systems in the world, and the quality of the result is much worse than what we see on HN.</p> <p>Also, at the time this post was written (2013), the threshold to get onto the front page was often 2-3 votes, making the marginal impact of a random passerby who happens to like a submission checking the new page very large. Even during off peak times now (in 2019), the threshold seems to be much higher, reducing the amount of randomness. Additionally, the rise in the popularity of HN increased the sheer volume of low quality content that languishes on the new page, which would reduce the exposure that any particular &quot;good&quot; submisison would get if it were among the 30 items on the new page that would randomly get boosted onto the front page. That doesn't mean there aren't still problems with the current system: most people seem to upvote and comment based on the title of the article and not the content (to check this, read the comments of articles that are mistitled before someone calls this out for a partiular post — it's generally quite clear that most commenters haven't even skimmed the article, let alone read it), but that's a topic for a different post.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:A">Another way to look at it is that it's A/B testing for upvotes (though, to be pedantic it's actually closer to multi-armed bandit). Another is that the distribution of people reading the front page and the new page aren't the same, and randomizing the front page prevents the clique that reads the new page from having undue influence. <a class="footnote-return" href="#fnref:A"><sup>[return]</sup></a></li> <li id="fn:1">If you want to do the exercise yourself, pg once said the formula for HN is: (votes - 1) / (time + 2)^1.5. It's possible the power of the denominator has been tweaked, but as long as it's greater than 1.0, you'll a reasonable result. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2"><a href="http://en.wikipedia.org/wiki/Kahan_summation_algorithm">Kahan summation</a> wasn't sufficient, for the fundamental same reason it won't work for the simplified example I gave above. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:3">Assume we use a rand function that returns a non-negative integer between 0 and n-1, inclusive. With x = 0, we start counting from 1, <a href="http://julialang.org/">as God intended</a>. inc(0) will definitely increment, so we'll increment and correctly count to countVal(1) = 2^1 = 2. Next, we'll increment with probability ½; we'll have to increment twice in expectation to increase x. That works out perfectly because countVal(2) = 2^2 = 4, so we want to increment twice before increasing x. Then we'll increment with probability ¼, and so on and so forth. <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> <li id="fn:4">See <a href="http://www.amazon.com/gp/product/0521835402/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0521835402&amp;linkCode=as2&amp;tag=abroaview-20">Mitzenmacher</a> for a good introduction to randomized algorithms that also has an explanation of all the math you need to know. If you already apply Chernoff bounds in your sleep, and want something more in-depth, <a href="http://www.amazon.com/gp/product/0521474655/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0521474655&amp;linkCode=as2&amp;tag=abroaview-20">Motwani &amp; Raghavan</a> is awesome. <a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li> </ol> </div> Writing safe Verilog pl-troll/ Sun, 15 Sep 2013 00:00:00 +0000 pl-troll/ <p><img src="images/pl-troll/davidalbert-pl-troll.png" alt="PL troll: a statically typed language with no type declarations. Types are determined entirely using Hungarian notation" width="1160" height="448"></p> <p>Troll? That's how people write Verilog<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. At my old company, we had a team of formal methods PhD's who wrote a linter that typechecked our code, based on our naming convention. For our chip (which was small for a CPU), building a model (compiling) took about five minutes, running a single short test took ten to fifteen minutes, and long tests took CPU months. The value of a linter that can run in seconds should be obvious, not even considering the fact that it can take hours of tracing through waveforms to find out why a test failed<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>.</p> <p>Lets look at some of the most commonly used naming conventions.</p> <p></p> <h3 id="pipeline-stage">Pipeline stage</h3> <p>When you <a href="http://en.wikipedia.org/wiki/Pipeline_(computing)">pipeline</a> hardware, you end up with many versions of the same signal, one for each stage of the pipeline the signal traverses. Even without static checks, you'll want some simple way to differentiate between these, so you might name them <code>foo_s1</code>, <code>foo_s2</code>, and <code>foo_s3</code>, indicating that they originate in the first, second, and third stages, respectively. In any particular stage, a signal is most likely to interact with other signals in the same stage; it's often a mistake when logic from other stages is accessed. There are reasons to access signals from other stages, like bypass paths and control logic that looks at multiple stages, but logic that stays contained within a stage is common enough that it's not too tedious to either “cast” or add a comment that disables the check, when looking at signals from other stages.</p> <h3 id="clock-domain">Clock domain</h3> <p>Accessing a signal in a different clock domain without synchronization is like accessing a data structure from multiple threads without synchronization. Sort of. But worse. Much worse. Driving combinational logic from a metastable state (where the signal is sitting between a 0 and 1) can burn a massive amount of power<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup>. Here, I'm not just talking about being inefficient. If you took a high-power chip from the late 90s and removed the heat sink, it would melt itself into the socket, even under normal operation. Modern chips have such a high maximum power possible power consumption that the chips would self destruct if you disabled the thermal regulation, even with the heat sink. Logic that's floating at an intermediate value not only uses a lot of power, it bypasses a chip's usual ability to reduce power by slowing down the clock<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">4</a></sup>. Using cross clock domain signals without synchronization is a bad idea, unless you like random errors, high power dissipation, and the occasional literal meltdown.</p> <h3 id="module-region">Module / Region</h3> <p>In high speed designs, it's an error to use a signal that's sourced from another module without registering it first. This will insidiously sneak through simulation; you'll only notice when you look at the timing report. On the last chip I worked on, it took about two days to generate a timing report<sup class="footnote-ref" id="fnref:and-you-couldn-t-concurrently-generate-a-new-one-for-each-build-due-to-the-license-costs-6"><a rel="footnote" href="#fn:and-you-couldn-t-concurrently-generate-a-new-one-for-each-build-due-to-the-license-costs-6">0</a></sup>. If you accidentally reference a signal from a distant module, not only will you not meet your timing budget for that path, the synthesis tool will allocate resources to try to make that path faster, which will slow down everything else<sup class="footnote-ref" id="fnref:7"><a rel="footnote" href="#fn:7">5</a></sup>, making the entire timing report worthless<sup class="footnote-ref" id="fnref:9"><a rel="footnote" href="#fn:9">6</a></sup>.</p> <h3 id="pl-trolling">PL Trolling</h3> <p>I'd been feeling naked at my new gig, coding Verilog without any sort of static checking. I put off writing my own checker, because static analysis is one of those scary things you need a PhD to do, right? And writing a parser for SystemVerilog is a ridiculously large task<sup class="footnote-ref" id="fnref:10"><a rel="footnote" href="#fn:10">7</a></sup>. But, it turns out that don't need much of a parser, and all the things I've talked about are simple enough that half an hour after starting, I had a tool that found seven bugs, with only two false positives. I expect we'll have 4x as much code by the time we're done, so that's 28 bugs from half an hour of work, not even considering the fact that two of the bugs were in heavily used macros.</p> <p>I think I'm done for the day, but there are plenty of other easy things to check that will certainly find bugs (e.g, checking for regs/logic that are declared or assigned, but not used). Whenever I feel like tackling a self-contained challenge, there are plenty of not-so-easy things, too (e.g., checking if things aren't clock gated or power gated when they should be, which isn't hard to do statistically, but is non-trivial statically).</p> <p>Huh. That wasn't so bad. I've now graduated to junior PL troll.</p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">Well, people usually use suffixes as well as prefixes. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> <li id="fn:2">You should, of course, write your own tool to script interaction with your waveform view because waveform viewers have such poor interfaces, but that's whole ‘nother blog post. <a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li> <li id="fn:3">In <a href="http://en.wikipedia.org/wiki/CMOS">static CMOS</a> there's a network of transistors between power and output, and a dual network between ground and output. As a first-order approximation, only one of the two networks should be on at a time, except when switching, which is why switching logic gates use power than unchanging gates -- in addition to the power used to discharge the capacitance that the output is driving, there is, briefly, a direct connection from power to ground. If you get stuck into a half-on state, there's a constant connection from power to ground. <a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li> <li id="fn:4">In theory, power gating could help, but you can't just power gate some arbitrary part of the chip that's too hot. <a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li> <li id="fn:7">There are a number of reasons that this completely destroys the timing report. First, for any high-speed design, there's not enough fast (wide) interconnect to go around. Gates are at the bottom, and wires sit above them. Wires get wider and faster in higher layers, but there's congestion getting to and from the fast wires, and relatively few of them. There are so few of them that people pre-plan where modules should be placed in order to have enough fast interconnect to meet timing demands. If you steal some fast wires to make some slow path fast, anything relying on having a fast path through that region is hosed. Second, the synthesis tool tries to place sources near sinks, to reduce both congestion and delay. If you place a sink on a net that's very far from the rest of the sinks, the source will migrate halfway in between, to try to match the demands of all the sinks. This is recursively bad, and will pull all the second order sources away from their optimal location, and so on and so forth. <a class="footnote-return" href="#fnref:7"><sup>[return]</sup></a></li> <li id="fn:9">With some tools, you can have them avoid optimizing paths that fail timing by more than a certain margin, but there's still always some window where a bad path will destroy your entire timing report, and it's often the case that there are real critical paths that need all the resources the synthesis tool can throw at it to make it across the chip in time. <a class="footnote-return" href="#fnref:9"><sup>[return]</sup></a></li> <li id="fn:10">The SV standard is 1300 pages long, vs 800 for C++, 500 for C, 300 for Java, and 30 for Erlang. <a class="footnote-return" href="#fnref:10"><sup>[return]</sup></a></li> </ol> </div> Verilog is weird why-hardware-development-is-hard/ Sat, 07 Sep 2013 00:00:00 +0000 why-hardware-development-is-hard/ <p>Verilog is the most commonly used language for hardware design in America (VHDL is more common in Europe). Too bad it's so baroque. If you ever browse the <a href="http://stackoverflow.com/questions/tagged/verilog">Verilog questions on Stack Overflow</a>, you'll find a large number of questions, usually downvoted, asking “why doesn't my code work?”, with code that's not just a little off, but completely wrong.</p> <p><img src="images/so_verilog.png" alt="6 questions, all but one with negative score"></p> <p></p> <p>Lets look at <a href="http://stackoverflow.com/questions/18678426/verilog-store-counter-value-when-reset-is-asserted">an example</a>: “Idea is to store value of counter at the time of reset . . . I get DRC violations and the memory, bufreadaddr, bufreadval are all optimized out.”</p> <pre><code>always @(negedge reset or posedge clk) begin if (reset == 0) begin d_out &lt;= 16'h0000; d_out_mem[resetcount] &lt;= d_out; laststoredvalue &lt;= d_out; end else begin d_out &lt;= d_out + 1'b1; end end always @(bufreadaddr) bufreadval = d_out_mem[bufreadaddr]; </code></pre> <p>We want a counter that keeps track of how many cycles it's been since reset, and we want to store that value in an array-like structure that's indexed by resetcount. If you've read a bit on semantics of Verilog, this is a perfectly natural way to solve the problem. Our poster knows enough about Verilog to use ‘&lt;=' in state elements, so that all of the elements are updated at the same time. Every time there's a clock edge, we'll increment d_out. When reset is 0, we'll store that value and reset d_out. What could possibly go wrong?</p> <p>The problem is that Verilog <a href="http://en.wikipedia.org/wiki/Verilog#History">was originally designed as a language to describe simulations</a>, so it has constructs to describe arbitrary interactions between events. When X transitions from 0 to 1, do Y. Great! Sounds easy enough. But then someone had the bright idea of using Verilog to represent hardware. The vast majority of statements you could write down don't translate into any meaningful hardware. Your synthesis tool, which translates from Verilog to hardware will helpfully pattern match to the closest available thing, or produce nothing, if you write down something untranslatable. If you're lucky, you might get some warnings.</p> <p>Looking at the code above, the synthesis tool will see that there's something called d<em>out which should be a clocked element that's set to something when it shouldn't be reset, and is otherwise asynchronously reset. That's a legit hardware construct, so it will produce an N-bit flip-flop and some logic to make it a counter that gets reset to 0. BTW, this paragraph used to contain a link to <a href="http://en.wikipedia.org/wiki/Flip-flop">http://en.wikipedia.org/wiki/Flip-flop</a></em>(electronics), but ever since I switched to Hugo, my links to URLs with parens in them are broken, so maybe try copy+pasting that URL into your browser window if you want know what a flip-flop is.</p> <p>Now, what about the value we're supposed to store on reset? Well, the synthesis tool will see that it's inside a block that's clocked. But it's not supposed to do anything when the clock is active; only when reset is asserted. That's pretty unusual. What's going to happen? Well, that depends on which version of which synthesis tool you're using, and how the programmers of that tool decided to implement undefined behavior.</p> <p>And then there's the block that's supposed to read out the stored value. It looks like the intent is to create a 64:1 <a href="http://en.wikipedia.org/wiki/Multiplexer">MUX</a>. Putting aside the cycle time issues you'll get with such a wide MUX, the block isn't clocked, so the synthesis tool will have to infer some sort of combinational logic. But, the output is only supposed to change if bufreadaddr changes, and not if d_out_mem changes. It's quite easy to describe that in our simulation language, the but the synthesis tool is going to produce something that is definitely not what the user wants here. Not to mention that laststoredvalue isn't meaningfully connected to bufreadvalue.</p> <p>How is it possible that a reasonable description of something in Verilog turns into something completely wrong in hardware? You can think of hardware as some state, with pure functions connecting the state elements. This makes it natural to think about modeling hardware in a functional programming language. Another natural way to think about it would be with OO. Classes describe how the hardware works. Instances of the class are actual hardware that will get put onto the chip. Yet another natural way to describe things would be declaratively, where you write down constraints the hardware must obey, and the synthesis tool outputs something that meets those constraints.</p> <p>Verilog does none of these things. To write Verilog that will produce correct hardware, you have to first picture the hardware you want to produce. Then, you have to figure out how to describe that in this weird C-like simulation language. That will then get synthesized into something like what you were imaging in the first step.</p> <p>As a software engineer, how would you feel if 99% of valid Java code ended up being translated to something that produced random results, even though tests pass on the untranslated Java code? And, by the way, to run tests on the translated Java code you have to go through a multi-day long compilation process, after which your tests will run 200 million times slower than code runs in production. If you're thinking of testing on some sandboxed production machines, sure, go ahead, but it costs 8 figures to push something to any number of your production machines, and it takes 3 months. But, don't worry, you can run the untranslated code only 2 million times slower than in production <sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. People used to statically typed languages often complain that you get run-time errors about things that would be trivial to statically check in a language with stronger types. We hardware folks are so used to the vast majority of legal Verilog constructs producing unsynthesizable garbage that we don't find it the least bit surprising that you not only do you not get compile-time errors, you don't even get run-time errors, from writing naive Verilog code.</p> <p>Old school hardware engineers will tell you that it's fine. It's fine that the language is so counter-intuitive that almost all people who initially approach Verilog write code that's not just wrong but nonsensical. &quot;All you have to do is figure out the design and then translate it to Verilog&quot;. They'll tell you that it's totally fine that the mental model you have of what's going on is basically unrelated to the constructs the language provides, and that they never make errors now that they're experienced, much like some experienced C programmers will erronously tell you that they never have security related buffer overflows or double frees or memory leaks now that they're experienced. It reminds me of talking to assembly programmers who tell me that assembly is as productive as a high level language once you get your functions written. Programmers who haven't talked to old school assembly programmers will think I'm making that up, but I know a number of people who still maintain that assembly is as productive as any high level langauge out there. But people like that are rare and becoming rarer. With hardware, we train up a new generation of people who think that Verilog is as productive as any language could be every few years!</p> <p>I won't even get into how Verilog is so inexpressive that many companies use an ad hoc tool to embed a scripting language in Verilog or generate Verilog from a scripting language.</p> <p>There have been a number of attempts to do better than jamming an ad hoc scripting language into Verilog, but they've all fizzled out. As a functional language that's easy to add syntax to, Haskell is a natural choice for Verilog code generation; it spawned ForSyDe, Hydra, Lava, HHDL, and Bluespec. But adoption of ForSyDe, Hydra, Lava, and HHDL is pretty much zero, not because of deficiencies in the language, but because it's politically difficult to get people to use a Haskell based language. Bluespec has done better, but they've done it by making their language look C-like, scrapping the original Haskell syntax and introducing Bluespec SystemVerilog and Bluespec SystemC. The aversion to Haskell is so severe that when we discussed a hardware style at my new gig, one person suggested banning any Haskell based solution, even though Bluespec has been used to good effect in a couple projects within the company.</p> <p>Scala based solutions look more promising, not for any technical reason, but because Scala is less scary. Scala has managed to bring the modern world (in terms of type systems) to more programmers than ML, Ocaml, Haskell, Agda, etc., combined. Perhaps the same will be true in the hardware world. <a href="https://chisel.eecs.berkeley.edu">Chisel</a> is interesting. Like Bluespec, it simulates much more quickly than Verilog, and unsynthesizable representations are syntax errors. It's not as high level, but it's the only hardware description language with a modern type system that I've been able to discuss with hardware folks without people objecting that Haskell is a bad idea.</p> <p>Commercial vendors are mostly moving in the other direction because C-like languages make people feel all warm and fuzzy. A number of them are pushing high-level hardware synthesis from SystemC, or even straight C or C++. These solutions are also politically difficult to sell, but this time it's the history of the industry, and not the language. Vendors pushing high-level synthesis have a decades long track record of overpromising and underdelivering. I've lost track of the number of times I've heard people dismiss modern offerings with “Why should we believe that this they're for real this time?”</p> <p>What's the future? Locally, I've managed to convince a couple of people on my team that Chisel is worth looking at. At the moment, none of the Haskell based solutions are even on the table. I'm open to suggestions.</p> <h3 id="cpu-internals-series">CPU internals series</h3> <ul> <li><a href="cpu-bugs/">CPU bugs</a></li> <li><a href="//danluu.com/new-cpu-features/">New CPU features since the 80s</a></li> <li><a href="//danluu.com/branch-prediction/">A brief history of branch prediction</a></li> <li><a href="//danluu.com/integer-overflow/">The cost of branches and integer overflow checking in real code</a></li> <li><a href="//danluu.com/hardware-unforgiving/">Why CPU development is hard</a></li> <li><a href="//danluu.com/why-hardware-development-is-hard/">Verilog sucks, part 1</a></li> <li><a href="//danluu.com/pl-troll/">Verilog sucks, part 2</a></li> </ul> <p><small>P.S. Dear hardware folks, sorry for oversimplifying so much. I started writing footnotes explaining everything I was glossing over until I realized that my footnotes were longer than the post. The culled footnotes may make it into their own blog posts some day. A very long footnote that I'll briefly summarize is that semantically correct Verilog simulation is inherently slower than something like Bluespec or Chisel because of the complications involved with the event model. EDA vendors have managed to get decent performance out of Verilog, but only by hiring large teams of the best simulation people in the world to hammer at the problem, the same way JavaScript is fast not because of any property of the language, but because there are amazing people working on the VM. It should tell you something when a tiny team working on a shoestring grant-funded budget can produce a language and simulation infrastructure that smokes existing tools.</p> <p>You may wonder why I didn't mention linters. They're a great idea and for reasons I don't understand, two of the three companies I've done hardware development for haven't used linters. If you ask around, everyone will agree that they're a good idea, but even though a linter will run in the thousands to tens of thousands of dollars range, and engineers run in hundreds of thousands of dollars range, it hasn't been politically possible to get a linter even on multi-person teams that have access to tools that cost tens or hundreds of thousands of dollars per license per year. Even though linters are a no-brainer, companies that spend millions to tens of millions a year on hardware development often don't use them, and good SystemVerilog linters are all out of the price range of the people who are asking StackOverflow questions that get downvoted to oblivion. </small></p> <div class="footnotes"> <hr /> <ol> <li id="fn:1">Approximate numbers from the last chip I worked on. We had licenses for both major commercial simulators, and we were lucky to get 500Hz, pre-synthesis, on the faster of the two, for a chip that ran at 2GHz in silicon. Don't even get me started on open source simulators. The speed is at least 10x better for most ASIC work. Also, you can probably do synthesis much faster if you don't have timing / parasitic extraction baked into the process. <a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li> </ol> </div> About danluu.com about/ Sun, 01 Sep 2013 00:00:00 +0000 about/ <h3 id="about-the-blog">About The Blog</h3> <p>This started out as a way to jot down thoughts on areas that seem interesting but underappreciated. Since then, this site has grown to the point where it gets <a href="blog-ads/#update">millions of hits a month</a> and I see that it's commonly cited by professors in their courses and on stackoverflow.</p> <p>That's flattering, but more than anything else, I view that as a sign there's a desperate shortage of <a href="//danluu.com/teach-debugging/">understandable explanation of technical topics</a>. There's nothing here that most of my co-workers don't know (with the exception of maybe three or four posts where I propose novel ideas). It's just that they don't blog and I do. I'm not going to try to <a href="https://sites.google.com/site/steveyegge2/you-should-write-blogs">convince you to start writing a blog</a>, since that has to be something you want to do, but I will point out that there's a large gap that's waiting to be filled by your knowledge. When I started writing this blog, I figured almost no one would ever read it; sure Joel Spolsky and Steve Yegge created widely read blogs, but that was back when almost no one was blogging. Now that there are millions of blogs, there's just no way to start a new blog and get noticed. Turns out that's not true.</p> <p>This site also archives a few things that have fallen off the internet, like <a href="//danluu.com/subspace-history/">this history of subspace, the 90s video game</a>, the <a href="//danluu.com/su3su2u1/physics/">su3su2u1 introduction to physics</a>, <a href="//danluu.com/su3su2u1/hpmor/">the su3su2u1 review of hpmor</a>, <a href="//danluu.com/symbolics-lisp-machines/">Dan Weinreb's history of Symbolics and Lisp machines</a>, <a href="//danluu.com/open-social-networks/">this discussion of open vs. closed social networks</a>, <a href="//danluu.com/mit-stanford/">this discussion about the differences between SV and Boston, and Stanford and MIT</a>, <a href="//danluu.com/threads-faq/">the comp.programming.threads FAQ</a>, and <a href="//danluu.com/microsoft-culture/">this presentation about Microsoft culture from 2000</a><a href="//danluu.com/mcilroy-unix">.</a></p> <p>P.S. If you enjoy this blog, you'd probably enjoy <a href="https://www.recurse.com/scout/click?t=b504af89e87b77920c9b60b2a1f6d5e8">RC</a>, which I've heard called &quot;nerd camp for programmers&quot;.</p> Latency mitigation strategies (by John Carmack) latency-mitigation/ Tue, 05 Mar 2013 00:00:00 +0000 latency-mitigation/ <p><i>this is an archive of an old article by John Carmack which seems to have disappeared off of the internet</i></p> <h4 id="abstract">Abstract</h4> <p>Virtual reality (VR) is one of the most demanding human-in-the-loop applications from a latency standpoint. The latency between the physical movement of a user’s head and updated photons from a head mounted display reaching their eyes is one of the most critical factors in providing a high quality experience.</p> <p>Human sensory systems can detect very small relative delays in parts of the visual or, especially, audio fields, but when absolute delays are below approximately 20 milliseconds they are generally imperceptible. Interactive 3D systems today typically have latencies that are several times that figure, but alternate configurations of the same hardware components can allow that target to be reached.</p> <p>A discussion of the sources of latency throughout a system follows, along with techniques for reducing the latency in the processing done on the host system.</p> <h4 id="introduction">Introduction</h4> <p>Updating the imagery in a head mounted display (HMD) based on a head tracking sensor is a subtly different challenge than most human / computer interactions. With a conventional mouse or game controller, the user is consciously manipulating an interface to complete a task, while the goal of virtual reality is to have the experience accepted at an unconscious level.</p> <p>Users can adapt to control systems with a significant amount of latency and still perform challenging tasks or enjoy a game; many thousands of people enjoyed playing early network games, even with 400+ milliseconds of latency between pressing a key and seeing a response on screen.</p> <p>If large amounts of latency are present in the VR system, users may still be able to perform tasks, but it will be by the much less rewarding means of using their head as a controller, rather than accepting that their head is naturally moving around in a stable virtual world. Perceiving latency in the response to head motion is also one of the primary causes of simulator sickness. Other technical factors that affect the quality of a VR experience, like head tracking accuracy and precision, may interact with the perception of latency, or, like display resolution and color depth, be largely orthogonal to it.</p> <p>A total system latency of 50 milliseconds will feel responsive, but still subtly lagging. One of the easiest ways to see the effects of latency in a head mounted display is to roll your head side to side along the view vector while looking at a clear vertical edge. Latency will show up as an apparent tilting of the vertical line with the head motion; the view feels “dragged along” with the head motion. When the latency is low enough, the virtual world convincingly feels like you are simply rotating your view of a stable world.</p> <p>Extrapolation of sensor data can be used to mitigate some system latency, but even with a sophisticated model of the motion of the human head, there will be artifacts as movements are initiated and changed. It is always better to not have a problem than to mitigate it, so true latency reduction should be aggressively pursued, leaving extrapolation to smooth out sensor jitter issues and perform only a small amount of prediction.</p> <h4 id="data-collection">Data collection</h4> <p>It is not usually possible to introspectively measure the complete system latency of a VR system, because the sensors and display devices external to the host processor make significant contributions to the total latency. An effective technique is to record high speed video that simultaneously captures the initiating physical motion and the eventual display update. The system latency can then be determined by single stepping the video and counting the number of video frames between the two events.</p> <p>In most cases there will be a significant jitter in the resulting timings due to aliasing between sensor rates, display rates, and camera rates, but conventional applications tend to display total latencies in the dozens of 240 fps video frames.</p> <p>On an unloaded Windows 7 system with the compositing Aero desktop interface disabled, a gaming mouse dragging a window displayed on a 180 hz CRT monitor can show a response on screen in the same 240 fps video frame that the mouse was seen to first move, demonstrating an end to end latency below four milliseconds. Many systems need to cooperate for this to happen: The mouse updates 500 times a second, with no filtering or buffering. The operating system immediately processes the update, and immediately performs GPU accelerated rendering directly to the framebuffer without any page flipping or buffering. The display accepts the video signal with no buffering or processing, and the screen phosphors begin emitting new photons within microseconds.</p> <p>In a typical VR system, many things go far less optimally, sometimes resulting in end to end latencies of over 100 milliseconds.</p> <h4 id="sensors">Sensors</h4> <p>Detecting a physical action can be as simple as a watching a circuit close for a button press, or as complex as analyzing a live video feed to infer position and orientation.</p> <p>In the old days, executing an IO port input instruction could directly trigger an analog to digital conversion on an ISA bus adapter card, giving a latency on the order of a microsecond and no sampling jitter issues. Today, sensors are systems unto themselves, and may have internal pipelines and queues that need to be traversed before the information is even put on the USB serial bus to be transmitted to the host.</p> <p>Analog sensors have an inherent tension between random noise and sensor bandwidth, and some combination of analog and digital filtering is usually done on a signal before returning it. Sometimes this filtering is excessive, which can contribute significant latency and remove subtle motions completely.</p> <p>Communication bandwidth delay on older serial ports or wireless links can be significant in some cases. If the sensor messages occupy the full bandwidth of a communication channel, latency equal to the repeat time of the sensor is added simply for transferring the message. Video data streams can stress even modern wired links, which may encourage the use of data compression, which usually adds another full frame of latency if not explicitly implemented in a pipelined manner.</p> <p>Filtering and communication are constant delays, but the discretely packetized nature of most sensor updates introduces a variable latency, or “jitter” as the sensor data is used for a video frame rate that differs from the sensor frame rate. This latency ranges from close to zero if the sensor packet arrived just before it was queried, up to the repeat time for sensor messages. Most USB HID devices update at 125 samples per second, giving a jitter of up to 8 milliseconds, but it is possible to receive 1000 updates a second from some USB hardware. The operating system may impose an additional random delay of up to a couple milliseconds between the arrival of a message and a user mode application getting the chance to process it, even on an unloaded system.</p> <h4 id="displays">Displays</h4> <p>On old CRT displays, the voltage coming out of the video card directly modulated the voltage of the electron gun, which caused the screen phosphors to begin emitting photons a few microseconds after a pixel was read from the frame buffer memory.</p> <p>Early LCDs were notorious for “ghosting” during scrolling or animation, still showing traces of old images many tens of milliseconds after the image was changed, but significant progress has been made in the last two decades. The transition times for LCD pixels vary based on the start and end values being transitioned between, but a good panel today will have a switching time around ten milliseconds, and optimized displays for active 3D and gaming can have switching times less than half that.</p> <p>Modern displays are also expected to perform a wide variety of processing on the incoming signal before they change the actual display elements. A typical Full HD display today will accept 720p or interlaced composite signals and convert them to the 1920×1080 physical pixels. 24 fps movie footage will be converted to 60 fps refresh rates. Stereoscopic input may be converted from side-by-side, top-down, or other formats to frame sequential for active displays, or interlaced for passive displays. Content protection may be applied. Many consumer oriented displays have started applying motion interpolation and other sophisticated algorithms that require multiple frames of buffering.</p> <p>Some of these processing tasks could be handled by only buffering a single scan line, but some of them fundamentally need one or more full frames of buffering, and display vendors have tended to implement the general case without optimizing for the cases that could be done with low or no delay. Some consumer displays wind up buffering three or more frames internally, resulting in 50 milliseconds of latency even when the input data could have been fed directly into the display matrix.</p> <p>Some less common display technologies have speed advantages over LCD panels; OLED pixels can have switching times well under a millisecond, and laser displays are as instantaneous as CRTs.</p> <p>A subtle latency point is that most displays present an image incrementally as it is scanned out from the computer, which has the effect that the bottom of the screen changes 16 milliseconds later than the top of the screen on a 60 fps display. This is rarely a problem on a static display, but on a head mounted display it can cause the world to appear to shear left and right, or “waggle” as the head is rotated, because the source image was generated for an instant in time, but different parts are presented at different times. This effect is usually masked by switching times on LCD HMDs, but it is obvious with fast OLED HMDs.</p> <h4 id="host-processing">Host processing</h4> <p>The classic processing model for a game or VR application is:</p> <pre><code>Read user input -&gt; run simulation -&gt; issue rendering commands -&gt; graphics drawing -&gt; wait for vsync -&gt; scanout I = Input sampling and dependent calculation S = simulation / game execution R = rendering engine G = GPU drawing time V = video scanout time </code></pre> <p>All latencies are based on a frame time of roughly 16 milliseconds, a progressively scanned display, and zero sensor and pixel latency.</p> <p>If the performance demands of the application are well below what the system can provide, a straightforward implementation with no parallel overlap will usually provide fairly good latency values. However, if running synchronized to the video refresh, the minimum latency will still be 16 ms even if the system is infinitely fast. This rate feels good for most eye-hand tasks, but it is still a perceptible lag that can be felt in a head mounted display, or in the responsiveness of a mouse cursor.</p> <pre><code>Ample performance, vsync: ISRG------------|VVVVVVVVVVVVVVVV| .................. latency 16 – 32 milliseconds </code></pre> <p>Running without vsync on a very fast system will deliver better latency, but only over a fraction of the screen, and with visible tear lines. The impact of the tear lines are related to the disparity between the two frames that are being torn between, and the amount of time that the tear lines are visible. Tear lines look worse on a continuously illuminated LCD than on a CRT or laser projector, and worse on a 60 fps display than a 120 fps display. Somewhat counteracting that, slow switching LCD panels blur the impact of the tear line relative to the faster displays.</p> <p>If enough frames were rendered such that each scan line had a unique image, the effect would be of a “rolling shutter”, rather than visible tear lines, and the image would feel continuous. Unfortunately, even rendering 1000 frames a second, giving approximately 15 bands on screen separated by tear lines, is still quite objectionable on fast switching displays, and few scenes are capable of being rendered at that rate, let alone 60x higher for a true rolling shutter on a 1080P display.</p> <pre><code>Ample performance, unsynchronized: ISRG VVVVV ..... latency 5 – 8 milliseconds at ~200 frames per second </code></pre> <p>In most cases, performance is a constant point of concern, and a parallel pipelined architecture is adopted to allow multiple processors to work in parallel instead of sequentially. Large command buffers on GPUs can buffer an entire frame of drawing commands, which allows them to overlap the work on the CPU, which generally gives a significant frame rate boost at the expense of added latency.</p> <pre><code>CPU:ISSSSSRRRRRR----| GPU: |GGGGGGGGGGG----| VID: | |VVVVVVVVVVVVVVVV| .................................. latency 32 – 48 milliseconds </code></pre> <p>When the CPU load for the simulation and rendering no longer fit in a single frame, multiple CPU cores can be used in parallel to produce more frames. It is possible to reduce frame execution time without increasing latency in some cases, but the natural split of simulation and rendering has often been used to allow effective pipeline parallel operation. Work queue approaches buffered for maximum overlap can cause an additional frame of latency if they are on the critical user responsiveness path.</p> <pre><code>CPU1:ISSSSSSSS-------| CPU2: |RRRRRRRRR-------| GPU : | |GGGGGGGGGG------| VID : | | |VVVVVVVVVVVVVVVV| .................................................... latency 48 – 64 milliseconds </code></pre> <p>Even if an application is running at a perfectly smooth 60 fps, it can still have host latencies of over 50 milliseconds, and an application targeting 30 fps could have twice that. Sensor and display latencies can add significant additional amounts on top of that, so the goal of 20 milliseconds motion-to-photons latency is challenging to achieve.</p> <h4 id="latency-reduction-strategies">Latency Reduction Strategies</h4> <h4 id="prevent-gpu-buffering">Prevent GPU buffering</h4> <p>The drive to win frame rate benchmark wars has led driver writers to aggressively buffer drawing commands, and there have even been cases where drivers ignored explicit calls to glFinish() in the name of improved “performance”. Today’s fence primitives do appear to be reliably observed for drawing primitives, but the semantics of buffer swaps are still worryingly imprecise. A recommended sequence of commands to synchronize with the vertical retrace and idle the GPU is:</p> <pre><code>SwapBuffers(); DrawTinyPrimitive(); InsertGPUFence(); BlockUntilFenceIsReached(); </code></pre> <p>While this should always prevent excessive command buffering on any conformant driver, it could conceivably fail to provide an accurate vertical sync timing point if the driver was transparently implementing triple buffering.</p> <p>To minimize the performance impact of synchronizing with the GPU, it is important to have sufficient work ready to send to the GPU immediately after the synchronization is performed. The details of exactly when the GPU can begin executing commands are platform specific, but execution can be explicitly kicked off with glFlush() or equivalent calls. If the code issuing drawing commands does not proceed fast enough, the GPU may complete all the work and go idle with a “pipeline bubble”. Because the CPU time to issue a drawing command may have little relation to the GPU time required to draw it, these pipeline bubbles may cause the GPU to take noticeably longer to draw the frame than if it were completely buffered. Ordering the drawing so that larger and slower operations happen first will provide a cushion, as will pushing as much preparatory work as possible before the synchronization point.</p> <pre><code>Run GPU with minimal buffering: CPU1:ISSSSSSSS-------| CPU2: |RRRRRRRRR-------| GPU : |-GGGGGGGGGG-----| VID : | |VVVVVVVVVVVVVVVV| ................................... latency 32 – 48 milliseconds </code></pre> <p>Tile based renderers, as are found in most mobile devices, inherently require a full scene of command buffering before they can generate their first tile of pixels, so synchronizing before issuing any commands will destroy far more overlap. In a modern rendering engine there may be multiple scene renders for each frame to handle shadows, reflections, and other effects, but increased latency is still a fundamental drawback of the technology.</p> <p>High end, multiple GPU systems today are usually configured for AFR, or Alternate Frame Rendering, where each GPU is allowed to take twice as long to render a single frame, but the overall frame rate is maintained because there are two GPUs producing frames</p> <pre><code>Alternate Frame Rendering dual GPU: CPU1:IOSSSSSSS-------|IOSSSSSSS-------| CPU2: |RRRRRRRRR-------|RRRRRRRRR-------| GPU1: | GGGGGGGGGGGGGGGGGGGGGGGG--------| GPU2: | | GGGGGGGGGGGGGGGGGGGGGGG---------| VID : | | |VVVVVVVVVVVVVVVV| .................................................... latency 48 – 64 milliseconds </code></pre> <p>Similarly to the case with CPU workloads, it is possible to have two or more GPUs cooperate on a single frame in a way that delivers more work in a constant amount of time, but it increases complexity and generally delivers a lower total speedup.</p> <p>An attractive direction for stereoscopic rendering is to have each GPU on a dual GPU system render one eye, which would deliver maximum performance and minimum latency, at the expense of requiring the application to maintain buffers across two independent rendering contexts.</p> <p>The downside to preventing GPU buffering is that throughput performance may drop, resulting in more dropped frames under heavily loaded conditions.</p> <h4 id="late-frame-scheduling">Late frame scheduling</h4> <p>Much of the work in the simulation task does not depend directly on the user input, or would be insensitive to a frame of latency in it. If the user processing is done last, and the input is sampled just before it is needed, rather than stored off at the beginning of the frame, the total latency can be reduced.</p> <p>It is very difficult to predict the time required for the general simulation work on the entire world, but the work just for the player’s view response to the sensor input can be made essentially deterministic. If this is split off from the main simulation task and delayed until shortly before the end of the frame, it can remove nearly a full frame of latency.</p> <pre><code>Late frame scheduling: CPU1:SSSSSSSSS------I| CPU2: |RRRRRRRRR-------| GPU : |-GGGGGGGGGG-----| VID : | |VVVVVVVVVVVVVVVV| .................... latency 18 – 34 milliseconds </code></pre> <p>Adjusting the view is the most latency sensitive task; actions resulting from other user commands, like animating a weapon or interacting with other objects in the world, are generally insensitive to an additional frame of latency, and can be handled in the general simulation task the following frame.</p> <p>The drawback to late frame scheduling is that it introduces a tight scheduling requirement that usually requires busy waiting to meet, wasting power. If your frame rate is determined by the video retrace rather than an arbitrary time slice, assistance from the graphics driver in accurately determining the current scanout position is helpful.</p> <h4 id="view-bypass">View bypass</h4> <p>An alternate way of accomplishing a similar, or slightly greater latency reduction Is to allow the rendering code to modify the parameters delivered to it by the game code, based on a newer sampling of user input.</p> <p>At the simplest level, the user input can be used to calculate a delta from the previous sampling to the current one, which can be used to modify the view matrix that the game submitted to the rendering code.</p> <p>Delta processing in this way is minimally intrusive, but there will often be situations where the user input should not affect the rendering, such as cinematic cut scenes or when the player has died. It can be argued that a game designed from scratch for virtual reality should avoid those situations, because a non-responsive view in a HMD is disorienting and unpleasant, but conventional game design has many such cases.</p> <p>A binary flag could be provided to disable the bypass calculation, but it is useful to generalize such that the game provides an object or function with embedded state that produces rendering parameters from sensor input data instead of having the game provide the view parameters themselves. In addition to handling the trivial case of ignoring sensor input, the generator function can incorporate additional information such as a head/neck positioning model that modified position based on orientation, or lists of other models to be positioned relative to the updated view.</p> <p>If the game and rendering code are running in parallel, it is important that the parameter generation function does not reference any game state to avoid race conditions.</p> <pre><code>View bypass: CPU1:ISSSSSSSSS------| CPU2: |IRRRRRRRRR------| GPU : |--GGGGGGGGGG----| VID : | |VVVVVVVVVVVVVVVV| .................. latency 16 – 32 milliseconds </code></pre> <p>The input is only sampled once per frame, but it is simultaneously used by both the simulation task and the rendering task. Some input processing work is now duplicated by the simulation task and the render task, but it is generally minimal.</p> <p>The latency for parameters produced by the generator function is now reduced, but other interactions with the world, like muzzle flashes and physics responses, remain at the same latency as the standard model.</p> <p>A modified form of view bypass could allow tile based GPUs to achieve similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to achieve 100% utilization without pipeline bubbles by the following steps:</p> <p>Inhibit the execution of GPU commands, forcing them to be buffered. OpenGL has only the deprecated display list functionality to approximate this, but a control extension could be formulated.</p> <p>All calculations that depend on the view matrix must reference it independently from a buffer object, rather than from inline parameters or as a composite model-view-projection (MVP) matrix.</p> <p>After all commands have been issued and the next frame has started, sample the user input, run it through the parameter generator, and put the resulting view matrix into the buffer object for referencing by the draw commands.</p> <p>Kick off the draw command execution.</p> <pre><code>Tiler optimized view bypass: CPU1:ISSSSSSSSS------| CPU2: |IRRRRRRRRRR-----|I GPU : | |-GGGGGGGGGG-----| VID : | | |VVVVVVVVVVVVVVVV| .................. latency 16 – 32 milliseconds </code></pre> <p>Any view frustum culling that was performed to avoid drawing some models may be invalid if the new view matrix has changed substantially enough from what was used during the rendering task. This can be mitigated at some performance cost by using a larger frustum field of view for culling, and hardware clip planes based on the culling frustum limits can be used to guarantee a clean edge if necessary. Occlusion errors from culling, where a bright object is seen that should have been occluded by an object that was incorrectly culled, are very distracting, but a temporary clean encroaching of black at a screen edge during rapid rotation is almost unnoticeable.</p> <h4 id="time-warping">Time warping</h4> <p>If you had perfect knowledge of how long the rendering of a frame would take, some additional amount of latency could be saved by late frame scheduling the entire rendering task, but this is not practical due to the wide variability in frame rendering times.</p> <pre><code>Late frame input sampled view bypass: CPU1:ISSSSSSSSS------| CPU2: |----IRRRRRRRRR--| GPU : |------GGGGGGGGGG| VID : | |VVVVVVVVVVVVVVVV| .............. latency 12 – 28 milliseconds </code></pre> <p>However, a post processing task on the rendered image can be counted on to complete in a fairly predictable amount of time, and can be late scheduled more easily. Any pixel on the screen, along with the associated depth buffer value, can be converted back to a world space position, which can be re-transformed to a different screen space pixel location for a modified set of view parameters.</p> <p>After drawing a frame with the best information at your disposal, possibly with bypassed view parameters, instead of displaying it directly, fetch the latest user input, generate updated view parameters, and calculate a transformation that warps the rendered image into a position that approximates where it would be with the updated parameters. Using that transform, warp the rendered image into an updated form on screen that reflects the new input. If there are two dimensional overlays present on the screen that need to remain fixed, they must be drawn or composited in after the warp operation, to prevent them from incorrectly moving as the view parameters change.</p> <pre><code>Late frame scheduled time warp: CPU1:ISSSSSSSSS------| CPU2: |RRRRRRRRRR----IR| GPU : |-GGGGGGGGGG----G| VID : | |VVVVVVVVVVVVVVVV| .... latency 2 – 18 milliseconds </code></pre> <p>If the difference between the view parameters at the time of the scene rendering and the time of the final warp is only a change in direction, the warped image can be almost exactly correct within the limits of the image filtering. Effects that are calculated relative to the screen, like depth based fog (versus distance based fog) and billboard sprites will be slightly different, but not in a manner that is objectionable.</p> <p>If the warp involves translation as well as direction changes, geometric silhouette edges begin to introduce artifacts where internal parallax would have revealed surfaces not visible in the original rendering. A scene with no silhouette edges, like the inside of a box, can be warped significant amounts and display only changes in texture density, but translation warping realistic scenes will result in smears or gaps along edges. In many cases these are difficult to notice, and they always disappear when motion stops, but first person view hands and weapons are a prominent case. This can be mitigated by limiting the amount of translation warp, compressing or making constant the depth range of the scene being warped to limit the dynamic separation, or rendering the disconnected near field objects as a separate plane, to be composited in after the warp.</p> <p>If an image is being warped to a destination with the same field of view, most warps will leave some corners or edges of the new image undefined, because none of the source pixels are warped to their locations. This can be mitigated by rendering a larger field of view than the destination requires; but simply leaving unrendered pixels black is surprisingly unobtrusive, especially in a wide field of view HMD.</p> <p>A forward warp, where source pixels are deposited in their new positions, offers the best accuracy for arbitrary transformations. At the limit, the frame buffer and depth buffer could be treated as a height field, but millions of half pixel sized triangles would have a severe performance cost. Using a grid of triangles at some fraction of the depth buffer resolution can bring the cost down to a very low level, and the trivial case of treating the rendered image as a single quad avoids all silhouette artifacts at the expense of incorrect pixel positions under translation.</p> <p>Reverse warping, where the pixel in the source rendering is estimated based on the position in the warped image, can be more convenient because it is implemented completely in a fragment shader. It can produce identical results for simple direction changes, but additional artifacts near geometric boundaries are introduced if per-pixel depth information is considered, unless considerable effort is expended to search a neighborhood for the best source pixel.</p> <p>If desired, it is straightforward to incorporate motion blur in a reverse mapping by taking several samples along the line from the pixel being warped to the transformed position in the source image.</p> <p>Reverse mapping also allows the possibility of modifying the warp through the video scanout. The view parameters can be predicted ahead in time to when the scanout will read the bottom row of pixels, which can be used to generate a second warp matrix. The warp to be applied can be interpolated between the two of them based on the pixel row being processed. This can correct for the “waggle” effect on a progressively scanned head mounted display, where the 16 millisecond difference in time between the display showing the top line and bottom line results in a perceived shearing of the world under rapid rotation on fast switching displays.</p> <h4 id="continuously-updated-time-warping">Continuously updated time warping</h4> <p>If the necessary feedback and scheduling mechanisms are available, instead of predicting what the warp transformation should be at the bottom of the frame and warping the entire screen at once, the warp to screen can be done incrementally while continuously updating the warp matrix as new input arrives.</p> <pre><code>Continuous time warp: CPU1:ISSSSSSSSS------| CPU2: |RRRRRRRRRRR-----| GPU : |-GGGGGGGGGGGG---| WARP: | W| W W W W W W W W| VID : | |VVVVVVVVVVVVVVVV| ... latency 2 – 3 milliseconds for 500hz sensor updates </code></pre> <p>The ideal interface for doing this would be some form of “scanout shader” that would be called “just in time” for the video display. Several video game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers ranging from half a scan line to several scan lines that were filled up in this manner.</p> <p>Without new hardware support, it is still possible to incrementally perform the warping directly to the front buffer being scanned for video, and not perform a swap buffers operation at all.</p> <p>A CPU core could be dedicated to the task of warping scan lines at roughly the speed they are consumed by the video output, updating the time warp matrix each scan line to blend in the most recently arrived sensor information.</p> <p>GPUs can perform the time warping operation much more efficiently than a conventional CPU can, but the GPU will be busy drawing the next frame during video scanout, and GPU drawing operations cannot currently be scheduled with high precision due to the difficulty of task switching the deep pipelines and extensive context state. However, modern GPUs are beginning to allow compute tasks to run in parallel with graphics operations, which may allow a fraction of a GPU to be dedicated to performing the warp operations as a shared parameter buffer is updated by the CPU.</p> <h4 id="discussion">Discussion</h4> <p>View bypass and time warping are complementary techniques that can be applied independently or together. Time warping can warp from a source image at an arbitrary view time / location to any other one, but artifacts from internal parallax and screen edge clamping are reduced by using the most recent source image possible, which view bypass rendering helps provide.</p> <p>Actions that require simulation state changes, like flipping a switch or firing a weapon, still need to go through the full pipeline for 32 – 48 milliseconds of latency based on what scan line the result winds up displaying on the screen, and translational information may not be completely faithfully represented below the 16 – 32 milliseconds of the view bypass rendering, but the critical head orientation feedback can be provided in 2 – 18 milliseconds on a 60 hz display. In conjunction with low latency sensors and displays, this will generally be perceived as immediate. Continuous time warping opens up the possibility of latencies below 3 milliseconds, which may cross largely unexplored thresholds in human / computer interactivity.</p> <p>Conventional computer interfaces are generally not as latency demanding as virtual reality, but sensitive users can tell the difference in mouse response down to the same 20 milliseconds or so, making it worthwhile to apply these techniques even in applications without a VR focus.</p> <p>A particularly interesting application is in “cloud gaming”, where a simple client appliance or application forwards control information to a remote server, which streams back real time video of the game. This offers significant convenience benefits for users, but the inherent network and compression latencies makes it a lower quality experience for action oriented titles. View bypass and time warping can both be performed on the server, regaining a substantial fraction of the latency imposed by the network. If the cloud gaming client was made more sophisticated, time warping could be performed locally, which could theoretically reduce the latency to the same levels as local applications, but it would probably be prudent to restrict the total amount of time warping to perhaps 30 or 40 milliseconds to limit the distance from the source images.</p> <h4 id="acknowledgements">Acknowledgements</h4> <p>Zenimax for allowing me to publish this openly.</p> <p>Hillcrest Labs for inertial sensors and experimental firmware.</p> <p>Emagin for access to OLED displays.</p> <p>Oculus for a prototype Rift HMD.</p> <p>Nvidia for an experimental driver with access to the current scan line number.</p> Kara Swisher interview of Jack Dorsey karajack/ Tue, 12 Feb 2013 00:00:00 +0000 karajack/ <p><i>This is a transcript of the Kara Swisher / Jack Dorsey interview from 2/12/2019, made by parsing the original Tweets because I wanted to be able to read this linearly. There's a &quot;moment&quot; that tries to track this, but since it doesn't distinguish between sub-threads in any way, you can't tell the difference between end of a thread and a normal reply. This linearization of the interview marks each thread break with a page break and provides some context from upthread where relevant (in <font color="#808080">grey text</font>).</i></p> <p><a href="https://twitter.com/karaswisher/status/1095440667373899776"><strong>Kara</strong></a>: Here in my sweatiest @soulcycle outfit for my Twitterview with @jack with @Laur_Katz at the ready @voxmediainc HQ. Also @cheezit acquired. #karajack</p> <p><a href="https://twitter.com/karaswisher/status/1095442716346011649"><strong>Kara</strong></a>: Oh hai @jack. Let’s set me set the table. First, I am uninterested in beard amulets or weird food Mark Zuckerberg served you (though WTF with both for my personal self). Second, I would appreciate really specific answers.</p> <p><a href="https://twitter.com/jack/status/1095442976497496064">Jack</a>: Got you. Here’s my setup. I work from home Tuesdays. In my kitchen. Tweetdeck. No one here with me, and no one connected to my tweetdeck. Just me focused on your questions!</p> <p><a href="https://twitter.com/karaswisher/status/1095443173260681217"><strong>Kara</strong></a>: Great, let's go</p> <p><a href="https://twitter.com/jack/status/1095443608340115457">Jack</a>: Ready</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095442857987661825"><strong>Kara</strong></a>: As @ashleyfeinberg wrote: “press him for a clear, unambiguous example of nearly anything, and Dorsey shuts down.” That is not unfair characterization IMHO. Third, I will thread in questions from audience, but to keep this non chaotic, let’s stay in one reply thread.</p> <p><a href="https://twitter.com/jack/status/1095443050271145984">Jack</a>: Deal</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095442999616724993"><strong>Kara</strong></a>: To be clear with audience, there is not a new event product, a glass house, if you will, where people can see us but not comment. I will ask questions and then respond to @jack answers. So it could be CHAOS.</p> <p><a href="https://twitter.com/jack/status/1095443402131271680">Jack</a>: To be clear, we’re interested in an experience like this. Nothing built yet. This gives us a sense of what it would be like, and what we’d need to focus on. If there’s something here at all!</p> <p><a href="https://twitter.com/karaswisher/status/1095443579147902977"><strong>Kara</strong></a>: Well an event product WOULD BE NICE. See my why aren't you moving faster trope.</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095443416148787202"><strong>Kara</strong></a>: Overall here is my mood and I think a lot of people when it comes to fixing what is broke about social media and tech: Why aren’t you moving faster? Why aren’t you moving faster? Why aren’t you moving faster?</p> <p><a href="https://twitter.com/jack/status/1095444282742173696">Jack</a>: A question we ask ourselves all the time. In the past I think we were trying to do too much. We’re better at prioritizing by impact now. Believe the #1 thing we should focus on is someone’s physical safety first. That one statement leads to a lot of ramifications.</p> <p><a href="https://twitter.com/karaswisher/status/1095444602545418246"><strong>Kara</strong></a>: It seems twitter has been stuck in a stagnant phase of considering/thinking about the health of the conversation, which plays into safety, for about 18-24 months. How have you made actual progress? Can you point me to it SPECIFICALLY?</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095443704448397312"><strong>Kara</strong></a>: You know my jam these days is tech responsibility. What grade do you gave Silicon Valley? Yourself?</p> <p><a href="https://twitter.com/jack/status/1095444570060419072">Jack</a>: Myself? C. We’ve made progress, but it has been scattered and not felt enough. Changing the experience hasn’t been meaningful enough. And we’ve put most of the burden on the victims of abuse (that’s a huge fail).</p> <p><a href="https://twitter.com/karaswisher/status/1095444895156789253"><strong>Kara</strong></a>: Well that is like telling me I am sick and am responsible for fixing it. YOU made the product, YOU run the platform. Saying it is a huge fail is a cop out to many. It is to me</p> <p><a href="https://twitter.com/jack/status/1095445502131195904">Jack</a>: Putting the burden on victims? Yes. It’s recognizing that we have to be proactive in enforcement and promotion of healthy conversation. This is our first priority in #health. We have to change a lot of the fundamentals of product to fix.</p> <p><a href="https://twitter.com/karaswisher/status/1095445998288207872"><strong>Kara</strong></a>: please be specific. I see a lot of beard-stroking on this (no insult to your Lincoln jam, but it works). WHAT are you changing? SPECIFICALLY.</p> <p><a href="https://twitter.com/jack/status/1095446428065837056">Jack</a>: First and foremost we’re looking at ways to proactively enforce and promote health. So that reporting/blocking is a last resort. Problem we’re trying to solve is taking that work away.</p> <p><a href="https://twitter.com/karaswisher/status/1095446571905441793"><strong>Kara</strong></a>: Ok name three initiatives.</p> <hr> <p><p style="color:#808080">Jack: Myself? C. We’ve made progress, but it has been scattered and not felt enough. Changing the experience hasn’t been meaningful enough. And we’ve put most of the burden on the victims of abuse (that’s a huge fail). </color></p> <p><a href="https://twitter.com/karaswisher/status/1095445631278243840"><strong>Kara</strong></a>: Also my son gets a C in coding and that is NO tragedy. You getting one matters a lot.</p> <p><a href="https://twitter.com/jack/status/1095446128844189696">Jack</a>: Agree it matters a lot. And it’s the most important thing we need to address and fix. I’m stating that it’s a fail of ours to put the majority of burden on victims. That’s how the service works today.</p> <p><a href="https://twitter.com/karaswisher/status/1095446310247907328"><strong>Kara</strong></a>: Ok but I really want to drill down on HOW. How much downside are you willing to tolerate to balance the good that Twitter can provide? Be specific</p> <p><a href="https://twitter.com/jack/status/1095447340809220097">Jack</a>: This is exactly the balance we have to think deeply about. But in doing so, we have to look at how the product works. And where abuse happens the most: replies, mentions, search, and trends. Those are the shared spaces people take advantage of</p> <p><a href="https://twitter.com/karaswisher/status/1095447844595662852"><strong>Kara</strong></a>: Well, WHERE does abuse happen most</p> <p><a href="https://twitter.com/jack/status/1095448481215336448">Jack</a>: Within the service? Likely within replies. That’s why we’ve been more aggressive about proactively downranking behind interstitials, for example.</p> <p><a href="https://twitter.com/karaswisher/status/1095448920757600258"><strong>Kara</strong></a>: Why not just be more stringent on kicking off offenders? It seems like you tolerate a lot. If Twitter ran my house, my kids would be eating ramen, playing Red Dead Redemption 2 and wearing filthy socks</p> <p><a href="https://twitter.com/jack/status/1095449592710148096">Jack</a>: We action all we can against our policies. Most of our system today works reactively to someone reporting it. If they don’t report, we don’t see it. Doesn’t scale. Hence the need to focus on proactive</p> <p><a href="https://twitter.com/karaswisher/status/1095450001101209600"><strong>Kara</strong></a>: But why did you NOT see it? It seems pretty basic to run your platform with some semblance of paying mind to what people are doing on it? Can you give me some insight into why that was not done?</p> <p><a href="https://twitter.com/jack/status/1095450592200151040">Jack</a>: I think we tried to do too much in the past, and that leads to diluted answers and nothing impactful. There’s a lot we need to address globally. We have to prioritize our resources according to impact. Otherwise we won’t make much progress.</p> <p><a href="https://twitter.com/karaswisher/status/1095451260088668161"><strong>Kara</strong></a>: Got it. But do you think the fact that you all could not conceive of what it is to feel unsafe (women, POC, LGBTQ, other marginalized people) could be one of the issues? (new topic soon)</p> <p><a href="https://twitter.com/jack/status/1095451822049783808">Jack</a>: I think it’s fair and real. No question. Our org has to be reflective of the people we’re trying to serve. One of the reason we established the Trust and Safety council years ago, to get feedback and check ourselves.</p> <p><a href="https://twitter.com/karaswisher/status/1095451952253677569"><strong>Kara</strong></a>: Yes but i want THREE concrete examples.</p> <hr> <p><p style="color:#808080">Jack: First and foremost we’re looking at ways to proactively enforce and promote health. So that reporting/blocking is a last resort. Problem we’re trying to solve is taking that work away.</color></p> <p><a href="https://twitter.com/karaswisher/status/1095447228615983106"><strong>Kara</strong></a>: Or maybe, tell me what you think the illness is you are treating? I think you cannot solve a disease without knowing that. Or did you create the virus?</p> <p><a href="https://twitter.com/jack/status/1095447781571932160">Jack</a>: Good question. This is why we’re focused on understanding what conversational health means. We see a ton of threats to health in digital conversation. We’re focuse first on off-platform ramifications (physical safety). That clarifies priorities of policy and enforcement.</p> <p><a href="https://twitter.com/karaswisher/status/1095448255301861376"><strong>Kara</strong></a>: I am still confused. What the heck is &quot;off-platform ramifications&quot;? You are not going to have a police force, right? Are you 911?</p> <p><a href="https://twitter.com/jack/status/1095448881054150656">Jack</a>: No, not a police force. I mean we have to consider first and foremost what online activity does to impact physical safety, as a way to prioritize our efforts. I don’t think companies like ours have admitted or focused on that enough.</p> <p><a href="https://twitter.com/karaswisher/status/1095449262639513600"><strong>Kara</strong></a>: So you do see the link between what you do and real life danger to people? Can you say that explicitly? I could not be @finkd to even address the fact that he made something that resulted in real tragedy.</p> <p><a href="https://twitter.com/jack/status/1095450020193525761">Jack</a>: I see the link, and that’s why we need to put physical safety above all else. That’s what we’re figuring out how to do now. We don’t have all the answers just yet. But that’s the focus. I think it clarifies a lot of the work we need to do. Not all of it of course.</p> <p><a href="https://twitter.com/karaswisher/status/1095450868902703106"><strong>Kara</strong></a>: I grade you all an F on this and that's being kind. I'm not trying to be a jackass, but it's been a very slow roll by all of you in tech to pay attention to this. Why do you think that is? I think it is because many of the people who made Twitter never ever felt unsafe.</p> <p><a href="https://twitter.com/jack/status/1095451280443506689">Jack</a>: Likely a reason. I’m certain lack of diversity didn’t help with empathy of what people experience on Twitter every day, especially women.</p> <p><a href="https://twitter.com/karaswisher/status/1095451773983227904"><strong>Kara</strong></a>: And so to end this topic, I will try again. Please give me three concrete things you have done to fix this. SPECIFIC.</p> <p><a href="https://twitter.com/jack/status/1095452345423429632">Jack</a>: 1. We have evolved our polices. 2. We have prioritized proactive enforcement to remove burden from victims 3. We have given more control in product (like mute of accounts without profile pics or associated phone/emails) 4. Much more aggressive on coordinated behavior/gaming</p> <p><a href="https://twitter.com/karaswisher/status/1095452756331106307"><strong>Kara</strong></a>: 1. WHICH? 2. HOW? 3. OK, MUTE BUT THAT WAS A WHILE AGO 4. WHAT MORE? I think people are dying for specifics.</p> <p><a href="https://twitter.com/jack/status/1095453539650830337">Jack</a>: 1. Misgendering policy as example. 2. Using ML to downrank bad actors behind interstitials 3. Not too long ago, but most of our work going forward will have to be product features. 4. Not sure the question. We put an entire model in place to minimize gaming of system.</p> <p><a href="https://twitter.com/karaswisher/status/1095453806442266624"><strong>Kara</strong></a>: thx. I meant even more specifics on 4. But see the Twitter purge one.</p> <p><a href="https://twitter.com/jack/status/1095454064047845376">Jack</a>: Just resonded to that. Don’t see the twitter purge one</p> <p><a href="https://twitter.com/karaswisher/status/1095454371494719493"><strong>Kara</strong></a>: I wanted to get off thread with Mark added! Like he needs more of me.</p> <p><a href="https://twitter.com/jack/status/1095454523311644672">Jack</a>: Does he check this much?</p> <p><a href="https://twitter.com/karaswisher/status/1095454681034342400"><strong>Kara</strong></a>: No, he is busy fixing Facebook. NOT! (he makes you look good)</p> <p><a href="https://twitter.com/karaswisher/status/1095455420020375552"><strong>Kara</strong></a>: I am going to start a NEW thread to make it easy for people to follow (@waltmossberg just texted me that it is a &quot;chaotic hellpit&quot;). Stay in that one. OK?</p> <p><a href="https://twitter.com/jack/status/1095455732458110976">Jack</a>: Ok. Definitely not easy to follow the conversation. Exactly why we are doing this. Fixing stuff like this will help I believe.</p> <p><a href="https://twitter.com/karaswisher/status/1095455941133324288"><strong>Kara</strong></a>: Yeah, it's Chinatown, Jake.</p> <hr> <p><p style="color:#808080">Jack: First and foremost we’re looking at ways to proactively enforce and promote health. So that reporting/blocking is a last resort. Problem we’re trying to solve is taking that work away.</color></p> <p><a href="https://twitter.com/jack/status/1095446924855898112">Jack</a>: Second, we’re constantly evolving our policies to address the issues we see today. We’re rooting them in fundamental human rights (UN) and putting physical safety as our top priority. Privacy next.</p> <p><a href="https://twitter.com/karaswisher/status/1095447726794424320"><strong>Kara</strong></a>: When you say physical safety, I am confused. What do you mean specifically? You are not a police force. In fact, social media companies have built cities without police, fire departments, garbage pickup or street signs. IMHO What do you think of that metaphor?</p> <p><a href="https://twitter.com/jack/status/1095448163479998464">Jack</a>: I mean off platform, offline ramifications. What people do offline with what they see online. Doxxing is a good example which threatens physical safety. So does coordinate harassment campaigns.</p> <p><a href="https://twitter.com/karaswisher/status/1095448510319726598"><strong>Kara</strong></a>: So how do you stop THAT? I mean regular police forces cannot stop that. It seems your job is not to let it get that far in the first place.</p> <p><a href="https://twitter.com/jack/status/1095449174873563136">Jack</a>: Exactly. What can we do within the product and policy to lower probability. Again, don’t think we or others have worked against that enough.</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095456464855019520"><strong>Kara</strong></a>: Ok, new one @jack</p> <p>What do you think about twitter breaks and purges. Why do you think that is? I can’t say I’ve heard many people say they feel “good” after not being on twitter for a while: <a href="https://twitter.com/TaylorLorenz/status/1095039347596898305">https://twitter.com/TaylorLorenz/status/1095039347596898305</a></p> <p><a href="https://twitter.com/jack/status/1095457041844334593">Jack</a>: Feels terrible. I want people to walk away from Twitter feeling like they learned something and feeling empowered to some degree. It depresses me when that’s not the general vibe, and inspires me to figure it out. That’s my desire</p> <p><a href="https://twitter.com/karaswisher/status/1095457517033963521"><strong>Kara</strong></a>: But why do they feel that way? You made it.</p> <p><a href="https://twitter.com/jack/status/1095458365516374016">Jack</a>: We made something with one intent. The world showed us how it wanted to use it. A lot has been great. A lot has been unexpected. A lot has been negative. We weren’t fast enough to observe, learn, and improve</p> <hr> <p><p style="color:#808080"><strong>Kara</strong>: Ok, new one @jack</color></p> <p><a href="https://twitter.com/karaswisher/status/1095456686977019905"><strong>Kara</strong></a>: In that vein, how does it affect YOU?</p> <p><a href="https://twitter.com/jack/status/1095457906848231424">Jack</a>: I also don’t feel good about how Twitter tends to incentivize outrage, fast takes, short term thinking, echo chambers, and fragmented conversation and consideration. Are they fixable? I believe we can do a lot to address. And likely have to change more fundamentals to do so.</p> <p><a href="https://twitter.com/karaswisher/status/1095458590679289857"><strong>Kara</strong></a>: But you invented it. You can control it. Slowness is not really a good excuse.</p> <p><a href="https://twitter.com/jack/status/1095459084785004544">Jack</a>: It’s the reality. We tried to do too much at once and were not focused on what matters most. That contributes to slowness. As does our technology stack and how quickly we can ship things. That’s improved a lot recently</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095458179285172225"><strong>Kara</strong></a>: Ok trying AGAIN @jack in another new thread! This one about @realDonaldTrump:</p> <p>We know a lot more about what Donald Trump thinks because of Twitter, and we all have mixed feelings about that.</p> <p><a href="https://twitter.com/karaswisher/status/1095459010751410176"><strong>Kara</strong></a>: Have you ever considered suspending Donald Trump? His tweets are somewhat protected because he’s a public figure, but would he have been suspended in the past if he were a “regular” user?</p> <p><a href="https://twitter.com/jack/status/1095459760567083008">Jack</a>: We hold all accounts to the same terms of service. The most controversial aspect of our TOS is the newsworthy/public interest clause, the “protection” you mention. That doesn’t extend to all public figures by default, but does speak to global leaders and seeing how they think.</p> <p><a href="https://twitter.com/karaswisher/status/1095460546831417345"><strong>Kara</strong></a>: That seems questionable to a lot of people. Let me try it a different way: What historic newsworthy figure would you ban? Is someone bad enough to ban. Be specific. A name.</p> <p><a href="https://twitter.com/jack/status/1095461409624711168">Jack</a>: We have to enforce based on our policy and what people do on our service. And evolve it with the current times. No way I can answer that based on people. Has to be focused on patterns of how people use the technology.</p> <p><a href="https://twitter.com/karaswisher/status/1095461706044768256"><strong>Kara</strong></a>: Not one name? Ok, but it is a copout imho. I have a long list.</p> <p><a href="https://twitter.com/jack/status/1095462135717482502">Jack</a>: I think it’s more durable to focus on use cases because that allows us to act broader. Likely that these aren’t isolated cases but things that spread</p> <p><a href="https://twitter.com/karaswisher/status/1095462374612582401"><strong>Kara</strong></a>: it would be really great to get specific examples as a lot of what you are doing appears incomprehensible to many.</p> <hr> <p><p style="color:#808080"><strong>Kara</strong>: Ok trying AGAIN @jack in another new thread! This one about @realDonaldTrump:</color></p> <p><a href="https://twitter.com/karaswisher/status/1095459296492560384"><strong>Kara</strong></a>: And will Twitter’s business/engagement suffer when @realDonaldTrump is no longer President?</p> <p><a href="https://twitter.com/jack/status/1095460252852482048">Jack</a>: I don’t believe our service or business is dependent on any one account or person. I will say the number of politics conversations has significantly increased because of it, but that’s just one experience on Twitter. There are multiple Twitters, all based on who you follow.</p> <p><a href="https://twitter.com/karaswisher/status/1095461276686471168"><strong>Kara</strong></a>: Ok new question (answer the newsworthy historical figure you MIGHT ban pls): Single biggest improvement at Twitter since 2016 that signals you’re ready for the 2020 elections?</p> <p><a href="https://twitter.com/jack/status/1095461785610534912">Jack</a>: Our work against automations and coordinated campaigns. Partnering with government agencies to improve communication around threats</p> <p><a href="https://twitter.com/karaswisher/status/1095461973926518784"><strong>Kara</strong></a>: Can you give a more detailed example of that that worked?</p> <p><a href="https://twitter.com/jack/status/1095462610462433280">Jack</a>: We shared a retro on 2018 within this country, and tested a lot with the Mexican elections too. Indian elections coming up. In mid-terms we were able to monitor efforts to disrupt both online and offline and able to stop those actions on Twitter.</p> <hr> <p><p style="color:#808080"><strong>Kara</strong>: Ok new question (answer the newsworthy historical figure you MIGHT ban pls): Single biggest improvement at Twitter since 2016 that signals you’re ready for the 2020 elections?</color></p> <p><a href="https://twitter.com/karaswisher/status/1095461526935429120"><strong>Kara</strong></a>: What confidence should we have that Russia or other state-sponsored actors won’t be able to wreak havoc on next year’s elections?</p> <p><a href="https://twitter.com/jack/status/1095463103561490432">Jack</a>: We should expect a lot more coordination between governments and platforms to address. That would give me confidence. And have some skepticism too. That’s healthy. The more we can do this work in public and share what we find, the better</p> <p><a href="https://twitter.com/karaswisher/status/1095463591099224064"><strong>Kara</strong></a>: I still am dying for specifics here. [meme image: Give me some specifics. I love specifics, the specifics were the best part!]</p> <hr> <p><p style="color:#808080">Jack: I think it’s more durable to focus on use cases because that allows us to act broader. Likely that these aren’t isolated cases but things that spread</color></p> <p><a href="https://twitter.com/karaswisher/status/1095462715332665351"><strong>Kara</strong></a>: going to shift to biz questions since it is not a lot of time and this system is CHAOTIC (as I thought it would be): What about the move to DAU instead of MAU. Why the move? And how are we to interpret the much smaller numbers?</p> <p><a href="https://twitter.com/jack/status/1095463620190789632">Jack</a>: We want to be valuable to people daily. Not monthly. It’s a higher bar for ourselves. Sure, it looks like a smaller absolute number, but the folks we have using Twitter are some of the most influential in the world. They drive conversation. We belevie we can best grow this.</p> <p><a href="https://twitter.com/karaswisher/status/1095463921748721664"><strong>Kara</strong></a>: Ok, then WHO is the most exciting influential on Twitter right now? BE SPECIFIC</p> <p><a href="https://twitter.com/jack/status/1095465039438393344">Jack</a>: To me personally? I like how @elonmusk uses Twitter. He’s focused on solving existential problems and sharing his thinking openly. I respect that a lot, and all the ups and downs that come with it</p> <p><a href="https://twitter.com/karaswisher/status/1095465160167301121"><strong>Kara</strong></a>: What about @AOC</p> <p><a href="https://twitter.com/jack/status/1095465322071486464">Jack</a>: Totally. She’s mastering the medium</p> <p><a href="https://twitter.com/karaswisher/status/1095465494407270400"><strong>Kara</strong></a>: She is well beyond mastering it. She speaks fluent Twitter.</p> <p><a href="https://twitter.com/jack/status/1095465722862395392">Jack</a>: True</p> <p><a href="https://twitter.com/karaswisher/status/1095465850423951360"><strong>Kara</strong></a>: Also are you ever going to hire someone to effectively be your number 2?</p> <p><a href="https://twitter.com/jack/status/1095466504689115136">Jack</a>: I think it’s better to spread that responsibility across multiple people. It creates less dependencies and the company gets more options around future leadership</p> <hr> <p><p style="color:#808080"><strong>Kara</strong>: going to shift to biz questions since it is not a lot of time and this system is CHAOTIC (as I thought it would be): What about the move to DAU instead of MAU. Why the move? And how are we to interpret the much smaller numbers?</color></p> <p><a href="https://twitter.com/karaswisher/status/1095463105021308929"><strong>Kara</strong></a>: Also: How close were you to selling Twitter in 2016? What happened?</p> <p>What about giving the company to a public trust per your NYT discussion.</p> <p><a href="https://twitter.com/jack/status/1095464676094205952">Jack</a>: We ultimately decided we were better off independent. And I’m happy we did. We’ve made a lot of progress since that point. And we got a lot more focused. Definitely love the idea of opening more to 3rd parties. Not sure what that looks like yet. Twitter is close to a protocol.</p> <p><a href="https://twitter.com/karaswisher/status/1095464894705684481"><strong>Kara</strong></a>: Chop chop on the other answers! I have more questions! If you want to use this method, quicker!</p> <p><a href="https://twitter.com/jack/status/1095465159202467842">Jack</a>: I’m moving as fast as I can Kara</p> <p><a href="https://twitter.com/karaswisher/status/1095465222951895040"><strong>Kara</strong></a>: Clip clop!</p> <hr> <p><p style="color:#808080"><strong>Kara</strong>: going to shift to biz questions since it is not a lot of time and this system is CHAOTIC (as I thought it would be): What about the move to DAU instead of MAU. Why the move? And how are we to interpret the much smaller numbers?</color></p> <p><a href="https://twitter.com/karaswisher/status/1095462909990309888"><strong>Kara</strong></a>: also: Is twitter still considering a subscription service? Like “Twitter Premium” or something?</p> <p><a href="https://twitter.com/jack/status/1095464073771114496">Jack</a>: Always going to experiment with new models. Periscope has super hearts, which allows us to learn about direct contribution. We’d need to figure out the value exchange on subscription. Has to be really high for us to charge directly</p> <hr> <p><p style="color:#808080">Jack: Totally. She’s mastering the medium</color></p> <p><a href="https://twitter.com/karaswisher/status/1095465790483116032"><strong>Kara</strong></a>: Ok, last ones are about you and we need to go long because your system here it confusing says the people of Twitter:</p> <ol> <li>What has been Twitter’s biggest missed opportunity since you came back as CEO?</li> </ol> <p><a href="https://twitter.com/jack/status/1095466230427742208">Jack</a>: Focus on conversation earlier. We took too long to get there. Too distracted.</p> <p><a href="https://twitter.com/karaswisher/status/1095466622956064770"><strong>Kara</strong></a>: By what? What is the #1 thing that distracted you and others and made this obvious mess via social media?</p> <p><a href="https://twitter.com/jack/status/1095467434440417282">Jack</a>: Tried to do too much at once. Wasn’t focused on what our one core strength was: conversation. That lead to really diluted strategy and approach. And a ton of reactiveness.</p> <p><a href="https://twitter.com/karaswisher/status/1095467977208668160"><strong>Kara</strong></a>: Speaking of that (CONVERSATION), let's do one with sounds soon, like this</p> <p><a href="https://www.youtube.com/watch?v=oiJkANps0Qw">https://www.youtube.com/watch?v=oiJkANps0Qw</a></p> <hr> <p><p style="color:#808080"><strong>Kara</strong>: She is well beyond mastering it. She speaks fluent Twitter.</color></p> <p><p style="color:#808080">Jack: True</color></p> <p><a href="https://twitter.com/karaswisher/status/1095465976626380801"><strong>Kara</strong></a>: Why are you still saying you’re the CEO of two publicly traded companies? What’s the point in insisting you can do two jobs that both require maximum effort at the same time?</p> <p><a href="https://twitter.com/jack/status/1095467046949642240">Jack</a>: I’m focused on building leadership in both. Not my desire or ambition to be CEO of multiple companies just for the sake of that. I’m doing everything I can to help both. Effort doesn’t come down to one person. It’s a team</p> <hr> <p><a href="https://twitter.com/karaswisher/status/1095467352395825152"><strong>Kara</strong></a>: LAST Q: For the love of God, please do Recode Decode podcast with me soon, because analog talking seems to be a better way of asking questions and giving answers. I think Twitter agrees and this has shown how hard this thread is to do. That said, thx for trying. Really.</p> <p><a href="https://twitter.com/jack/status/1095468327978205184">Jack</a>: This thread was hard. But we got to learn a ton to fix it. Need to make this feel a lot more cohesive and easier to follow. Was extremely challenging. Thank you for trying it with me. Know it wasn’t easy. Will consider different formats!</p> <p><a href="https://twitter.com/karaswisher/status/1095468672397832192"><strong>Kara</strong></a>: Make a glass house for events and people can watch and not throw stones. Pro tip: Twitter convos are wack</p> <p><a href="https://twitter.com/jack/status/1095469020508082177">Jack</a>: Yep. And they don’t have to be wack. Need to figure this out. This whole experience is a problem statement for what we need to fix</p> <hr> <p><p style="color:#808080">Jack: This thread was hard. But we got to learn a ton to fix it. Need to make this feel a lot more cohesive and easier to follow. Was extremely challenging. Thank you for trying it with me. Know it wasn’t easy. Will consider different formats!</color></p> <p><a href="https://twitter.com/karaswisher/status/1095469018310422529"><strong>Kara</strong></a>: My kid is hungry and says that you should do a real interview with me even if I am mean. Just saying.</p> <p><a href="https://twitter.com/jack/status/1095469279514750977">Jack</a>: I don’t think you’re mean. Always good to experiment.</p> <p><a href="https://twitter.com/karaswisher/status/1095469622130888705"><strong>Kara</strong></a>: Neither does my kid. He just wants to go get dinner</p> <p><a href="https://twitter.com/jack/status/1095469788439035905">Jack</a>: Go eat! Thanks, Kara</p> Jonathan Shapiro's Retrospective Thoughts on BitC bitc-retrospective/ Fri, 23 Mar 2012 00:00:00 +0000 bitc-retrospective/ <p><em>This is an archive of the Jonathan Shapiro's &quot;Retrospective Thoughts on BitC&quot; that seems to have disappeared from the internet; at the time, BitC was aimed at the same niche as Rust</em></p> <pre><code>Jonathan S. Shapiro shap at eros-os.org Fri Mar 23 15:06:41 PDT 2012 </code></pre> <p>By now it will be obvious to everyone that I have stopped work on BitC. An explanation of <em>why</em> seems long overdue.</p> <p>One answer is that work on Coyotos stopped when I joined Microsoft, and the work that I am focused on <em>now</em> doesn't really require (or seem to benefit from) BitC. As we all know, there is only so much time to go around in our lives. But that alone wouldn't have stopped me entirely.</p> <p>A second answer is that BitC isn't going to work in its current form. I had hit a short list of issues that required a complete re-design of the language and type system followed by a ground-up new implementation. Experience with the first implementation suggested that this would take quite a while, and it was simply more than I could afford to take on without external support and funding. Programming language work is not easy to fund.</p> <p>But the third answer may of greatest interest, which is that I no longer believe that type classes &quot;work&quot; in their current form from the standpoint of language design. That's the only important science lesson here.</p> <p>In the large, there were four sticking points for the current design:</p> <ol> <li>The compilation model.</li> <li>The insufficiency of the current type system w.r.t. by-reference and reference types.</li> <li>The absence of some form of inheritance.</li> <li>The instance coherence problem.</li> </ol> <p>The first two issues are in my opinion solvable, thought the second requires a nearly complete re-implementation of the compiler. The last (instance coherence) does not appear to admit any general solution, and it raises conceptual concerns about the use of type classes for method overload in my mind. It's sufficiently important that I'm going to deal with the first three topics here and take up the last as a separate note.</p> <p>Inheritance is something that people on the BitC list might (and sometimes have) argue about strongly. So a few brief words on the subject may be relevant.</p> <h3 id="prefacing-comments-on-objects-inheritance-and-purity">Prefacing Comments on Objects, Inheritance, and Purity</h3> <p>BitC was initially designed as an [imperative] functional language because of our focus on software verification. Specification of the typing and semantics of functional languages is an area that has a <em>lot</em> of people working on it. We (as a field) <em>kind</em> of know how to do it, and it was an area where our group at Hopkins didn't know very much when we started. Software verification is a known-hard problem, doing it over an imperative language was already a challenge, and this didn't seem like a good place for a group of novice language researchers to buck the current trends in the field. Better, it seemed, to choose our battles. We knew that there were interactions between inheritance and inference<em>,</em> and it <em>appeared</em> that type classes with clever compilation could achieve much of the same operational results. I therefore decided early <em>not</em> to include inheritance in the language.</p> <p>To me, as a programmer, the removal of inheritance and objects was a very reluctant decision, because it sacrificed any possibility of transcoding the large body of existing C++ code into a safer language. And as it turns out, you can't really remove the underlying semantic challenges from a successful systems language. A systems language <em>requires</em> some mechanism for existential encapsulation. The <em>mechanism</em> which embodies that encapsulation isn't really the issue; once you introduce that sort of encapsulation, you bring into play most of the verification issues that objects with subtyping bring into play, and once you do that, you might as well gain the benefit of objects. The remaining issue, in essence, is the modeling of the Self type, and for a range of reasons it's fairly essential to have a Self type in a systems language once you introduce encapsulation. So you end up pushed in to an object type system at some point in <em>any</em> case. With the benefit of eight years of hindsight, I can now say that this is perfectly obvious!</p> <p>I'm strongly of the opinion that multiple inheritance is a mess. The argument pro or con about single inheritance still seems to me to be largely a matter of religion. Inheritance and virtual methods certainly aren't the <em>only</em> way to do encapsulation, and they may or may not be the best primitive mechanism. I have always been more interested in getting a large body of software into a safe, high-performance language than I am in innovating in this area of language design. If transcoding current code is any sort of goal, we need something very similar to inheritance.</p> <p>The last reason we left objects out of BitC initially was purity. I wanted to preserve a powerful, pure subset language - again to ease verification. The object languages that I knew about at the time were heavily stateful, and I couldn't envision how to do a non-imperative object-oriented language. Actually, I'm <em>still</em> not sure I can see how to do that practically for the kinds of applications that are of interest for BitC. But as our faith in the value of verification declined, my personal willingness to remain restricted by purity for the sake of verification decayed quickly.</p> <p>The <em>other</em> argument for a pure subset language has to do with advancing concurrency, but as I really started to dig in to concurrency support in BitC, I came increasingly to the view that this approach to concurrency isn't a good match for the type of concurrent problems that people are actually trying to solve, and that the needs and uses for non-mutable state in practice are a lot more nuanced than the pure programming approach can address. Pure subprograms clearly play an important role, but they aren't enough.</p> <p>And I <em>still</em> don't believe in monads. :-)</p> <h3 id="compilation-model">Compilation Model</h3> <p>One of the objectives for BitC was to obtain acceptable performance under a conventional, static separate compilation scheme. It may be short-sighted on my part, but complex optimizations at run-time make me very nervous from the standpoint of robustness and assurance. I understand that bytecode virtual machines today do very aggressive optimizations with considerable success, but there are a number of concerns with this:</p> <ul> <li>For a robust language, we want to <em>minimize</em> the size and complexity of the code that is exempted from type checking and [eventual] verification. Run-time code is excepted in this fashion. The garbage collector taken alone is already large enough to justify assurance concerns. Adding a large and complex optimizer to the pile drags the credibility of the assurance story down immeasurably.</li> <li>Run-time optimization has very negative consequences for startup times</li> <li>especially in the context of transaction processing. Lots of hard data on this from IBM (in DB/2) and others. It is one of the reasons that &quot;Java in the database&quot; never took hold. As the frequency of process and component instantiation in a system rises, startup delays become more and more of a concern. Robust systems don't recycle subsystems.</li> <li>Run-time optimization adds a <em>huge</em> amount of space overhead to the run-time environment of the application. While the <em>code</em> of the run-time compiler can be shared, the <em>state</em> of the run-time compiler cannot, and there is quite a lot of that state.</li> <li>Run-time optimization - especially when it is done &quot;on demand&quot; - introduces both variance and unpredictability into performance numbers. For some of the applications that are of interest to me, I need &quot;steady state&quot; performance. If the code is getting optimized on the fly such that it improves by even a modest constant factor, real-time scheduling starts to be a very puzzling challenge.</li> <li>Code that is <em>produced</em> by a run-time optimizer is difficult to share across address spaces, though this probably isn't solved very well by * other* compilation approaches models either.</li> <li>If run-time optimization is present, applications will come to rely on it for performance. That is: for social reasons, &quot;optional&quot; run-time optimization tends to quickly become required.</li> </ul> <p>To be clear, I'm <em>not</em> opposed to continuous compilation. I actually think it's a good idea, and I think that there are some fairly compelling use-cases. I <em>do</em> think that the run-time optimizer should be implemented in a strongly typed, safe language. I <em>also</em> think that it took an awfully long time for the hotspot technology to stabilize, and that needs to be taken as a cautionary tale. It's also likely that many of the problems/concerns that I have enumerated can be solved - but probably not * soon*. For the applications that are most important to me, the concerns about assurance are primary. So from a language design standpoint, I'm delighted to exploit continuous compilation, but I don't want to design a language that <em>requires</em> continuous compilation in order to achieve reasonable baseline performance.</p> <p>The optimizer complexity issue, of course, can be raised just as seriously for conventional compilers. You are going to optimize <em>somewhere</em>. But my experience with dynamic translation tells me that it's a lot easier to do (and to reason about) one thing at a time. Once we have a high-confidence optimizer in a safe language, <em>then</em> it may make sense to talk about integrating it into the run-time in a high-confidence system. Until then, separation of concerns should be the watch-word of the day.</p> <p>Now strictly speaking, it should be said that run-time compilation actually isn't necessary for BitC, or for any other bytecode language. Run-time compilation doesn't become necessary until you combine run-time loading with compiler-abstracted representations (see below) and allow types having abstracted representation to appear in the signatures of run-time loaded libraries. Until then it is possible to maintain a proper phase separation between code generation and execution. Read on - I'll explain some of that below.</p> <p>In any case, I knew going in that strongly abstracted types would raise concerns on this issue, and I initially adopted the following view:</p> <ul> <li>Things like kernels can be whole-program compiled. This effectively eliminates the run-time optimizer requirement.</li> <li>Things like critical system components want to be statically linked anyway, so they can <em>also</em> be dealt with as whole-program compilation problems.</li> <li>For everything else, I hoped to adopt a kind of &quot;template expansion&quot; approach to run-time compilation. This wouldn't undertake the full complexity of an optimizer; it would merely extend run-time linking and loading to incorporate span and offset resolution. It's still a lot of code, but it's not horribly <em>complex</em> code, and it's the kind of thing that lends itself to rigorous - or even formal - specification.</li> </ul> <p>It took several years for me to realize that the template expansion idea wasn't going to produce acceptable baseline performance. The problem lies in the interaction between abstract types, operator overloading, and inlining.</p> <h3 id="compiler-abstracted-representations-vs-optimization">Compiler-Abstracted Representations vs. Optimization</h3> <p>Types have representations. This sometimes seems to make certain members of the PL community a bit uncomfortable. A thing to be held at arms length. Very much like a zip-lock bag full of dog poo (insert cartoon here). From the perspective of a systems person, I regret to report that where the bits are placed, how big they are, and their assemblage actually <em>does</em> matter. If you happen to be a dog owner, you'll note that the &quot;bits as dog poo&quot; analogy is holding up well here. It seems to be the lot of us systems people to wade daily through the plumbing of computational systems, so perhaps that shouldn't be a surprise. Ahem.</p> <p>In any case, the PL community set representation issues aside in order to study type issues first. I don't think that pragmatics was forgotten, but I think it's fair to say that representation issues are not a focus in current, mainstream PL research. There is even a school of thought that views representation as a fairly yucky matter that should be handled in the compiler &quot;by magic&quot;, and that imperative operations should be handled that way too. For systems code that approach doesn't work, because a lot of the representations and layouts we need to deal with are dictated to us by the hardware.</p> <p>In any case, types <em>do</em> have representations, and knowledge of those representations is utterly essential for even the simplest compiler optimizations. So we need to be a bit careful not to abstract types* too * successfully<em>,</em> lest we manage to break the compilation model.</p> <p>In C, the &quot;+&quot; operator is primitive, and the compiler can always select the appropriate opcode directly. Similarly for other &quot;core&quot; arithmetic operations. Now try a thought experiment: suppose we take every use of such core operations in a program and replace each one with a functionally equivalent procedure call to a runtime-implemented intrinsic. You only have to do this for <em>user</em> operations - addition introduced by the compiler to perform things like address arithmetic is always done on concrete types, so those can still be generated efficiently. But even though it is only done for user operations, this would clearly harm the performance of the program quite a lot. You <em>can</em> recover that performance with a run-time optimizer, but it's complicated.</p> <p>In C++, the &quot;+&quot; operator can be overloaded. But (1) the bindings for primitive types cannot be replaced, (2) we know, statically, what the bindings and representations <em>are</em> for the other types, and (3) we can control, by means of inlining, which of those operations entail a procedure call at run time. I'm not trying to suggest that we want to be forced to control that manually. The key point is that the compiler has enough visibility into the implementation of the operation that it is possible to inline the primitive operators (and many others) at static compile time.</p> <p>Why is this possible in C++, but not in BitC?</p> <p>In C++, the instantiation of an abstract type (a template) occurs in an environment where complete knowledge of the representations involved is visible to the compiler. That information may not all be in scope to the programmer, but the compiler can chase across the scopes, find all of the pieces, assemble them together, and understand their shapes. This is what induces the &quot;explicit instantiation&quot; model of C++. It also causes a lot of &quot;internal&quot; type declarations and implementation code to migrate into header files, which tends to constrain the use of templates and increase the number of header file lines processed for each compilation unit - we measured this at one point on a very early (pre templates) C++ product and found that we processed more than 150 header lines for each &quot;source&quot; line. The ratio has grown since then by at least a factor of ten, and (because of templates) quite likely 20.</p> <p>It's all rather a pain in the ass, but it's what makes static-compile-time template expansion possible. From the <em>compiler</em> perspective, the types involved (and more importantly, the representations) aren't abstracted at all. In BitC, <em>both</em> of these things <em>are</em> abstracted at static compile time. It isn't until link time that all of the representations are in hand.</p> <p>Now as I said above, we can imagine extending the linkage model to deal with this. All of that header file information is supplied to deal with * representation* issues, not type checking. Representation, in the end, comes down to sizes, alignments, and offsets. Even if we don't know the concrete values, we <em>do</em> know that all of those are compile-time constants, and that the results we need to compute at compile time are entirely formed by sums and multiples of these constants. We could imagine dealing with these as <em>opaque</em> constants at static compile time, and filling in the blanks at link time. Which is more or less what I had in mind by link-time template expansion. Conceptually: leave all the offsets and sizes &quot;blank&quot;, and rely on the linker to fill them in, much in the way that it handles relocation.</p> <p>The problem with this approach is that it removes key information that is needed for optimization and registerization, and it doesn't support inlining. In BitC, we can <em>and do</em> extend this kind of instantiation all the way down to the primitive operators! And perhaps more importantly, to primitive accessors and mutators. The reason is that we want to be able to write expressions like &quot;a + b&quot; and say &quot;that expression is well-typed provided there is an appropriate resolution for +:('a,'a)-&gt;'a&quot;. Which is a fine way to <em>type</em> the operation, but it leaves the representation of 'a fully abstracted. Which means that we cannot see when they are primitive types. Which means that we are <em>exactly</em> (or all too often, in any case) left in the position of generating <em>all</em> user-originated &quot;+&quot; operations as procedure calls. Now surprisingly, that's actually not the end of the world. We can imagine inventing some form of &quot;high-level assembler&quot; that our static code generator knows how to translate into machine code. If the static code generator does this, the run-time loader can be handed responsibility for emitting procedure calls, and can substitute intrinsic calls at appropriate points. Which would cause us to lose code sharing, but that might be tolerable on non-embedded targets.</p> <p>Unfortunately, this kind of high-level assembler has some fairly nasty implications for optimization: First, we no longer have any idea what the * cost* of the &quot;+&quot; operator is for optimization purposes. We don't know how many cycles that particular use of + will take, but more importantly, we don't know how many bytes of code it will emit. And without that information there is a very long list of optimization decisions that we can no longer make at static compile time. Second, we no longer have enough information at static code generation time to perform a long list of <em>basic</em> register and storage optimizations, because we don't know which procedure calls are actually going to use registers.</p> <p>That creaking and groaning noise that you are hearing is the run-time code generator gaining weight and losing reliability as it grows. While the impact of this mechanism actually wouldn't be as bad as I am sketching - because a lot of user types <em>aren't</em> abstract - the <em>complexity</em> of the mechanism really is as bad as I am proposing. In effect we end up deferring code generation and optimization to link time. That's an idea that goes back (at least) to David Wall's work on link time register optimization in the mid-1980s. It's been explored in many variants since then. It's a compelling idea, but it has pros and cons.</p> <p>What is going on here is that types in BitC are <em>too</em> successfully abstracted for static compilation. The result is a rather <em>large</em> bag of poo, so perhaps the PL people are on to something.:-)</p> <h3 id="two-solutions">Two Solutions</h3> <ul> <li>The most obvious solution - adopted by C++ - is to redesign the language so that representation issues are not hidden from the compiler. That's actually a solution that is worth considering. The problem in C++ isn't so much the number of header file lines per source line as it is the fact that the C preprocessor requires us to process those lines <em>de novo</em> for each compilation unit. BitC lacks (intentionally) anything comparable to the C preprocessor.</li> <li>The other possibility is to shift to what might be labeled &quot;install time compilation&quot;. Ship some form of byte code, and do a static compilation at install time. This gets you back all of the code sharing and optimization that you might reasonably have expected from the classical compilation approach, it opens up some interesting design point options from a systems perspective, and (with care) it can be retrofitted to existing systems. There are platforms today (notably cell phones) where we basically do this already.</li> </ul> <p>The design point that you don't want to cross here is dynamic <em>loading</em> where the loaded interface carries a type with an abstracted representation. At that point you are effectively committing yourself to run-time code generation<em>,</em> though I do have some ideas on how to mitigate that.</p> <h3 id="conclusion-concerning-compilation-model">Conclusion Concerning Compilation Model</h3> <p><strong>If static, separate compilation is a requirement, it becomes necessary for the compiler to see into the source code across module boundaries whenever an abstract type is used. That is: any procedure having abstract type must have an exposed source-level implementation.</strong></p> <p><strong>The practical alternative is a high-level intermediate form coupled with install-time or run-time code generation. That is certainly feasible, but it's more that I felt I could undertake.</strong></p> <p>That's all manageable and doable. Unfortunately, it isn't the path we had taken on, so it basically meant starting over.</p> <h3 id="insufficiency-of-the-type-system">Insufficiency of the Type System</h3> <p>At a certain point we had enough of BitC working to start building library code. It may not surprise you that the first thing we set out to do in the library was IO. We found that we couldn't handle typed input within the type system. Why not?</p> <p>Even if you are prepared to do dynamic allocation within the IO library, there is a level of abstraction at which you need to implement an operation that amounts to &quot;inputStream.read(someObject: ByRef mutable 'a)&quot; There are a couple of variations on this, but the point is that you want the ability at some point to move the incoming bytes into previously allocated storage. So far so good.</p> <p>Unfortunately, in an effort to limit creeping featurism in the type system, I had declared (unwisely, as it turned out) that the only place we needed to deal with ByRef types was at parameters. Swaroop took this statement a bit more literally than I intended. He noticed that if this is <em>really</em> the only place where ByRef needs to be handled, then you can internally treat &quot;ByRef 'a&quot; as 'a, merely keeping a marker on the parameter's identifier record to indicate that an extra dereference is required at code generation time. Which is actually quite clever, except that it doesn't extend well to signature matching between type classes and their instances. Since the argument type for <em>read</em> is <em>ByRef 'a</em>, InputStream is such a type class.</p> <p>So now we were faced with a couple of issues. The first was that we needed to make ByRef 'a a first-class type within the compiler so that we could unify it, and the second was that we needed to deal with the implicit coercion issues that this would entail. That is: conversion back and forth between ByRef 'a and 'a at copy boundaries. The coercion part wasn't so bad; ByRef is never inferred, and the type coercions associated with ByRef happen in exactly the same places that const/mutable coercions happen. We already had a cleanly isolated place in the type checker to deal with that.</p> <p>But even if ByRef isn't inferred, it can propagate through the code by unification. And <em>that</em> causes safety violations! The fact that ByRef was syntactically restricted to appear only at parameters had the (intentional) consequence of ensuring that safety restrictions associated with the lifespan of references into the stack were honored - that was why I had originally imposed the restriction that ByRef could appear only at parameters. Once the ByRef type can unify, the syntactic restriction no longer guarantees the enforcement of the lifespan restriction. To see why, consider what happens in:</p> <pre><code> define byrefID(x:ByRef 'a) { return x; } </code></pre> <p>Something that is <em>supposed</em> to be a downward-only reference ends up getting returned up the stack. Swaroop's solution was clever, in part, because it silently prevented this propagation problem. In some sense, his implementation doesn't really treat ByRef as a type, so it can't propagate. But *because *he didn't treat it as a type, we also couldn't do the necessary matching check between instances and type classes.</p> <p>It turns out that being able to do this is <em>useful</em>. The essential requirement of an abstract mutable &quot;property&quot; (in the C# sense) is that we have the ability within the language to construct a function that returns the <em>location</em> of the thing to be mutated. That location will often be on the stack, so returning the location is <em>exactly</em> like the example above. The &quot;ByRef only at parameters&quot; restriction is actually very conservative, and we knew that it was preventing certain kinds of things that we eventually wanted to do. We had a vague notion that we would come back and fix that at a later time by introducing region types.</p> <p>As it turned out, &quot;later&quot; had to be &quot;now&quot;, because region types are the right way to re-instate lifetime safety when ByRef types become first class. But <em>adding</em> region types presented two problems (which is why we had hoped to defer them):</p> <ul> <li>Adding region types meant rewriting the type checker and re-verifying the soundness and completeness of the inference algorithm, <em>and</em></li> <li>It wasn't just a re-write. Regions introduce subtyping. Subtyping and polymorphism don't get along, so we would need to go back and do a lot of study.</li> </ul> <p>Region polymorphism with region subtyping had certainly been done before, but we were looking at subtyping in another case too (below). That was pushing us toward a kinding system and a different type system.</p> <p>So to fix the ByRef problem, we very nearly needed to re-design both the type system and the compiler from scratch. Given the accumulation of cruft in the compiler, that might have been a good thing in any case, but Swaroop was now full-time at Microsoft, and I didn't have the time or the resources to tackle this by myself.</p> <h3 id="conclusion-concerning-the-type-system">Conclusion Concerning the Type System</h3> <p><strong>In retrospect, it's hard to imagine a strongly typed imperative language that doesn't type locations in a first-class way. If the language simultaneously supports explicit unboxing, it is effectively forced to deal with location lifespan and escape issues, which makes memory region typing of some form almost unavoidable.</strong></p> <p><strong>For this reason alone, even if for no other, the type system of an imperative language with unboxing must incorporate some form of subtyping. To ensure termination, this places some constraints on the use of type inference. On the bright side, once you introduce subtyping you are able to do quite a number of useful things in the language that are hard to do without it.</strong></p> <h3 id="inheritance-and-encapsulation">Inheritance and Encapsulation</h3> <p>Our first run-in with inheritance actually showed up in the compiler itself. In spite of our best efforts, the C++ implementation of the BitC compiler had not entirely avoided inheritance, so it didn't have a direct translation into BitC. And even if we changed the code of the compiler, there are a large number of third-party libraries that we would like to be able to transcode. A good many of those rely on [single] inheritance. Without having at least some form of interface (type) inheritance, We can't really even do a good job interfacing to those libraries as foreign objects.</p> <p>The compiler aside, we also needed a mechanism for encapsulation. I had been playing with &quot;capsules&quot;, but it soon became clear that capsules were really a degenerate form of subclassing, and that trying to duck that issue wasn't going to get me anywhere.</p> <p>I could nearly imagine getting what I needed by adding &quot;ThisType&quot; and inherited <em>interfaces</em>. But the combination of those two features introduces subtyping. In fact, the combination is equivalent (from a type system perspective) to single-inheritance subclassing.</p> <p>And the more I stared at interfaces, the more I started to ask myself why an interface wasn't just a type class. <em>That</em> brought me up against the instance coherence problem from a new direction, which was already making my head hurt. It also brought me to the realization that Interfaces work, in part, because they are always parameterized over a single type (the ThisType) - once you know that one, the bindings for all of the others are determined by type constructors or by explicit specification.</p> <p>And introducing SelfType was an even bigger issue than introducing subtypes. It means moving out of System F&lt;: entirely, and into the object type system of Cardelli <em>et al</em>. That wasn't just a matter of re-implementing the type checker to support a variant of the type system we already had. It meant re-formalizing the type system entirely, and learning how to think in a different model.</p> <p>Doable, but time not within the framework or the compiler that we had built. At this point, I decided that I needed to start over. We had learned a lot from the various parts of the BitC effort, but sometimes you have to take a step back before you can take more steps forward.</p> <h3 id="instance-coherence-and-operator-overloading">Instance Coherence and Operator Overloading</h3> <p>BitC largely borrows its type classes from Haskell. Type classes aren't just a basis for type qualifiers; they provide the mechanism for *ad hoc*polymorphism. A feature which, language purists notwithstanding, real languages actually do need.</p> <p>The problem is that there can be multiple type class instances for a given type class at a given type. So it is possible to end up with a function like:</p> <pre><code>define f(x : 'x) { ... a:int32 + b // typing fully resolved at static compile time return x + x // typing not resolvable until instantiation } </code></pre> <p>Problem: we don't know which instance of &quot;+&quot; to use when 'x instantiates to <em>int32</em>. In order for &quot;+&quot; to be meaningful in a+b, we need a static-compile-time resolution for +:(int32, int32)-&gt;int32. And we get that from Arith(int32). So far so good. But if 'x is instantiated to <em>int32</em>, we will get a type class instance supplied by the caller. The problem is that there is no way to guarantee that this is the <em>same</em> instance of Arith(int32) that we saw before.</p> <p>The solution in Haskell is to impose the <em>ad hoc</em> rule that you can only instantiate a type class once for each unique type tuple in a given application. This is similar to what is done in C++: you can only have one overload of a given global operator at a particular type. If there is more than one overload at that type, you get a link-time failure. This restriction is tolerable in C++ largely because operator overloading is so limited:</p> <ol> <li>The set of overloadable operators is small and non-extensible.</li> <li>Most of them can be handled satisfactorily as methods, which makes their resolution unambiguous.</li> <li>Most of the ones that <em>can't</em> be handled as methods are arithmetic operations, and there are practical limits to how much people want to extend those.</li> <li>The remaining highly overloaded global operators are associated with I/O. These <em>could</em> be methods in a suitably polymorphic language.</li> </ol> <p>In languages (like BitC) that enable richer use of operator overloading, it seems unlikely that these properties would suffice.</p> <p>But in Haskell and BitC, overloading is extended to <em>type properties</em> as well. For example, there is a type class &quot;Ord 'a&quot;, which states whether a type 'a admits an ordering. Problem: most types that admit ordering admit more than one! The fact that we know an ordering <em>exists</em> really isn't enough to tell us which ordering to <em>use</em>. And we can't introduce <em>two</em> orderings for 'a in Haskell or BitC without creating an instance coherence problem. And in the end, the instance coherence problem exists because the language design performs method resolution in what amounts to a non-scoped way.</p> <p>But if nothing else, you can hopefully see that the heavier use of overloading in BitC and Haskell places much higher pressure on the &quot;single instance&quot; rule. Enough so, in my opinion, to make that rule untenable. And coming from the capability world, I have a strong allergy to things that smell like ambient authority.</p> <p>Now we can get past this issue, up to a point, by imposing an arbitrary restriction on where (which compilation unit) an instance can legally be defined. But as with the &quot;excessively abstract types&quot; issue, we seemed to keep tripping on type class issues. There are other problems as well when multi-variable type classes get into the picture.</p> <p>At the end of the day, type classes just don't seem to work out very well as a mechanism for overload resolution without some other form of support.</p> <p>A second problem with type classes is that you can't resolve operators at static compile time. And if instances are explicitly named, references to instances have a way of turning into first-class values. At that point the operator reference can no longer be statically resolved at all, and we have effectively re-invented operator methods!</p> <h3 id="conclusion-about-type-classes-and-overloading">Conclusion about Type Classes and Overloading:</h3> <p><strong>The type class notion (more precisely: qualified types) is seductive, but absent a reasonable approach for instance coherence and lexical resolution it provides an unsatisfactory basis for operator overloading. There is a disturbingly close relationship between type class instances and object instances that needs further exploration by the PL community. The important distinction may be pragmatic rather than conceptual: type class instances are compile-time constants while object instances are run-time values. This has no major consequences for typing, but it leads to significant differences w.r.t. naming, binding, and [human] conceptualization.</strong></p> <p><strong>There are unresolved formal issues that remain with multi-parameter type classes. Many of these appear to have natural practical solutions in a polymorphic object type system, but concerns of implementation motivate kinding distinctions between boxed and unboxed types that are fairly unsatisfactory.</strong></p> <h3 id="wrapping-up">Wrapping Up</h3> <p>The current outcome is <em>extremely</em> frustrating. While the blind spots here were real, we were driven by the requirements of the academic research community to spend nearly three years finding a way to do complete inference over mutability. That was an enormous effort, and it delayed our recognition that we were sitting on the wrong kind of underlying type system entirely. While I continue to think that there is some value in mutability inference, I think it's a shame that a fairly insignificant wart in the original inference mechanism managed to prevent larger-scale success in the overall project for what amount to <em>political</em> reasons. If not for that distraction, I think we would probably have learned enough about the I/O and the instance coherency issues to have moved to a different type system while we still had a group to do it with, and we would have a working and useful language today.</p> <p>The distractions of academia aside, it is fair to ask why we weren't building small &quot;concept test&quot; programs as a sanity check of our design. There are a number answers, none very satisfactory:</p> <ul> <li>Research languages can adopt simplifications on primitive types (notably integers) that systems languages cannot. That's what pushed us into type classes in the first place, we new that polymorphism over unboxed types hadn't seen a lot of attention in the literature, and we knew that mutability inference had never been done. We had limited manpower, so we chose to focus on those issues first.</li> <li>We knew that parametric polymorphism and subtyping didn't get along, so we wanted to avoid that combination. Unfortunately, we avoided subtypes too well for too long, and they turned out to be something unavoidable.</li> <li>For the first several years, we were very concerned with software verification, which <em>also</em> drove us strongly away from object-based languages and subtyping. That blinded us.</li> <li>Coming to language design as &quot;systems&quot; people, working in a department that lacked deep expertise and interest in type systems, there was an enormous amount of subject matter that we needed to learn. Some of the reasons for our failure are &quot;obvious&quot; to people in the PL community, but others are not. Our desire for a &quot;systems&quot; language drove us to explore the space in a different way and with different priorities than are common in the PL community.</li> </ul> <p>I think we did make some interesting contributions. We now know how to do (that is: to implement) polymorphism over unboxed types with significant code sharing, and we understand how to deal with inferred mutability. Both of those are going to be very useful down the road. We have also learned a great deal about advanced type systems.</p> <p>In any case, BitC in its current form clearly needs to be set aside and re-worked. I have a fairly clear notion about how I would approach <em>continuing</em> this work, but that's going to have to wait until someone is willing to pay for all this.</p> Are closed social networks inevitable? open-social-networks/ Fri, 01 Jan 2010 00:00:00 +0000 open-social-networks/ <p><em>This is an archive of an old Google Buzz conversation (circa 2010?) on a variety of topics, including whether or not it's inevetible that a closed platform will dominate social.</em></p> <p><strong>Piaw</strong>: Social networks will be dominated primarily by network effects. That means in the long run, Facebook will dominate all the others.</p> <p><strong>Rebecca</strong>: ... which is also why no one company should dominate it. &quot;The Social Graph&quot; and its associated apps should be like the internet, distributed and not confined to one company's servers. I wish the narrative surrounding this battle was centered around this idea, and not around the whole Silicon Valley &quot;who is the most genius innovator&quot; self-aggrandizing unreality field. Thank god Tim Berners Lee wasn't from Silicon Valley, or we wouldn't have the Internet the way we know it in the first place.</p> <p>I suppose I shouldn't be being so snarky, revealing how much I hate your narratives sometimes. But I think for once this isn't, as it usually is, merely harmlessly cute and endearing - you all collectively screwing up something actually important, and I'm annoyed.</p> <p><strong>Piaw</strong>: The way network effects work, one company will control it. It's inevitable.</p> <p><strong>Rebecca</strong>: No it is not inevitable! What is inevitable is either that one company controls it or that no company controls it. If you guys had been writing the narrative of the invention of the internet you would have been arguing that it was inevitable that the entire internet live on one companies servers, brokered by hidden proprietary protocols. And obviously that's just nuts.</p> <p><strong>Piaw</strong>: I see, the social graph would be collectively owned. That makes sense, but I don't see why Facebook would have an incentive to make that happen.</p> <p><strong>Rebecca</strong>: Of course not! That's why I'm biting my fingernails hoping for some other company to be the white knight and ride up and save the day, liberating the social graph (or more precisely, the APIs of the apps that live on top of them) from any hope of control by a single company. Of course, there isn't a huge incentive for any other company to do it either --- the other companies are likely to just gaze enviously at Facebook and wish they got there first. Tim Berners Lee may have done great stuff for the world, but he didn't get rich or return massive value to shareholders, so the narrative of the value he created isn't included in the standard corporate hype machine or incentives.</p> <p>Google is the only company with the right position, somewhat appropriate incentives, and possibly the right internal culture to be the &quot;Tim Berners Lee&quot; of the new social internet. That's what I was hoping for, and I'm am more than a bit bummed they don't seem to be stepping up to the plate in an effective way in this case, especially since they are doing such a fabulous job in an analogous role with Android.</p> <p><strong>Rebecca</strong>: There is a worldview lurking behind these comments, which perhaps I should try to explain. I'm been nervous about this because it contains some strange ideas, but I'm wondering what you think.</p> <p>Here's a very strange assertion: Mark Zuckerberg is not a capitalist, and therefore should not be judged by capitalist logic. Before you dismiss me as nuts, stop and think for a minute. What is the essential property that makes someone a capitalist?</p> <p>For instance, when Nike goes to Indonesia and sets up sweatshops there, and if communists, unhappy with low pay &amp; terrible conditions, threaten to rebel, they are told &quot;this is capitalism, and however noxious it is, capitalism will make us rich, so shut up and hold your peace!&quot; What are they really saying? Nike brings poor people sewing machines and distribution networks so they can make and sell things they could not make and sell otherwise, so they become more productive and therefore richer. The productive capacity is scarce, and Nike is bringing to Indonesia a piece of this scarce resource (and taking it away from other people, like American workers.) So Indonesia gets richer, even if sweatshop workers suffer for a while.</p> <p>So is Mark Zuckerberg bringing to American workers a piece of scarce productive capacity, and therefore making American workers richer? It is true he is creating for people productive capacity they did not have before --- the possibility of writing social apps, like social games. This is &quot;innovation&quot; and it does make us richer.</p> <p>But it is not wealth that follows the rules of capitalist logic. In particular, this kind of wealth of productive capacity, unlike the wealth created by setting up sewing machines, does not have the kind of inherent scarcity that fundamentally drives capitalist logic. Nike can set up its sewing machines for Americans, or Indonesians, but not for everyone at once. But Tim Berners Lee is not forced to make such choices -- he can design protocols that allow everyone everywhere to produce new things, and he need not restrict how they choose to do it.</p> <p>But -- here's the key point -- though there is no natural scarcity, there may well be &quot;artificial&quot; scarcity. Microsoft can obfuscate Windows API's, and bind IE to Windows. Facebook can close the social graph, and force all apps to live on its servers. &quot;Capitalists&quot; like these can then extract rents from this artificial scarcity. They can use the emotional appeal of capitalist rhetoric to justify their rent-seeking. People are very conditioned to believe that when local companies get rich American workers in general will also get rich -- it works for Indonesia so why won't it work for us? And Facebook and Microsoft employees are getting richer. QED.</p> <p>But they aren't getting richer in the same way that sweatshop employees are getting richer. The sweatshop employees are becoming more productive than they would otherwise be, in spite of the noxious behavior of the capitalists. But if Zuckerberg or Gates behaves noxiously, by creating a walled garden, this may make his employees richer, in the sense of giving them more money, but &quot;more money&quot; is not the same as &quot;more wealth.&quot; More wealth means more productive capacity for workers, not more payout to individual employees. In a manufacturing economy those are linked, so people forget they are not the same.</p> <p>And in fact, shenanigans like these reduce rather than increase the productive capacity available to everyone, by creating an artificial scarcity of a kind of productive tool that need not be scarce at all, just for the purpose of extracting rents from them. No real wealth comes from this extraction. In aggregate it makes us poorer rather than richer.</p> <p>Here's where the kind of stunt that Google pulled with Android, that broke the iPhone's lock, even if it made Google no money, should be seen as the real generator of wealth, even if it is unclear whether it made any money for Google's shareholders. Wealth means I can build something I couldn't build before -- if I want I can install a Scheme or Haskell interpreter on an Android phone, which I am forbidden to put on the iPhone. It means a lot to me! Google's support of Firefox &amp; Chrome, which sped the adoption of open web standards and HTML5, also meant a huge amount to me. I'm an American worker, and I am made richer by Google, in the sense of having more productive capacity available to me, even if Google wasn't that great for my personal wealth in the sense of directly increasing my salary.</p> <p><strong>Rebecca</strong>: (That idea turned out to be sortof shockingly long comment by itself, and on review the last two paragraphs of the original comment were a slightly different thought, so I'm breaking them into a different section.)</p> <p>I'm upset that Google is getting a lot of anti-trust type flak, when I think the framework of anti-trust is just the wrong way to think. This battle isn't analogous to Roosevelt's big trust busting battles; it is much more like the much earlier battles at the beginning of the industrial revolution of the Yankee merchants against the old agricultural, aristocratic interests, which would have squelched industrialization. And Google is the company that has been most consistently on the side of really creating wealth, by not artificially limiting the productivity they make available for developers everywhere. Other companies, like Microsoft or Facebook, though they are &quot;new economy,&quot; though they are &quot;innovative,&quot; though they seem to generate a lot of &quot;wealth&quot; in the form of lots of money, really are much more like the old aristocrats rather than the scrappy new Yankees. In many ways they are slowing down the real revolution, not speeding it up.</p> <p>I've been reluctant to talk too much about these ideas, because I'm anxious about being called a raving commie. But I'm upset that Google is the target of misguided anti-trust logic, and it might be sufficiently weakened that it can't continue to be the bulwark of defense against the real &quot;new economy&quot; abuses that it has been for the last half-decade. That defense has meant a lot to independent developers, and I would hate to see it go away.</p> <p><strong>Phil</strong>: +100, Rebecca. It is striking how little traction the rhetoric of software freedom has here in Silicon Valley relative to pretty much everywhere else in the world.</p> <p><strong>Rebecca</strong>: Thanks - I worry whether my ultra-long comments are spam and its good to hear if someone appreciates them. I have difficulty making my ideas short, but I'm using this Buzz conversation to practice.</p> <p>I'm am not entirely happy with the way the &quot;software freedom&quot; crowd is pitching their message. I had an office down the hall from Richard Stallman for a while, and I was often harangued by him. However, I thought his message was too narrow and radicalized. But on the other hand, when I thought about it hard, I also realized that in many ways it was not radical enough...</p> <p>Why are we talking about freedom? To motivate this, I sometimes tell a little story about the past. When I was young my father read to me &quot;20,000 Leagues Under the Sea,&quot; advertising it as a great classic of futuristic science fiction. Unfortunately, I was unimpressed. It didn't seem &quot;futuristic&quot; at all: it seemed like an archaic fantasy. Why? Certainly it was impressive that an author in 1869 correctly predicted that people would ride in submarines under the sea. But it didn't seem like an image of the future, or even the past, at all. Why not? Because the person riding around on the submarine under the sea was a Victorian gentleman surrounded by appropriately deferential Victorian servants.</p> <p>Futurists consistently get their stories wrong in a particular way: when they say that technology changes the world, they tell stories of fabulous gadgets that will enable people to do new and exciting things. They completely miss that this is not really what &quot;change&quot; -- serious, massive, wrenching, social change - really is. When technology truly enables dreams of change, it doesn't mean it enables aristocrats to dream about riding around under the sea. What it means is that enables the aristocrat's butler to dream of not being a butler any more --- a dream of freedom not through violence or revolution, but through economic independence. A dream of technological change -- really significant technological change -- is not a dream of spiffy gadgets, it is a dream of freedom, of social &amp; economic liberation enabled by technology.</p> <p>Lets go back to our Indonesian sweatshop worker. Even though in many ways the sweatshop job liberates her --- from backbreaking work on a farm, a garbage dump, or in brothels -- she is also quite enslaved. Why? She can sew, let us say, high-end basketball sneakers, which Nike sells for $150 apiece -- many, many times her monthly or even yearly wage. Why is she getting a small cut of the profit from her labors? Because she is dependent on the productive capacity that Nike is providing to her, so bad as the deal is, it is the best she can get.</p> <p>This is where new technology comes in. People talk about the information revolution as if it is about computers, or software, but I would really say it is about society figuring out (slowly) how to automate organization. We have learned to effectively automate manufacturing, but not all of the work of a modern economy is manufacturing. What is the service Nike provides that makes this woman dependent on such a painful deal? Some part of this service is the manufacturing capacity they provide -- the sewing machine -- but sewing machines are hardly expensive or unobtainable, even for poor people. The much bigger deal is the organizational services Nike offers: all the branding, logistics, supply-chain management and retail services that go into getting a sneaker sewn in Indonesia into the hands of an eager consumer in America. One might argue that Nike is performing these services inefficiently, so even if our seamstress is effective and efficient, Nike must take an unreasonably large cut of the profits from the sale of the sneaker to support the rest of this inefficient, expensive, completely un-automated effort.</p> <p>That's where technological change comes in. Slowly, it is making it possible for all these organizational services to be made more automated, streamlined and efficient. This is really the business Google is in. It is said that Google is an &quot;advertising&quot; business, but to call what Google does &quot;advertising&quot; is to paper over the true story of the profound economic shift of which they are merely providing the opening volley.</p> <p>Consider the maker of custom conference tables who recently blogged in the New York Times about Adwords (<a href="http://boss.blogs.nytimes.com/2010/12/03/adwords-and-me-exploring-the-mystery/">http://boss.blogs.nytimes.com/2010/12/03/adwords-and-me-exploring-the-mystery/</a>). He said he paid Google $75,124.77 last year. What does that money represent -- what need is Google filling which is worth more than seventy thousand a year to this guy? You might say that they are capturing an advertising budget of a company, until you consider that without Google this company wouldn't exist at all. Before Google, did you regularly stumble across small businesses making custom conference tables? This is a new phenomenon! The right way to see it is that this seventy thousand isn't really a normal advertising budget -- instead, think of it as a chunk of the revenue of the generic conference table manufacturer that this person no longer has to work for. Because Google is providing for him the branding, customer management services, etc, etc that this old company used to be doing much less efficiently and creatively, this blogger has a chance to go into business for himself. He is paying Google seventy thousand a year for this privilege, but this is probably much less than the cut that was skimmed off the profits of his labors by his old management (not to mention issues of control and stifled creativity he is escaping). Google isn't selling &quot;advertising&quot;: Google is selling freedom. Google is selling to the workers of the world the chance to rid themselves of their chains -- nonviolently and without any revolutionary rhetoric -- but even without the rhetoric this service is still about economic liberation and social change.</p> <p>I feel strange when I hear Eric Schmidt talk about Google's plans for the future of their advertising business, because he seems to be telling Wall Street of a grand future where Google will capture a significant portion of Nike's advertising budget (with display ads and such). This seems like both an overambitious fantasy; and also strangely not nearly ambitious enough. For I think the real development of Google's business -- not today, not tomorrow, not next year, not even next decade, but eventually and inexorably (assuming Google survives the vicissitudes of fate and cultural decay) -- isn't that Google captures Nike's advertising budget. It is that Google captures a significant portion of Nike's entire revenue, paid to them by the workers who no longer have to work for Nike anymore, because Google or a successor in Google's tradition provides them with a much more efficient and flexible alternative vendor for the services Nike's management currently provides.</p> <p><strong>Rebecca</strong>: (Once again I looked at my comment and realized it was even more horrifyingly long. My thoughts seem short in my head, but even when I try to write them down as fast and effectively as I can, they aren't short anymore! Again, I saw the comment has two parts: first, explaining the basic idea of the &quot;freedom&quot; we are talking about, and second, tying it back into the context of our original discussion. So to try to be vaguely reasonable I am cutting it in two.)</p> <p>I suppose Eric Schmidt will never stand in front of Wall Street and say that. When it is really true that &quot;We will bury you!&quot; nobody ever stands up and bangs a shoe on the table while saying it. The architects of the old &quot;new economy&quot; didn't say such things either: the Yankee merchants never said to their aristocratic political rivals that they intended to eventually completely dismantle their social order. In 1780 there was no discussion that foretold the destructive violence of Sherman's march to the sea. I'm not sure they knew it themselves, and if they had been told that that was a possible result of their labors they might not have wanted to hear it. The new class wanted change, they wanted opportunity, they wanted freedom, but they did not want blood! That they would be cornered into seeking it anyway would have been as horrifying to them as to anyone else. Real change is not something anyone wants to crow about --- it is too terrifying.</p> <p>But it is nonetheless important to face, because in the short term this transformation is hardly inevitable or necessarily smooth. If our equivalent of Sherman's march to the sea might be in our future, we might want to think about how to manage or avoid it before it is too late.</p> <p>One major difficulty, as I explained in the last comment, is that while the &quot;automation of information,&quot; if developed properly, has the potential to break fundamental laws of the scarcity of productive capacity, and thereby free &quot;the workers of the world&quot;, nonetheless that potential can be captured, and turned into &quot;artificial&quot; scarcity, which doesn't set workers free, it further enslaves them. There is also a big incentive to do this, because it is the easiest way to make massive amounts of money quickly for a person in the right place at the right time.</p> <p>I see Microsoft as a company that has made a definite choice of corporate strategy to make money on &quot;artificial scarcity.&quot; I see Google as a company that has made a similar definite choice to make money &quot;selling freedom&quot;, specifically avoiding the tricks that create artificial scarcity, even when it doesn't help or even hurts their immediate business prospects.</p> <p>And Facebook? Where is Sheryl Sandburg (apparently the architect of business development at Facebook) on this crucial question? A hundred years from now, when all your &quot;genius&quot; and &quot;innovation,&quot; all the gadgets you build that so delight aristocrats, and are so celebrated by &quot;futurists&quot;, will be all but forgotten, the choices you make on this question will be remembered. This matters.</p> <p>Ms. Sandburg seems to be similarly clear on her philosophy: she simply wants as much diversity of revenue streams for Facebook as she can possibly get. It is hard to imagine an more un-ideological antithesis of Richard Stallman. Freedom or scarcity, she doesn't care: if its a way to make money, she wants it. As many different ones as possible! She wants it all! Its hard for me to get too worked up about this, especially since for other reasons I am rooting for Ms. Sandburg's success. Even so, I would prefer if it were Google in control of this technological advance, because Google's preference on this question is so much more clear and unequivocal.</p> <p>I don't care who is the &quot;genius innovator&quot; and who is the &quot;big loser&quot;, whether this or that company has taken up the mantle of progress or not, who is going to get rich, which company will attract all the superstars, or all the other questions that seem to you such critical matters, but I do care that your work makes progress towards realizing the potential of your technology to empower the workers of the world, rather than slowing it down or blocking it. Since Google has made clear the most unequivocal preference in the right direction on this question, that means I want Google to win. This is too important to be trusted to the control of someone ambivalent about their values, no matter how much basic sympathy I have for the pragmatic attitude. Baris Baser: +100! Liberate the social graph! I wish I could share the narrative taking place here on my buzz post, but I'll just plug it in.</p> <p><strong>Rob</strong>: Google SO dropped the ball with Orkut - they let Facebook run off with the Crown Jewels. Helder Suzuki: I believe that Facebook's dominance will eventually be challenged just like credit card companies are being today. But I think it's gonna come much quickier for Facebook.</p> <p>There are lots of differences, but I like this comparison because credit card companies used great network effect to dominate and shield the market from competition. If you look at them (visa, amex, mastercard), all they have today is basically brand. Today we just know that &quot;credit card&quot; payment (and the margins) will be so much different in the near future.</p> <p>Likewise I don't think that social graph will protect Facebook's &quot;market&quot; in the long run. Just like today it's incredibly easier to set a POS network compared to a few years ago, social graph is gonna be something trivial in the years to come.</p> <p><strong>Rebecca</strong>: Yay! People are reading my obscenely long and intellectual comments. Thanks guys!</p> <p><strong>Piaw</strong>: I disagree with Helder, even though I agree with Rebecca that it's better for Google to own the social graph. The magic trick that Facebook pulled off was getting the typical user to provide and upload all his/her personal information. It's incredibly hard to do that: Amazon couldn't do it, and neither could Google. I don't think it's one of those things that's technically difficult, but the social engineering required to do that demands critical mass. That's why I think that Facebook is (still) under-valued.</p> <p><strong>Rob</strong>: @Piaw - it was an accident of history I think. When Facebook started, they required a student ID to join. This made a culture of &quot;real names&quot; that stuck, and that no one else has been able to replicate.</p> <p><strong>Piaw</strong>: @Rob: The accident of history that's difficult to replicate is what makes Facebook such a good authentication mechanism. I would be willing to not moderate my blog, for instance, if I could make all commenters disclose their true identity. The lowest qualify arguments I've seen on Quora, for instance, were those in which one party was anonymous. Elliotte Rusty Harold: This is annoying I want to reshare Rebecca's comments. not the original post, but I can't seem to do that. :-)</p> <p><strong>Rebecca</strong>: In another conversation, someone linked a particular point in a Buzz commentary to Hacker News (<a href="http://news.ycombinator.com/item?id=1416348">http://news.ycombinator.com/item?id=1416348</a>). I'm not sure how they did it. It was a little strange, though, because then people saw it out of context. These comments were tailored for a context.</p> <p>Where do you want to share it? I'm not sure I'm ready to deal with too big an audience; there is a purpose to practicing writing and getting reactions in an obscure corner of the internet. After all, I am saying things that might be offensive or objectionable in Silicon Valley, and are, in any case, awfully forward -- it is useful to me to talk to a select group of my friends to get feedback from them on how well it does or doesn't fly. Its not like I mind being public, but I also don't mind obscurity for now.</p> <p><strong>Rebecca</strong>: Speaking of which, Piaw, I was biting my fingernails a little wondering how you would react to my way of talking about &quot;software freedom.&quot; I've sort of thought of becoming a software freedom advocate in the tradition of Stallman or ESR, but more intellectual, with more historical perspective, and (hopefully) with less of an edge of polemical insanity. However, adding in an intellectual and historical perspective also added in the difficulty of colliding with real intellectuals and historians, which makes the whole thing fraught, so for that reason among others I've been dragging my feet.</p> <p>This discussion made me dredge up this whole project, partly because I really wanted to know your reactions to it. However, you only reacted to the Facebook comments, not the more general software freedom polemic. What did you think about that?</p> <p><strong>Piaw</strong>: I mostly didn't react to the free software polemic because I agree with what you're saying. I agree that something like Adwords and Google makes possible businesses that didn't exist before. Facebook, for instance, recently showed me an ad for a Suzanne Vega concert that I definitely would not have known about but would have wanted to go if not for a schedule conflict. I want to be able to &quot;like&quot; that ad so that I can get Facebook to show me more ads like those!</p> <p>Do I agree that the world would be a better place for Facebook's social graph to be an open system? Yes and No. In the sense of Facebook having less control, I think it would be a good thing. But do I think I want anybody to have access to it? No. People are already trained to click &quot;OK&quot; to whatever data access any applet wants in Facebook, and I don't need to be inundated with spam in Facebook --- one of the big reasons Facebook has so much less spam is because my friends are more embarrassed about spamming me than the average marketing person, and when they do spam me it's usually with something that I'm interested in, which makes it not spam.</p> <p>But yes, I do wish my Buzz comments (and yours too) all propagated to Facebook/Friendfeed/etc. and the world was one big open community with trusted/authenticated users and it would be all spam free (or at least, I get to block anonymous commenters who are unauthenticated). Am I holding my breath for that? No.</p> <p>I am grateful that Facebook has made a part of the internet (albeit a walled garden part) fully authenticated and therefore much more useful. I think most people don't understand how important that is, and how powerful that is, and that this bit is what makes Facebook worth whatever valuation Wall Street puts on it.</p> <p><strong>Baris</strong>: Piaw, a more fundamental question lurks within this discussion. Ultimately, will people gravitate toward others with similar interests and wait for resources to grow there (Facebook,) or go where the resources are mature, healthy, and growing fast, and wait for everyone else to arrive (Google?)</p> <p>Will people ultimately go to Google where amazing technology currently exists and will probably magnify, given the current trend (self driving cars, facial recognition, voice recognition, realtime language translation, impeccable geographic data, mobile OS with a bright future, unparalleled parallel computing, etc..) or join their friends first at the current largest social network, Facebook, and wait for the technology to arrive there?</p> <p>A hypothetical way of looking at this: Will people move to a very big city and wait for it to be an amazing city, or move to an already amazing city and wait for everyone else to follow suit? Or are people ok with a bunch of amazing small cities?</p> <p><strong>Piaw</strong>: Baris, I don't think you've got the analogy fully correct. The proper analogy is this: Would you prefer to live in a small neighborhood where you sometimes have to go a long way to get what you want/are interested in but is relatively crime free, or would you like to live in a big city where it's easy to get what you want but you get tons of spam and occasionally someone comes in and breaks into your house?</p> <p>The world obviously has both types of people, which is why suburbs and big cities both exist.</p> <p><strong>Baris</strong>: &quot;tons of spam and occasionally someone comes in and breaks into your house?&quot; I think this is a bit too draconian/general though... going with this analogy, I think becomes a bit more subjective, i.e. really depends on who you are in that city, where you live, what you own, how carefree you live your life, and so forth.</p> <p><strong>Piaw</strong>: Right. And Facebook has successfully designed a web-site around this ego-centricity. You can be the star of your tiny town by selectively picking your friends, or you can be the hub of a giant city and accept everyone as a friend. If the latter, then you gave up your privacy when your &quot;friend&quot; posts compromising pictures of you that gets you in trouble with your employer.</p> <p><strong>Nick</strong>: Google is the only company with the right position, somewhat appropriate incentives, and possibly the right internal culture to be the &quot;Tim Berners Lee&quot; of the new social internet.</p> <p>I'd agree that Google hasn't done well at social, but surely are better than that!</p> <p><strong>Rebecca</strong>: Oh, you aren't impressed with Tim Berner Lee's work? Was it the original HTML standard you didn't like, or the more recent W3C stuff? I would admit there is stuff to complain about about both of them.</p> <p><strong>Nick</strong>: It seems to me that TBL got lucky. His original work on the WWW was good, but I think it is difficult to argue he was responsible for its success - certainly no more than someone like Marc Andreessen, who has a pattern of success that repeated after his initial success with Mosaic.</p> <p><strong>Rebecca</strong>: @Piaw (a little ways back) So you found my free software polemic so unobjectionable as to be barely worth comment? Wasn't it a little intellectually radical, with all that &quot;not a capitalist&quot; and &quot;change in the nature of scarcity&quot; stuff? When I told Lessig parts of basic story (not in Google context, because it was many years ago), and asked him for advice about how to talk to economists, he warned me that the words I was using contain so many warning bells of crackpot intellectual radicalism that economists would immediately write me off for using them without any further consideration.</p> <p>It never ceases to amaze me how engineers will swallow shockingly strange ideas without a peep. I suppose in the company of Stallman and ESR, I am a raging intellectual conservative and pragmatist, and since engineers have accepted their style as at least a valid way to have a discussion (even if they don't agree with their politics), I seem tame by comparison. Of course talking to historians or economists is a different matter, because they don't already accept that this is a valid way to have a discussion.</p> <p>Actually, it is immensely useful to me to have this discussion thread to us to show people who might think I'm a crackpot, because it is evidence for the claim that in my own world nobody bats an eyelash at this way of talking.</p> <p>Incidentally, I started thinking about this subject because of Krugman. In the late nineties I was a rabid Krugman fan in a style that is now popular -- &quot;Krugman is always right&quot; -- but was a bit strange back then when he was just another MIT economics professor hardly anyone had ever heard of. However, when he talked about technology (<a href="http://pkarchive.org/column/61100.html">http://pkarchive.org/column/61100.html</a>), I thought he was wrong, which upset me terribly because I also was almost religiously convinced he was always right. In another essay (<a href="http://pkarchive.org/personal/howiwork.html">http://pkarchive.org/personal/howiwork.html</a>) he said it was very important to him to &quot;Listen to the Gentiles&quot; i.e &quot;Pay attention to what intelligent people are saying, even if they do not have your customs or speak your analytical language.&quot; But he also said &quot;I have no sympathy for those people who criticize the unrealistic simplifications of model-builders, and imagine that they achieve greater sophistication by avoiding stating their assumptions clearly.&quot; So it seemed clear to me that he would be willing to hear me explain why he was wrong, as long as I would be willing to state my assumption clearly.</p> <p>Before I knew exactly what I was intending to say, my plan had been to figure out my assumptions well enough to meet his standards, and then ask him to help me do the rest of the work to cast it all into a real economic model. Back then he was just an MIT professor I'd taken a class from, not a famous NYTimes columnist, Nobel-prize winning celebratory, so this plan seemed natural. Profs at MIT don't object if their students point out mistakes, as long as the students are responsible about it. It took me a while to struggle through the process of figuring out what my assumptions were (assumptions? I have assumptions?). When I did I was somewhat horrified to realize that following through with my plan meant accosting him to demand he write a new Wealth of Nations for me! (He'd also left for Princeton by then and started to become famous, so my plan was logistically more difficult than I'd planned.) I had not originally realized what it was that I would be asking for, or that the whole thing would be so daunting.</p> <p>I asked Lessig for advice what to do (Lessig being the only person I knew who lived in both worlds) and Lessig read me the riot act about the rules of intellectual respectability. So it seemed it would be up to me to write the new Wealth of Nations, or at least enough of it to prove the respectability of the ideas contained therein. I was trying to be a computer science student, not an economist, so that degree of effort hardly fit into my plans. I tried to ask for help at the Lab for Computer Science (now CSAIL) by giving a talk in a Dangerous Ideas seminar series, but of the professors I talked to, only David Clark was sympathetic about the need for such a thing. However, he also said very clearly that resources to support grad students to work with economists were limited and really confined to only the kind of very specific net-neutrality stuff he was pushing in concert with his protocol work, not the kind of general worldview I was thinking about. So I was amazed to find that this kind of thing falls into the cracks between different parts of academic culture.</p> <p>I'm still not sure what to do, but I am more and more inclined to ignore Lessig's (implicit) advice to be apologetic and defensive about my lack of intellectual respectability. That would entail a degree of effort I can't afford, since I am still focused on proving myself as a computer scientist, not an intellectual in the humanities. (Having this discussion thread to point to is quite useful on that score.) I could just drop it (I did for a while), but I'm getting more and more upset that technology is moving much faster than the intellectual and social progress that is required to handle it. People seem to think that powerful technology is a good thing in itself, but that is not true: it is only technology in the presence of strong institutions to control its power that provide net benefits to society -- without such controls it can be fantastically destructive. From that point of view a &quot;new economy&quot; is not good news -- what &quot;new&quot; means is that all the old institutions are now out of date and their controls are no longer working. And academic culture is culturally dislocated in ways that ensure that no one anywhere is really dealing with this problem. Pretty picture, isn't it?</p> <p><strong>Nick</strong>: @Rebecca: I don't understand your argument. Why is Google selling advertising anymore about freedom than Facebook selling advertising?</p> <p>It's true that Facebook doesn't make their social graph and/or demographic data available to third parties, but Google doesn't make a person's search history available to third parties either. Why is one so much worse than the other?</p> <p><strong>Piaw</strong>: Rebecca I think that having more data be more open is ideal. However, but I view it as a purely academic discussion for the same reason I view writing &quot;Independent Cycle Touring&quot; in TeX to be an academic discussion. Sure it could happen, but the likelihood of it happening is so slim to none that I don't find the discussion to be of interest.</p> <p>Now, I do agree that technology and its adoption does grow faster than our wisdom and controls for them. However, I don't think that information technology is the big offender. Humanity's big long term problems has more to do with fossil fuels as an energy source, and that's pretty darn old technology. You can fix all the privacy problems in the world, but if we get a runaway greenhouse planet by 2100 it is all moot. Because of that you don't find me getting worked up about privacy or the open-ness of Facebook's social graph. If Facebook does become personally objectionable to me, then I will close my account. Otherwise, I will keep leveraging the work their engineers do.</p> <p><strong>Elliotte</strong>: Rebecca, going back and rereading your comments I'm not sure your analysis is right, but I'm not sure it's wrong either. Of course, I am not an economist. From my non-economist perspective it seems worth further thought, but I also suspect that economists have already thought much of this. The first thing I'd do is chat up a few economists and see if they say something like, &quot;Oh, that's Devereaux's Theory of Productive Capacity&quot; or some such thing.</p> <p>I guess I didn't see anything particularly radical and certainly nothing objectionable in what you wrote. You're certainly not the first to notice that software economics has a different relationship to scarcity than physical goods.Nor would I see that as incompatible with capitalism. It's only really incompatible with a particular religious view of capitalism that economists connected to the real world don't believe in anyway. The theological ideologues of the Austrian School and the nattering nabobs of television news will call you a commie (or more likely these days a socialist) but you can ignore them. Their claimed convictions are really just a bad parody of economics that bares only the slightest resemblance to the real world.</p> <p>You hear a lot from these fantasy world theorists because they have been well funded over the last 40 years or so by corporations and the extremely wealthy with the explicit goal of justifying wealth. Academically this is most notable at the University of Chicago, and it's even more obvious in the pseudo-economics spouted on television news. At the extreme, these paid hucksters expouse the laissez-faire theological conviction that markets are perfectly efficient and rational and that therefore whatever the markets do must be correct; but the latest economic crises have caused most folks to realize that this emperor has no clothes. Economists doing science and not theology pay no attention to this priesthood. I wish the same could be said for the popular media.</p> <p><strong>Helder</strong>: I don't think I agree with the scarcity point that Rebecca made.</p> <p>Generally, if a company is making money from something it's because their are producing some kind of wealth, otherwise they won't sustain economically. It doesn't have to be productive wealth like in factories, it could be cultural (e.g. a TV Show), or something else.</p> <p>Even if you think of artificial scarcity, that's only possible for a company to make when they already have a big momentum (e.g. windows or facebook dominance). Artificial scarcity sucks when you look just at it, but it's more like a &quot;local&quot; optimzation based on an already dominant market position.</p> <p>Perhaps Facebook, Microsoft and other co. wouldn't thrive in the first place if they weren't &quot;allowed&quot; to make the most of their closed system. The world is a better place with a closed Facebook and proprietary Windows API than no with no Facebook or Windows at all.</p> <p>TV producers try to do their best to create the right scarcity when releasing their shows and movies to maximize profit. If they were to adopt some kind of free and open philosophy that they would release their content for download on day 1, they would simply go broke and destroy wealth in the long run.</p> <p><strong>Rebecca</strong>: Thanks guys, for the great comments! I appreciate the opportunity to answer these objections, because this is a subtle issue and I can certainly see that the reasoning behind my position is far from obvious. I won't be able to do it today because I need to be out all day, and its probably just as well that I have a little time to think of how to make the reply as clear and short as possible.</p> <p><strong>Rebecca</strong>: OK, I have about four different kinds of objections to answer, and I don't want to keep this as short as I can, so I think I will arrange it carefully so I can use the ideas in one answer to help me explain the next one. That means I'll answer in the order: Elliot, Piaw then Nick &amp; Helder.</p> <p>It actually took me much of a week to write and edit an answer I liked and believed was condensed as I could make it. And despite my efforts it is still quite long. However, your reaction to my first version has impressed on me that there are some key points I need to take the space to clarify:</p> <ol> <li><p>I shouldn't have tried to talk about a system that &quot;isn't capitalism&quot; in too short an essay, because that is just too easily misunderstood. I take a good bit of space in the arguments below matching Elliots' disavowal of the religious view of capitalism with an explicit disavowal of the religious view of the end of capitalism.</p></li> <li><p>Piaw also asked a good question &quot;why is this important?&quot; It isn't obvious; its only something you can see once it sinks into you how dramatically decades of exponential technological growth can change the world. Since this subject is pretty crazy-making and hard to see with any perspective, I try to use an image from the past to help us predict how people from the future will see us differently than we see ourselves. I want to impress on you why future generations are likely to make very different judgements about what is and isn't important.</p></li> <li><p>Finally, I said rather casually that I wanted to talk about software freedom in the standard tradition, only with more intellectual and historical perpective. As I write this, though, I'm realizing the historical perspective actually changes the substance of the position, in a way I need to make clear.</p></li> </ol> <p>And last of all I wanted to step back again and put this all in the context of what I am trying to accomplish in general, with some commentary on your reactions to the assertion that I am being intellectually radical.</p> <p>These replies are split into sections so you can choose the one you like if the whole thing is too long. But the long answer to Piaw contains the idea which is key to the rest of it.</p> <p><strong>Rebecca</strong>: so, first, @Elliot -- &quot;I'm not the first to notice that software has a different relationship with scarcity than physical goods&quot; But my take on the difference is not the usual: I am not repeating the infinitely-copyable thing everyone goes on about, but instead focusing on the scarcity (or increasing lack thereof) of productive capacity. That way of talking challenges more directly the fundamental assumptions of economic theory, and is therefore more intellectually radical: in a formal way, it challenges the justification for capitalism. But you didn't buy my &quot;incompatible with capitalism&quot; argument either, which I'm glad of, because it gives me the chance to mention that just as much as you want to disown the religious view of what capitalism is, I'd like to specifically disown the religious view of the end of capitalism.</p> <p>Marx talked about an &quot;end of capitalism&quot; as some magic system where it becomes possible for workers to seize the means of production (the factories) and make the economy work without ownership of capital. He also was predicting that capitalism must eventually end, because after all, feudalism had ended. But if you put those two assertions together, and solved the logical syllogism, you would get the assertion that feudalism ended because the serfs seized the means of production (the farms) and made an economy work without the ownership of land. That isn't true! I grew up in Iowa. There are landowners there who own more acreage than most fabled medieval kings. Nobody challenges their ownership, and yet nobody would call that system feudalism. Why not? Because their fields are harvested by machines, not serfs. Feudalism ended not because the landowning class changed their landowning ways. It was because the land-working class, the serfs, left for better jobs in factories; and the landowners don't care anymore, because they eventually replaced the serfs with machines. The end of feudalism was not the end of the ownership of land, it was the end of a social position and set of perogatives that went along with that ownership. If your vassals are machines, you can't lord over them.</p> <p>Similarly, in a non-religious view of the end of capitalism, it will come about not because the capitalist class, the class that owns factories, will ever disappear or change their ways, but because the proletariat will go away -- they will leave for better jobs doing something else, and the factory owners will replace them with machines. And in fact you can see that that is already happening. Are you proletariat? Am I? If I create an STL model and have it printed by Shapeways, I am manufacturing something, but I am not proletariat. Shapeways is certainly raising capital to buy their printers, which strictly speaking makes them &quot;capitalists,&quot; but in a social sense they are not capitalists, because their relationship with me has a different power structure from the one Marx objected to so violently. I am not a &quot;prole&quot; being lorded over by them. It isn't the big dramatic revolution Marx envisioned; it is almost so subtle you can miss it entirely. What if capitalism ended and nobody noticed?</p> <p><strong>Rebecca</strong>: Next @Piaw -- Piaw said he didn't think information technology was the biggest offender in the realm of technology that grows faster than our controls of it; for instance he thought global warming was a more pressing immediate problem.</p> <p>I definitely agree that the immediate problems created by information technology and the associated social change are, right now, small by comparison to global warming. It would be nice if we could tackle the most immediate and pressing problems first, and leave the others until they get big enough to worry about. But the problems of a new economy have the unique feature of being pressing not because they are necessarily immediate or large (right now), but because if they are left undealt-with they can destroy the political system's ability to effectively handle these or any other problems.</p> <p>I'm a believer in understanding the present through the lens of the past: since we have so much more perspective about things that happened many, many years ago, we can interpret the present and predict our future by understanding how things that are happening to us now are analogous to things that happened long ago. Towards that end, I'd like to point out an analogy with a fictional image of people who, very early on in the previous &quot;new economy,&quot; tried to push new ideas of freedom and met with the objection that they were making too big a deal over problems that were too unimportant. (That this image is fictional is part of my point -- bear with me.) My image comes from a dramatic schene in the musical 1776 (whose synopsis can found at <a href="http://en.wikipedia.org/wiki/1776_%28musical%29">http://en.wikipedia.org/wiki/1776_%28musical%29</a>, scene seven), in which an &quot;obnoxious and disliked&quot; John Adams almost throws away the vote of Edward Rutledge and the rest of the southern delegation over the insistence that a condemnation of slavery be included in the Declaration of Independence. He drops this insistence only when he is persuaded to change his mind by Franklin's arguments that the fight with the British is more important than any argument on the subject -- &quot;we must hang together or we will hang separately.&quot;</p> <p>In fact, nothing like that ever happened: as the historical notes on the Wikipedia page say, everyone at the time was so totally in agreement that the issue was too unimportant to be bothered to fight about it, let alone have the big showdown depicted in the musical, with Rutledge dramatically but improbably singing a spookily beautiful song in defense of the triangle trade: &quot;Molasses to Rum to Slaves.&quot; The scene was inserted to satisfy the sensibilities of modern audiences that whether or not such a showdown happened, it should have happened.</p> <p>Why are our sensibilities so much different than reality? Why are we imposing on the past the idea that the fight ought to have been important to them, even though it wasn't, that John Adams ought to have made himself obnoxious and disliked in his intransigent insistence on America's founding values of freedom, even though he didn't and he wasn't, that Franklin ought to have argued with great reluctance that the fight with the British was more important, even though he never made that argument (because it went without saying), and that Edward Rutledge ought to have been a spooky, equally intransigent apologist for slavery, even though he wasn't either (later he freed his own slaves). We are imposing this false narrative because we are looking backwards through a lens where we know something about the future the real actors had no idea about. This is important to understand because we may be in a similar position with respect to future generations -- they will think we should have had a fight we in fact have no inclination to have, because they will know something we don't know about our own future. The central argument I want to make to Piaw hinges on an understanding of this thing that later generations are likely to know about our future that we currently have difficulty imagining.</p> <p>So forgive me if I belabor this point: it is key to my answer both to Piaw's question and also to Nick &amp; Helder's objection. Its going to take a little bit of space to set up the scenery, because it is non-trivial for me to pull my audience back into a historical mentality very different than our own. But I want to go through this exercise in order to pull out of it a general understanding of how and why political ways of thinking shift in the face of dramatic technological change -- which we can use to predict how our own future and the changing shape of our politics.</p> <p>What is it that the real people behind this story didn't know that we know now? Start with John Adams: to understand why the real John Adams wouldn't have been very obnoxious about pushing his idea of freedom on slaveowners in 1776, realize that his idea of freedom, if restated in economic rather than moral terms, would have been the assertion that &quot;it should be an absolute right of all citizens of the United States to leave the farm where they were born and seek a better job in a factory.&quot; But making a big deal about such a right in 1776 would have been absurd. There weren't very many factories, and they were sufficiently inefficient that the jobs they provided were unappealing at best. For example, at the time Jefferson wrote in total seriousness about the moral superiority of agrarian over industrial life: such a sentiment seemed reasonable in 1776, because, not to put too fine a point on it, factory life was horrible. Because of this, the politicians in 1776, like Adams or Hamilton, who were deeply enamored of industrialization, pushed their obsession with an apologetic air, as if they were only talking about their own personal predilections, which they took great pains to make clear they were not going to impose on anyone else. The real John Adams was not nearly as obnoxious as our imaginary version of him: we imagine him differently only because we wish he had been different.</p> <p>We wish him different than he really was because there was one important fact that the people of 1776 may have understood intellectually, but whose full social significance they did not even begin to wrap their minds around: the factories were getting better exponentially, while the farms would always stay the same. Moore's Law-like growth rates in technology are not a new phenomenon. Improvements in the production of cotton textiles in the early nineteenth century stunned observers like the improvements in chips or memory impress us today -- and after cotton-spinning had its run, other advances stepped into the limelight each in turn, as the article at www.theatlantic.com/magazine/archive/1999/10/beyond-the-information-revolution/4658/ tries to impress on us. We forget that dramatic exponential improvements in technology are not a new phenomenon. We also forget that if exponential growth runs for decades, it changes things... and it changes things more than anybody at the beginning of such a run dares to imagine.</p> <p>This brings us to the other characters in our story who made choices we now wish they had made differently (and they also later regretted). Edward Rutledge and Thomas Jefferson didn't exactly defend slavery; they were quite open about being uncomfortable with it, but they didn't consider this discomfort important enough to do much about. That position would also have made sense in 1776: landowners had owned slaves since antiquity, but slavery in ancient times was not fantastically onerous compared to the other options available to the lower classes at the time -- there are famous stories of enterprising Greek and Roman slaves who earned their freedom and rose to high positions in society. Rutledge and Jefferson probably thought they were offering their slaves a similar deal, and that all in all, it wasn't half bad.</p> <p>They were wrong. American slavery turned out to be something unique, entirely different than the slavery of antiquity. My American history teacher presented this as a &quot;paradox,&quot; that the country that was founded on an ideal of freedom also was home to the most brutal system of slavery the world has ever seen. But I think this &quot;paradox&quot; is quite understandable: it is two parts of the same phenomenon. Ask the question: why could ancient slaveowners afford to be relatively benign? Because they were also relatively secure in their position -- their slaves knew as well as they did that the lower classes didn't have many other better options. Sally Hemmings, Jefferson's lover, considered running away when she was in France with him, but Jefferson successfully convinced her that she would get a better deal staying with him. He didn't have to take her home in chains: she left the possibility of freedom in France and came back of her own free will (if slightly wistfully).</p> <p>But as time passed and the factory jobs in the North proceeded in their Moore's Law trajectory, eventually the alternatives available to the lower classes began to look better than in any time before in human history. The slaves Harriet Tubman smuggled to Canada arrived to find options exponentially better than those Hemmings could have hoped for if she had left Jefferson. As a result, for the first time in human history, slaves had to be kept in chains.</p> <p>In the more abstract terms I was using before, slavery was relatively benign when the scarcity of opportunity that bound slaves to their masters was real, but as other opportunities became available, this &quot;real scarcity&quot; became &quot;artificial,&quot; something that had to be enforced with chains -- and laws. That is where the slaveowners transformed into something uniquely brutal: to preserve their way of life they needed not only to put their slaves in chains, they also needed to take over the political and legal apparatus of society to keep those chains legal. There came into existence the one-issue politician -- the politician whose motive to enter political life was not to understand or solve the problems facing the nation, to listen to other points of view or forge compromises, or any of the other natural things that a normal politician does, but merely to fight for one issue only: to write into law the &quot;artificial scarcity&quot; that was necessary to preserve the way of life of his constituents, and play whatever brutal political tricks were necessary to keep those laws on the books. Political violence was not off the table - a recent editorial &quot;When Congress Was Armed And Dangerous&quot; (www.nytimes.com/2011/01/12/opinion/12freeman.htm) reminds us that that the incitements to violence of today's politics are tame compared to the violence of the politics of the 1830's, 40's and 50's. The early 1860's were the culmination of the decades-long disaster we wish the Founding Fathers had foreseen and averted. We wish they had had the argument about slavery while there was still time for it to be a mere argument -- before the elite it supported poisoned the political system to provide for its defense.</p> <p>They, in their old age, wished it too: forty-five years after Jefferson declined to make slavery an important issue in the debate over the Declaration of Independence, he was awakened by the &quot;firebell in the night&quot; in the form of the Missouri compromise. News of this fight caused him to wake up to the real situation, and he wrote to a friend &quot;we have the wolf by the ears, and we can neither hold him, nor safely let him go. Justice is in one scale, and self-preservation in the other.... I regret that I am now to die in the belief that the useless sacrifice of themselves by the generation of '76, to acquire self government and happiness to their country, is to be thrown away by the unwise and unworthy passions of their sons, and that my only consolation is to be that I live not to weep over it.&quot;</p> <p>So, forty-five years after he declined to engage with an &quot;unimportant,&quot; &quot;academic&quot; question, he said of the consequences of that decision that his &quot;only consolation is to be that I live not to weep over it.&quot; He had not counted on the &quot;unwise and unworthy passions&quot; of his sons -- for his own part, he would have been happy to let slavery lapse when economic conditions no longer gave it moral justification. However, the next generation had different ideas -- they wanted to do anything it took to preserve their prerogatives. By that point the choices he had were defined by the company he kept: since he was a Virginian, he would have had to go to war for Virginia, and fight against everything he believed in. He would have wanted to go back to the time when he could have made a choice that was his own, but that time was past and gone, and no matter how &quot;unwise and unworthy&quot; were the passions which were now controlling him, he had no choice but to be swept along by them.</p> <p>This is my argument about why we should pay attention to &quot;unimportant&quot; and &quot;academic&quot; questions. In 1776 it was equally well &quot;academic&quot; to consider looking ahead through seventy five years of exponential growth to project the economic conditions of 1860, and use that projection to motivate a serious consideration of abstract principles that were faintly absurd in the conditions of the time, and would only become burning issues decades and decades later. Yet we wish they had done just that, and in their old age they also fervently wished that they had too. This seems strange: why plan for 1860 in 1776? Why plan for 2085 in 2010? Why not just cross that bridge when we come to it? Let the next generation worry about their own problems; why should we think for them? we have our own burning issues to worry about! The projected problems of 2085 are abstract, academic, and unimportant to us. Why not leave them alone and worry about our present burning concerns?</p> <p>The difficulty is that if we don't leave them alone, if we don't project the battle over our values absurdly into the future and start dealing with the shape of our conflict as it will look when transformed by many decades of time and technological change, we may well lose the political freedom of action to solve these problems non-violently -- or to handle any others either. We will have &quot;a wolf by the ears.&quot; We wish the leaders of 1776 had envisioned and taken seriously the problems of 1860, because in 1776 they were still reasonable people who could talk to each other and effectively work out a compromise. By 1860 that option was no longer available. The problem is that when these kinds of problems eventually stop being &quot;academic,&quot; when they stop being the dreams of intellectuals and become burning issues for millions of real people, the fire burns too hot. Too many powerful people choose to &quot;take a wolf by the ears&quot;. This wolf may well consume the entire political and legal system and make it impossible to handle that problem or any other, until the only option left to restore the body politic is civil war. Once that happens everyone will fervently wish they could go back to the time when the battles were &quot;merely academic&quot;.</p> <p>I worked out this story around 2003, because starting in 1998 I had wanted to have a name to give to a nameless anxiety (in between, I thrashed around for quite a while figuring out which historical image I believed in the most). When I was sure, I considered going to Krugman to use this story to fuel a temper tantrum about how he absolutely had to stop ignoring the geeks who tried to talk to him about &quot;freedom.&quot; But I was inhibited: I was afraid the whole argument would come across as intellectually suspect and emotionally manipulative. Besides, the immediate danger this story predicted -- that politics would devolve into 1830's style one-issue paralysis -- seemed a bit preposterous in 2003. Krugman wasn't happy about the 2002 election, but it wasn't that bad. But now I feel some remorse in the other direction: it has gotten worse faster than I ever dreamed it would. I didn't predict what has been happening exactly. I was very focussed on tech, so I didn't expect the politicians in the service of the powerful people with &quot;a wolf by the ears&quot; to be funded by the oldest old economy powers imaginable -- banking and oil. That result isn't incompatible with this argument: that very traditional capitalism should gain an unprecedented brutality just when the new economy is promising new freedoms, is, this line of reasoning predicts, exactly what you should expect. I'm afraid now that Krugman will be mad at me for not bothering him in 2003, because he would have wanted the extra political freedom of action more than he would have resented the very improper intellectual argument.</p> <p><strong>Rebecca</strong>: Now that I've laid the groundwork, it is much easier for me to answer Nick and Helder. Both of you are essentially telling me that I'm being unreasonable and obnoxious. I will break dramatically with Stallman by completely conceding this objection across the board. I am being unreasonably obnoxious. However, there is a general method to this madness: as I explained in the image above, I am essentially pushing values that will make sense in a few decades, and pulling them back to the current time, admittedly somewhat inappropriately. The main reason I think it is important to do this is not because I think the values I am promoting should necessarily apply in an absolute way right now (as Stallman would say) but instead because it is a lot easier to start this fight now than to deal with it later. The reason to fight now is exactly because the opponents are still reasonable, whereas they might not be later. Unlike Stallman, I want to emphasize my respect (and gratitude) for reasonable objections to my position. My opponents are unlikely to shoot me, which is not a priviledge to be taken for granted, and one I want to take advantage of while I have it.</p> <p>To address the specifics of your objections: Helder complained that companies needed the tactics I called &quot;exploitation of artificial scarcity&quot; to recoup their original investment -- if that wasn't allowed, the service wouldn't exist at all, which would be worse. Nick objected that 80 or 90% of Facebook's planned revenue was from essentially similar sources as Google's, so why should I complain just because of the other 10 or 20%? That was what I was complaining about -- that a portion of their revenue comes from closing their platform and taxing developers -- but that is only a small part of Ms. Sandberg's diversified revenue plans, and I admit that the rest is fairly logically indistinguishable from Google's strategy. In both cases it can easily be argued I am taking an extremely unreasonable hard line.</p> <p>Let's delve into a dissection of how unreasonable I'm being. In both cases the unreasonableness comes from a problem with my general argument: I said that Mark Zuckerberg is not a capitalist, that is to say, he is not raising capital to buy physical objects that make his workers more productive -- but that is not entirely true. Facebook's data centers are expensive, and they are necessary to allow his employees to do their work.</p> <p>The best story on this subject might also be the exception to prove the rule. The most &quot;capitalist&quot; story about a tech mogul's start is the account of how Larry &amp; Sergey began by maxing out their credit cards to buy a terabyte of disk (<a href="http://books.google.com/books?id=UVz06fnwJvUC&amp;pg=PA6#v=onepage&amp;q&amp;f=false">http://books.google.com/books?id=UVz06fnwJvUC&amp;pg=PA6#v=onepage&amp;q&amp;f=false</a>) This story could have been written by Horatio Alger -- it so exactly follows the script of a standard capitalist's start. But for all that, L&amp;S did not make all the standard capitalist noise. I was a fan of Google very early, and may even have pulled data from their original terabyte, and I never heard about how they needed to put restrictions on me to recoup their investment. A year or so later when I talked to Googlers at their recruiting events, I thought they were almost bizarrely chipper about their total lack of revenue strategy. Yet they got rich anyway. And now that same terabyte costs less than $100.</p> <p>That last is the key point: it is not that the investments aren't significant and need to be recouped. It is that their size is shrinking exponentially. In the previous section, I emphasized the enormous transformative effect of decades of exponential improvement in technology, and the importance of extrapolating one's values forward through those decades, even if that means making assertions that currently seem absurd. The investments that need to be recouped are real and significant, but they are also shrinking, at an exponential rate. So the economic basis for the assertion of a right to restrict people's freedom of opportunity in order to recoup investment is temporary at best. And, as I described in the last part, asserting prerogatives on the basis of scarcity which is now real but will soon be artificial is ... dangerous. Even if you honestly think that you will change with changing times, you may find to your sincere horror that when the time comes to make a new choice, you no longer have the option. Your choices will be dictated by the company you keep. I didn't say it earlier, but one of the things that worries me the most about Facebook is that they seem to have gotten in bed with Goldman Sachs. The idea of soon-to-be multibillionare tech moguls taking lessons in political tactics from Lloyd Blankfein doesn't make me happy.</p> <p>I am glad that you objected, and gave me licence to take the space to explain this more carefully, because actually my point is more subtle and nuanced than my original account -- which I was oversimplifying to save space -- suggested. (Lessig told me I had to write an account that fit in five pages in order to hope to be heard, to which I reacted with some despair. I can't! There is more than five pages worth of complexity to this problem! If I try to reduce beyond a certain point I burn the sauce. You are watching me struggle in public with this problem.)</p> <p>There is another part of the mythology of &quot;the end of capitalism&quot; that I should take the time to disavow. The mythology talks as if there is one clear historical moment when an angel of annunciation appears and declares that a new social order has arrived. In reality it isn't like that. It may seem like that in history books when a few pages cover the forty-five years between the time Jefferson scratched out his denunciation of slavery in the Declaration of Independence, and when he wrote the &quot;firebell in the night&quot; letter. But in real, lived, life, forty-five years are most of an adult lifetime. When Jefferson wrote &quot;justice is in one scale, self-preservation in the other,&quot; could he point to a particular moment in the previous forty-five years when the hand of justice had moved the weight to the other side of the scale? There was no one moment: just undramatic but steady exponential growth in the productivity of the industrial life whose moral inferiority seemed so obvious four decades earlier. He hadn't been paying attention, no clarion call was blown at the moment of transition (if such a moment even existed), so when he heard the alarm that woke him up to how much the world had changed, it was too late.</p> <p>In a similar way, I think the only clear judgement we can make now is that we are in the middle of a transition. There was some point in time when disk drives were so expensive and the data stored on them so trivial that the right of their owners to recoup their investment clearly outweighed all other concerns. There will be some other point, decades from now, when the disk drives are so cheap and the data on them so crucial to the livelihood of their users and the health of the economy, that the right to &quot;software freedom&quot; will clearly outweigh the rights of the owner of the hardware. There will be some point clearly &quot;too early&quot; to fight for new freedoms, and some point also clearly &quot;too late.&quot; In between these two points in time? Merely a long, slow, steady pace of change. In that time the hand of justice may refuse to put the weight on either side of the scale, no matter how much we plead with her for clarity. We may live our whole lives between two eras, where no judgement can be made black or white, where everything is grey.</p> <p>But people want solid judgement: they want to know what is right and wrong. This greyness is dangerous, for it opens a vacuum of power eagerly filled by the worst sorts, causes a nameless anxiety, and induces political panic. So what can you do? I think it is impossible to make black-and-white judgements about which era's values should apply. But one can say with certainty that it is desirable to preserve the freedom of political action that will make it possible to defend the values of a new era at the point when it becomes unequivocally clear that that fight is appropriate. I'm really not very upset if companies do whatever it takes to recoup their initial investment -- as long as it's temporary. But what guarantee do I have that if Facebook realizes its 50 billion market capitalization, they won't use that money to buy politicians to justify continuing their practices indefinitely? Their association with Goldman doesn't reassure me on that score. I trust Google more to allow, and even participate in, honest political discussion. That is the issue which I'm really worried about. The speed and effectiveness with which companies are already buying the political discourse has taken me by surprise, even though I had reason to predict it. When powerful people &quot;have a wolf by the ears&quot; they can become truly terrifying.</p> <p><strong>Rebecca</strong>: You could have a point that this kind of argument isn't as unorthodox as it once was. After all, plenty of real economists have been talking about the &quot;new economy&quot; -- Peter Drucker in &quot;Beyond the Information Revolution&quot; (linked above), Hal Varian in &quot;Information Rules&quot;, Larry Summers in a speech &quot;The New Wealth of Nations&quot;, even Krugman in his short essay &quot;The Dynamo and the Microchip&quot;, and a bunch of younger economists surveyed in David Brooks's recent editorial &quot;The Protocol Society.&quot; But I don't share Brooks's satisfaction in the observation &quot;it is striking how [these &quot;new economy&quot; theorists] are moving away from mathematical modeling and toward fields like sociology and anthropology.&quot; There is a sense that my attitude is now more orthodox than the orthodoxy -- though the argument I sketched here is not a mathematical model, it was very much designed and intended to be turned into one, and I am vehemently in agreement with Krugman's attitude that nothing less than math is good enough. Economics is supposed to be a science where mathematical discipline forces intellectual honesty and provides a bulwark of defense against corruption.</p> <p>I'm in shock that this crop of &quot;new economy talk&quot; is so loose, sloppy and journalistic ... and because it is so intellectually sloppy, it is hard even to tell whether it is corrupt or not. For instance, though I liked the historical observations and the conclusions drawn from them in Drucker's 1999 essay as much as anything I've read, his paean to the revolutionary effects of e-commerce reads so much like dot.com advertising it is almost embarrassing. Though, to be fair, there are some hints at the essay's conclusion of a consciousness that an industrial revolution isn't just about being sprinkled with technological fairy dust magic, but also involves some aspect of painful social upheaval -- even so, his story is so strangely upbeat, especially since, given his clearly deep understanding of historical thinking, he should have known better ... one wonders whom he was trying not to offend. Similarly, can we trust Summer's purported genius or do his millions in pay from various banks, um, influence his thinking? and it goes on... Krugman is about the only one left I'm sure I can trust. People who do this for a living should be doing better than this! Even though I understand Lessig's point about my intellectual radicalism, it's hard for me to want to follow it, because some part of me just wants to challenge these guys to show enough intellectual rigor to prove that I can trust them.</p> <p>To be fair, I admit part of Brooks's point that there is something &quot;anthropological&quot; about a new economy: it tends to drive everyone mad. Think about the new industrial rich of the nineteenth century -- dressed to the nines like pseudo-aristocrats, top hat, cane, affected accent, and maybe a bought marriage to get themselves a title too -- they were cuckoo like clocks! Deep, wrenching, technologically-driven change does that to people. But just because it is madness doesn't mean it doesn't have method to it amenable to mathematical modeling. Krugman wrote (<a href="http://pkarchive.org/personal/incidents.html">http://pkarchive.org/personal/incidents.html</a>) that when he was young, Azimov's &quot;Foundation Trilogy&quot; inspired him to dream of growing up to be a &quot;psychohistorian who use[s his] understanding of the mathematics of society to save civilization as the Galactic Empire collapses&quot; but he said &quot;Unfortunately, there's no such thing (yet).&quot; Do you think he could be tempted with the possibility of a real opportunity to be a &quot;psychohistorian?&quot;</p> <p>I never meant for this to be a fight I pursue on my own. The whole reason I translated it into an economic and historical language is that I wanted to convince the people who do this for a living to take it up for me. I can't afford to fight alone: I don't have the time to spend caught up in political arguments, nor can I afford to make enemies of people I might want to work for. I'm making these arguments here mostly because having a record of a debate with other tech people will help convince intellectuals of my seriousness. I'm having some difficulty getting you to understand, but I think I would have a terribly uphill battle trying to convince intellectuals that I am not &quot;crying wolf&quot; -- they have just heard this kind of argument misused too many times before. I have to admit that I am crying wolf, but the reason I'm doing it is because this time there really is a wolf!</p> <p><strong>Piaw</strong>: Here's the thing, Rebecca: it wasn't possible to have that argument about freedom/slavery in 1776. The changes brought about later made it possible to have that argument much later. The civil war was horrifying, but I really am not sure if it was possible to change the system earlier.</p> <p><strong>Ruchira</strong>: Hi Rebecca,</p> <p>I haven't yet read this long conversation. But if you're not already familiar with the concepts of rivalrous vs nonrivalrous and excludable vs nonexcludable</p> <p><a href="http://en.wikipedia.org/wiki/Rivalry_(economics">http://en.wikipedia.org/wiki/Rivalry_(economics</a>)</p> <p>these terms might help connect you with what others have thought about the issues you're talking about. See in particular the &quot;Possible solutions&quot; under Public goods:</p> <p><a href="http://en.wikipedia.org/wiki/Public_good">http://en.wikipedia.org/wiki/Public_good</a> Daniel Stoddart: I've said it before and I'll say it again: I wouldn't be so quick to count Google out of social. Oh, I know it's cool to diss Buzz like Scoble has been doing for a while now, saying that he has more followers on Quora. But that's kind of an apples and oranges comparison.</p> <p><strong>Ruchira</strong>: Rebecca: Okay, now I have read the long conversation. I do think you have an important point but I haven't digested it enough to form an opinion (which would require judging how it interconnects with other important issues). Just a couple of tangential thoughts:</p> <p>1) If you fear the loss of freedom, watch out for the military-industrial complex. You've elsewhere described some of the benefits from it, but this is precisely why you shouldn't be lulled into a false sense of comfort by these benefits, just as you're thinking others should not be lulled into a false sense of comfort about the issues you're describing. Think about the long-term consequences of untouchable and unaccountable defense spending, and about the interlocking attributes of the status quo that keep it untouchable and unaccountable. They are fundamentally interconnected with information hiding and lack of transparency.</p> <p>2) There exists a kind of psychohistory: cliodynamics. <a href="http://cliodynamics.info/">http://cliodynamics.info/</a> As far as I know it's not yet sufficiently developed to apply to the future, though.</p> <p><strong>Ruchira</strong>: Rebecca: On that note, I wonder what you think of Noam Scheiber's article &quot;Why Wikileaks Will Kill Big Business and Big Government&quot; <a href="http://www.tnr.com/article/politics/80481/game-changer">http://www.tnr.com/article/politics/80481/game-changer</a> He's certainly thinking about how technology will cause massive changes in how society is organized.</p> <p><strong>Helder</strong>: (note: I didn't read the whole thing with full attention) In the case of some closed systems. the cost of making it open (and lack business justification for that) and a general necessity to protect the business usually outweighs the need of return on investment by far. So it's not all about ROE.</p> <p>Also, society's technology development and the shrinking size on capital needs for new business (e.g. terabyte cost), don't usually favors closed system business in the long run, it probably only weakens it. You can have a walled garden, but as the outside ground level goes up, the wall gets shorter and shorter. Just look at how the operating system is increasingly less relevant as most action gravitates towards the browser. Another example (perhaps to be seen?) is the credit card industry as I mentioned in my first comment.</p> <p><strong>Rebecca</strong>: Thanks for reading this long, long post and giving me feedback!</p> <p><strong>Ruchira</strong>: Helder: Facebook makes the wall shorter for its developers (I'm sure Zynga think they've grown wealth due to Facebook). This directly caused an outcry over privacy (the walled garden is not walled any more).</p> <p><strong>Rebecca</strong>: Hope you find it food for thought! You might also be interested in David Singh Grewal's Network Power <a href="http://amzn.to/h72nNJ">http://amzn.to/h72nNJ</a> It discusses a lot of relevant issues, and doesn't assume a lot of background (since it's targeted at multiple disciplines), so I found it very helpful, as an outsider like you. After that, you might (or might not) become interested in the coordination problem--if you do, Richard Tuck's free riding <a href="http://amzn.to/f9goyT">http://amzn.to/f9goyT</a> may be of interest.</p> <p><strong>Rebecca</strong>: Thanks, Ruchira, for the links.</p> How does Boston compare to SV and what do MIT and Stanford have to do with it? mit-stanford/ Fri, 01 Jan 2010 00:00:00 +0000 mit-stanford/ <p><em>This is an archive of an old Google Buzz conversation on MIT vs. Stanford and Silicon Valley vs. Boston</em></p> <p>There's no reason why the Boston area shouldn't be as much a hotbed of startups as Silicon Valley is. By contrast, there are lots of reasons why NYC is no good for startups. Nevertheless, Paul Graham gave up on the Boston area, so there must be something that hinders startup formation in the area.</p> <p><strong>Kevin</strong>: This has nothing to do with money, or talents, or what it. All it matters is &quot;entrepreneur density&quot;.</p> <p>Boston may have the money, the talent, the intelligence, but does it have an entrepreneurial spirit and enough of a density?</p> <p><strong>Marya</strong>: From <a href="http://www.xconomy.com/boston/2009/01/22/paul-graham-and-y-combinator-to-leave-cambridge-stay-in-silicon-valley-year-round/">http://www.xconomy.com/boston/2009/01/22/paul-graham-and-y-combinator-to-leave-cambridge-stay-in-silicon-valley-year-round/</a> &quot;Graham says the reasons are mostly personal, having to do with the impending birth of his child and the desire not to try and be a bi-coastal parent&quot; But then immediately after, we see he says: &quot;Boston just doesn’t have the startup culture that the Valley does. It has more startup culture than anywhere else, but the gap between number 1 and number 2 is huge; nothing makes that clearer than alternating between them.&quot; Here's an interview: <a href="http://www.xconomy.com/boston/2009/03/10/paul-graham-on-why-boston-should-worry-about-its-future-as-a-tech-hub-says-region-focuses-on-ideas-not-startups-while-investors-lack-confidence/">http://www.xconomy.com/boston/2009/03/10/paul-graham-on-why-boston-should-worry-about-its-future-as-a-tech-hub-says-region-focuses-on-ideas-not-startups-while-investors-lack-confidence/</a> Funny, because Graham seemed partial to the Boston area, earlier: <a href="http://www.paulgraham.com/cities.html">http://www.paulgraham.com/cities.html</a> <a href="http://www.paulgraham.com/siliconvalley.html">http://www.paulgraham.com/siliconvalley.html</a></p> <p><strong>Rebecca</strong>: I think he's partial because he likes the intellectual side of Boston, enough to make him sad that it doesn't match SV for startup culture. I know the feeling. I guess I have seen things picking up here recently, enough to make me a little wistful that I have given my intellectual side priority over any entrepreneurial urges I might have, for the time being.</p> <p><strong>Scoble</strong>: I disagree that Boston is #2. Seattle and Tel Aviv are better and even Boulder is better, in my view.</p> <p><strong>Piaw</strong>: Seattle does have a large number of Amazon and Microsoft millionaires funding startups. They just don't get much press. I wasn't aware that Boulder is a hot-bed of startup activity.</p> <p><strong>Rebecca</strong>: On the comment &quot;there is no reason Boston shouldn't be a hotbed of startups...&quot; Culture matters. MIT's culture is more intellectual than entrepreneurial, and Harvard even more so. I'll tell you a story: I was hanging out in the MIT computer club in the early nineties, when the web was just starting, and someone suggested that one could claim domain names to make money reselling them. Everyone in the room agreed that was the dumbest idea they had ever heard. It was crazy. Everything was available back then, you know. And everyone in that room kindof knew they were leaving money on the ground. And yet we were part of this club that culturally needed to feel ourselves above wanting to make money that way. Or later, in the late nineties I was hanging around Philip Greenspun, who was writing a book on database backed web development. He was really getting picked on by professors for doing stuff that wasn't academic enough, that wasn't generating new ideas. He only barely graduated because he was seen as too entrepreneurial, too commercial, not original enough. Would that have happened at Stanford? I read an interview with Rajiv Motwani where he said he dug up extra disk drives whenever the Google founders asked for them, while they were still grad students. I don't think that wouldn't happen at MIT: a professor wouldn't give a grad student lots of stuff just to build something on their own that they were going to commercialize eventually. They probably would encounter complaints they weren't doing enough &quot;real science&quot;. There was much resentment of Greenspun for the bandwidth he &quot;stole&quot; from MIT while starting his venture, for instance, and people weren't shy about telling him. I'm not sure I like this about MIT.</p> <p><strong>Piaw</strong>: One my friends once turned down a full time offer at Netscape (after his internship) to return to graduate school. He said at that time, &quot;I didn't go to graduate school to get rich.&quot; Years later he said, &quot;I succeeded... at not getting rich.&quot;</p> <p><strong>Dan</strong>: As the friend in question (I interned at Netscape in '96 and '97), I'm reasonably sure I wouldn't have gotten very rich by dropping out of grad school. Instead, by sticking with academia, I've managed to do reasonably well for myself with consulting on the side, and it's not like academics are paid peanuts, either.</p> <p>Now, if I'd blown off academia altogether and joined Netscape in '93, which I have to say was a strong temptation, things would have worked out very differently.</p> <p><strong>Piaw</strong>: Well, there's always going to be another hot startup. :-) That's what Reed Hastings told me in 1995.</p> <p><strong>Rebecca</strong>: A venture capitalist with Silicon Valley habits (a very singular and strange beast around here) recently set up camp at MIT, and I tried to give him a little &quot;Toto, you're not in Kansas anymore&quot; speech. That is to say, I was trying to tell him that the habits one got from making money from Stanford students wouldn't work at MIT. It isn't that one couldn't make money investing in MIT students -- if one was patient enough, maybe one could make more, maybe a lot more. But it would only work if one understood how utterly different MIT culture is, and did something different out of an understanding of what one was buying. I didn't do a very good job talking to him, though; maybe I should try again by stepping back and talking more generally about the essential difference of MIT culture. You know, if I did that, maybe the Boston mayor's office might want to hear this too. Hmmm... you've given me an idea.</p> <p><strong>Marya</strong>: Apropos, Philip G just posted about his experience attending a conference on angel investing in Boston: <a href="http://blogs.law.harvard.edu/philg/2010/06/01/boston-angel-investors/">http://blogs.law.harvard.edu/philg/2010/06/01/boston-angel-investors/</a> He's in cranky old man mode, as usual. I imagine him shaking his cane at the conference presenters from the rocking chair on his front porch. Fun quotes: 'Asked if it wouldn’t make more sense to apply capital in rapidly developing countries such as Brazil and China, the speakers responded that being an angel was more about having fun than getting a good return on investment. (Not sure whose idea of “fun” included sitting in board meetings with frustrated entrepreneurs, but personally I would rather be flying a helicopter or going to the beach.)... 'Nobody had thought about the question of whether Boston in fact needs more angel investors or venture capital. Nobody could point to an example of a good startup that had been unable to obtain funding. However, there were examples of startups, notably Facebook, that had moved to California because of superior access to capital and other resources out there... 'Nobody at the conference could answer a macro question: With the US private GDP shrinking, why do we need capital at all?'</p> <p><strong>Piaw</strong>: The GDP question is easily answered. Not all sectors are shrinking. For instance, Silicon Valley is growing dramatically right now. I wouldn't be able to help people negotiate 30% increases in compensation otherwise (well, more like 50% increases, depending on how you compute). The number of pre-IPO companies that are extremely profitable is also surprisingly high.</p> <p>And personally, I think that investing in places like China and Brazil is asking for trouble unless you are well attuned to the local culture, so whoever answered the question with &quot;it's fun&quot; is being an idiot.</p> <p>The fact that Facebook was asked by Accel to move to Palo Alto should definitely be something Boston area VCs should berate themselves about. But that &quot;forced move&quot; was very good for Facebook. They acquired Jeff Rothschild, Marc Kwiatkowski, Steve Grimm, Paul Bucheit, Sanjeev Singh, and many others by being in Palo Alto that would not have moved to Boston for Facebook no matter what. It's not clear to me that staying in Boston was an optimal move for Facebook no matter what. At least, not before things got dramatically better in Boston for startups.</p> <p><strong>Marya</strong>: The GDP question is easily answered. Not all sectors are shrinking. For instance, Silicon Valley is growing dramatically right now</p> <p>I'm guessing medical technology and biotech are still growing. What else?</p> <p>Someone pointed this out in the comments, and Philip addressed it; he argues that angel investors are unlikely to get a good return on their investment (partial quote): &quot;...we definitely need some sources of capital... But every part of the U.S. financial system, from venture capital right up through investment banks, is sized for an expanding private economy. That means it is oversized for the economy that we have. Which means that the returns to additional capital should be very small....&quot;</p> <p>He doesn't provide any supporting evidence, though.</p> <p><strong>Piaw</strong>: Social networks and social gaming is growing dramatically and fast.</p> <p><strong>Rebecca</strong>: Thanks, Marya, for pointing out Philip's blog post. I think the telling quote from it is this: &quot;What evidence is there that the Boston area has ever been a sustainable place for startups to flourish? When the skills necessary to build a computer were extremely rare, minicomputer makers were successful. As soon as the skills ... became more widespread, nearly all of the new companies started up in California, Texas, Seattle, etc. When building a functional Internet application required working at the state of the art, the Boston area was home to a lot of pioneering Internet companies, e.g., Lycos. As soon as it became possible for an average programmer to ... work effectively, Boston faded to insignificance.&quot; Philip is saying Boston can only compete when it can leverage skills that only it has. That's because its ability to handle business and commercialization are so comparatively terrible that when the technological skill becomes commoditized, other cities will do much better.</p> <p>But it does often get cutting-edge technical insight and skills first -- and then completely drops the ball on developing them. I find this frustrating. Now that I think about it, it seems like Boston's leaders are frustrated by this too. But I think they're making a mistake trying to remake Boston in Silicon Valley's image. If we tried to be you, at best we would be a pathetic shadow of you. We could only be successful by being ourselves, but getting better at it.</p> <p>There is a fundamental problem: the people at the cutting edge aren't interested in practical things, or they wouldn't be bothering with the cutting edge. Though it might seem strange to say now, the guy who set up the hundredth web server was quite an impractical intellectual. Who needs a web server when there are only 99 others (and no browsers yet, remember)? We were laughing at him, and he was protesting the worth of this endeavor merely out of a deep intellectual faith that this was the future, no matter how silly it seemed. Over and over I have seen the lonely obsessions of impractical intellectuals become practical in two or three years, become lucrative in five or eight, and become massive industries in seven to twelve years.</p> <p>So if the nascent idea that will become a huge industry in a dozen years shows up first in Boston, why can't we take advantage of it? The problem is that the people who hone their skill at nascent ideas that won't be truly lucrative for half a decade at least, are by definition impractical, too impractical to know how to take advantage of being first. But maybe Boston could become a winner if it could figure out how to pair these people up with practical types who could take advantage of the early warning about the shape of the future, and leverage the competitive advantage of access to skills no-one else has. It would take a very particular kind of practicality, different from the standard SV thing. Maybe I'm wrong, though; maybe the market just doesn't reward being first, especially if it means being on the bleeding edge of practicality. What do you think?</p> <p><strong>Piaw</strong>: Being 5 or 10 years ahead of your time is terrible. What you want to be is just 18 months or even 12 months ahead of your time, so you have just enough time to build product before the market explodes. My book covers this part as well. :-)</p> <p><strong>Marya</strong>: Rebecca, I don't know the Boston area well enough to form an opinion. I've been here two years, but I'm certainly not in the thick of things (if there is a &quot;thick&quot; to speak of, I haven't seen it). My guess would be that Boston doesn't have the population to be a huge center of anything, but that's a stab in the dark.</p> <p>Even so, this old survey (2004) says that Boston is #2 in biotech, close behind San Diego: <a href="http://www.forbes.com/2004/06/07/cz_kd_0607biotechclusters.html">http://www.forbes.com/2004/06/07/cz_kd_0607biotechclusters.html</a> So why is Boston so successful in biotech if the people here broadly lack an interest in business, or are &quot;impractical&quot;? (Here's a snippet from the article: &quot;...When the most successful San Diego biotech company, IDEC Pharmaceuticals, merged with Biogen last year to become Biogen Idec (nasdaq: BIIB - news - people ), it officially moved its headquarters to Biogen's hometown of Cambridge, Mass.&quot; Take that, San Diego!)</p> <p>When you talk about a certain type of person being &quot;impractical&quot;, I don't think that's really the issue. Such people can be very practical when it comes to pursuing their own particular kind of ambition. But their interests may not lie in the commercialization of an idea. Some extremely intelligent, highly skilled people just don't care about money and commerce, and may even despise them.</p> <p>Even with all that, I find it hard to believe that the intelligentsia of New England are so much more cerebral than their cousins in Silicon Valley. There's certainly a puritan ethic in New England, but I don't think that drives the business culture.</p> <p><strong>Rebecca</strong>: Marya, thanks for pointing out to me I wasn't being clear (I'm kindof practicing explaining something on you, that I might try to say more formally later, hence the spam of your comment field. I hope you don't mind.) You question &quot; why is Boston so successful in biotech if the people here broadly lack an interest in business?&quot; made me realize I'm not talking about people broadly -- there are plenty of business people in Boston, as everywhere. I'm talking about a particular kind of person, or even more specifically, a particular kind of relationship. Remember I contrasted the reports of Rajeev Motwani's treatment of the Google guys with the MIT CS lab's treatment of Philip? In general, I am saying that a university town like Palo Alto or Cambridge will be a magnet for ultra-ambitious young people who look for help realizing their ambitions, and a group of adults who are looking to attract such young people and enable those ambition, and there is a characteristic relationship between them with (perhaps unspoken) terms and expectations. The idea I'm really dancing around is that these terms &amp; expectations are very different at MIT than (I've heard) they are at Stanford. Though there may not be very many people total directly involved in this relationship, it will still determine a great deal of what the city can and can't accomplish, because it is a combination of the energy of very ambitious young people and the mentorship of experienced adults that makes big things possible.</p> <p>My impression is that the most ambitious people at Stanford dream of starting the next big internet company, and if they show enough energy and talent, they will impress professors who will then open their Rolodex and tell their network of VC's &quot;this kid will make you tons of money if you support his work.&quot; The VC's who know that this professor has been right many times before will trust this judgement. So kids with this kind of dream go to Stanford and work to impress their professors in a particular kind of way, because it puts them on a fast track to a particular kind of success.</p> <p>The ambitious students most cultivated by professors in Boston have a different kind of dream: they might dream of cracking strong AI, or discovering the essential properties of programming languages that will enable fault-tolerant or parallel programming, or really understanding the calculus of lambda calculus, or revolutionizing personal genomics, or building the foundations of Bladerunner-style synthetic biology. If professors are sufficiently impressed with their student's energy and talent, they will open their Rolodex of program managers at DARPA (and NSF and NIH), and tell them &quot;what this kid is doing isn't practical or lucrative now, nor will it be for many years to come, but nonetheless it is critical for the future economic and military competitiveness of the US that this work is supported.&quot; The program managers' who know that this professor has been right many times before will trust this judgment. In this way, the kid is put on a fast track to success -- but it is a very different kind of success than the Stanford kid was looking for, and a different kind of kid who will fight to get onto this track. The meaning of success is very different, much more intellectual and much less practical, at least in the short term.</p> <p>That's what I mean when I say &quot;Boston&quot; is less interested in business, more impractical, less entrepreneurial. It isn't that there aren't plenty of people here who have these qualities. But the &quot;ecosystem&quot; that gives ultra-ambitious young people the chance to do something singular which could be done no-where else -- an ecosystem which that it does have, but in a very different kind of way -- doesn't foster skill at commercialization or an interest in the immediate practical application of technology.</p> <p>Maybe there is nothing wrong with that: Boston's ecosystem just fosters a different kind of achievement. However, I can see it is frustrating to the mayor of Boston, because the young people whose ambitions are enabled by Boston's ecosystem may be doing work crucial to the economic and military competitiveness of the US in the long term, but they might not help the economy of Boston very much! What often happens in the &quot;long term&quot; is that the work supported by grants in Boston develops to the point it becomes practical and lucrative, and then it gets commercialized in California, Seattle, New York, etc... The program managers at DARPA who funded the work are perfectly happy with this outcome, but I can imagine that the mayor of Boston is not! The kid also might not be 100% happy with this deal, because the success which he is offered isn't much like SV success -- its a fantastic amount of work, rather hermit-like and self-abnegating, which mostly ends up making it possible for other people far away to get very, very rich using the results of his labors. At best he sees only a minuscule slice of the wealth he enabled.</p> <p>What one might want instead is that the professors in Boston have two sections in their Rolodex. The first section has the names of all the relevant program managers at DARPA, and the professor flips to this section first. The second section has the names of suitable cofounders, and friendly investors, and after the student has slaved away for five to seven years making a technology practical, the professor flips to the second section and sets the student up a second time to be the chief scientist or something like that at an appropriate startup.</p> <p>And its not like this doesn't happen. It does happen. But it doesn't happen as much as it could, and I think the reason why it doesn't may be that it just takes a lot of work to maintain a really good Rolodex. These professors are busy and they just don't have enough energy to be the linchpin of a really top-quality ecosystem in two different ways at the same time.</p> <p>If the mayor of Boston is upset that Boston is economically getting the short end of the stick in this whole deal (which I think it is), a practical thing he could do is give these professors some help in beefing up the second section of their Rolodex, or perhaps try to build another network of mentors which was in the appropriate way Rolodex-enabled. If he took the later route, he should understand that this second network shouldn't try to be a clone of the similar thing at Stanford (because at best it would only be a pale shadow) but instead be particularly tailored to incorporating the DARPA-project graduates that are unique to Boston's ecosystem. That way he could make Boston a center of entrepreneurship in a way that was uniquely its own and not merely a wannabe version of something else -- which it would inevitably do badly. That's what I meant when I said Boston should be itself better, rather than trying to be a poor pale copy of Silicon Valley.</p> <p><strong>Piaw</strong>: I like that line of thought Rebecca. Here's the counter-example: Facebook. Facebook clearly was interested in monetizing something that was very developed, and in fact, had been tried and failed many times because the timing wasn't right. Yet Facebook had to go to Palo Alto to get funding. So the business culture has to change sufficiently that the people with money are willing to risk it on very high risk ventures like the Facebook that was around 4 years ago.</p> <p>Having invested my own money in startups, I find that it's definitely something very challenging. It takes a lot to convince yourself that this risk is worth taking, even if it's a relatively small portion of your portfolio. To get enough people to build critical mass, you have to have enough success in prior ventures to gain the kind of confidence that lets you fund Facebook where it was 4 years ago. I don't think I would have been able to fund Google or Facebook at the seed stage, and I've lived in the valley and worked at startups my entire career, so if anyone would be comfortable with risk, it should be me.</p> <p><strong>Dan</strong>: Rebecca: a side note on &quot;opening a rolodex for DARPA&quot;. It doesn't really work quite like that. It's more like &quot;hey, kid, you should go to grad school&quot; and you write letters of recommendation to get the kid into a top school. You, of course, steer the kid to a research group where you feel he or she will do awesome work, by whatever biased idea of awesomeness.</p> <p>My own professorial take: if one of my undergrads says &quot;I want to go to grad school&quot;, then I do as above. If he or she says &quot;I want to go work for a cool startup&quot;, then I bust out the VC contacts in my rolodex.</p> <p><strong>Rebecca</strong>: Dan: I know. I was oversimplifying for dramatic effect, just because qualifying it would have made my story longer, and it was already pushing the limits of the reasonable length for a comment. Of course the SV version of the story isn't that simple either.</p> <p>I have seen it happen that sufficiently brilliant undergraduates (and even high school students -- some amazing prodigies show up at MIT) can get direct support. But realize also I'm really talking about grad students -- after all, my comparison is with the relationship between the Google guys and Rajeev Motwani, which happened when they were graduate students. The exercise was to compare the opportunities they encountered with the opportunities similarly brilliant, energetic and ultra-ambitious students at MIT would have access to, and talk about how it would be similar and different. Maybe I shouldn't have called such people &quot;kids,&quot; but it simplified and shortened my story, which was pushing its length limit anyway. Thanks for the feedback; I'm testing out this story on you, and its useful to know what ways of saying things work and what doesn't.</p> <p><strong>Rebecca</strong>: Piaw: I understand that investing in startups by individual is very scary. I know some Boston angels (personally more than professionally) and I hear stories about how cautious their angel groups are. I should explain some context: the Boston city government recently announced a big initiative to support startups in Boston, and renovate some land opened up by the Big Dig next to some decaying seaport buildings to create a new Innovation District. I was thinking about what they could do to make that kind of initiative a success rather than a painful embarrassment (which it could easily become). So I was thinking about the investment priorities of city governments, more than individual investors like you.</p> <p>Cities invest in all sorts of crazy things, like Olympic stadiums, for instance, that lose money horrifyingly ... but when you remember that the city collects 6% hotel tax on every extra visitor, and benefits from extra publicity, and collects extra property tax when new people move to the city, it suddenly doesn't look so bad anymore. Boston is losing out because there is a gap in the funding of technology between when DARPA stops funding something, because it is developed to the point where it is commercializable, and when the cautious Boston angels will start funding something -- and other states step into the gap and get rich off of the product of Massachusetts' tax dollars. That can't make the local government too happy.</p> <p>Maybe the Boston city or state government might have an incentive to do something to plug that hole. They might be more tolerant of losing money directly because even a modestly lucrative venture, or one very, very slow to generate big returns, which nonetheless successfully drew talent to the city would make them money in hotel &amp; property tax, publicity etc. etc. -- or just not losing the huge investment they have already made in their universities! I briefly worked for someone who was funded by Boston Community Capital, an organization which, I think, divided its energies between developing low income housing and and funding selected startups that were deemed socially redeeming for Boston. When half your portfolio is low-income housing, you might have a different outlook on risk and return! I was hugely impressed by what great investors they were -- generous, helpful &amp; patient. Patience is necessary for us because the young prodigies in Boston go into fields whose time horizon is so long -- my friends are working on synthetic biology, but it will be a long, long time before you can buy a Bladerunner-style snake!</p> <p>Again, thanks for the feedback. You are helping me understand what I am not making clear.</p> <p><strong>Marya</strong>: Rebecca, you said The idea I'm really dancing around is that these terms &amp; expectations are very different at MIT than (I've heard) they are at Stanford</p> <p>I read your initial comments as being about general business conditions for startups in Boston. But now I think you're mainly talking about internet startups or at least startups that are based around work in computer science. You're saying MIT's computer science department in particular does a poor job of pointing students in an entrepreneurial direction, because they are too oriented towards academic topics.</p> <p>Both MIT and Stanford have top computer science and business school rankings. Maybe the problem is that Stanford's business school is more inclined to &quot;mine&quot; the computer science department than MIT's?</p> <p><strong>Doug</strong>: Rebecca, your description of MIT vs. Stanford sounds right to me (though I don't know Stanford well). What's interesting is that I remember UC Berkeley as being very similar to how you describe MIT: the brightest/most ambitious students at Cal ended up working on BSD or Postgres or The Gimp or Gnutella, rather than going commercial. Well, I haven't kept up with Berkeley since the mid-90s, but have there been any significant startups there since Berkeley Softworks?</p> <p><strong>Piaw</strong>: Doug: Inktomi. It was very significant for its time.</p> <p><strong>Dan</strong>: John Ousterhout built a company around Tcl. Eric Allman built a company around sendmail. Mike Stonebreaker did Ingres, but that was old news by the time the Internet boom started. Margo Seltzer built a company around Berkeley DB. None of them were Berkeley undergrads, though Seltzer was a grad student. Insik Rhee did a bunch of Internet-ish startup companies, but none of them had the visibility of something like Google or Yahoo.</p> <p><strong>Rebecca</strong>: Dan: I was thinking more about what you said about not involving undergraduates, but instead telling them to go to grad school. Sometimes MIT is in the nice sedate academic mode which steers undergrads to the appropriate research group when they are ready to work on their PhD. But sometimes it isn't. Let me tell you more about the story of the scene in the computer club concerning installation of the first web server. It was about the 100th web server anywhere, and its maintainer accosted me with an absurd chart &quot;proving&quot; the exponential growth of the web -- i.e. a graph going exponentially from 0 to 100ish, which he extrapolated forward in time to over a million -- you know the standard completely bogus argument -- except this one was exceptionally audacious in its absurdity. Yet he argued for it with such intensity and conviction, as if he was saying that this graph should convince me to drop everything and work on nothing but building the Internet, because it was the only thing that mattered!</p> <p>I fended him off with the biggest stick I could find: I was determined to get my money's worth for my education, do my psets, get good grades (I cared back then), and there is no way I would let that be hurt by this insane Internet obsession. But it continued like that. The Internet crowd only grew with time, and they got more insistent that they were working on the only thing that mattered and I should drop everything and join them. That I was an undergraduate did not matter a bit to anyone. Undergrads were involved, grad students were involved, everyone was involved. It wasn't just a research project; eventually so many different research projects blended together that it became a mass obsession of an entire community, a total &quot;Be Involved or Be Square&quot; kind of thing. I'd love to say that I did get involved. But I didn't; I simply sat in the office on the couch and did psets, proving theorems and solving the Schrodinger's equation, and fended them off with the biggest stick I could find. I was determined to get a Real Education, to get my money's worth at MIT, you know.</p> <p>My point is that when the MIT ecosystem really does its thing, it is capable of tackling projects that are much bigger than ordinary research projects, because it can get a critical mass of research projects working together, involving enough grad students and also sucking in undergrads and everyone else, so that the community ends up with an emotional energy and cohesion that goes way, way beyond the normal energy of a grad student trying to finish a PhD.</p> <p>There's something else too, though I cannot report on this with that much certainty, because was too young to see it all at the time. You might ask: if MIT had this kind of emotional energy focused on something in the 90's, then what is it doing in a similar way now? And the answer I'd have to say, painfully, is that it is frustrated and miserable about being an empty shell of what it once was.</p> <p>Why? Because in 2000 Bush got elected and he killed the version of DARPA with which so many professors had had such a long relationship. I didn't I understand this in the 90's -- like a kid I took the things that were happening around me for granted without seeing the funding that made them possible -- but now I see that that the kind of emotional energy expended by the Internet crowd at MIT in the 90's costs a lot of money, and needs an intelligent force behind it, and that scale of money and planning can only come from the military, not from NSF.</p> <p>More recently I've watched professors who clearly feel it is their birthright to be able to mobilize lots of student to do really large-scale projects, but then they try to find money for it out of NSF, and they spend all their time killing themselves writing grant proposals, never getting enough money to make themselves happy, and complaining about the cowardice of academia, and wishing they could still work with their old friends at DARPA. They aren't happy because they are merely doing big successful research projects, but a mere research project isn't enough... when MIT is really MIT it can do more. It is an empty shell of itself when it is merely a collection of merely successful but not cohesive NSF funded research projects. As I was saying, the Boston &quot;ecosystem&quot; has in itself the ability to do something singular, but it is singular in an entirely different way than SV's thing.</p> <p>This may seem obscure, a tale of funding woes at a distant university, but perhaps it is something you should be aware of, because maybe it affects your life. The reason you should care is that when MIT was fully funded and really itself, it was building the foundations of the things that are now making you rich.</p> <p>One might think of the relationship between technology and wealth like a story about potential energy: when you talk about finding a &quot;product/market&quot; fit, its like pushing a big stone up a hill, until you get the &quot;fit&quot; at the top of the hill, and then the stone rolls down and the energy you put into it spins out and generates lots of money. In SV you focus on pushing stones up short hills -- like Piaw said, no more than 12-18 months of pushing before the &quot;fit&quot; happens.</p> <p>But MIT in its golden age could tackle much, much bigger hills -- the whole community could focus itself on ten years of nothing but pushing a really big stone up a really big hill. The potential energy that the obsessed Internet Crowd in the 90's was pushing into the system has been playing out in your life ever since. They got a really big stone over a really big hill and sent it down onto you, and then you pushed it over little bumps on the way down, and made lots of money doing it, and you thought the potential energy you were profiting from came entirely from yourselves. Some of it was, certainly, but not all. Some of it was from us. If we aren't working on pushing up another such stone, if we can't send something else over a huge hill to crash into you, then the future might not be like the past for you. Be worried.</p> <p>So you might ask, how did this story end? If I'm claiming that there was intense emotional energy being poured into developing the Internet at MIT in the 90's, why didn't those same people fan out and create the Internet industry in Boston? If we were once such winners, how did we turn into such losers? What happened to this energetic, cohesive group?</p> <p>I can tell you about this, because after years of fending off the emotional gravitation pull of this obsession, towards the end I began to relent. First I said &quot;No way!&quot; and then I said &quot;No!&quot; and then I said &quot;Maybe Later,&quot; and then I said &quot;OK, Definitely Later&quot;... and then when I finally got around to Later, and (perhaps the standard story of my life) Later turned out to be Too Late. By 2000 I was ready to join the crowd and remake myself as an Internet Person in the MIT style. So I ended up becoming seriously involved just at the time it fell apart. Because 2000ish, almost the beginning of the Internet Era for you, was the end for us.</p> <p>This weekend I was thinking of how to tell this story, and I was composing it in my head in a comic style, thinking to tell a story of myself as &quot;Parable of Boston Loser&quot; to talk about all my absurd mistakes as a microcosm of the difficulties of a whole city. I can pick on myself, can't I; no one will get upset at that? The short story is that in 2000ish the Internet crowd had achieved their product/market fit, DARPA popped the champagne -- you won guys! Congratulations! Now go forth and commercialize! -- and pushed us out of the nest into the big world to tackle the standard tasks of commercializing a technology -- the tasks that you guys can do in your sleep. I was there, right of the middle of things, during that transition. I thought to tell you a comic story about the absurdity of my efforts in that direction, and make you laugh at me.</p> <p>But when I was trying to figure out how to explain what was making it so terribly hard for me, to my great surprise I was suddenly crying really hard. All Saturday night I was thinking about it and crying. I had repressed the memory, decided I didn't care that much -- but really it was too terrible to face. All the things you can do without thinking, for us hurt terribly. The declaration of victory, the &quot;achievement of product/market fit&quot;, the thing you long for more than anything, I -- and I think many of the people I knew -- experienced as a massive trauma. This is maybe why I've reacted so vehemently and spammed your comment field, because I have big repressed personal trauma about all this. I realized I had a much more earnest story to tell than I had previously planned.</p> <p>For instance, I was reflecting on my previous comment about what cities spend money on, and thinking that I sounded like the biggest jerk ever. Was I seriously suggesting that the city take money that they would have spent on housing for poor black babies and instead spend it on overeducated white kids with plenty of other prodigiously lucrative economic opportunities? Where do I get off suggesting something like that? If I really mean it I have a big, big burden of proof.</p> <p>So I'll try to combine my more earnest story with at least a sketch of how I'd tackle this burden of proof (and try to keep it short, to keep the spam factor to a minimum. The javascript is getting slow, so I'll cut this here and continue.)</p> <p><strong>Ruchira</strong>: Interlude (hope Rebecca continues soon!): Rebecca says &quot;that scale of money and planning can only come from the military, not from NSF.&quot; Indeed, it may be useful to check out this NY Times infographic of the federal budget: <a href="http://www.nytimes.com/interactive/2010/02/01/us/budget.html">http://www.nytimes.com/interactive/2010/02/01/us/budget.html</a></p> <p>I'll cite below some of the 2011 figures from this graphic that were proposed at that time; although these may have changed, the relative magnitudes of one sector versus another are not very different. I've mostly listed sectors in decreasing order of budget size for research, except I listed &quot;General science &amp; technology&quot; sector (which includes NSF) before &quot;Health&quot; sector (which includes NIH) since Rebecca had contrasted the military with NSF.</p> <p>The &quot;Research, development, test, and evaluation&quot; segment of the &quot;National Defense&quot; sector is $76.77B. I guess DARPA, ONR, etc. fit there.</p> <p>The &quot;General science &amp; technology&quot; sector is down near the lower right. The &quot;National Science Foundation programs&quot; segment gets $7.36B. There's also another $0.1B for &quot;National Science Foundation and other&quot;. The &quot;Science, exploration, and NASA supporting activities&quot; segment gets $12.78B. (I don't know to what extent satellite technology that is relevant to the national defense is also involved here, or in the $4.89B &quot;Space operations&quot; segment, or in the $0.18B &quot;NASA Inspector General, education, and other&quot; segment.) The &quot;Department of Energy science programs&quot; segment gets $5.12B. The &quot;Department of Homeland Security science and technology programs&quot; segment gets $1.02B.</p> <p>In the &quot;Health&quot; sector, the &quot;National Institutes of Health&quot; segment gets $32.09B. The &quot;Disease control, research, and training&quot; segment gets $6.13B (presumably this includes the CDC). There's also &quot;Other health research and training&quot; at $0.14B and &quot;Diabetes research and other&quot; at $0.095B.</p> <p>In the &quot;Natural resources and environment sector&quot;, the &quot;National Oceanic and Atmospheric Administration&quot; gets $5.66B. &quot;Regulatory, enforcement, and research programs&quot; gets $3.86B (is this the entire EPA?).</p> <p>In the &quot;Community and regional development&quot; sector, the &quot;National Infrastructure Innovation and Finance fund&quot; (new this year) gets $4B.</p> <p>In the &quot;Agriculture&quot; sector, which presumably includes USDA-funded research, &quot;Research and education programs&quot; gets $1.97B, &quot;Research and statistical analysis&quot; gets $0.25B, and &quot;Integrated research, education, and extension programs&quot; gets $0.025B.</p> <p>In the &quot;Transportation&quot; sector, &quot;Aeronautical research and technology&quot; gets $1.15B, which by the way would be a large (130%) relative increase. (Didn't MIT find a way of increasing jet fuel efficiency by 75% recently?)</p> <p>In the &quot;Commerce and housing credit&quot; sector, &quot;Science and technology&quot; gets $0.94B. I find this rather mysterious.</p> <p>In the &quot;Education, training, employment&quot; sector, &quot;Research and general education aids: Other&quot; gets $1.14B. The &quot;Institute for Education Sciences&quot; gets $0.74B.</p> <p>In the &quot;Energy&quot; sector, &quot;Nuclear energy R&amp;D&quot; gets $0.82B and &quot;Research and development&quot; gets $0.024B (presumably this is the portion outside the DoE).</p> <p>In the &quot;Veterans' benefits and services&quot; sector, &quot;Medical and prosthetic research&quot; gets $0.59B.</p> <p>In the &quot;Income Security&quot; sector there's a tiny segment &quot;Children's research and technical assistance&quot; $0.052B. Not sure what that means.</p> <p><strong>Rebecca</strong>: I'll start with a non-sequitur which I hope to use to get at the hear of the difference between MIT and Stanford: recently I was at a Marine publicity event and I asked the recruiter what differentiates the Army from the Marines? Since they both train soldiers to fight, why don't they do it together? He answered vehemently that they must be separate because of one simple attribute in which they are utterly opposed: how they think about the effect they want to have on the life their recruits have after they retire from the service. He characterized the Army as an organization which had two goals: first, to train good soldiers, and second, to give them skills that would get them a good start in the life they would have after they left. If you want to be a Senator, you might get your start in the Army, get connections, get job skills, have &quot;honorable service&quot; on your resume, and generally use it to start your climb up the ladder. The Army aspires to create a legacy of winners who began their career in the Army.</p> <p>By contrast the Marines, he said, have only one goal: they want to create the very best soldiers, the elite, the soldiers they can trust in the most difficult and dangerous situations to keep the Army guys behind them alive. This elite training, he said, comes with a price. The price you pay is that the training you get does not prepare you for anything at all in the civilian world. You can be the best of the best in the Marines, and then come home and discover that you have no salable civilian job skills, that you are nearly unemployable, that you have to start all over again at the bottom of the ladder. And starting over is a lot harder than starting the first time. It can be a huge trauma. It is legendary that Marines do not come back to civilian life and turn into winners: instead they often self-destruct -- the &quot;transition to civilian life&quot; can be violently hard for them.</p> <p>He said this calmly and without apology. Did I say he was a recruiter? He said vehemently: &quot;I will not try to recruit you! I want to you to understand everything about how painful a price you will pay to be a Marine. I will tell you straight out it probably isn't for you! The only reason you could possibly want it is because you want more than anything to be a soldier, and not just to be a soldier, but to be in the elite, the best of the best.&quot; He was saying: we don't help our alumni get started, we set them up to self-destruct, and we will not apologize for it -- it is merely the price you pay for training the elite!</p> <p>This story gets to the heart of what I am trying to say is the essential difference between Stanford and MIT. Stanford is like the Army: for its best students, it has two goals -- to make them engineers, and to make them winners after they leave. And MIT is like the Marines: it has only one goal -- to make its very best student into the engineering elite, the people about whom they can truthfully tell program managers at DARPA: you can utterly trust these engineers with the future of America's economic and military competitiveness. There is a strange property to the training you get to enter into that elite, much like the strange property the non-recruiter attributed to the training of the Marines: even though it is extremely rigorous training, once you leave you can find yourself utterly without any salable skills whatever.</p> <p>The skills you need to acquire to build the infrastructure ten years ahead of the market's demand for it may have zero intersection with the skills in demand in the commercial world. Not only are you not prepared to be a winner, you may not even be prepared to be basically employable. You leave and start again at the bottom. Worse than the bottom: you may have been trained with habits commercial entities find objectionable (like a visceral unwillingness to push pointers quickly, or a regrettable tendency to fight with the boss before the interview process is even over.) This can be fantastically traumatic. Much as ex-Marines suffer a difficult &quot;transition to civilian life,&quot; the chosen children of MIT suffer a traumatic &quot;transition to commercial life.&quot; And the leaders at MIT do not apologize for this: as the Marine said, it is just the price you pay for training the elite.</p> <p>This is the general grounds which I might use to appeal to the city officials in Boston. There's more to explain, but the shape of the idea would be roughly this: much a cities often pay for programs to help ex-Marines transition to civilian life, on the principal that they represent valuable human capital that ought not to be allowed to self-destruct, it might pay off for the city to understand the peculiar predicament of graduates of MIT's intense DARPA projects, and provide them with help with the &quot;transition to commercial life.&quot; There's something in it for them! Even though people who know nothing but how to think about the infrastructure of the next decade aren't generically commercially valuable, if you put them in proximity to normal business people, their perspective would rub off in a useful way. That's the way that Boston could have catalyzed an Internet industry of its own -- not by expecting MIT students to commercialize their work, which (with the possible exception of Philip) they were constitutionally incapable of, but by giving people who wanted to commercialize something but didn't know what a chance to learn from the accumulated (nearly ten years!) of experience and expertise of the Internet Crowd.</p> <p>On that note, I wanted to say -- funny you should mention Facebook. You think of Mark Zuckerberg as the social networking visionary in Boston, and Boston could have won if they had paid to keep him. I think that strange -- Zuckerberg is fundamentally one of you, not one of us. It was right he should leave. But I'll ask you a question you've probably never thought about. Suppose the Internet had not broken into the public consciousness at the time it did; suppose the world had willfully ignored it for a few more years, so the transition from a DARPA-funded research project to a commercial proposition would have happened a few years later. There was an Internet Crowd at MIT constantly asking DARPA to let them build the &quot;next thing,&quot; where &quot;next&quot; is defined as &quot;what the market will discover it wants ten years from now.&quot; So if this crowd had gotten a few more years of government support, what would they have built?</p> <p>I'm pretty sure it would have been a social networking infrastructure, not like Facebook, really, but more like the Diaspora proposal. I'm not sure, but I remember in '98/'99 that's what all the emotional energy was pointing toward. It wasn't technically possible to build yet, but the instant it was that's what people wanted. I think it strange that everyone is talking about social networking and how it should be designed now; it feels to me like deja vu all over again, and echo from a decade ago. If the city or state had picked up these people after DARPA dropped them, and given them just a little more time, a bit more government support -- say by a Mass ARPA -- they could have made Boston the home, not of the big social networking company, but of the open social networking infrastructure and and all the expertise and little industries such a thing would have thrown off. And it would have started years and years ago! That's how Boston could have become a leader by being itself better, rather than trying to be you badly.</p> <p><strong>Dan</strong>: I think you're perhaps overstating the impact of DARPA. DARPA, by and large, funds two kinds of university activities. First, it funds professors, which pays for post-docs, grad students, and sometimes full-time research staff. Second, DARPA also funds groups that have relatively little to do with academia, such as the BSD effort at Berkeley (although I don't know for a fact that they had DARPA money, they didn't do &quot;publish or perish&quot; academic research; they produced Berkeley Unix).</p> <p>Undergrads at a place like MIT got an impressive immersion in computer science, with a rigor and verve that wasn't available most other places (although Berkeley basically cloned 6.001, and others did as well). They call it &quot;drinking from a firehose&quot; for a reason. MIT, Berkeley, and other big schools of the late 80's and early 90's had more CS students than they knew what to do with, so they cranked up the difficulty of the major and produced very strong students, while others left for easier pursuits.</p> <p>The key inflection point is how popular culture at the university, and how the faculty, treat their &quot;rock star&quot; students. What are the expectations? At MIT, it's that you go to grad school, get a PhD, become a researcher. At Stanford, it's that you run off and get rich.</p> <p>The decline in DARPA funding (or, more precisely, the micromanagement and short-term thinking) in recent years can perhaps be attributed to the leadership of Tony Tether. He's now gone, and the &quot;new DARPA&quot; is very much planning to come back in a big way. We'll see how it goes.</p> <p>One last point: I don't buy the Army vs. Marines analogy. MIT vs. Stanford train students similarly, in terms of their preparation to go out and make money, and large numbers of MIT people are quite successfully out there making money. MIT had no lack of companies spin out of research there, notably including Akamai. The differences we're talking about here are not night vs. day, they're not Army vs. Marines. They're more subtle but still significant.</p> <p><strong>Rebecca</strong>: Yes, I've been hearing about the &quot;unTethered Darpa.&quot; I should have mentioned that, but left it out to stay (vaguely) short. And yes, I am overstating to make it possible to make a simple statement of what I might be asking for that would be couched in terms a city or state government official might be able to relate to. Maybe that's irresponsible; that's why I'm testing it on you first, to give you a chance to yell at me and tell me if you think that's so.</p> <p>They are casting about for a narrative of why Boston ceded its role as leaders of the Internet industry to SV, that would point them to something to do about it. So I was talking specifically about the sense in which Boston was once a leader in internet technology and the weaknesses that might have caused it to lose its lead. Paul Graham says that Boston has the weakness in developing industries that it is &quot;too good&quot; at other things, so I wanted to tell a dramatized story specifically about what the other things were and why that would lead to fatal weakness -- how being &quot;too strong&quot; in a particular way can also make you weak.</p> <p>I certainly am overstating, but perhaps I am because I am trying to exert force against another prediliction I find pernicious: the tendency to be eternally vague about the internal emotional logic that makes things happen in the world. If people build a competent, cohesive, energetic community, and then it suddenly fizzles, fails to achieve its potential, and disbands, it might be important to know what weakness caused this surprising outcome so you know how to ask for the help that would keep it from happening the next time.</p> <p>And to tell the truth, I'm not sure I entirely trust your objection. I've wondered why so often I hear such weak, vague narratives about the internal emotional logic that causes things to happen in the world. Vague narratives make you helpless to solve problems! I don't cling to the right to overstate things, but I do cling to the right to sleuth out the emotional logic of cause and effect that drives the world around me. I feel sometimes that I am fighting some force that wants to thwart me in that goal -- and I suspect that that force sometimes originates, not always in rationality, but in in a male tendency to not want to admit to weakness just for the sake of &quot;seeming strong.&quot; A facade of strength can exact a high price in the currency of the real competence of the world, since often the most important action that actually makes the world better is the action of asking for help. I was really impressed with that Marine for being willing to admit to the price he paid, to the trauma he faced. That guy didn't need to fake strength! So maybe I am holding out the image of him as an example. We have government officials who are actively going out of their way to offer to help us; we have a community that accomplishes many of its greatest achievements because of government support; we shouldn't squander an opportunity to ask for what might help us. And this narrative might be wrong; that's why I'm testing it first. I'm open to criticism. But I don't want to pass by an opportunity, an opening to ask for help from someone who is offering it, merely because I'm too timid to say anything for the fear of overstatement.</p> <p><strong>Dan</strong>: Certainly, Boston's biggest strength is the huge number of universities in and around the area. Nowhere else in the country comes close. And, unsurprisingly, there are a large number of high-tech companies in and around Boston. Another MIT spin-out I forgot to mention above is iRobot, the Roomba people, which also does a variety of military robots.</p> <p>To the extent that Boston &quot;lost&quot; the Internet revolution to Silicon Valley, consider the founding of Netscape. A few guys from Illinois and one from Kansas. They could well have gone anywhere. (Simplifying the story, but) they hooked up with a an angel investor (Jim Clark) and he draged them out to the valley where they promptly hired a bunch of ex-SGI talent and hit the road running. Could they have gone to Boston? Sure. But they didn't.</p> <p>What seems to be happening is that different cities are developing their own specialties and that's where people go. Dallas, for example, has carved out a niche in telecom, and all the big players (Nortel, Alcatel, Cisco, etc.) do telecom work there. In Houston, needless to say, it's all about oilfield engineering. It's not that there's any particular Houston tax advantage or city/state funding that brings these companies here. Rather, the whole industry (or, at least the white collar part of it) is in Houston, and many of the big refineries are close nearby (but far enough away that you don't smell them).</p> <p>Greater Boston, historically, was where the minicomputer companies were, notably DEC and Data General. Their whole world got nuked by workstations and PCs. DEC is now a vanishing part of HP and DG is now a vanishing part of EMC. The question is what sort of thing the greater Boston area will become a magnet for, in the future, and how you can use whatever leverage you've got to help make it happen. Certainly, there's no lack of smart talent graduating from Boston-area universities. The question is whether you can incentivize them to stay put.</p> <p>I'd suggest that you could make headway, that way, by getting cheap office space in and around Cambridge (an &quot;incubator&quot;) plus building a local pot of VC money. I don't think you can decide, in advance, what you want the city's specialty to be. You pretty much just have to hope that it evolves organically. And, once you see a trend emerging, you might want to take financial steps to reinforce it.</p> <p><strong>Thomas</strong>: BBN (which does DARPA funded research) has long been considered a halfway house between MIT and the real world.</p> <p><strong>Piaw</strong>: It looks like there's another conversation about this thread over at Hacker News: <a href="http://news.ycombinator.com/item?id=1416348">http://news.ycombinator.com/item?id=1416348</a> I love conversation fragmentation.</p> <p><strong>Doug</strong>: Conversation fragmentation can be annoying, but do you really want all those Hacker News readers posting on this thread?</p> <p><strong>Piaw</strong>: Why not? Then I don't have to track things in two places.</p> <p><strong>Ruchira</strong>: hga over at Hacker News says: &quot;Self-selection by applicants is so strong (MIT survived for a dozen year without a professional as the Director), whatever gloss the Office is now putting on the Institute, it's able to change things only so much. E.g. MIT remains the a place where you don't graduate without taking (or placing out of) a year of the calculus and classical physics (taught at MIT speed), for all majors.&quot;</p> <p>Well, the requirements for all majors at Caltech are: two years of calculus, two years of physics (including quantum physics), a year of chemistry, and a year of biology (the biology requirement was added after I went there); freshman chemistry lab and another introductory lab; and a total of four years of humanities and social sciences classes. The main incubator I know of near Caltech is the Idealab. Certainly JPL (the Jet Propulsion Laboratory) as well as Hollywood CGI and animation have drawn from the ranks of Caltech grads. The size of the Caltech freshman class is also much smaller than those at Stanford or MIT.</p> <p>I don't know enough to gauge the relative success of Caltech grads at transitioning to local industry, versus Stanford or MIT, does anyone else?</p> <p><strong>Rebecca</strong>: The comments are teaching me what I didn't make clear, and this is one of the worst ones. When I talked about the &quot;transition to the commercial world&quot; I didn't mainly mean grads transitioning to industry. I was thinking more about the transition that a project goes through when it achieves product/market fit.</p> <p>This might not be something that you think of as such a big deal, because when companies embark on projects, they usually start with a fairly specific plan of the market they mean to tackle and what they mean to do if and when the market does adopt their product. There is no difficult transition because they were planning for it all along. After all, that's the whole point of a company! But a ten year research project has no such plan. The web server enthusiast did not know when the market would adopt his &quot;product&quot; -- remember, browsers were still primitive then -- nor did he really know what it would look like when they did. Some projects are even longer term than that: a programming language professor said that the expected time from the conception of a new programming language idea to its widespread adoption is thirty years. That's a good chunk of a lifetime.</p> <p>When you've spent a good bit of your life involved with something as a research project that no-one besides your small crowd cares about, when people do notice, when commercial opportunities show up, when money starts pouring out of the sky, its a huge shock! You haven't planned for it at all. Have you heard Philip's story of how he got his first contract for what became ArsDigita? I couldn't find the story exactly, but it was something like this: he had posted some of the code for his forum software online, and HP called him up and asked him to install and configure it for them. He said &quot;No! I'm busy! Go away!&quot; They said &quot;we'll pay you $100,000.&quot; He's in shock: “You'll give me $100000 for 2 weeks of work?”</p> <p>He wasn't exactly planning for money to start raining down out of the sky. When he started doing internet applications, he said, people had told him he was crazy, there was no future in it. I remember when I first started seeing URL's in ads on the side of buses, and I was just bowled over -- all the time my friends had been doing web stuff, I had never really believed they would ever be adopted. URL's are just so geeky, after all! I mean, seriously, if some wild-eyed nerd told you that in five years people would print &quot;http://&quot;,on the side of a bus, what would you think? I paid attention to what they were doing because they thought it was cool, I thought it was cool, and the fact that I had no real faith anyone else ever would made no difference. So when the world actually did, it was entering a new world that none of us were prepared for, that nobody had planned for, that we had not given any thought to developing skills to be able to deal with. I guess this is a little hard to convey, because it wouldn't happen in a company. You wouldn't ever do something just because you thought it was cool, without any faith that anyone would ever agree with you, and then get completely caught by surprise, completely bowled over, when the rest of the world goes crazy about what you thought was your esoteric geeky obsession.</p> <p><strong>Piaw</strong>: I think we were all bowled over by how quickly people started exchanging e-mail addresses, and then web-sites, etc. I was stunned. But it took a really long time for real profits to show up! It took 20 or so search engine companies to start up and fail before someone succeeded!</p> <p><strong>Rebecca</strong>: Of course; you are bringing up what was in fact the big problem. The question was: in what mode is it reasonable to ask the local government for help? And if you are in the situation where $100,000 checks are raining on you out of the sky without you seeming to make the slightest effort to even solicit them, then it seems like only the biggest jerk on the planet would claim to the government that they were Needy and Deserving. Black babies without roofs on their heads are needy and deserving; rich white obnoxious nerds with money raining down on them are not. But remember though Philip doesn't seem to be expending much effort in his story, he also said in the late 90's that he had been building web apps for ten years. Who else on the planet in 1999 could show someone a ten year long resume of web app development?</p> <p>As Piaw said, it isn't like picking up the potential wealth really was just a matter of holding out your hand as money rained from the sky. Quite the contrary. It wasn't easy; in fact it was singularly difficult. Sure, Philip talked like it was easy, until you think about how hard it would have been to amass the resume he had in 1999.</p> <p>When the local government talks about how it wants to attract innovators to Boston, to turn the city into a Hub of Innovation, my knee-jerk reaction is -- and what are we, chopped liver? But then I realize that when they say they want to attract innovators, what they really mean is not that they want innovators, but that they want people who can innovate for a reasonable, manageable amount of time, preferably short, and then turn around, quick as quicksilver, and scoop up all the return on investment in that innovation before anyone else can get at it -- and give a big cut in taxes to the city and state! Those are the kind of innovators who are attractive! Those are the kind who properly make your Boston the kind of Hub of Innovation the Mayor of Boston wants it to be. Innovators like those in Tech Square or Stata, not so much. We definitely qualify for the Chopped Liver department.</p> <p>And this hurts. It hurts to think that the Mayor of Boston might be treating us with more respect now if we had been better in ~2000 at turning around, quick as quicksilver, and remaking ourselves into people who could scoop up all, or some, or even a tiny fraction of the return on investment of the innovation at which we were then, in a technical sense, well ahead of anyone else. But remaking yourself is not easy! Especially when you realize that the state from which we were remaking ourselves was sort of like the Marines -- a somewhat ascetic state, one that gave you the nerd equivalent of military rations, a tent, maybe a shower every two weeks, and no training in any immediately salable skills whatsoever -- but also one that also gave you a community, an identity, a purpose, a sense of who you were that you never expected to change. But all of a sudden we &quot;won,&quot; and all of a sudden there was a tremendous pressure to change. It was like being thrown in the deep end of the pool without swim lessons, and yes we sank, we sank like a stone with barely a dog paddle before making a beeline for the bottom. So we get no respect now. But was this a reasonable thing to expect? What does the mayor of Boston really want? Yes, the sense in which Boston is a Hub of Innovation (for it already is one, it is silly for it to try to become what it already is!) is problematic and not exactly what a Mayor would wish for. I understand his frustration. But I think he would do better to work with his city for what it is, in all its problematic incompetence and glory, than to try to remake it in the image of something else it is not.</p> <p><strong>Rebecca</strong>: On the subject of Problematic Innovators, I was thinking back to the scene in the computer lab where everyone agreed that hoarding domain names was the dumbest idea they had ever heard of. I'm arguing that scooping up return on the investment in innovation was hard, but registering a domain name is the easiest thing in the world. I think they were free back then, even. If I remember right, they started out free, and then Procter &amp; Gamble registered en-mass every name that had even the vaguest entomological relation with the idea of &quot;soap,&quot; at which point the administrators of the system said &quot;Oops!&quot; and instituted registration fees to discourage that kind of behavior -- which, of course, would have done little to deter P&amp;G. They really do want to utterly own the concept of soap. (I find it amusing that P&amp;G was the first at bat in the domain name scramble -- they are not exactly the world's image of a cutting-edge tech-savvy company -- but when it comes to the problem of marketing soap, they quietly dominate.)</p> <p>How can I can explain that we were not able to expend even the utterly minimal effort in capturing the return on investment in innovation of registering a free domain name, so as to keep the resulting tax revenues in Massachusetts?</p> <p>Thinking back on it, I don't think it was either incapacity, or lack of foresight, or a will to fail in our duty as Boston and Massachusetts taxpayers. It was something else: it was almost a &quot;semper fidelis&quot;-like group spirit that made it seem dishonorable to hoard a domain name that someone else might want, just to profit from it later. Now one might ask, why should you refrain from hoarding it sooner just so that someone else could grab it and hoard it later? That kind of honor doesn't accomplish anything for anyone!</p> <p>But you have to realize, this was right at the beginning, when the domain name system was brand new and it wasn't at all clear it would be adopted. These were the people who were trying to convince the world to accept this system they had designed and whose adoption they fervently desired. In that situation, honor did make a difference. It wouldn't look good to ask the world to accept a naming system with all the good names already taken. You actually noticed back then when something (like &quot;soap&quot;) got taken -- the question wasn't what was available, the question was what was taken, and by whom. You'd think it wouldn't hurt too much to take one cool name: recently I heard that someone got a $38 million offer for &quot;cool.com.&quot; That's a lot of money! -- would it have hurt that much to offer the world a system with all the names available except, you know, one cool one? But there was a group spirit that was quite worried that once you started down that slope, who knew where it would lead?</p> <p>There were other aspects of infrastructure, deeper down, harder to talk about, where this group ethos was even more critical. You can game an infrastructure to make it easier to personally profit from it -- but it hurts the infrastructure itself to do that. So there was a vehement group discipline that maintained a will to fight any such urge to diminish the value of the infrastructure for individual profit.</p> <p>This partly explains why we were not able, when the time came, to turn around, quick as quicksilver, and scoop up the big profits. To do that would have meant changing, not only what we were good at, but what we thought was right.</p> <p>When I think back, I wonder, why people weren't more scared? When we chose not to register &quot;cool.com&quot; or similar names, why didn't we think, life is hard, the future is uncertain, and money does really make a difference in what you can do? I think this group ethic was only possible because there was a certain confidence -- the group felt itself party to a deal: in return for being who we are, the government would take care of us, forever. Not until the time when the product achieved sufficient product/market fit that it became appropriate to expect return on investment. Forever.</p> <p>This story might give a different perspective on why it hurts when the Mayor of Boston announces that he wants to make the city a Hub of Innovation. The innovators he already has are chopped liver? Well, its understandable that he isn't too pleased with the innovators in this story, because they aren't exactly a tax base. But that is the diametric opposition of the deal with the government we thought we had.</p> Work-life balance at Bioware bioware/ Sat, 31 May 2008 00:00:00 +0000 bioware/ <p><i>This is an archive of some posts in a forum thread titled &quot;Beware of Bioware&quot; in a now defunct forum, with comments from that forum as well as blog comments from a now defunct blog that archived that made the first attempt to archive this content. The original posts were deleted shortly after being posted, replaced with &quot;big scary company vs. li'l ol' me.&quot;</i></p> <p><i>Although the comments below seem to be about Bioware's main studio in Edmonton, I knew someone at Bioware Austin during the time period under discussion, which is how I learned about the term &quot;sympathy crunch&quot;, where you're required to be at the office because other teams are &quot;crunching&quot;. I'd never heard of this concept before, so I looked it up and found the following thread around 2008 or so.</i></p> <p><i>Searching for &quot;sympathy crunch&quot; today, in 2024, doesn't return many hits. One of the few hits is a 2011 blog post by a former director at BioWare titled &quot;Loving the Crunch&quot;, which has this to say about sympathy crunches:</i></p> <blockquote> <p><i>If you find yourself working sympathy crunch, in that even though you have no bugs of your own to take care of, don’t be pissed off about it. Play test the game you made! And enjoy it for what it is, something you contributed to. And if that’s not enough to make you happy then be satisfied that every bug you send to one of your co-workers will make them more miserable. (Though do try and be constructive.)</i></p> </blockquote> <p><i>Another one of the few hits is a 2013 recruiting marketing blog post on Bioware blog, where a developer notes that &quot;We are clearly moving away from the concept of 'sympathy crunch'. In the 2008 thread below, there's a heated debate over the promise by leadership that sympathy crunch had been abolished was kept or not. Even if you ignore the comments from the person I knew at Bioware, these later comments from Bioware employees, especially the recruiting marketing post in 2013, seem to indicate that sympathy crunch was not abolished until well after 2008.</i></p> <hr> <p><i>Milez5858</i></p> <p>EA gets a lot of the grief for their employment history.</p> <p>For anyone considering work at Bioware, beware of them as well.</p> <p>They use seriously cult like tactics to keep their employees towing the company line, but don't be fooled. The second you don't tow that line you'll be walked out the doors.</p> <p>They love out of country workers because they don't understand the Canadian labour laws. They continually fire people with out warning. This is illegal in Canada. You must warn people of performance problems and give them a certain amount of warnings before you are allowed to fire someone.</p> <p>They are smarter about it than EA by offering food and free ice cream and other on site amenities, but it all adds up to a lot of extra hours with no extra pay. <hr> <i>Milez5858</i></p> <p>BTW you could say my post is somewhat personal, but I've not worked for Bioware. I have several friends that do. Three have been walked out the door in the last year for refusing to work more than the 40 hours they are paid for and wanting to spend time with their wives and kids.</p> <p>The friends I have remaining there are from out of country and feel as though they are somewhat trapped. They are unhappy but will only admit it behind very closed doors because they have it in their head they will get black listed or something.</p> <p>I'm an outside observer. I get paid more than these poor kids, and I only work about 35 hours a week. I've always put MY life ahead of my employers, but work very hard and dedicated at my job. When I see what in my opinion amounts to cult like mind control over these young men who are enamoured by the legend of Ray and Greg, the founders of bioware, I'm almost sickened.</p> <p>I used to think it would be cool to work in the gaming industry, and now I'm just happy as hell I'm not in it.</p> <p>I will certainly be pointing them to this site as a resource and hope, that like any other entertainment industry, they get organized in some fashion. It's absolutely dehumanizing what this industry does to people. <hr> <i>Arty</i></p> <p>Hell is happy?</p> <p>And, didn't EA buy BioWare?</p> <p>Just asking, not complaining. Welcome aboard!</p> <hr> <p><i>Anguirel</i></p> <p>Yes, Bioware and Pandemic were bought by EA. However, it sounds like this started well before that acquisition. It actually sounds more like what amounts to a Cult of Personality (such as you see at Blizzard, Maxis, id, or Ion Storm) where a single person or a few people have such a huge reputation that they can get a more-or-less unlimited supply of reasonably good new hires -- and thus, someone (possibly someone in between said person and the average worker) takes advantage and pushes the employees much harder than they should.</p> <p>In some cases the cults that build up around certain individuals insulates them from bad working conditions (I've heard Maxis enjoyed that happy fate while Will Wright was there, keeping the people inside in much better conditions than the remainder of EA), but in many cases it results in an attitude of &quot;if you don't like it, we can get any number of people to replace you.&quot; Which is what EA certainly had for a long time, and several other studios still have, though the people who work at them tend to be quieter about it (and I don't think any other studios ever got to EA's legendary status of continuous crunch). <hr> <i>Milez5858</i></p> <p>Yes, these things were happening well ahead of the EA purchase of Bioware.</p> <p>You are exactly accurate about the Cult of Personality thing. People think that Ray and Greg are looking out for them. Bioware Pandemic just sold for nearly a billion dollars. People are marked for assination for far smaller stakes. Can anyone really believe that the owners are in there saying... &quot;ya know.. I can't accept your billion dollars unless I know the employees are really well taken care of&quot;. It defies logic.</p> <p>I suspect it will just get worse. Now people that have been forced to leave the company are also being told the the MUST sell their stocks back to the company.</p> <p>I don't know enough about it to say if it's legal or not, but it sure sounds fishy to me. Again I've recommended to people that they check with a lawyer first, but nobody wants any more hastle than they already have. Unfortunately it's this attitude that keeps the gaming industry from getting organized.</p> <p>I would love to be able to say this sort of putrid intimidation is anomalous, but that would just be the utterest of untruths plaguing the industry. Maybe we should extract this ubiquitous haughtiness from the development process, I'm crazy enough to believe this can happen without the U-word. <hr> <i>Anonymous</i></p> <p>The truth is life at Bioware is not as bad and as bad as implied in this article. But mostly not as bad.</p> <p>The original article states that Bioware uses cult tactics. I dont know what cult tactics are so I cant give a simple answer but I know that Bioware uses traditional tactics to make for good morale. They give you the free breakfast. They bring you dinner if you are working overtime. They give out Oilers tickets or tickets to theatre or gokarts and the like. They let you wander away from your desk for an hour to go to the lunch room and take a nap on the sofa or play video games. Or they let you leave to run errands or go out for an hour for coffee. They also have company meetings where they highlight the development of the various projects ongoing. On this last one some think its a great way to keep up on other projects while some think its just a way to keep employees excited about everything. Is any of this a cult tactic?</p> <p>Bioware is notoriously loyal. And the managers are notoriously wimpy and avoid confrontation. There isnt a way to emphasise this enough. In fact people joke that they havent done good work in months and yet they receive a strong review and a raise or more stock options. Getting fired as a fulltime employee from Bioware is a shocking rare event that there has been company meetings or emails sent out to explain the firing. Employees are given so many chances to get it right when they do get a bad review. This is all different for contract employees. A contract employee who sucks wont be fired but they will have their contract not renewed. A good contract employee is always offered an opening for fulltime (since contract employees do get overtime and making them fulltime saves money) or if there is none has their contract renewed.</p> <p>I dont know much about HR so I dont know about the hiring tactics but I can agree that a lot of employees come from around the world or outside of the Edmonton area. It was always assumed that they couldnt find good talent around Edmonton. I cant comment much more than that because I dont know. I know there are lots of recruiting drives all over.</p> <p>Hours are definitely a problem and I can agree whole hearted:</p> <p>Some think it gets better each project cycle but its just transposed. In the early days the staff might work 30 to 40 hour weeks for a year or two and then come to the end and realize there was too much to be done. They spend the last few months working at least double the hours. In later projects things were more controlled by project managers and producers and it turned into what some call death crunch or death marches. Employees start working 50 hour weeks a year or two before ship. It isnt much extra hours and noone complains much but there are problems if an employee wants to make plans and never knows if she has to work or not. Managers are good about making sure employees can take time off and get lots of extra time off as compensation for the work but employees are still asked a lot of them.</p> <p>When the time comes closer to release the employees have their hours scaled up more and more. It might start as 9-9 on Tuesday and Thursday, and then become 9-9 Monday until Thursday. Then it's Saturday 10-4. Then it's 9-9 Friday nights too. Then it's 9-5 Saturday. And in desperate times they even say maybe 12-4 on Sunday.</p> <p>Morale gets low when employees think the game is awful and they cant get it done right. Thats when management tells the staff that they have decided to not ship the game until it is done and that they are extending the release date. This is good for the title but still hurts morale when employees think of having to work so much longer when they were working so hard to make a date. For example to use the most recent title Mass Effect employees were told the game would ship in Christmas of 2006. Then it was pushes to February of 2007 and then later spring and then June. But we know the game didnt get shipped until November. That does wear on employees.</p> <p>Each time it gets pushed back the management cuts hours for a few weeks or give out breaks of a few extra days off or a four days weekend to recharge employees. Plus people say &quot;in the old days we worked 90 hours or 100 or more. Now its only 50 or 60 or maybe in rare times sometimes 70 so this is great.&quot; which is meant to make you remember that you are making a game and should be happy to hang out at Bioware for only 50 or 60 hours a week to make “the best games ever”.</p> <p>After game is shipped people take long breaks weeks at a time. Then they slowly ease back into it. They get maybe 20 or 30 hours of work assigned per week. This can go on for months before employee returns to regular 40 hour weeks. And that happens for many months or a year before the project pushes for release and the cycle starts again.</p> <p>This is not an indictment of Bioware. All companies in video gaming do this. It's unfair to point at Bioware as any exception but a good one to me.</p> <p>People leave all the time for these reasons. More have left in the last year. Maybe two years. Maybe three. Im not sure. But Bioware hires so many people and the ones who leave are generally not as good as the ones who are hired so it works out. Many people in Bioware wish more people would leave. Maybe with less dead weight since no one gets fired ever there could be a stronger staff who works more efficiently.</p> <p>When EA Spouse was out everyone at Bioware got curious about it. But then the reports about what EA Spouse's husband was working and the situations he was put into and everyone at Bioware realised that they had nothing anywhere as close to as bad as what that was. Bioware employees enjoy comfort and support by managers and long projects but not 100 hour weeks. And some employees do manage to hold on to 9-5 for very long stretches if they can get their work done to highest standards.</p> <p>But still people hoped that the industry in general would improve. Many people at Bioware would be thrilled to have nothing else change other than to never have to work more than 40 or 45 hours. But also many people at Bioware are workaholics who would never work less than 50 or 60 and they always create a dangerous precedent and control the pace unfairly.</p> <p>Everyone at Bioware is aware that EA owns them now but no one is more thoughtful about it. Nothing has changed. Life is the same in every way except for more money. Its easy to forget that Bioware was ever bought by EA because the culture is unchanged and no one from EA is coming in and yelling about how things have to be changed. Thats kind of amazing considering Bioware doesnt make really big selling titles and probably deserves someone like EA coming in and saying this is how to do it. Bioware titles eventually after a year or two sell a million or two m but they never have that 5 m in the first month kind of release that the big titles get and that Bioware sorely wants.</p> <p>If you want to work video games you are going to work lots of hours no matter where you go but Bioware is a great place to do it because they do treat you well. Many people leave Bioware and write back to say they regret it. Some come back. Some also do find a better life elsewhere but say that Bioware life is good too and that they miss many facets of it. I think more people leave Bioware to get away from Deadmonton then any other reason!!</p> <p>Bioware censoring that article if they did isnt surprising to me because they are very controlling of their image. They only want positive talk about them and want to protect fragile egos and morale of the employee staff. Censoring bad publicity doesnt make the bad publicity true. Bioware just doesnt want that out there. <hr> <i>Anonymous</i></p> <p>I've worked my share of crunches (over almost 400 hours in three months during summer) on various projects. My takeaway from that is:</p> <p>1) Don't be an ass about it. If you need to crunch, admit what the reason is and create a sensible plan for the crunch.</p> <p>People will take their free time any way they can. Crunches that last over two weeks are too much without breaks in-between and lead to more errors than actual work being done.</p> <p>2) Realize that full day's job has about average 5-6 hours of actual work that benefits the company. With crunch, you can have people in the office for 12+ hours a day, but the additional hours don't really pay off that well in comparison.</p> <p>3) Pretty much every developer I know is very savvy with the industry and how it operates. Most of them put their family and life above work and wouldn't hesitate to resign in a minute if the crunch or the company seems unfair or badly managed. Again, don't be an ass about the crunch.</p> <p>4) I don't want to spend years and make a shit game. It's a waste of my life and time. I'll crunch for you if you plan it with a brain in the head. If you don't, I'll do something more relevant with my life.</p> <p>What anonymous posts about Bioware sounds very much like normal circumstances and I'd be willing to work in a company like that. I want to wander away from my desk to play a demo or whatever and I like people trusting me that yes, I will do my tasks by the deadline. In fact, I think that's how every game company should operate. I'm glad I haven't experienced it any other way yet during the years. <hr> <i>Anonymous</i></p> <p>While those kinds of hours may be typical or expected, they are in no way excusable. Companies that ask people to work 60 hours weeks for long periods of time in the middle of a project are either a) incompetent project managers or b) guilty of taking advantage of their teams.</p> <p>Just because it's the video game industry, many people think that it is just par for the course. I call shenanigans, and so did Erin Hoffman (EA Spouse). Thank you, Erin, for getting this kind of crap out of the closet.</p> <p>Not every company is like that. I work at a fantastic game company right now that is creating a AAA game for the Wii. We've hit some tough times at points, but our crunch times are 50 hour weeks, and we almost never do them back-to-back.</p> <p>The idea that you have to suck up and deal sometimes in games? Yes, absolutely. We're way too young as an industry with our production practices and there is a ton of money at stake. The idea that you should be forced to do long stretches of 60 or more hour weeks out of love for a project? As a responsible company owner, you need to turn around and either put some more resources onto the problem, or start cutting scope. Because at that point, you're abusing your team for the mistakes you have made.</p> <hr> <p><i>Anonymous</i></p> <p>WOW, you folks are lucky to think that 60 hour weeks are some kind of exception in modern corporate America. I'm a gamer, not someone who works in the games industry, but I do work in film production in Hollywood, and these hours are truly par for the course out here. I'm talking about EVERY DAY, in by 9 am and you don't leave before 7:30 pm, and you're expected to take home scripts or novels to read and write up for the following day. This is not high paid, either, average salary for a &quot;creative executive&quot; is in the 50K range and you're working 65 hour weeks, not including any reading or additional work that's required outside of the office.</p> <p>I'm not saying this is excusable, just that this seems to be the trend in America today, and it's afflicting a lot of industries, from banking to lawyering to consulting to video games and film. And the sad truth is, most of the &quot;work&quot; people are doing during their 10 or 12 hour work day could be finished in 5 hours if employees weren't such believers in presentee-ism, or the idea that the amount of hours you sit at a desk = your productivity as a worker.</p> <p>Anyway, I just wanted to chime in to say these abuses are not symptomatic of the games industry but American white collar work in general, so... beware! The fact is, in competitive industries like these, there are so many willing workers who will step in and suffer these kinds of abuses that we have little power to organize or protect ourselves. Good luck out there.</p> <hr> <p><i>Anonymous</i></p> <p>I worked at Bio for a few years and I can verify that what the OP said is true. Greg and Ray have spent years cultivating a perception in the gaming industry that they are somehow better then other companies. They aren't. I am not saying they are monsters, because in person they are both pretty cool guys. However, when it comes to business they are pretty ruthless. I think it's mostly Ray. They continually promise that things will get better and they don't. Almost every project at Bio has had extended crunch because the project directors always plan way more than they are able to deliver. So, the employees suffer and the directors get a pat on the back and bigger and more shares. The latest example is Mass Effect. There was a 9 month crunch on that game. Some people came close to nervous breakdowns. They implemented sympathetic crunch which they also promised they had abolished. That's where the whole team has to be on site just in case something goes wrong even if they don't have anything to do. What it's really about though is the politics of making sure that the programmers don't get pissed that the artist got off early or whoever. I think it's just going to get worse under EA. Eventually people will realize that BioWare is just like every other crunchy game dev out there.</p> <hr> <p><i>Anonymous</i></p> <p>(FYI, I posted before, but with the amount of anonymous posters I'll clarify I'm the one who spoke about a 400 hour summer crunch)</p> <p>I've never understood or had to withstand a sympathetic crunch - Even though our projects had crunches, it was more about sharing workload or just realizing that yes, the coders do have more work ahead for them.</p> <p>I recently talked to an U.S friend of mine who was astounded by my 4 week summer vacation and 1 week winter vacation. She couldn't really imagine it since she had never had one. Most she had was max 1 week off during a year. Add to that the 10+ hour workdays and I can't understand how you cope with that.</p> <p>I work in northern europe, do 8 hour days 90% of the time and at my current place I can show the total crunch hours with 10 fingers.</p> <p>When you're young, 400 hour crunches and sleeping in a sleeping bag might not be a big deal, but nearing 30, you'll get more interested about your rights and such. The 400 hours for me was a good eye opener on how not to do things. It was valuable, but never again. It took me more than a year to recover the friends and social aspects of my life to recover from that.</p> <hr> <p><i>Anonymous</i></p> <blockquote> <p>I think more people leave Bioware to get away from Deadmonton then any other reason!!</p> </blockquote> <p>Hey! Nuts to you!</p> <p>(Edmontonian here)</p> <p>I'm acquainted with some people in Bioware. Their loyalty to the company is astonishing - if, as previous posters have said, there is a cultivated sense that Bioware's better than other game companies then it's absolutely taken root. They joke that it feels like almost living in a self-sustained arcology.</p> <p>I remember one of them dismissing EA's Bioware takeover as no big deal, business as usual, why would anything change blah blah blah. I'm sure it was supposed to sound reassuring but it came off as naively dismissive in light of EA's infamous track record.</p> <p>This is the first time I've actually seen numbers of their crunch time which is disappointing if true. It smacks of poor management and I guess I had some of that fairly-tale view of the company. I hope for Bioware's sake it doesn't get any worse.</p> <hr> <p><i>Anonymous</i></p> <p>One anonymous (or, from this page, I'm the anonymous [who defended Bioware earlier] to another I will respond to this comment -- They implemented sympathetic crunch which they also promised they had abolished.</p> <p>My response is that they didn't implement that (??). People were on call of course but they didn't tell every one to be there because some people had to work. That's incorrect. If a manager thought their team had deliverables to make for sprints then they told their team to be there. If people were behind they had to be there. If people could help the team they had to be there. But that was almost always up to individual managers and if an employee had to be there just because then that is the fault of an individual manager and that individual manager should have been discussed with Casey or the project manager.</p> <p>Not disagreeing with the rest of your post which pretty much said the same as mine. But this one point was inaccurate.</p> <hr> <p><i>Anonymous</i></p> <p>Anon in response to your argument about the sympathetic crunch; Of course it's the manager's fault, but do you really think anyone is going to buck Casey and not suffer for it? They never come right out and say that people have to be there for political reasons, they couch it in all kind of ways. All I know is that I spent more than a few nights at work until 2:00 or 3:00AM because something might go wrong even though I lived less than 30 minutes away and told them I could be called at any time. I also know that I was taken to task for always asking to leave when I was done because of the perception it created amongst people that were forced to stay.</p> History of Symbolics lisp machines symbolics-lisp-machines/ Fri, 16 Nov 2007 00:00:00 +0000 symbolics-lisp-machines/ <p><em>This is an archive of Dan Weinreb's comments on Symbolics and Lisp machines.</em></p> <h3 id="rebuttal-to-stallman-s-story-about-the-formation-of-symbolics-and-lmi">Rebuttal to Stallman’s Story About The Formation of Symbolics and LMI</h3> <p>Richard Stallman has been telling a story about the origins of the Lisp machine companies, and the effects on the M.I.T. Artificial Intelligence Lab, for many years. He has published it in a book, and in a widely-referenced paper, which you can find at <a href="http://www.gnu.org/gnu/rms-lisp.html">http://www.gnu.org/gnu/rms-lisp.html</a>.</p> <p>His account is highly biased, and in many places just plain wrong. Here’s my own perspective on what really happened.</p> <p>Richard Greenblatt’s proposal for a Lisp machine company had two premises. First, there should be no outside investment. This would have been totally unrealistic: a company manufacturing computer hardware needs capital. Second, Greenblatt himself would be the CEO. The other members of the Lisp machine project were extremely dubious of Greenblatt’s ability to run a company. So Greenblatt and the others went their separate ways and set up two companies.</p> <p>Stallman’s characterization of this as “backstabbing”, and that Symbolics decided not “not have scruples”, is pure hogwash. There was no backstabbing whatsoever. Symbolics was extremely scrupulous. Stallman’s characterization of Symbolics as “looking for ways to destroy” LMI is pure fantasy.</p> <p>Stallman claims that Symbolics “hired away all the hackers” and that “the AI lab was now helpless” and “nobody had envisioned that the AI lab’s hacker group would be wiped out, but it was” and that Symbolics “wiped out MIT”. First of all, had there been only one Lisp machine company as Stallman would have preferred, exactly the same people would have left the AI lab. Secondly, Symbolics only hired four full-time and one part-time person from the AI lab (see below).</p> <p>Stallman goes on to say: “So Symbolics came up with a plan. They said to the lab, ‘We will continue making our changes to the system available for you to use, but you can’t put it into the MIT Lisp machine system. Instead, we’ll give you access to Symbolics’ Lisp machine system, and you can run it, but that’s all you can do.’” In other words, software that was developed at Symbolics was not given away for free to LMI. Is that so surprising? Anyway, that wasn’t Symbolics’s “plan”; it was part of the MIT licensing agreement, the very same one that LMI signed. LMI’s changes were all proprietary to LMI, too.</p> <p>Next, he says: “After a while, I came to the conclusion that it would be best if I didn’t even look at their code. When they made a beta announcement that gave the release notes, I would see what the features were and then implement them. By the time they had a real release, I did too.” First of all, he really was looking at the Symbolics code; we caught him doing it several times. But secondly, even if he hadn’t, it’s a whole lot easier to copy what someone else has already designed than to design it yourself. What he copied were incremental improvements: a new editor command here, a new Lisp utility there. This was a very small fraction of the software development being done at Symbolics.</p> <p>His characterization of this as “punishing” Symbolics is silly. What he did never made any difference to Symbolics. In real life, Symbolics was rarely competing with LMI for sales. LMI’s existence had very little to do with Symbolics’s bottom line.</p> <p>And while I’m setting the record straight, the original (TECO-based) Emacs was created and designed by Guy L. Steele Jr. and David Moon. After they had it working, and it had become established as the standard text editor at the AI lab, Stallman took over its maintenance.</p> <p>Here is the list of Symbolics founders. Note that Bruce Edwards and I had worked at the MIT AI Lab previously, but had already left to go to other jobs before Symbolics started. Henry Baker was not one of the “hackers” of which Stallman speaks.</p> <ul> <li>Robert Adams (original CEO, California)</li> <li>Russell Noftsker (CEO thereafter)</li> <li>Minoru Tonai (CFO, California)</li> <li>John Kulp (from MIT Plasma Physics Lab)</li> <li>Tom Knight (from MIT AI Lab)</li> <li>Jack Holloway (from MIT AI Lab)</li> <li>David Moon (half-time as MIT AI Lab)</li> <li>Dan Weinreb (from Lawrence Livermore Labs)</li> <li>Howard Cannon (from MIT AI Lab)</li> <li>Mike McMahon (from MIT AI Lab)</li> <li>Jim Kulp (from IIASA, Vienna)</li> <li>Bruce Edwards (from IIASA, Vienna)</li> <li>Bernie Greenberg (from Honeywell CISL)</li> <li>Clark Baker (from MIT LCS)</li> <li>Chris Terman (from MIT LCS)</li> <li>John Blankenbaker (hardware engineer, California)</li> <li>Bob Williams (hardware engineer, California)</li> <li>Bob South (hardware engineer, California)</li> <li>Henry Baker (from MIT)</li> <li>Dave Dyer (from USC ISI)</li> </ul> <h3 id="why-did-symbolics-fail">Why Did Symbolics Fail?</h3> <p>In a comment on a previous blog entry, I was asked why Symbolics failed. The following is oversimplified but should be good enough. My old friends are very welcome to post comments with corrections or additions, and of course everyone is invited to post comments.</p> <p>First, remember that at the time Symbolics started around 1980, serious computer users used timesharing systems. The very idea of a whole computer for one person was audacious, almost heretical. Every computer company (think Prime, Data General, DEC) did their own hardware and their own software suite. There were no PCs’, no Mac’s, no workstations. At the MIT Artificial Intelligence Lab, fifteen researchers shared a computer with a .001 GHz CPU and .002 GB of main memory.</p> <p>Symbolics sold to two kinds of customers, which I’ll call primary and secondary. The primary customers used Lisp machines as software development environments. The original target market was the MIT AI Lab itself, followed by similar institutions: universities, corporate research labs, and so on. The secondary customers used Lisp machines to run applications that had been written by some other party.</p> <p>We had great success amongst primary customers. I think we could have found a lot more of them if our marketing had been better. For example, did you know that Symbolics had a world-class software development environment for Fortran, C, Ada, and other popular languages, with amazing semantics-understanding in the editor, a powerful debugger, the ability for the languages to call each other, and so on? We put a lot of work into those, but they were never publicized or advertised.</p> <p>But we knew that the only way to really succeed was to develop the secondary market. ICAD made an advanced constraint-based computer-aided design system that ran only on Symbolics machines. Sadly, they were the only company that ever did. Why?</p> <p>The world changed out from under us very quickly. The new “workstation” category of computer appeared: the Suns and Apollos and so on. New technology for implementing Lisp was invented that allowed good Lisp implementations to run on conventional hardware; not quite as good as ours, but good enough for most purposes. So the real value-added of our special Lisp architecture was suddenly diminished. A large body of useful Unix software came to exist and was portable amongst the Unix workstations: no longer did each vendor have to develop a whole software suite. And the workstation vendors got to piggyback on the ever-faster, ever-cheaper CPU’s being made by Intel and Motorola and IBM, with whom it was hard for Symbolics to keep up. We at Symbolics were slow to acknowledge this. We believed our own “dogma” even as it became less true. It was embedded in our corporate culture. If you disputed it, your co-workers felt that you “just didn’t get it” and weren’t a member of the clan, so to speak. This stifled objective analysis. (This is a very easy problem to fall into — don’t let it happen to you!)</p> <p>The secondary market often had reasons that they needed to use workstation (and, later, PC) hardware. Often they needed to interact with other software that didn’t run under Symbolics. Or they wanted to share the cost of the hardware with other applications that didn’t run on Symbolics. Symbolics machines came to be seen as “special-purpose hardware” as compared to “general-purpose” Unix workstations (and later Windows PCs). They cost a lot, but could not be used for the wider and wider range of available Unix software. Very few vendors wanted to make a product that could only run on “special-purpose hardware”. (Thanks, ICAD; we love you!)</p> <p>Also, a lot of Symbolics sales were based on the promise of rule-based expert systems, of which the early examples were written in Lisp. Rule-based expert systems are a fine thing, and are widely used today (but often not in Lisp). But they were tremendously over-hyped by certain academics and by their industry, resulting in a huge backlash around 1988. “Artificial Intelligence” fell out of favor; the “AI Winter” had arrived.</p> <p>(Symbolics did launch its own effort to produce a Lisp for the PC, called CLOE, and also partnered with other Lisp companies, particularly Gold Hill, so that customers could develop on a Symbolics and deploy on a conventional machine. We were not totally stupid. The bottom line is that interest in Lisp just declined too much.)</p> <p>Meanwhile, back at Symbolics, there were huge internal management conflicts, leading to the resignation of much of top management, who were replaced by the board of directors with new CEO’s who did not do a good job, and did not have the vision to see what was happening. Symbolics signed long-term leases on big new offices and a new factory, anticipating growth that did not come, and were unable to sublease the properties due to office-space gluts, which drained a great deal of money. There were rounds of layoffs. More and more of us realized what was going on, and that Symbolics was not reacting. Having created an object-oriented database system for Lisp called Statice, I left in 1988 with several co-workers to form Object Design, Inc., to make an object-oriented database system for the brand-new mainstream object-oriented language, C++. (The company was very successful and currently exists as the ObjectStore division of Progress Software (www.objectstore.com). I’m looking forward to the 20th-year reunion party next summer.)</p> <p>Symbolics did try to deal with the situation, first by making Lisp machines that were plug-in boards that could be connected to conventional computers. One problem is that they kept betting on the wrong horses. The MacIvory was a Symbolics Ivory chip (yes, we made our own CPU chips) that plugged into the NuBus (oops, long-since gone) on a Macintosh (oops, not the leading platform). Later, they finally gave up on competing with the big chip makers, and made a plug-in board using a fast chip from a major manufacturer: the DEC Alpha architecture (oops, killed by HP/Compaq, should have used the Intel). By this time it was all too little, too late.</p> <p>The person who commented on the previous blog entry referred by to an MIT Masters thesis by one Eve Philips (see <a href="http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/ai-business.pdf">http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/ai-business.pdf</a>) called “If It Works, It’s Not AI: A Commercial Look at Artificial Intelligence Startups”. This is the first I’ve heard of it, but evidently she got help from Tom Knight, who is one of the other Symbolics co-founders and knows as much or more about Symbolics history than I. Let’s see what she says.</p> <p>Hey, this looks great. Well worth reading! She definitely knows what she’s talking about, and it’s fun to read. It brings back a lot of old memories for me. If you ever want to start a company, you can learn a lot from reading “war stories” like the ones herein.</p> <p>Here are some comments, as I read along. Much of the paper is about the AI software vendors, but their fate had a strong effect on Symbolics.</p> <p>Oh, of course, the fact that DARPA cut funding in the late 80’s is very important. Many of the Symbolics primary-market customers had been ultimately funded by DARPA research grants.</p> <p>Yes, there were some exciting successes with rule-based expert systems. Inference’s “Authorizer’s Assistant” for American Express, to help the people who talk to you on the phone to make sure you’re not using an AmEx card fraudulently, ran on Symbolics machines. I learn here that it was credited with a 45-67% internal rate of return on investment, which is very impressive.</p> <p>The paper has an anachronism: “Few large software firms providing languages (namely Microsoft) provide any kind of Lisp support.” Microsoft’s dominance was years away when these events happened. For example, remember that the first viable Windows O/S, release 3.1, came out in in 1990. But her overall point is valid.</p> <p>She says “There was a large amount of hubris, not completely unwarranted, by the AI community that Lisp would change the way computer systems everywhere ran.” That is absolutely true. It’s not as wrong as it sounds: many ideas from Lisp have become mainstream, particularly managed (garbage-collected) storage, and Lisp gets some of the credit for the acceptance of object-oriented programming. I have no question that Lisp was a huge influence on Java, and thence on C#. Note that the Microsoft Common Language Runtime technology is currently under the direction of the awesome Patrick Dussud, who was the major Lisp wizard from the third MIT-Lisp-machine company, Texas Instruments.</p> <p>But back then we really believed in Lisp. We felt only scorn for anyone trying to write an expert system in C; that was part of our corporate culture. We really did think Lisp would “change the world” analogously to the way “sixties-era” people thought the world could be changed by “peace, love, and joy”. Sorry, it’s not that easy.</p> <p>Which reminds me, I cannot recommend highly enough the book “Patterns of Software: Tales from the Software Community” by Richard Gabriel (<a href="http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf">http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf</a>) regarding the process by which technology moves from the lab to the market. Gabriel is one of the five main Common Lisp designers (along with Guy Steele, Scott Fahlman, David Moon, and myself), but the key points here go way beyond Lisp. This is the culmination of the series of papers by Gabriel starting with his original “Worse is Better”. Here the ideas are far more developed. His insights are unique and extremely persuasive.</p> <p>OK, back to Eve Philips: at chapter 5 she describes “The AI Hardware Industry”, starting with the MIT Lisp machine. Does she get it right? Well, she says “14 AI lab hackers joined them”; see my previous post about this figure, but in context this is a very minor issue. The rest of the story is right on. (She even mentions the real-estate problems I pointed out above!) She amply demonstrates the weaknesses of Symbolics management and marketing, too. This is an excellent piece of work.</p> <p>Symbolics was tremendously fun. We had a lot of success for a while, and went public. My colleagues were some of the skilled and likable technical people you could ever hope to work with. I learned a lot from them. I wouldn’t have missed it for the world.</p> <p>After I left, I thought I’d never see Lisp again. But now I find myself at ITA Software, where we’re writing a huge, complex transaction-processing system (a new airline reservation system, initially for Air Canada), whose core is in Common Lisp. We almost certainly have the largest team of Common Lisp programmers in the world. Our development environment is OK, but I really wish I had a Lisp machine again.</p> <h3 id="more-about-why-symbolics-failed">More about Why Symbolics Failed</h3> <p>I just came across “Symbolics, Inc: A failure of heterogeneous engineering” by Alvin Graylin, Kari Anne Hoir Kjolaas, Jonathan Loflin, and Jimmie D. Walker III (it doesn’t say with whom they are affiliated, and there is no date), at <a href="http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/Symbolics.pdf">http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/Symbolics.pdf</a></p> <p>This is an excellent paper, and if you are interested in what happened to Symbolics, it’s a must-read.</p> <p>The paper’s thesis is based on a concept called “heterogeneous engineering”, but it’s hard to see what they mean by that other than “running a company well”. They have fancy ways of saying that you can’t just do technology, you have to do marketing and sales and finance and so on, which is rather obvious. They are quite right about the wide diversity of feelings about the long-term vision of Symbolics, and I should have mentioned that in my essay as being one of the biggest problems with Symbolics. The random directions of R&amp;D, often not co-ordinated with the rest of the company, are well-described here (they had good sources, including lots of characteristically, harshly honest email from Dave Moon). The separation between the software part of the company in Cambridge, MA and the hardware part of the company in Woodland Hills (later Chatsworth) CA was also a real problem. They say “Once funds were available, Symbolics was spending money like a lottery winner with new-found riches” and that’s absolutely correct. Feature creep was indeed extremely rampant. The paper also has financial figures for Symbolics, which are quite interesting and revealing, showing a steady rise through 1986, followed by falling revenues and negative earnings from 1987 to 1989.</p> <p>Here are some points I dispute. They say “During the years of growth Symbolics had been searching for a CEO”, leading up to the hiring of Brian Sear. I am pretty sure that only happened when the trouble started. I disagree with the statement by Brian Sear that we didn’t take care of our current customers; we really did work hard at that, and I think that’s one of the reasons so many former Symbolics customers are so nostalgic. I don’t think Russell is right that “many of the Symbolics machines were purchased by researchers funded through the Star Wars program”, a point which they repeat many times. However, many were funded through DARPA, and if you just substitute that for all the claims about “Star Wars”, then what they say is right. The claim that “the proliferation of LISP machines may have exceeded the proliferation of LISP programmers” is hyperbole. It’s not true that nobody thought about a broader market than the researchers; rather, we intended to sell to value-added resellers (VAR’s) and original equipment manufacturers (OEM’s). The phrase “VARs and OEMs” was practically a mantra. Unfortunately, we only managed to do it once (ICAD). While they are right that Sun machines “could be used for many other applications”, the interesting point is the reason for that: why did Sun’s have many applications available? The rise of Unix as a portable platform, which was a new concept at the time, had a lot to do with it, as well as Sun’s prices. They don’t consider why Apollo failed.</p> <p>There’s plenty more. To the authors, wherever you are: thank you very much!</p> Subspace / Continuum History subspace-history/ Wed, 01 Feb 2006 00:00:00 +0000 subspace-history/ <p>Archived from an unknown source. Possibly Gravitron?</p> <p>In regards for history:</p> <h3 id="chapter-1">Chapter #1</h3> <p>(Around) December 1995 is when it all started. Rod Humble wished to create something like Air Warrior but online, he approached Virgin Interactive Entertainment with the idea and they replied with something along the lines of &quot;here's the cash, good luck&quot;. Rod called Jeff Petersen and asked if he's interested in helping him creating an online only game, Jeff agreed. Inorder to overcome lag Jeff decided it would be best to test an engine which simulates neutonian physic principles - an object in motion tends to keep the same vector, and on that he also built prediction forumlas. They also enlisted Juan Sanchez, with whom Rod had worked before to design them some graphics for it. Either way, after some initial advancement they've decided to put it on the public gamers block to test it and code named it Sniper. They got a few people who played it around and gave feedback. After a short alpha testing period they've decided that they've learned enough and decided to pull the plug. The shockwave that came back from the community whence the announcement was recieved impressed them enough to have it kept in development. They moved onto beta at early-mid '96 and dubbed it SubSpace. From there it entered a real development cycle and was opened and advertised by word-of-mouth around to many people. Michael Simpson (Blackie) was assigned by VIE as an external producer from their westwood studio division to serve as a promotional agent, community manager and overall public relations person. Jeff and Rod are starting to prepare to leave VIE.</p> <h3 id="chapter-2">Chapter #2</h3> <p>At late '97 SubSpace officially started entering retail cycle with pre-orders being collected and demo priviledges being revoked (demo clients confined to 15 minutes of play alone and only first four ships accessable). The reason behind this was that VIE was going down hill, loosely losing money (only thing kept them afloat this long was westwood and C&amp;C frenchise) and tried to cash out on any bone they could get. Early 98 a small work, primarily on Juan's part and mainly being blurped about by Michael, begins on SubSpace 2, it soon enough dissolves to dust and never being discussed again. Skipping to 98, VIE classifies SubSpace as a B-rated product, which means it gets no advertisement budget. In addition, they only manufactured a mere 10,000 or so copies of it and tossed it to only a select few retailers for $30 a box. Along those lines, VIE also lost on an opportunity to sell SubSpace to Microsoft, as part of the &quot;The Zone&quot; which would've ensured the game's long-term success and continuance, for a very nice sum. The deal fell through the cracks due to meddlings of Viacom, who owned VIE at the time, untill it was completely screwed up. Rod and Jeff, being enraged on all of this realised it was over and notified VIE several months ahead of their contacts expiring (they were employeed by a 1 year contract with an option to extend it another year at a time) about their intentions to leave and go independant, they tried to negotiate with them to enter developer-publisher relationship, naturally it didn't work and they seperated up. On October '98 the wind broke from an inside source and rumors, which would later be proven truth, begun to fly about VIE bunkrupting and SubSpace being abandoned and left without support nor a development team. Although franatically denied by Michael, the horror was proven true, and not too soon after of, VIE officially announced the shutdown of SubSpace and complete support withdrawl accompanied with that of a filing chapter 11 and sale-off of its remaining assets (Viacom had WestWood already sold to Electronic Arts, along with Michael). The owners of Brilliant Digital Entertainment (Kazaa/Altnet.com) created an asset holding company called Ozaq2, and are now the sole holders of SubSpace copyrights. By then, the original developers are long gone.</p> <h3 id="chapter-3">Chapter #3</h3> <p>Early '98/Late '97 the ex-SubSpace developers : Rod, Jeff and Juan move to Origin, which contracts them to create Crusader Online. Unfortunately, however, after producing an alpha version, Origin execute a clause which states that they may pull the plug if they do not like their demo and terminates the project. Nick Fisher (AKA trixter, as known in subspace) approaches them, and together they form Harmless Games, their first task, taking what they had done and building it further onto a viable profitable online game, the crusader online demo is redubbed as Infantry (online). On a side note, I have no idea if what they made was what is known as the would-be &quot;Crusader : No Mercy&quot; (the so called online version of Crusader with only 1, possibly fake, screenshot ever released of the project). Nick creates the GameFan Network which will rehost warzone.com and Infantry's gaming servers, among other websites and deals. Jeff have the game quickly plow through pre-alpha and rapidly working it up to be suitable for alpha testing. Larry Cordner is contracted to create content editors for the game, though he won't be staying on pay for long (and disappear/sacked upon the move to SOE). By October '98 the Harmless Games site is being put up, along the &quot;most&quot; official Infantry section, which is the only one which gets any attention at all throughout that site. Juan creates for Rod the insignia of HG - the Tri-Nitro-Teddy. Jeremy Weeks is contracted to create secondary concept art for Infantry. On November '98 HG officially announces Infantry, alpha testing is to commence shortly. Juan, with his artwork part finished, leaves the team. At March '99 HG officially announces BrainScan, a company founded by Nick, to be the game's publisher, after many attempts at signing a publishing deal kept falling through, beta testing is to begin later that year with a full release with pay to play schedueled not far behind. Rod and Jeff clash about Rod's desire to bring Infantry to verant/studio 989 (later renamed SOE) via his connections, Rod eventually leaves Infantry and HG to head up a high window position in Sony Online Entertainment (senior executive of games development I believe it was). At late 2000 due to the dot-com crash, express.com fails to pay GameFan Network its dues (advertisement banner payments, of course) and GFN crash and burns due to the lack of millions of dollars to cover its debts (as well as silently BrainScan), Infantry's servers are slated for a shutdown, the hunt for a new host/publisher begins. And so they contact Rod, and eventually all of the intellectual properties owned by Nick are sold to SOE (the ICQ-esque EGN2 program as well), with Infantry &amp; Jeff among them for an &quot;undisclosed sum&quot; (according to Nick the deal earned him a figure around 6 million USD). SOE's &quot;The Station&quot; announces the aquisition of Infantry. Infantry is still being ran on GFN's last remaining online server, which for some reason someone (whoever the box-hosting was bought from) forgot to take down, well that is, untill late of October at which it is being brought down and the long coma begins.</p> <h3 id="chapter-4">Chapter #4</h3> <p>Coming November 2000, Infantry is going back online at SOE. Not so long after of, Cosmic Rift begins development, the SubSpace Clone which at at April 2001 is being announced publicaly. Jeff is becoming more and more abscent untill finally disappearing from Infantry and its development altogether (we later learn that he was pulled away by Rod's steering and removed onto EQ projects and SWG). SOE partially &quot;fires&quot; Jeremy only to rehire him later. Then at April 2002 the hellfire spits the brimstone : Infantry is going Pay 2 Play and the chain of broken promises and EQ Customer Support personall being assigned as the games's Exec. Producers begins. Alot of miscontent and grief grises from the player base. Some people begin to run private limited-ability servers from the beta era. Infantry players who had access to beta software, Gravitron among them, being outragous by the injustice done to Jeff, the game, the betrayel of Rod, not being able to stand SOE's continued abuse, mistreatment and lies, make a shoe stomp by gathering all possible available material (namely mainly, beta client, beta server and editing tools) and making a statement&amp;point by releasing it to the public and whomever desires (despite the predictable effect of anger and alienation by Jeff). Rod plummets onto the depths of EQ, Jeff disappears off radar. Everyone else continues with their seperate lives, employements and projects.</p> <h3 id="chapter-5">Chapter #5</h3> <p>A supplemental as for SubSpace's well being.</p> <p>About post-VIE SS: A Norwegian named Robert Oslo, his alias Baudchaser, approached a finnish ISP called Inet. Cutting a lot of events (and shit) short, he, along with the one known as Xalimar (an Exodus/C&amp;W employee) whose name eludes me, became the two carriers of the SS torch, as they arranged for the hosting of the game's zones. BaudChaser formed the SubSpace Council and for as long as he stayed around upto his departure, took care to have SS kept going and battled a lot of cheating, staff troubles (abuse) and grief. Eventually Inet stopped hosting SS and now Xalimar alone carry the burden, for the most part, of hosting core SS zones. Priit Kasesalu, who apparently been playing the game, started working for the current chief in power of SSC and ex-Vangel AML league sysop, Alex Zinner (AKA Ghost Ship), hacking the server software and eventually creating his own client by reverse engineering the original SS, possibly having some sort of access to the source.</p> <p>About &quot;SubSpace 2&quot; rumors mid 2003: The owners of BDE wished to create the perfect Peer2Peering network (Altnet.com), they needed a flagship product to prove investors that their way is just and right. For that, they contacted a company called Horizon which was specializing in P2P technology. Horizon was creating a P2P technology called Horizon's DEx, later on Horizon renamed to SilverPlatter and their technology to Alloy. Somewhere around 2002-2003 they were supposed to use BDE's Altnet in an E3 show to present the manifestation of this technology - Space Commander, presumably, SubSpace remade a new and being used as the first massive MPOG under Peer2Peer. However, silverplatter eventually went bunkrupt, for some reason, and nothing was known since and before about BDE's attempts at using the SubSpace properties which they owned aside this single E3 presentation.</p> <h3 id="chapter-6">Chapter #6</h3> <p>Additional update:</p> <p>(wow, I must put this through a grammatical correctional application) Somewhere along 2004-2005 Rod quit SOE as Executive Producer/VP of production (SOE seeming nowdays as a leaking boat about to drown) and joins Maxis to head off sims projects. October 2005, in a series of lay-offs Jeremy Weeks (yankee) is fired from SOE, apparently permanently this time, any shread of hope (not much to begin with) Infantry had is now diminished next to null. Jeff is still assumed to be working at SOE. Juan surfaces at pendamic studios, working for Lucas Arts on BattleFront I &amp; II (and has a website, www.allinjuan.com). Somewhere later that year or in the beginning of 2006, a high ranking moderator-player known as Mar snaps in face of the continued abuse/neglect by the owners and in an anti-SOE move releases the latest editor tools, his efforts are quickly hashed, however, and it is unknown if anyone got their hands on the software, he is of course terminated of status and subsequently banned from the game. February 2006, Rod is tracked down and grants his point of view, being accused of not lending assistance to infantry while being games development exec. at SOE and clearly in a position to help:</p> <hr /> <p>You know what? You are probably right. At the time I was focused entirely on the big EQ issues which the entire company's survival hinged on. In retrospect Infantry could have been turned into a bigger product than it was by extra rescources (although I will say it got more than other titles of similar sub bases). Somewhat ironically now I am completely fatigued by graphical MUDS, games like Infantry are interesting to me again. So yeah, I could have done some more at the time. Hopefully a lesson learned. Anyways I hope that serves by way of an honest explanation. I can imagine how frustrating it must have been as a player. All the best,</p> <p>Rod</p> <hr /> Glenn Henry interview glenn-henry-interview/ Wed, 09 Jun 2004 00:00:00 +0000 glenn-henry-interview/ <p>This is an archive of an interview from the now defunct linuxdevices.com. It originally lived at linuxdevices.com/articles/AT2656883479.html. It's archived here because I refer to it on this site and the original source is gone.</p> <h4 id="q1-can-you-give-us-a-short-history-of-centaur">Q1: Can you give us a short history of Centaur?</h4> <p>A1: The idea came to me and my co-founders in 1993. We were working at Dell -- I was Senior Vice President in charge of products. At that time, we were paying Intel I think $160 per processor. That was the lowest Intel price, and that was a special deal. So, it occurred to me that you could make a compatible part and sell it a lot lower. And that part, if not equally fast, would be fast enough for the masses of people.</p> <p>No one seemed interested in doing that. AMD was just starting in the x86 business at the time, and they were trying to compete head-on with Intel. So, in early 1994, I quit Dell, and three other people came with me. We spent a year working out of our homes trying to get funding to start a company to build low-cost, low-power, x86 chips that were differentiated from Intel but fully compatible with all the x86 software.</p> <p>Our theory at that time was sort of a &quot;build it, and they will come&quot; theory. We thought that if we could lower the price of the processor, it would stimulate not only low-cost PCs, but new applications we didn't know about in 1994.</p> <p>We found funding from an American semiconductor company called IDT, and started Centaur. Centaur has never been an independent company in one sense -- we were previously wholly owned by IDT, and now we're wholly owned by VIA. On the other hand, we're an independent company in the way we operate. We have our own culture, our own payroll, etc.</p> <p>We started officially on Apr. 1, 1995, the day the check came in the mail, an auspicious date. We shipped our first processor two years later, and then another a year and a half after that, in early-1999.</p> <p>IDT decided to sell us because they had no presence in x86 or the PC world -- there was no synergism there. So they publicly put us on sale, and VIA bought us in September of 1999. The marriage was perfect, because VIA produces all the other silicon that goes into a PC. They design boards, their sister and cousin companies produce boards, their other cousin company makes all the other little low-cost parts for a PC -- all that was missing, from a hardware point of view, was the processor.</p> <p>In fact, since you're LinuxDevices, I'll make a comment. When I was going around selling this argument, I would point out that the price of everything in a PC but two things was going down drastically, and therefore there's this huge opportunity to move &quot;PC processing&quot; into new dimensions. But the two things that weren't going down were reducing the opportunity. And those two things were the Intel processor, and the Microsoft software.</p> <p>When we started, we had no answer for what to do about the Microsoft software. We just attacked the Intel processor part of it. But in the meantime, along came Linux. Our percentage of Linux -- I suspect, although I don't have the numbers to give you -- is much higher than other peoples' percentage of Linux, just because of the characteristics of our part.</p> <p>VIA also had that vision of low-cost, low-power system platforms, so it was a good marriage, because we had the secret ingredient that was missing. As long as you have to buy a processor from Intel, you're obviously restricted in how low a price or small a form factor you can have.</p> <h4 id="q2-so-currently-the-relationship-to-via-is-wholly-owned-subsidiary">Q2: So, currently the relationship to VIA is &quot;wholly owned subsidiary?&quot;</h4> <p>A2: Yes, in one sense. We're very independent, on a day-to-day basis. They don't send people here. My titular boss is WenChi Chen, the head of VIA, but I talk to him once a month by phone, and it's usually on strategic things. Day-to-day, month-to-month, we operate independently. We have our own payroll, own culture, etc. On the other hand, in terms of product strategy for the future, and practical issues like manufacturing and product support, we work very closely with them. In one sense we're an integrated division, and in one sense we're a contract processor design firm.</p> <h4 id="q3-how-many-employees-do-you-have-now">Q3: How many employees do you have now?</h4> <p>A3: We have roughly 82. That was one of our selling themes when I started this. But, it was a catch-22. To get people even vaguely interested, we had to have a low-cost theme. To have a low-cost theme, you have to have a very lean design cost, too. But on the other hand, when I told people, &quot;Well, there's four of us in our kitchens, and with another 20 or 30 people we could build an x86 processor,&quot; no one would believe that.</p> <h4 id="q4-yeah-how-is-that-possible">Q4: Yeah, how is that possible?</h4> <p>A4: It's made possible by two things. One is sort of a product focus. The other is the culture, how we operate. Let me talk about the product focus first.</p> <p>Intel designs processors -- and so does AMD -- number one, to be the world's fastest, and number two, to cover the whole world. The same basic processor is packaged in the morning as a server processor, and in the afternoon as a mobile processor, etc., etc. Not quite true, but... They sort of cover the world of workstations, servers, high-cost desktops, and mobile units. And AMD tried to do that. But they're also trying to be the world's fastest.</p> <p>The idea I had, which actually was hard for people to accept could be successful, was, &quot;Let's not try to make the world's fastest. Let's look at all the characteristics. Speed is one, cost is another, power consumption is another, footprint is another. And let's make something that wins on footprint size, cost, and power, and is fast enough to get the job done for 90 percent of the people but is not the fastest thing in the world.&quot;</p> <p>The last 10 percent of performance is a huge cost. And not just the hardware side, but also in design complexity. So, in fairness, our parts are slower than Intel's. On the other hand, our parts are fast enough to get the job done for most people. Other than the marketing disadvantage of having slower parts, our parts perform quite well. And they have much lower power, and a much smaller footprint, and they cost much less. Those characteristics appeal to a number of applications.</p> <p>So, that's our theme. The marketing guys don't like me to say &quot;fast enough&quot; performance, but you know, we're not as fast, head-to-head, as Intel is. But, we are fast enough to do 90 percent of the applications that are done, using a processor. Maybe even more than that. And we have very low power -- much lower than Intel or AMD -- and a really small footprint size -- much smaller than Intel or AMD.</p> <p>The small footprint only appeals to some people, but for that class of people, i.e., in the embedded space, that characteristic is important. And, of course, the cost is very, very good. I can't give you actual numbers, but our parts sell in the neighborhood of $30 to $40 for a 1GHz part.</p> <p>There's a second secret ingredient. I had 28 years of management before I started this. And, I had the luxury of starting a company with a clean sheet of paper, with three other guys who had worked for me for a long time. Our original owner, IDT, sent money and left us alone. So, we have created a culture that I think is the best in the world for doing this.</p> <p>Our engineers are extremely experienced, and very, very productive. One engineer here does the work of many, many others in a big company. I won't go through all the details, because they're not technical, but basically, we started with the theory, &quot;We're going to do this with 20 or 30 designers.&quot; Remember the 82 people I mentioned? Of those, only 35 are actual designers. The rest are testers, and things like that.</p> <p>So, we said, we were going to do it, with that many people. As I said before, we had this idea to constrain the complexity of the hardware. We hired just the right people, and gave them just the right environment to work in. We bought the best tools, and developed our own tools when the best tools weren't available, etc., etc.</p> <p>So I have a long story there, but the punchline is that we were able to hire extremely experienced and good people, and keep those people. We just passed our nine year anniversary, and the key people who started the company in the first year are almost all still here.</p> <p>So, this is the secret, actually: all the things I do are underlying things to allow us to hire the right people and keep them motivated and happy, and not leaving.</p> <p>Our general claim is, &quot;This is a company designed by me to be the kind of place I wanted to work in as an engineer.&quot; It doesn't fit everyone, but it fits certain people, and for those people, it's probably the best environment in the world.</p> <h4 id="q5-bravo-that-sounds-like-a-great-way-to-create-a-company">Q5: Bravo! That sounds like a great way to create a company.</h4> <p>A5: We were very lucky. I found two people who personally believed in that. This is a very hard story to sell. You got to go back to 1994, when I was travelling from company to company, and my story was, &quot;Lookit, there's four of us in our kitchens in Austin, I want you to give us $15 million dollars, I want you to leave me alone for two years, and then I'll deliver an x86 processor.&quot; Right?</p> <p>That's a very hard story to sell. We were lucky to find at the time, the CEO of IDT, a person by the name of Len Perham, who, personally believed the story. Since he was the CEO, Len was able to get others to believe in it. With VIA, the person we found that believed that was WenChi Chen, who's the CEO of VIA. Those people were both visionaries in their own right, and understood the importance of having an x86 processor.</p> <p>My basic argument for why a person would want to do this was simple. It's clear that the processor is the black hole, and that all silicon is going to fall into it at some point. Integration is inevitable, in our business. Those who can do a processor can control their system destiny, and those who don't will end up totally at the mercy of other people, who can shut them out of business right away.</p> <p>And as an example, when IDT bought us, they were making a lot of money on SRAMs that went into caches on PC motherboards. Two years later, there were no SRAMs on PC motherboards, because Intel put them on the die. That's going to happen to some of the chips today. All the other chips are gonna disappear at some point, and all that's left is the big chip, with the processor in the middle. You have to own that processor technology, or it won't be your chip.</p> <p>So that's my basic sell job. It reached two people, but they were the right people.</p> <h4 id="q6-can-you-give-a-real-quick-summary-of-the-history-of-the-processors">Q6: Can you give a real quick summary of the history of the processors?</h4> <p>A6: Our first processor we shipped was called the WinChip C6. &quot;WinChip&quot; was a branding name that IDT had. It was 200MHz, but it had MMX. It was a Pentium with MMX. We shipped two or three more straightforward derivatives of that, added 3DNow, put an on-chip cache on it, then further enhanced the speed, and that's where we were when VIA bought us.</p> <p>With VIA, we've shipped several major designs.</p> <p>The first one we call internally C5A. There are three different names... it's very confusing. When I talk to people, I usually end up using our internal names. I used those in my talk at the Microprocessor Forum. VIA also uses another codename for a class of products that covers several of our products, names like Samuel, Nehemiah -- they're Bible names. And then there's the way the product is sold.</p> <p>The first part we sold for VIA had a Pentium III bus on it, and was around 600MHz. Since VIA bought us, four and a half years ago, we have shipped four different variations. Each one is faster; each one has more functions, is more capable; each one is relatively lower power. The top Megahertz goes up, but the watts per Megahertz always goes down. They're all the same cost.</p> <p>The product we're shipping now, the C5P has a top speed of 1.4 to 1.5GHz, today, but the sweet spot is 1GHz. We have a fanless version at 1GHz. We also sell all the way down to 533 or even 400MHz, for low-power applications.</p> <p>To give you an idea about the 1GHz version we're selling today, the worst case power -- not &quot;typical&quot; or &quot;average&quot; power, which other people talk about -- our worst case power is 7 watts, which is low enough to do fanless at 1GHz [story], and no one else can do that.</p> <p>Second, we also sell that same part at 533. It's worst case power is 2.5 watts. So, remember, I'm talking worst-case power. Typical power, which a lot of people quote, is about half of the worst-case power. So, if we want to play games, we could say it's a 1 to 1.5 watt part at 533MHz, and it's a 3-watt part at 1GHz.</p> <p>Along the the way, we've used four different technologies. All our technologies since we were bought by VIA have been with TSMC [Taiwan Semiconductor Manufacturing Company]. We've used four major technologies, with, obviously, sub-technologies. So we've shipped four major designs, with two or three minor variations, but they weren't radically different. That's in four years.</p> <p>We design products very quickly. That was also part of my theme: be lower cost than everybody else, and be able to move faster than everybody else. The things that make you able to do things with a small group also allow you to do things quickly. Actually, the more people you have, the slower things go: more communication, more decision-making, etc.</p> <p>By the way, I stole this idea from Michael Dell. #### Quick anecdote: I was an IBM Fellow, and I managed very large groups -- hundreds of people -- at IBM. And I went to Dell originally to be their first Vice President of R&amp;D. This was in 1988. So, I get to Dell, and find that the R&amp;D department is six or seven guys that work directly for Michael! And Michael says, &quot;Your job is to compete with Compaq.&quot; And I say, &quot;Well, how can you do that with six or seven guys?&quot; And he says, &quot;That's the secret. We'll always be lower-cost, and we'll move quicker than they are.&quot; And of course, that's worked out very well at Dell.</p> <p>We put out a lot of products, in a short period of time, which is actually a major competitive advantage. To give you one minor example, early this year, Intel started shipping a new processor called the Prescott. It's a variation of the Pentium 4. And it had new instructions in it, basically, that are called SSE3. We got our first part in late January. Those instructions are already in the next processor, that we've taped out to IBM.</p> <h4 id="q7-that-s-the-c5j">Q7: That's the C5J?</h4> <p>A7: Yes. That's what I talked about [at the recent Embedded Processor Forum] out in California. I said there's four major processor types that we've shipped in four and a half years, but I take it back. There's five. The C5A, C5B, C5C, C5XL, and C5P. Five major designs that we've shipped in four and a half years. And the sixth is the C5J. It's headed for IBM.</p> <h4 id="q8-is-tsmc-still-your-normal-fab">Q8: Is TSMC still your normal fab?</h4> <p>A8:The old designs will still sell for quite a while. The new design is going to IBM. So we'll have both. We'll be shipping partial IBM and partial TSMC for quite a while. And we may go back again to TSMC in the future. This is normal business. We haven't burned any bridges. At least, we don't think we have. VIA does a tremendous amount of business with TSMC, and has a very close relation.</p> <p>As technology advances, no one remains the best in the world across all technology versions. Typically, in 0.13 [micron process], this person's better than this person. Then you go to 0.09, and it may be different.</p> <h4 id="q9-in-terms-of-competition-between-embedded-x86-processors-with-amd-and-intel-where-does-via-stand-for-example-in-comparison-with-the-amd-nx-line">Q9: In terms of competition between embedded x86 processors, with AMD and Intel, where does VIA stand? For example, in comparison with the AMD NX line?</h4> <p>A9: All I know is what's in the specs. [AMD's NX] looks suspiciously like a 32-bit Athlon...</p> <h4 id="q10-they-did-tell-us-it-was-a-tweaked-athlon-story">Q10: They did tell us it was a tweaked Athlon. [story]</h4> <p>A10: Right. And their power is reduced over their normal Athlon numbers, but it is still higher than ours. Let's be fair about it. If you wanted to build the fastest thing in the world, you'd choose the AMD part, or the Intel Pentium 4. However, we beat them on power -- both of them -- we beat them on cost, and we beat them on footprint [comparison chart at right, click to enlarge]. So, if what you want to build is a set-top box, or what you want to build is sort of a classic PC, or what you want to build is a thin client terminal, or what you want to build is a Webpad, our processor's performance is adequate and we win easily on cost, power, and footprint.</p> <h4 id="q11-how-do-you-position-yourself-relative-to-the-gx-the-old-geode-stuff">Q11: How do you position yourself relative to the GX, the old Geode stuff?</h4> <p>A11: Well, we blow it away in performance. I mean, it's a 400MHz part. It is a two-chip solution, and we are a three-chip solution today. I don't really know its specs, but power is probably close when you get down to the same Megahertz. It stops at 400MHz, while we start at 400MHz and go up to 1.5GHz. From 400 on up, we're faster, and our power is good. If 100MHz would do you, or 200MHz -- whatever their low point is, which I don't know exactly -- if that's good enough for you, then they will have lower power, because we don't go below 400MHz.</p> <p>We think they're squeezed into a really narrow niche in the world, because their performance is so low. Anything from 400MHz on up, in the power that goes with that, we win.</p> <p>So here's how we look at the world:</p> <p>From 400MHz to 1GHz, there's nothing but us, that's competitive.</p> <p>From 1GHz to 1.5GHz, we'll compete with the low-end of Intel and AMD.</p> <p>From 1.5GHz on up today, if that's the speed you want, then you choose AMD or Intel.</p> <p>My opinion is down at 400MHz and below, the GX has a very narrow slice. They're competing with things that have even better power than they do and good prices, the classical non-x86 embedded processors based on ARM and MIPS. And right on top of them, there's us. We have better performance and equal power at our low end, which is their high end. And, we stretch on up until we run into the bottom end of AMD and Intel.</p> <p>We think that our area, the 500MHz to 1.5GHz range, represents a potentially massive opportunity. What we're seeing is interesting. About 50 percent of our sales are going into non-PCs. We're not taking sales away from other people, as much as we are enabling new applications -- things that have historically not been done at all, because it either was too expensive, or the power was too high, or the software cost was too great.</p> <p>You know the VIA mini-ITX boards? [story] That is one of the smartest moves.</p> <p>All of our engineers like to play. You know, there are robots roaming around here; people have built things like that? One guy built a rocket controller last week -- you know, normal engineering work. What our engineers do is what I tell people to do: buy a mini-ITX board, and add your value-add with frames, cover, and software.</p> <p>It really is a major improvement over the classic thing where you had to do your own hardware. That board is so cheap: $70 to $150 down at Fry's, depending upon the variation. And it has all the I/O in the world. You can get 18 different versions of it, etc., you know the story. You can customize it using software, and you have a wide variety of operating systems to choose from, ranging from the world's most powerful, most expensive, to things that are free and very good.</p> <h4 id="q12-what-about-nano-itx-story-and-even-smaller-opportunities-like-pc-104-and-system-on-modules-there-are-standards-from-sbc-companies-like-kontron-and-20-other-companies-you-ought-to-have-a-devcon-like-arm-and-intel-do-to-enable-all-these-guys-do-you-do-anything-special-to-enable-them">Q12: What about nano-ITX [story] and even smaller opportunities, like PC/104 and System-on-Modules. There are standards from SBC companies like Kontron and 20 other companies. You ought to have a DevCon, like ARM and Intel do, to enable all these guys. Do you do anything special to enable them?</h4> <p>A12: I'm not sure of all that is done. VIA does make sample designs, board schematics, etc., available, and works with many customers on their unique solutions. And, of course, the mini-ITX does stimulate lots of designs.</p> <p>We have other customers, that are doing their own unique designs, and some of them we work with to make things small. Right now, we are a three-chip solution. We're working on reducing that to a two-chip solution. By three-chip solution, I mean processor, northbridge, and southbridge to make a complete system. That'll be reduced to a two-chip solution, at some point, which reduces the footprint even more.</p> <p>We've made a major improvement with the new package. You've seen that little teeny package we're shipping today, that we call the nano-BGA? [story] It's the size of a penny-type-thing. That's a real breakthrough on processor size. We need to get the other two chips boiled down to go much further.</p> <p>People are doing other designs. For example, one of the things VIA's touting -- I don't know much about the details of it -- there's a group that's doing this handheld x86 gaming platform? [story] It sits in your hand like a Gameboy, but it's got a full x86 platform under it. And the theory is, you can run PC games on it. It has a custom motherboard in it, using our processor.</p> <h4 id="q13-when-you-mentioned-reducing-from-three-chips-to-two-chips-that-reminded-me-of-transmeta-what-would-you-say-about-transmeta-as-an-x86-competitor">Q13: When you mentioned reducing from three chips to two chips, that reminded me of Transmeta. What would you say about Transmeta as an x86 competitor?</h4> <p>A13: Well, Transmeta has good power. They're really not any better than us, but we're better than the other guys. So I'd say, yeah, equal in power. They have okay footprint. They have two chips, but their two chips are big. I had a chart in the fall Microprocessor Forum showing that our three chips were the same size as their two chips. So there's not a big difference in footprint size. Biggest difference is threefold.</p> <p>One, is that they're very costly. In their last quarterly earnings conference call, they quoted orally an ASP [average selling price] of I think it was $70, which is ridiculously high. You notice they're losing mass amounts of money -- $20 million a quarter -- so they need to try and keep prices high.</p> <p>Two, is the fact that our performance on real-world applications is much better than Transmeta's. They do well at simple repetitive loops, because their architecture is designed to do well there. But, for real applications with big working sets, rambling threads of control, etc., we beat them badly. For example, in office applications such as Word, Excel, etc. -- the bread and butter of what the real world does.&quot;</p> <p>But the other argument is sort of subtle, and people miss it. Their platform is very restricted. Their two-chip solution only talks to two graphics chips that exist, in the world, right? And you have to choose. And if you don't like the memory controller they have, let's say you want X, etc. -- they only support certain memories. etc., etc.</p> <p>We have a processor that talks to 15 different northbridges, each of which talk to 17 different southbridges. You want four network things? Fine, you got four network things. You want an integrated graphics panel, AGP... fine. You can configure all of those things using the normal parts. And, what we've found in the embedded world is that one size definitely does not fit all. That's why VIA itself has done eight variations of the mini-ITX board, and keeps doing more, and other vendors have made lots of other variations. Some have four serial ports, some have one; some have four networks, some have none, and so forth and so on. The flexibility offered by these standard parts in the PC world is, I think, a significant advantage.</p> <h4 id="q14-you-are-chipset-compatible-with-intel-parts">Q14: You are chipset compatible with Intel parts?</h4> <p>A14: Oh, yes. We do most of our testing with Intel parts, just to make sure. We can drop into Pentium III motherboards, and do it all the time. In the embedded world, we sell primarily with VIA parts. They usually sell our processor either with a bag of parts, or built into the VIA mini-ITX motherboards, and now the nano-ITX [pictured at right, click to enlarge]. The other half of our business is in the world of low-cost PCs. There, it's whatever the board designer or OEM chose to use.</p> <p>That's always been part of our secret. Basically, my selling strategy for that part of the world [using low-end PCs] is, hey, &quot;Take your design, take your screwdriver, pry up that Pentium III, plug us in, and you'll save XX dollars.&quot; That isn't a a very appealing thing in the United States. But if you go to the rest of the world, where people don't have as many dollars as we do, and where there isn't as much PC penetration, that's a reasonably appealing story. We've actually sold millions of parts, with that strategy.</p> <h4 id="q15-can-you-explain-your-packaging-variations">Q15: Can you explain your packaging variations?</h4> <p>A15: The same die goes into multiple packages, and goes into different versions of the product. The C3, that's a desktop part that has the highest power of our parts. Antaur, that's the mobile version. It has fancy, very sophisticated dynamic power management enabled. Eden is the embedded part. All Edens are fanless. They run low enough power to be fanless, at whatever speed they are. So, one of our dies ends up in three branding bins: the C3, the Antaur, or the Eden. When you buy an Eden, 1GHz, there are sub-variations that may not be obvious that actually distinguish whether it was a C5P or a C5XL. But both are branded as Eden 1GHz. So that's the branding strategy.</p> <p>Cutting across that, there are three packages. One, is the ceramic pin grid array, which is compatible with Pentium III. The PC market usually has a socketed processor, so we have a package that's compatible with that. We also have a cheaper ball grid array version of that, which is a little smaller, and it's cheaper because it's not socketed. And then we have the nanoBGA, which is that little teeny 15 by 15 mm thing; and that's an expensive package, so we charge a little bit more for that. But for people where space is a concern, that's appropriate. That small size is only offered in Eden, because that small size is only appealing to embedded guys. The standard PC market in general doesn't care about small size. When they design a motherboard, they generally design it big enough to handle anybody's processors -- AMD's, Intel's, or ours. We fit easily into that world.</p> <h4 id="q16-our-reader-surveys-story-show-arm-overtaking-x86-in-new-embedded-project-starts-though-we-can-t-identify-the-mhz-range-of-those-projects-can-you-comment-on-the-competition-from-the-higher-ends-of-xscale-and-arm-and-mips">Q16. Our reader surveys [story] show ARM overtaking x86 in new embedded project starts -- though we can't identify the MHz range of those projects. Can you comment on the competition from the higher ends of XScale and ARM and MIPS?</h4> <p>A16: I think there's a class of applications where an x86 is clearly going to win, a class where it's clearly going to lose, and that always leaves a middle.</p> <p>One where it's going to lose is where your power budget, or your size budget, are really, really small. For example, the PDA, which has got an ARM in it. An x86 processor, even ours or Transmeta's, just consumes too much power. There isn't a size problem, probably, with our design -- there would be with Intel or AMD, where the chips just won't fit in that package [see size comparison photo at right; click to enlarge]. That type of application, you're going to choose an ARM, or a MIPS.</p> <p>Where I think x86 wins easily, in my experience, is where, number one, the battery size versus runtime expectation is hours -- right? -- not days, not weeks. Or, there's a power supply. Set-top boxes, for example. They all want fanless parts, so they want really low power, but they have a power supply. They're not running on batteries, obviously.</p> <p>The other characteristic that I think really shows the advantage of x86, is where an application has a user interface of any sophistication, or where the application is doing something particularly sophisticated. Obviously, you can run Windows CE, and even Linux, on something the size of a PDA. But those things are very limited. I have a 400MHz Toshiba PDA, the 800, the latest one out. The 400MHz ARM is the fastest you can buy. I installed an aviation application on it last week, and it runs really slow. I wish I had 800MHz, or 1GHz.</p> <p>If you want very low power, there's really no choice. If you're not worried too much about the power, then there's two factors. One is how fast you need to run, and the other is the complexity of the software you need to run. Speed and software complexity favor x86.</p> <p>We're not going to take over the printer world, or the carburetor control world. What we are doing is pushing the size and power envelope down every year, so x86 is able to reach smaller sizes and lower power every year but still maintain the high ground.</p> <p>I made a joke when I attended the Embedded Processor Forum... I always talk at the fall Forum, you know, the Microprocessor Forum? And there, I'm always the slowest thing. I'm put in the group with 3GHz this, etc. But at this Embedded Forum, we were far and away at 1GHz the fastest thing there, bar none.</p> <p>There is definitely a major performance gap. We may be slower than a Pentium 4, for example, but we're still a lot faster than the MIPS and ARM chips.</p> <h4 id="q17-except-a-geode-might-have-the-distinction-of-being-speed-competitive-with-the-arm-chips">Q17. Except a Geode might have the distinction of being speed-competitive with the ARM chips?</h4> <p>A17: I'll bet you a 400MHz ARM is a lot faster than a 400MHz Geode. There's an age-old penalty of x86. Plus, Geode's a three-year old design. AMD hasn't changed it a bit. It doesn't have SSE, it's still using 3DNow, etc., etc. It doesn't have a number of other modern instructions that you'd like it to have. The ARM 400s are going to have lower power and faster performance. I don't actually know the price in those markets, so I can't comment on that.</p> <h4 id="q18-we-d-like-to-get-you-to-talk-a-little-bit-more-about-linux-and-the-importance-of-linux-in-terms-of-centaur-s-success-and-how-the-fit-is">Q18: We'd like to get you to talk a little bit more about Linux and the importance of Linux in terms of Centaur's success, and how the fit is.</h4> <p>A18: I think there are three things. First of all, our theme all along is to be able to produce the lowest-cost PC platform that there is. When I was going around selling this, I was talking about the sub-$1,000 PC, which is now a joke. There are PCs being sold with us in it that are sub-$200. So, let's look at a $200 PC, and right next to it is a $100 off-the-shelf version of Windows XP. The presence of a low-cost operating system is substantially more important at the low end of the low-price hardware market, which is our focus. I think it's a very big deal that we have a low-cost operating system to go with a sub-$200 PC.</p> <p>The second thing is the embedded space. In embedded -- I'm using that as a very broad term -- one of the characteristics is customization. People are building applications that do a particular thing well, be it a set-top box, or a rocket controller, or whatever it may be. Customization for hardware is probably easier to do on Linux than it is on Windows. If it's an application that needs to have close control of hardware, needs to run very fast, needs to be lean, etc., than a lot of people are going to want to do it on Linux.</p> <p>The third one, is one you haven't asked me about, this is actually my pet hobby, here -- we've added these fully sophisticated and very powerful security instructions into the...</p> <h4 id="q19-that-was-my-last-question">Q19: That was my last question!</h4> <p>A19: So the classic question is, hey, you built some hardware, who's going to use it? Well, the answer is, six months after we first started shipping our product with encryption in it [story], we have three or four operating systems, including Linux, OpenBSD, and FreeBSD, directly supporting our security features in the kernel.</p> <p>Getting support that quickly can't happen in the Microsoft world. Maybe they'll support it someday, maybe they won't. Quite honestly, if you want to build it, and hope that someone will come, you've got to count on something like the free software world. Free software makes it very easy for people to add functionality. You've got extremely talented, motivated people in the free software world who, if they think it's right to do it, will do it. That was my strategy with security.</p> <p>We didn't have to justify it, because it's my hobby, so we did it. But, it would have been hard to justify these new hardware things without a software plan. My theory was simple: if we do it, and we do it right, it will appeal to the really knowledgeable security guys, most of whom live in the free software world. And those guys, if they like it, and see it's right, then they will support it. And they have the wherewithal to support it, because of the way open software works.</p> <p>So those are my three themes, ignoring the fourth one, that's obvious: that without competition, Windows would cost even more. To summarize, for our business, [Linux is] important because it allows us to build lower-cost PC platforms, it allows people to build new, more sophisticated embedded applications easier, and it allows us, without any software costs, to add new features that we think are important to the world.</p> <p>Our next processor -- I haven't ever told anyone, so I won't say what it is -- but our next processor has even more things in it that I think will be just as quickly adopted by the open source software world, and provide even more value.</p> <p>It's always bothered me that hardware can do so many things relatively easily and fast that aren't done today because there's no software to support it. We just decided to try to break the mold. We were going to do hardware that, literally, had no software support at the start. And now the software is there, in several variations, and people are starting to use it. I actually think that's only going to happen in the open source world.</p> <h4 id="q20-we-d-like-a-few-words-from-you-about-your-security-strategy-how-you-ve-been-putting-security-in-the-chips-and-so-on">Q20: We'd like a few words from you about your security strategy, how you've been putting security in the chips, and so on.</h4> <p>A20: Securing one's information and data is sort of fundamental to the human need -- it's certainly fundamental to business needs. With the current world, in which everyone's attached to the Internet -- with most peoples' machines having back-door holes in them, whether they know it or not -- and with all the wireless stuff going on, people's data, whether they know it or not, is relatively insecure.</p> <p>The people who know that are using secure operating systems, and they're encrypting their data. Encrypting of data's been around for a long time. We believe, though, that this should be a pervasive thing that should appear on all platforms, and should be built into all things.</p> <p>It turns out, though, that security features are all computationally intensive. That's what they do. They take the bits and grind them up using computations, in a way that makes it hard to un-grind them.</p> <p>So, we said, they're a perfect candidate for hardware. They're well-defined, they're not very big, they run much faster in hardware than in software -- 10 to 30 times, in the examples we use. And, they are so fundamental, that we should add the basic primitives to our processor.</p> <p>How did we know what to add? We added government standards. The U.S. government has done extensive work on standardizing the encryption protocols, secure digital signature protocols, secure hash protocols. We used the most modern of government standards, built the basic functions into our chip, and did it in such a way that made it very easy for software to use.</p> <p>Every time you send an email, every time you send a file to someone, that data should be encrypted. It's going out on the Internet, where anyone with half a brain can steal it.</p> <p>Second, if you really care about not letting people have access to certain data that's on your hard drive, it ought to be encrypted, because half the PCs these days have some, I don't know what the right word is, some &quot;spy&quot; built into it, through a virus or worm, that can steal data and pass it back. You'll never get that prevented through operating system upgrades.</p> <p>I do have some background, sort of, in security: it's always been my hobby. The fundamental assumption you should make is, assume that someone else can look at what you're looking at. In other words, don't try to protect your data by assuming that no one's going to come steal your hard drive, or no one can snoop through a backdoor in Windows. You protect your data by saying, &quot;Even if they can see the data, what good is it going to do them?&quot;</p> <p>We think this is going to be a pervasive need. The common if-you-will person's awareness of worms and viruses has gone up a million percent in the last few years, based on all the problems. The awareness of the need to protect data is going to go up substantially, too.</p> <p>We're doing more than encryption, though. There's another need, which is coming, related to message authentication and digital signatures.</p> <p>We're encrypting all the time. Every time you buy something over the Web, your order is encrypted. So there is encryption going on already. But the next major thing -- and this is already done in the high-security circles of banks -- is message authentication through digital signatures. How do you know someone didn't intercept that order, and they're sending in their own orders using your credit card number? How do you know, when you get a message from somebody, that they didn't substitute the word &quot;yes&quot; for &quot;no,&quot; things like that? These are very important in the world of security. They're well understood in the government world, or the high-security world, and there are government standards on how you do these things. They are called secure hashes, and things like that. So we've added features for those.</p> <p>To summarize, the things we've added fall into three categories. One is a good hardware random number generator. That was actually the first thing, and that's actually one of the hardest things to do. It sounds trivial, but it's actually very hard to generate randomness, with any kind of process. It needs to be done in hardware. Software cannot generate random numbers that pass the tests that the government and others define.</p> <p>The second thing we did is a significant speedup in the two basic forms of encryption. One's called symmetric key encryption, and the government standard is AES, which is a follow-on to a thing called DES. So we do AES encryption very fast. The other form of encryption that's widely used is public key encryption, and the most common form there is a thing called RSA. That's what's being used, you know, for secure Web transactions. We think we're the only people who've done this: we added instructions in our new processor that's coming to speed up RSA.</p> <p>The third thing we've done is added what's called a secure hash algorithm. Again, it's a government standard. Its used for message authentication and digital signatures. It deals with the issue, if you send me an email, how do I know that the email I got was the one you sent? That it wasn't intercepted and changed? And more fundamentally, how do I know that it actually came from you? Anyone can put their name, in our world, on that email. Things like that. So there's got to be some code in that email that I can look at, and know that only you could have sent it. I can explain this more if you want to know.</p> <h4 id="q21-that-s-probably-sufficient-we-re-looking-more-for-the-strategy">Q21: That's probably sufficient. We're looking more for the strategy.</h4> <p>A21: Okay, let me back up. Our strategy was, assuming that we believe that security is fundamental and ought to be there, to define the primitive operations that need to be done as the building blocks of security. Those we put into hardware. We're not trying to impose a particular, I don't know, protocol or use. We're just making available the tools. We're doing it for free. The tools are in the processors, at no extra price. They don't require any OS support, no kernel support, no device drivers. It's getting into the kernels of BSD and Linux, but applications can directly use the features [even without kernel support], and the hardware takes care of the multitasking aspects.</p> <p>The two guys who worked on it with me are both heavy Linux users. They wrote to friends in the security and Linux communities. Very little marketing money was spent.</p> <p>When the security press release went out, at the Embedded Processor Forum, it had three key quotes, real quotes. Not quotes written by PR managers. My quote was written by a PR manager, but the others weren't. All three were big names in the security world, and all were saying good stuff.</p> <h4 id="q22-beyond-security-are-other-cool-features-planned">Q22: Beyond security, are other cool features planned?</h4> <p>A22: The next chip has some tools to do computationally intensive things where hardware provides a big advantage. But I don't want to say yet what they are.</p> <h4 id="q23-would-they-be-useful-for-multimedia">Q23: Would they be useful for multimedia?</h4> <p>Yes, for multimedia, and for other things.</p> <h4 id="q24-like-a-dsp">Q24: Like a DSP?</h4> <p>A24: Kind of like that.</p> <h4 id="q25-okay-we-won-t-push-we-appreciate-you-taking-the-time-to-speak-with-us-we-can-t-imagine-getting-the-president-of-amd-or-intel-to-do-this">Q25: Okay, we won't push. We appreciate you taking the time to speak with us. We can't imagine getting the president of AMD or Intel to do this.</h4> <p>A25: Our whole strategy is so close to the, if you will, the fate of Linux. We identify so much with it. We're low-cost, aimed at the common person, we're aimed at new applications, and we don't have any massive PR or marketing or sales budget, so. Actually, I have a special softness in my heart for Linux. I think without Linux our business would be much less than what it is today. It's just very important to us, so, I wanted to give you guys the time.</p> <h4 id="other-interviews">Other Interviews</h4> <p>If you liked this interview, you <a href="http://conservancy.umn.edu/handle/11299/120173">might also like this interview with Glenn on IBM history</a></p> >