Team of Me

Mindset · What you can't see

AI is a black box, and that's not the problem it sounds like

You can read every number inside a model and still not know why it answered the way it did. That feels like it should disqualify the thing. It doesn't, because we already trust dozens of boxes we can't read. The move is to stop demanding to see inside and start learning to test the outside.

The closed box

Works anyway

You give it words, it gives you an answer. You don't need to see the machinery to judge whether the haiku is good. You read it.

Tap a task below, then try opening the box

Open the box and you learn nothing. A large model is a few hundred billion to a trillion numbers (weights) multiplied together in a particular order. Every one of them is right there on disk; nothing is hidden. And yet reading them tells you as much about why it wrote that haiku as reading the firing pattern of every neuron in a poet's brain would tell you why she chose that word. The information isn't concealed. It's just not legible. Meaning lives in the pattern of a trillion tiny interactions, not in any line you can point to.

"Black box" doesn't mean hidden. It means the behavior is observable but the internal cause isn't human-readable. The weights are public in an open model. The explanation still isn't. Transparency of the parts is not the same as understanding the whole.

Here's the part that took me a while to accept: we run our whole lives on black boxes and it's fine. Acetaminophen (Tylenol) has been sold for over 60 years and its exact mechanism is still debated; we trust it because it's been tested on millions of people, not because anyone can draw you the pathway. General anesthesia puts you unconscious by mechanisms still being worked out. You trust other people's minds every day without reading their neurons. We don't withhold trust until we can trace the cause. We grant it when the behavior is reliable, tested, and the failures are survivable.

The wrong question

"Show me exactly why you said that." A model can produce a reason after the fact, but that explanation is itself generated: a plausible story, not a readout of what actually happened inside. Demanding it can make you more confident and less safe.

The right question

"How do I check this is right?" Read the haiku. Verify the math. Look up the cited case. You evaluate the output the way you'd evaluate any expert you can't see inside: by testing what comes out, against something you can check.

None of this means opacity is harmless. It matters enormously that we can't fully audit a system before it denies a loan or reads a scan, and there's real, serious science, mechanistic interpretability, slowly prying the box open, finding individual concepts represented inside and learning to name them. That work is worth doing. But you, today, using this thing to write or think or build, don't have to wait for it. The skill that pays off now isn't seeing inside. It's building the habit of verification on the outside: knowing which outputs you can check yourself, which you must check, and which you should never trust unchecked.

The shift in one line. Stop asking the box to justify itself. Start asking yourself how you'd know if it were wrong, and make that check cheap and routine. That single habit is most of what "being good with AI" actually is.

Sources: Anthropic, Towards Monosemanticity (2023) & Mapping the Mind of a Large Language Model (2024), on extracting human-interpretable features from a model · Olah et al., Distill, "Zoom In: An Introduction to Circuits" · Acetaminophen mechanism still debated; review literature in Annals of Palliative Medicine / pharmacology texts · EU GDPR Recital 71, the contested "right to explanation." Tiered: the limits of weight-level legibility are established; interpretability successes are real but partial; the brain analogy is illustrative, not a proof.

Mindset · How meaning is stored

Thinking in vector space when you came from Boolean

I grew up matching exact words: AND, OR, NOT, quotation marks. Then a model's page told me it had millions of "runs," and chasing that small confusion taught me the real shift: words aren't tokens you match, they're points in space, and what matters is how near they sit.

Nearest in meaning

By distance

Drag the ember point. The closest words light up, not because they share letters, but because they sit near each other in meaning-space.

Drag the ember point around the plane

For twenty years, a search was a set of exact words. You typed "new world screwworm" AND texas NOT cattle and the engine handed back documents containing those literal tokens. Boolean search is set membership: a page is either in the result or out of it, decided by whether the exact strings appear. It is precise, legible, and completely blind to meaning. Cat and kitten are as unrelated to it as cat and carburetor: different strings, different sets.

The confusion that opened the door. I was looking at a model's page on a hosting site and saw it listed "3.4M runs." I asked, naively, "why does it have so many runs?" The answer is dull: a "run" is just one execution of that model; the count is popularity, how many times people pressed go. But the question I should have been asking was sitting one layer down: how does a thing made of numbers know that monarch belongs near queen?

The answer is that, somewhere along the way, words stopped being strings and became vectors: lists of numbers, coordinates of a point in a space with hundreds of dimensions. You can't picture three hundred dimensions, so the map above flattens it to two. But the idea survives the flattening: every word is a location, and words that mean similar things end up living in the same neighborhood. King, queen, prince, throne cluster together. Dog, puppy, cat sit somewhere else entirely. Nothing matched the letters. They were placed by meaning.

Where the coordinates come from

This isn't hand-drawn. In 2013, Mikolov and colleagues at Google published word2vec, a method that learns a word's vector from the company it keeps. Feed it enough text and it notices that dog and puppy show up in nearly interchangeable contexts ("my ___ needs a walk"), so it nudges their vectors close together. Do that across billions of words and a geometry of meaning falls out on its own, learned, never specified. That geometry is the thing your AI tools think inside.

The famous trick: true, but tier it. The headline result was arithmetic on meaning: king − man + woman ≈ queen. Genuinely real, and genuinely startling. But it's illustrative, not a law. Later work showed the analogy is partly an artifact of how it's scored: the method often quietly excludes the input words, and most analogies are far messier than the cherry-picked one. Believe the geography; don't oversell the algebra.

Boolean: the old set

Membership by exact string. A document is in or out. "Color" never finds "colour." Powerful when you know the precise word; useless the moment you only know the idea.

Vector: the new space

Membership by nearness. Closeness is measured as cosine similarity, the angle between two vectors. Small angle, related meaning. You retrieve by describing the meaning, and near-misses still come back.

That single change rewires how you should talk to these tools. On Boolean systems I hunted for the magic keyword, the exact phrase a document must contain, and a near-synonym got me nothing. In vector space the burden flips: I describe the meaning I'm after, in whatever plain words I have, and the model meets me in its own geometry and pulls back what sits nearby. You stop guessing the password and start pointing at a region.

The shift in one line. Boolean asks "which documents contain these exact words?" Vector space asks "what sits near this meaning?" Stop hunting for the keyword. Describe the idea and let nearness do the retrieving.

Which loops back to those millions of runs. The count was never the interesting number, but it taught me the lesson by accident. Popularity concentrates: a few models get pressed millions of times while most sit untouched, the way a few words sit dense in the middle of a corpus. Meaning has a geography too. Once you can feel that landscape, its clusters, distances, and neighborhoods, you stop typing at a search box and start navigating a space.

Sources: Mikolov, Chen, Corrado & Dean (2013), Efficient Estimation of Word Representations in Vector Space (word2vec) · Mikolov, Sutskever et al. (2013), Distributed Representations of Words and Phrases (the king−man+woman analogies) · Levy & Goldberg (2014) and Linzen (2016) on analogy results being partly an artifact of the scoring (excluding input terms) · cosine similarity, the standard linear-algebra measure of vector angle · "run count" on a model-hosting site (e.g. Replicate) = number of executions, a popularity metric. Tiered: word vectors and cosine similarity are established; word2vec's specific method is established; the clean king−man+woman≈queen analogy is illustrative and partly a scoring artifact; "run count = popularity" is a product fact.

Limits · The narrow world

AI hasn't learned the world, just the parts of it we wrote down

It folds proteins and crushes chess, then fumbles a task a toddler does without thinking. That gap isn't about how smart it is. It's about two things: how densely the skill was written down, and whether the answer can be checked. Map those two and the whole pattern snaps into focus, including where you should and shouldn't trust it.

Tap a domain

Pick one

Each dot is a thing humans do. Its spot left-to-right is how richly that skill was written down and how cheaply you can check a right answer. Its height is how good AI is at it. Tap one to see why it sits where it sits.

Tap any dot · or highlight a cluster above

The same system can be superhuman and helpless in the same hour. Ask a frontier model to find the structure of a protein and it does in minutes what took a lab a year. Ask it to fold a pile of laundry, soothe a crying toddler, or figure out why your kitchen faucet drips, and it has nothing, or worse, a confident answer that's wrong. We reach for "it's smart at some things, dumb at others," but that's not a pattern, it's a shrug. There is a pattern, and once you see it you can predict where the thing will shine and where it'll fail before you ask.

The two dials. Competence tracks density (how many examples of this were written down) and verifiability (can a right answer be checked and fed back). High on both (chess, code, protein structure) and AI soars. Low on either (the feel of a ripe avocado, your house's quirks) and it's thin. Not because it's "dumb there." Because that knowledge was never in the corpus, and can't be cheaply graded.

Hans Moravec named the surprising half of this in 1988. The hard problems for AI, he noticed, are the ones evolution spent hundreds of millions of years on: seeing, walking, grabbing, balancing, reading a room. The "easy" problems, the ones we find impressive, are the recent thin veneer: algebra, chess, logic. A four-year-old's effortless perception and movement turn out to be the deepest, most computationally expensive thing the brain does; we just can't feel the effort. So the things any toddler does, and can't explain, are exactly the things machines find hardest. The things we hand out PhDs for are, comparatively, easy. That inversion is Moravec's paradox, and it's the spine of the whole map.

Why density matters

A model learns by reading. Where humanity wrote down millions of games, papers, and code samples, and where there's a clean right answer to grade against, the training signal is a firehose. Chess, Go, code, board games, protein structures: dense and checkable, so the climb is steep.

Why so much is missing

The feel of dough that's ready, the weight of a sleeping child, the give of an old pipe, the mood of your house: almost none of it was ever typed out. It lives in hands and rooms, not text. The corpus can't contain what the species never wrote.

AlphaFold is the cleanest example of the bright corner of the map. Predicting how a protein folds is a brutally hard problem, but it's a problem with a dense, checkable shape: decades of experimentally-determined structures to train on, and a precise way to score how close a prediction lands. DeepMind's system (Jumper et al., 2021) reached near-experimental accuracy and reset a 50-year-old scientific challenge. Notice what made that possible. Not general intelligence: a rich pile of right answers, and a way to measure error and feed it back. Give AI that, in any field, and it tends to win. It almost never gets that for the things you do all day.

Now the dim corner. Try to give a robot the toddler's skill (pick up the soft toy, walk across the cluttered floor) and you hit the reality gap. Models are trained in simulation, where physics is clean, then deployed into the real world, where friction, light, wear, and slip never quite match. A policy that's flawless in sim falls over on the actual floor. Engineers fight this with "sim-to-real" tricks, randomizing the simulation so wildly the real world looks like just one more variation, and it's improving fast, but the gap is the point: the physical world is not text, and it does not hand you a clean answer to grade. Embodiment is the part the corpus can't carry.

Polanyi's clue. In 1966 the philosopher Michael Polanyi put it in one line: "we know more than we can tell." You recognize a face in a crowd, ride a bike, hear that a joke landed wrong, and you cannot write down the rule you used. That's tacit knowledge, and because it was never told, it was never written, and because it was never written, it never entered the data. Most of human skill is tacit. So most of human skill is exactly the part AI never saw.

This is the practical turn, and it's good news for you. The map gives you a default. Trust AI most where the domain is text-dense and the answer is checkable: summarizing a document you can skim, drafting code you can run, recalling a fact you can verify, structured reasoning you can re-derive. Trust it least where the knowledge is tacit, physical, or specific to your life: the quirks of your own house, your kid's particular fears, the judgment your trade taught your hands. There it will sound just as fluent and be far less reliable. And notice the flip side: the knowledge that's hardest for the machine is the contextual, lived, hard-won knowledge that's most yours. That's not a gap to be embarrassed about. It's the half of the work that's still, distinctly, your job.

The move in one line. Before you trust an answer, ask two things: was this written down densely, and can I check it? Yes and yes: lean in. No to either: treat the fluent reply as a guess, and bring your own ground truth.

Sources: Moravec, Mind Children (1988), the paradox that perception and motor skill are hardest for machines · Jumper et al., "Highly accurate protein structure prediction with AlphaFold," Nature (2021) · Polanyi, The Tacit Dimension (1966), "we know more than we can tell" · sim-to-real / reality-gap robotics literature, e.g. domain-randomization work (Tobin et al., 2017) and surveys on transfer from simulation. Tiered: Moravec's paradox and Polanyi's tacit knowledge are established; the density-plus-verifiability framing is a useful illustrative synthesis, not a measured law; AlphaFold's accuracy and the persistence of the reality gap are established, the latter still actively emerging as a research frontier.

Data · The scrubbed web

A model's worldview is the web, minus everything that got filtered out

Before a model reads a single page, the page has to survive a gauntlet of cleaning filters. Each one drops a slice of the web, and not a random slice. Step through the funnel and watch the pile shrink. What's left at the bottom is what the model learned the world is.

The raw web

Nothing dropped yet

Common Crawl is a giant public scrape of the open web: billions of pages, untouched. No model trains on this directly. It's far too noisy. Everything that follows is a decision about what to throw away.

Step through each filter, watch what falls away, and read what falls with it

A model never reads the web. It reads a version of the web that survived cleaning. When people say a language model "knows the internet," they're picturing the whole, messy thing: every page, every forum, every voice. What actually went in is narrower than that, by a lot. The raw scrape is enormous and mostly unusable: boilerplate, spam, broken markup, duplicate pages, sixty languages tangled together. So before training, the data passes through a pipeline of filters. Each filter is reasonable on its own. Together they decide whose web the model learns.

"Trained on the internet" is shorthand for something much more specific. A filtered snapshot, of a particular slice of the web, captured at a particular time, run through a particular set of cleaning heuristics. Understanding the filters is most of understanding the blind spots.

Start with the source. Common Crawl is a nonprofit that has scraped the public web for years and publishes the result openly, petabytes of it. It's the raw material behind most large text corpora. But raw Common Crawl is a firehose of noise, so every team that uses it cleans it first, and the cleaning choices are where the worldview gets set.

Take one famous filter. For GPT-2, OpenAI didn't want to hand-rate billions of pages, so they used a clever proxy for "is this page worth reading": they took every outbound link posted to Reddit that had received at least 3 karma, and scraped those pages. They called the result WebText. It's an elegant trick: let a crowd of humans pre-filter the web for you. But read what it actually selects for. The corpus's sense of "quality" became whatever Reddit's users, around 2017 to 2019, chose to upvote and share. That's a real community with a real demographic skew and real blind spots. Pages that circle in other communities, or in no community at all, simply weren't in the room.

What the karma filter keeps

Pages a particular, skewed slice of English-speaking internet found shareable in a particular window of years. Plenty of genuinely good writing, and a specific taste, a specific set of preoccupations, a specific sense of humor.

What it can't see

Communities that don't post to Reddit. Topics that don't trend. Whole languages and dialects. "3 karma" is a human quality signal, but it's those humans' signal, not a neutral measure of worth.

The next corpus made the costs visible. For the T5 model, Google built C4, the Colossal Clean Crawled Corpus (Raffel et al., 2020): Common Crawl, scrubbed with a stack of heuristics. Drop pages without enough real sentences. Drop boilerplate. And run a blocklist of "bad words," discarding any page that contained one. Sensible-sounding. Then in 2021, Dodge and colleagues did something the field hadn't: they actually documented C4: they read what the filters removed. The blocklist, it turned out, disproportionately scrubbed text written in minority dialects, and writing about LGBTQ+ identities, because the same words appear in reclaimed, in-group, and ordinary uses, not only as slurs. The "clean" corpus was cleaner of some voices than others. The same paper found the corpus also swept in sources nobody intended. Cleaning, in other words, is not neutral. It has a direction.

This is the load-bearing finding. A word-level blocklist can't tell a slur from its reclaimed use, or a dialect from an obscenity. So it removes them together, and the writing it removes belongs disproportionately to communities already underrepresented online. The filter did its job. Its job had a bias baked in.

Not every corpus leans only on Common Crawl. The Pile (Gao et al., 2020) deliberately mixed in curated sources (academic papers, code, books, structured text) to widen and raise the quality of what a model sees. That's a different lever on the same problem: instead of only subtracting noise, you add chosen signal. But "chosen" is the operative word. Curation is a worldview too. Every corpus, scrubbed or curated, is a set of decisions about what counts.

And all of it is frozen at a moment. A corpus is a snapshot in time: the web as it looked when the scrape ran. That's the real meaning of a model's "knowledge cutoff": not just that it stops at a date, but that it learned the world from a web that no longer exists in that form. The proportions, the live debates, the communities that have grown or vanished since: none of that updated. The map is fixed even as the territory keeps moving.

The pipeline, honestly

Filtering is necessary and mostly good. You cannot train on raw spam-soaked petabytes, and dedup, language filtering, and quality heuristics genuinely improve what comes out. None of this is an indictment.

The catch

Necessary isn't neutral. Every filter is a choice with a side effect, and the side effects compound. The model at the bottom learned from a minority of a minority, and it can't tell you what fell away.

Here's the practical takeaway, the thing I actually use. When a model is confidently bland, when it gives you the flattened, median, sanded-down take and seems blind to a community's real voice or a niche you know well, don't reach first for "the model is dumb." Suspect the filters. A flat answer often means the textured version of that topic lived in the part of the web that got scrubbed, deduped, or never upvoted into the corpus in the first place. The blandness isn't a flaw in the reasoning. It's the shape of the funnel showing through.

The shift in one line. Read a model's gaps as evidence about its training data, not just its intelligence. When a voice is missing or an answer is suspiciously smooth, ask which filter would have removed the richer version, and go find that version yourself.

Sources: Radford et al., 2019, Language Models are Unsupervised Multitask Learners (GPT-2 / WebText; outbound Reddit links with ≥3 karma) · Raffel et al., 2020, Exploring the Limits of Transfer Learning… (T5 / the C4 Colossal Clean Crawled Corpus & its blocklist) · Dodge et al., 2021, Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (filters disproportionately remove minority-dialect & identity-related text; unexpected included sources) · Gao et al., 2020, The Pile: An 800GB Dataset of Diverse Text (curated multi-source mix) · Common Crawl (commoncrawl.org), the primary public scrape. Tiered: the existence and mechanics of these filters are established; the documented bias of C4's blocklist is an established empirical finding (Dodge 2021); the "confidently bland → suspect the filters" diagnostic is illustrative, a useful heuristic, not a measured law.

Data · The language of the web

Why AI is fluent in HTML, and why it started handing you HTML instead of Markdown

The model writes HTML this well for a dull, deep reason: the web it learned from is made of HTML. And lately its answers have been quietly turning from plain Markdown into living, clickable pages. Both facts come from the same place. Flip the same answer between the two and see why.

Markdown: the old default

You read it

For years the model answered in Markdown: a few symbols that render to tidy text. Cheap in tokens, perfect for something you skim once. But it's inert; there's nothing to do but read.

Same answer, three ways. The rendered one is something you can actually use

The web is written in HTML, so the model learned to think in it. When a model trains on a scrape of the internet, a staggering share of what it reads isn't prose at all; it's markup. Every page is a tree of tags: <h1>, <p>, <ul>, <a href>, nested and closed in a rigid, repeating grammar. The model saw billions of examples of that grammar wrapped around real meaning. So fluency in HTML isn't a feature anyone bolted on. It's a side effect of the training data being, quite literally, HTML.

It's transfer, not special training. Researchers found that models pretrained on ordinary text, never specialized for the web, handle HTML tasks remarkably well, in one benchmark completing 50% more web-navigation tasks while using 192 times less fine-tuning data than purpose-built systems. HTML is just structured text, and structure is exactly what these models are built to predict. The patterns it learned from plain language carry straight over to angle brackets.

Which connects to something you may have noticed without naming it: sometime recently, the answers changed shape. You'd ask for a comparison or a little tool and instead of a wall of Markdown you'd get a thing: a styled card, a working calculator, a tabbed layout you could click. That's the second half of this story, and it has a clear turning point. In May 2026 an engineer on Anthropic's Claude Code team, Thariq Shihipar, published a piece called "The Unreasonable Effectiveness of HTML" that spread fast, and his argument is the cleanest version of it.

Why Markdown won first

It was the token economy. Every character the model emits costs compute and money, and Markdown gets you headings, lists, and bold with almost no overhead. When tokens were precious, the leanest format that still looked organized was the right call.

Why HTML is winning now

An abundance economy. Tokens got cheap, so the deciding question flipped from "what's cheapest to emit?" to "what actually gets used?" And the answer people reach for, click on, edit, and share is a rendered page, not a block of text.

That's the real reason for the shift, and it's worth saying plainly: the right metric was never token cost. It's whether the output gets used. A Markdown table you copy into a doc is fine. But a question that wants to be reviewed, compared, toggled, or handed to someone else is asking to be an interface, and HTML is the only output a model can emit that a browser turns into one for free. Anthropic's Artifacts and OpenAI's Canvas are both built on exactly this: the assistant stops handing you text about the thing and starts handing you the thing.

The quiet superpower: HTML is the one format where what the model writes and what you use are the same artifact. It needs no compiler, no install, no runtime; every browser on earth renders it, and has for thirty years. Tim Berners-Lee sketched the first HTML in 1991; a page from then still opens today. That backward-compatible, render-anywhere durability is why it was everywhere in the training data, and why it's a safe thing for a model to hand you.

So here's the practical lever, the thing I actually do now. When I want something to keep (notes, a record, anything that should stay readable and portable for years) I ask for Markdown (the family-record piece in this collection is exactly that case). When I want something to use (a quick tool, a comparison I'll click through, a little page to share) I ask for HTML, and I get an artifact instead of a description of one. Same model, same fluency, two different jobs. Knowing which to ask for is a small skill that pays off every single day.

Sources: Gur et al., "Understanding HTML with Large Language Models" (Findings of EMNLP 2023; arXiv 2210.03945), showing text-pretrained LLMs transfer to HTML, ~50% more MiniWoB tasks with ~192× less fine-tuning data · Thariq Shihipar (Anthropic, Claude Code), "Using Claude Code: The Unreasonable Effectiveness of HTML" (May 2026), the token-economy → abundance-economy argument and "does it get used" as the real metric · Anthropic Claude Artifacts & OpenAI ChatGPT Canvas, HTML/interactive outputs as a first-class product surface · Google Search Central, "Should I use markdown for my site?" (Jun 2026), on Markdown's history & the llms.txt debate · Tim Berners-Lee, first HTML description, 1991. Tiered: HTML fluency as a training-data transfer effect is established (Gur 2023); the Markdown→HTML output shift and its framing are emerging: a 2026 product trend and one influential argument, not settled science; "ask for HTML to use, Markdown to keep" is my illustrative rule of thumb.

Practice · Working memory

How to work within a context window

A model has no memory of you between turns and a fixed amount of room to think this turn. Everything you give it, and everything it says back, competes for the same shelf. Learning where to put what is most of the craft.

Plenty of room

Healthy

Nothing's loaded yet but the system prompt. Tap blocks below to fill the window, and watch what the model has to give up.

Chat history, drag to let the thread growshort

Add the doc, then grow the history. Watch the earliest block fall out the left, and the answer space shrink

The model doesn't remember you. It re-reads you, every single turn. There is no file on disk with your name on it, no running sense of who you are between sessions. Each time you hit send, the model is handed one block of text (the system prompt, your instructions, anything you pasted, and the conversation so far) and it answers from that and nothing else. That block has a hard size limit: the context window. Think of it less as long-term memory and more as a desk of a fixed size. Whatever's on the desk, it can use. Whatever isn't, it has genuinely never seen.

Working memory, not a relationship. Close the tab and the model forgets everything, not because it's being coy, but because there was never anything stored. If a detail matters next session, you re-supply it. The corollary: anything you do put in is competing for finite space with everything else, including the room the model needs to write its answer.

And the window is finite in a way that bites twice. First, when you exceed it, the oldest material simply scrolls out of view; the model can no longer see the start of a long thread even though you can. Second, and less obvious: even inside the window, attention isn't even. Liu and colleagues (2023) found models attend best to what sits at the very beginning and the very end of the context, and worst to what's buried in the middle; they called it "lost in the middle." A critical instruction dropped into the center of a wall of pasted text can be technically present and effectively invisible.

Why bigger isn't a fix

Million-token windows exist, but attention cost grows with the square of the length (Vaswani et al., 2017), so long context is slower and more expensive, and the middle still gets read least. A bigger desk doesn't make you tidier; it just lets you bury more.

What actually helps

Curation. Decide what goes in, what stays out, and where it sits. Put the one line that matters near the top or the bottom. Paste the paragraph you need, not the report it came from. The skill is editing the input, not enlarging it.

So working well with a model is, mostly, context curation: three quiet decisions you make before you ever read the reply, namely what to include, what to leave out, and where to place the part that must not be missed.

Put the key instruction near the top or bottom, never mid-wall.
Start a fresh chat when the thread gets long and off-topic; a clean window beats a bloated one.
Paste only the relevant slice of a document, not the whole thing.
Restate the goal every so often in a long session, so it stays in the high-attention end.

Skip

Dumping huge, mostly-irrelevant context "just in case"; it crowds out the answer.
Assuming it remembers last week, or even the top of this thread once it's long.
Burying the one critical line in the middle of a wall of text.
Treating a longer window as a reason to be sloppier; it isn't.

The shift in one line. Stop thinking "I'll just tell it everything." Start thinking like an editor with a fixed page count: every word you add is a word of attention you spend, and the answer needs room too.

Sources: Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni & Liang (2023), "Lost in the Middle: How Language Models Use Long Contexts" (TACL), on the start/end attention bias · Vaswani et al. (2017), "Attention Is All You Need", where self-attention scales quadratically with sequence length, which is why long context is costly · general transformer documentation on fixed context-length limits and per-turn statelessness. Tiered: per-turn statelessness and a fixed window are established; "lost in the middle" is an established finding with model-dependent strength; the desk metaphor and the bar's exact proportions are illustrative.

Mindset · The throughline

What you might have missed

Most people pick up AI by osmosis and never notice the handful of small shifts that separate fumbling from fluency. None of them are about prompts. They're about a posture, and once you see them, you can't unsee them.

Seven small turns

0 / 7 read

Each row is an old habit, carried over from the search-engine era, and the move that replaces it. Tap one to see why the new way wins. The difference is the whole skill.

Tap a row · old habit → new habit

Nobody sat you down and taught you this. You started using AI the way you'd use anything else: typed something in, got something back, learned by bumping into the edges. That works, up to a point. But the people who got fluent didn't learn better tricks. They quietly changed a handful of assumptions, most of them without noticing, and those changes are the difference. This piece is just me naming them out loud, so the change can be deliberate instead of accidental.

The shift in one word: posture. Fluency with AI is less about the words you type and more about how you hold the whole interaction: what you expect, what you check, and what you supply. The prompt is the smallest part.

Four turns sit under all seven rows. Describe, don't search. The old reflex hunts for the page that already has the answer; the new one describes the outcome and lets the model assemble it. Verify, don't assume. The old reflex trusts a confident tone; the new one treats every answer as a claim to be checked, and arranges the work so checking is cheap. Steer, don't one-shot. The old reflex chases the one perfect prompt; the new one starts a conversation and nudges it until it converges. Supply context, don't expect memory. The old reflex assumes the thing remembers you; the new one hands over what it cannot possibly know.

I want to be honest that this is synthesis: my read, not a law of nature. But it isn't arbitrary. It's grounded in how these systems actually behave. They're stochastic: the same prompt can yield different answers, so a single perfect prompt was never going to be the unit of skill; steering is. They carry no memory between turns except what's in front of them, so "remember what I told you yesterday" is a category error, not a glitch. And their fluency is uncorrelated with their accuracy: they sound exactly as sure when they're wrong as when they're right, which is why "ask it to be correct" does nothing and "make the answer cheap to check" does everything.

The habit that doesn't transfer

Knowing the magic phrasing. Phrasings drift across models and versions; a clever prompt that worked last month can quietly stop working. Memorizing incantations is the search-era reflex wearing new clothes.

The habit that does

Knowing what "right" looks like. The ability to recognize a good answer, smell a wrong one, and design a quick check is portable across every model, every tool, and every year. It's the part the machine can't supply.

If you've read the other essays, you've met these turns already, dressed differently. The black-box piece is the verify-don't-assume turn in full: you can't see inside, so you test the outside. The vector-space piece is describe-don't-search: meaning lives in a space of relationships, so you describe what you mean instead of guessing the keyword. The context-window piece is supply-don't-expect-memory: working memory is finite and resets, so you curate what the model is allowed to hold. This essay is just the seam that joins them: the recognition that they're one change in stance, not three separate tricks.

Why "missed" and not "wrong." None of the old habits were foolish. They were the correct moves for the tools we had: Boolean operators, exact syntax, pages you could read top to bottom. The shift isn't that you were doing it wrong. It's that the tool changed underneath you, and the old reflexes kept firing.

Here's the part that should land softest and matter most. As you hand more of the doing to the machine, the skills that stay yours aren't the mechanical ones. They're judgment: deciding what's worth making and what to trust. Taste: knowing the difference between fine and good. And the quiet, hard-won sense of what "right" looks like, which is the only thing that lets you catch the machine when it's confidently wrong. The model can draft, sort, summarize, and tirelessly try again. It cannot want the right thing, or know when it has arrived. That part is still, stubbornly, the work, and it's still yours.

Sources: Stochastic generation & lack of cross-session memory: primary model documentation and the standard transformer/decoding literature (e.g. Vaswani et al., 2017, on the architecture; temperature/sampling as standard practice) · Confidence uncorrelated with correctness: the model-calibration and "hallucination" literature broadly (calibration studies on large language models; surveys of factual reliability), here described as a field rather than a single paper · Sibling essays in this collection on the black box, vector space, and context windows. Tiered: that these systems are stochastic and contextless between turns is established; that fluency and accuracy are decoupled is established in pattern, with magnitude varying by model; the four-turns framing and the "seven shifts" are illustrative synthesis: my organizing scheme, not a settled taxonomy; the closing claims about judgment and taste are opinion, offered as such.

Craft · Talking to a machine about a thing it can't see

You can't say "move that" to something with no eyes, so I learned to name the parts first

Early on, instructing the model about a build I was making drove me a little mad. "Shift the thing near the other thing" works between two people looking at the same screen. It fails completely in text. The fix wasn't a better prompt. It was building a map, a number and a name for every part, before I could give a single clear instruction.

A panel with no words

Unaddressable

Right now you and the model are looking at the same picture and still can't talk about it. "That knob, no, the other one" goes nowhere in text. Switch on Numbers or Names, then tap a part.

Set a mode, then tap any part of the panel

The hardest part of telling a machine what to change was that I had no words for the parts. I was building something with a lot of pieces on a screen, and I wanted the model to adjust one of them. So I typed what I'd say to a person standing next to me: "make that thing under the top one a bit smaller, and move it closer to the other one." A person would glance up and know. The model could only guess, and it guessed wrong, and I'd correct it, and it would guess wrong again, politely, tirelessly, in a slightly different wrong direction each time. It was like giving directions over the phone to someone standing in a room I couldn't see.

The real failure. It wasn't that the model was dumb. It was that "that," "this," and "there" carry no meaning unless both of us already share a reference for what they point at. Standing together, our eyes supply that reference for free. In a chat box, it has to be manufactured.

The breakthrough was embarrassingly simple, and it took me far too long to reach it: I gave every part a number and a name. I built a little overlay, exactly the kind you just toggled, that could outline each piece, badge it with a digit, or label it in plain words. Suddenly I could write "shrink item 14" or "the intake valve sits too high" and the model knew, with no ambiguity at all, which one I meant. The same instruction that had failed a dozen times now landed on the first try. Nothing about the model had changed. I had simply given us a shared vocabulary.

Before the map

"Move that thing near the other thing." Every pronoun is a guess. The model fills the gap with a plausible interpretation, you both drift, and the back-and-forth multiplies. Effort goes into re-describing, not building.

After the map

"Item 14, the left dial, rotate it down." One number, one name, zero ambiguity. The instruction is checkable: you can both point at the same thing, so a correction actually corrects.

This is older than software. Linguists have a word for pointing-words (this, that, here, there, the one over there) and the whole point of them is that they only work when speaker and listener already share a frame of reference. Face to face, your gaze and your finger do the sharing. Over a wire, with a partner that has no eyes, you have to build the shared reference yourself, out of nothing but text. A numbered, named map is exactly that: a common ground laid down on purpose so that pointing words finally have something to point at.

The pattern under the pattern. Half of working well with AI isn't the prompt; it's noticing when there's no shared vocabulary yet, and building the small piece of scaffolding that creates one. The map is not the work. But often you have to make the map before the real work is even sayable.

I've come to think of it as a quiet rule. When an instruction keeps landing wrong, I stop rewording it and ask a different question: do we even have a name for the thing I'm pointing at? Usually we don't. So I make one (a label, a number, a fixed term I promise to reuse) and the conversation gets sharp again. Naming is humble work. It's also the precondition for everything precise that comes after it.

Sources: Deixis & reference: the linguistics of pointing-words (this/that/here/there) and their dependence on shared context, as treated in standard pragmatics and reference-grammar texts (e.g. Levinson, Pragmatics; Lyons, Semantics) · "Common ground" as a precondition for successful reference, in Clark, Using Language · The HCI principle that addressable, named elements enable precise control (the same reason interface elements carry IDs and accessibility labels). Tiered: deixis requiring shared reference is established linguistics; naming/addressability enabling precise instruction is an established design principle; the personal struggle and the specific build are illustrative / personal.

Practice · One true source

When someone you love is sick, write it down in plain text

The information lives in six portals, three printouts, and four people's half-memories, and no two of you hold the same picture. One canonical Markdown file fixes more of that than you'd think. It's just text: durable, portable, readable by a person today and a model tomorrow, and shareable without anyone needing an account.

Markdown: the source

Lasts 50 years

A few #, - and | characters in a text file. Ugly up close, but it's the durable thing: openable by any tool, on any machine, with no app to go out of business.

Same file, three ways. Toggle and watch nothing but the formatting change

It's the same file the whole time. Toggle between the raw view and the pretty one and you'll notice the words never move; only the formatting does. That's the quiet trick of Markdown: the thing people read and the thing that lasts are not two files you keep in sync. They're one file, shown two ways. You write a little plain text with a few symbols, and any tool that can read it will render it into something readable. Strip the rendering away and you've lost nothing: the record is still all there, in characters you could type on any keyboard made in the last forty years.

When someone in your family gets sick, the information doesn't arrive in one place. It's spread across a hospital portal and a different clinic's portal and the pharmacy's app, plus discharge printouts, plus the thing the nurse said on the phone that only one of you heard. Everyone who's helping holds a slightly different version, and the gaps between those versions are where mistakes live. A single canonical file, one source of truth that everyone reads from, collapses all of that into one picture you can actually trust.

Why plain text, specifically. No vendor owns it. No subscription can strip it from you. It opens in Notes, in a code editor, in an email, on a phone, on a machine that hasn't been built yet. It diffs cleanly, so you can see exactly what changed and when. And it's small enough to keep forever. The fanciest health app you adopt today is one acquisition away from disappearing; a .md file is just letters.

The discipline that makes it work is append-only. You don't rewrite history; you add to the bottom. Each entry gets a date. When a dose changes, you don't quietly edit the old number; you write a new dated line saying it changed, and on what day. Done this way the file becomes a timeline, not just a snapshot, and a timeline is what catches the slow drift no single visit reveals: the weight creeping up, the dizziness that started the week before the medication changed. The old number, kept, is what lets you see the trend.

Scattered (the default)

Two portals, a pharmacy app, three printouts, and four relatives' memories. No two pictures match; the most current fact lives in whoever happened to be on the last call. Every handoff loses a little.

One file (the move)

A single dated text file everyone reads from. Updates go at the bottom. Anyone can open it, no account required. The family argues less because there's one thing to point at.

Because it's plain text, it's also machine-readable, and that turns out to matter more than it sounds. Paste the whole file into a model and ask it to summarize what changed since April, or to spot a trend across the timeline, or to draft the three questions worth asking at the next appointment, and it can, because everything it needs is right there in legible characters. It isn't pulling from a hidden database or guessing; it's reading the same words you wrote. The same property that makes the file durable for humans makes it usable by a model. You don't have to choose.

A note on what this is, and isn't. This file complements the medical record; it never replaces it. It is not a diagnosis and not a chart. It's your family's working notes, a way to walk into a visit prepared and walk out without losing what was said. The clinicians keep the authoritative record. This keeps you all reading from the same page in between.

And there's a part that has nothing to do with formats. Keeping this file is itself an act of attention. To write the line, you have to have noticed the change. To keep it for someone you love is to keep paying attention on the days it's tedious and the days it's frightening. When you share it with the rest of the family, you're not just handing them data; you're handing them one calm, true picture in a situation that mostly delivers chaos. The plain text is the durable part. The caring is the point.

Sources: John Gruber, Markdown (2004), daringfireball.net, the original specification of the plain-text formatting syntax · the long-standing "plain text lasts" / future-proofing case made across the plain-text-notes and digital-preservation communities · "single source of truth" and append-only / immutable-log as established data-keeping principles (event-sourcing, version-control practice). Tiered: plain-text durability and portability are established; Markdown as the de-facto lightweight format is established; using a model to summarize a personal record is illustrative of the practice, not a clinical recommendation; and this file complements, never replaces, the medical record and the people who keep it.

Building · Rent vs. own

You don't have to build the whole machine, just own one stage of it

Calling an AI API gives you a finished result you can't shape. Owning even one stage of the pipeline, picking the model, the voice, the parameters, turns you from someone who takes what he's given into someone who makes something that's actually his.

The voice model stage

The stage worth owning

This is the stage I chose to own. Instead of accepting whatever one generic voice gave me, I picked the model and the voice myself, and built a page to compare candidates side by side before committing.

Pick a stage below, then flip between renting and owning it

There's a quiet difference between renting an AI feature and owning a stage of it. When you call a finished API, you hand over your input and take back whatever comes out. It works, and for a lot of jobs that's exactly right. But you can't reach inside. You can't pick the voice, nudge the parameters, or swap the engine for a better one next month. You get one result, shaped by someone else's defaults, and your only move is to accept it or leave. That's renting. You're a tenant in someone else's box.

Owning a stage is different. A pipeline is just a chain: input goes in, passes through a few transformations, a result comes out the other end. Owning one of those transformations means you choose the model that runs it, you tune how it behaves, and crucially, you can compare. You can line up three options and listen, or look, and pick the one that's actually right rather than the one that was handed to you. You don't have to own the whole chain. You just have to find the one link where ownership gives you leverage.

Rent vs. own, in one line. Renting gives you a black-box result you can't shape. Owning a stage gives you the model, the parameters, and the ability to compare and swap, which is also the ability to develop taste about what "good" even means here.

Here's where it got concrete for me. I was building something that needed to speak, to turn text into a voice you'd actually want to listen to. The easy path was to call a generic text-to-speech service and take its default voice. It would have worked. But the voice was the soul of the thing, and I didn't want to be a tenant in that decision. So I owned that one stage. I picked my own speech models, generated the audio myself, and, because I couldn't tell from a spec sheet which voice was right, I built a page to audition and compare Japanese voices side by side. Same lines, different engines, played back to back, until the right one was obvious.

What renting costs you

One voice, chosen by someone else's defaults. No way to compare, no way to tune, no way to swap when something better appears. The result is fine, but it's theirs, not yours. You never develop an ear for the choice because you never made it.

What owning bought me

A short list of candidates I could play side by side, judge with my own ears, and choose deliberately. Control over the parameters. The freedom to swap engines later. And a result that carried a decision I actually made, which is what makes it feel like mine.

I'm not an engineer, and a year ago wiring real models together myself would have been out of reach. What changed is that an AI coding assistant let me do the plumbing (pull a model down, feed it text, capture the audio, stand up a little comparison page) without needing to be the person who could write all of that from scratch. That's the second quiet shift. The same tools that tempt you to rent everything are also what make owning a stage finally possible for someone like me.

The takeaway. You don't have to own the whole stack. Look at the pipeline you're building, find the one stage where the result really matters (the voice, the retrieval, the ranking, whatever carries the soul of the thing) and own that one. Rent the rest. Owning even a single stage is the difference between assembling someone else's product and making something that's yours.

Sources: Modern neural text-to-speech and voice cloning as a field, the category of systems that synthesize natural speech from text and can mimic a target voice (the broad neural-TTS / voice-conversion literature; no specific paper claimed) · a page I built to audition and compare Japanese voices side by side, as a primary artifact of the "own a stage" pattern. Tiered: that high-quality neural TTS exists and is widely usable is established; the existence and behavior of the comparison page is a primary artifact; the "rent vs. own / own the bottleneck" framing is illustrative, a way of thinking, not a measured result.

Practice · Learning a language

AI didn't give me a language; it gave me a tutor that never gets tired

The principles that actually grow a second language are a century old: understand messages just above your level, review a word right before you'd forget it, and study the real sentences you actually met. AI didn't replace any of that. It made all three cheap, and let me build the small tools that fit my own gaps.

Comprehensible input

i + 1

Tap any word in the sentence. The reading and meaning appear instantly; the message is now one notch inside your reach. That "just a little above you" is the whole engine.

Tap a word above · then rate it to schedule the next review

For years I thought learning a language meant finishing a course. Buy the app, do the streak, get to the end. I never got to the end. What finally moved the needle was the opposite shape: stop consuming someone else's course, and start assembling the exact practice my own weak spots needed. AI is what made that possible for a non-engineer, not because it "teaches languages," but because it can do three old, well-understood things on demand, for free, without ever getting bored of me.

The reframe. The durable principles are decades old. AI is not a new theory of learning; it's a patient, infinite tutor that makes the old principles cheap enough to actually run, every day, on your own sentences.

The first principle is comprehensible input. Stephen Krashen's argument is that you acquire a language mainly by understanding messages a little beyond your current level, what he wrote as "i + 1." Not drilling grammar tables; understanding things you almost understand. The bottleneck was always supply: finding a steady stream of material pitched exactly at your edge is hard. A model can take any sentence you actually met and gloss it instantly (reading, meaning, the nuance you'd otherwise skip) turning "too hard to bother" into "one notch inside reach." That's the tap in the demo above.

The second is spaced repetition. In 1885 Hermann Ebbinghaus measured how fast he forgot nonsense syllables and drew the forgetting curve: memory decays, but each well-timed review flattens the next decay. Algorithms like Leitner's boxes and SM-2 turn that into a schedule: review a card right before you'd lose it, and the interval grows from minutes to a day, then days, then weeks. The "Again / Good / Easy" buttons in the demo do exactly this in miniature: rate how it felt, watch the next interval stretch. AI's contribution here is small but real: it can generate the cards, write fresh example sentences for a word, and explain why you keep missing one.

Understand it now

Comprehensible input. A model glosses any real sentence (reading, meaning, nuance) so material just past your level becomes readable instead of overwhelming. The supply problem disappears.

Schedule it to stick

Spaced repetition. Each word you meet becomes a card reviewed right before you'd forget it. The interval grows with each success; the forgetting curve gets flatter every pass.

The third is immersion and sentence mining: the community practice of pulling real sentences from things you're actually reading or watching, and turning each one into a study item rather than memorizing word lists in a vacuum. The point is that you study language you've already encountered in context, so the card carries meaning, not just a definition. This is where the first two principles meet: mine a sentence you met (input), make a card from it (repetition), and let the schedule decide when it comes back.

Here's the part I didn't expect. The winning move wasn't finding a bigger, smarter app; it was building tiny ones. A non-engineer can now wire a model into a small tool that does one thing his learning needs: a reader where you tap a word to gloss it, a deck that schedules your own mined sentences, a drill for the exact sound you keep mangling. Each is a weekend's worth of glue around a model. None of them is impressive software. All of them are fitted to a gap a generic course can't see, because the course doesn't know which words I keep forgetting. That fit is the entire advantage.

Build for your gap, not the average learner's. A course optimizes for everyone, which means it's wrong for you in a hundred small ways. The smallest tool you build around your own sentences beats the biggest app built for no one in particular.

One honest caution. A model is a fluent, confident guide and it is sometimes wrong about subtle language facts: a pitch accent, a register, whether a phrase is natural or merely grammatical, which of two near-synonyms a native would actually say. For anything load-bearing, verify against a dictionary or a native source before you bake it into a card you'll review fifty times. A wrong card is worse than no card: spaced repetition will faithfully drill the error into you. Use the tutor for breadth and instant glosses; reach for a trusted reference for the things you're about to memorize.

Sources: Krashen, The Input Hypothesis (1985) and the comprehensible-input / "i + 1" framework · Ebbinghaus, Über das Gedächtnis (1885), the forgetting curve · spaced-repetition algorithms: Leitner's box system and the SM-2 algorithm behind SuperMemo/Anki · sentence mining / immersion as a documented language-learning community practice. Tiered: Krashen's input hypothesis is influential but contested (emerging), widely used, not settled science; the retention benefit of spaced repetition is well-evidenced (established); the "build small personal tools / mine your own sentences" workflow is illustrative of what's now possible, not a measured result.

A place to explore
the journey

Learning to think
in a new kind of machine.

AI is a black box, and that's not the problem it sounds like

Thinking in vector space when you came from Boolean

Where the coordinates come from

AI hasn't learned the world, just the parts of it we wrote down

A model's worldview is the web, minus everything that got filtered out

Why AI is fluent in HTML, and why it started handing you HTML instead of Markdown

How to work within a context window

What you might have missed

You can't say "move that" to something with no eyes, so I learned to name the parts first

When someone you love is sick, write it down in plain text

You don't have to build the whole machine, just own one stage of it

AI didn't give me a language; it gave me a tutor that never gets tired

A place to explorethe journey

Learning to thinkin a new kind of machine.

AI is a black box, and that's not the problem it sounds like

Thinking in vector space when you came from Boolean

Where the coordinates come from

AI hasn't learned the world, just the parts of it we wrote down

A model's worldview is the web, minus everything that got filtered out

Why AI is fluent in HTML, and why it started handing you HTML instead of Markdown

How to work within a context window

What you might have missed

You can't say "move that" to something with no eyes, so I learned to name the parts first

When someone you love is sick, write it down in plain text

You don't have to build the whole machine, just own one stage of it

AI didn't give me a language; it gave me a tutor that never gets tired

A place to explore
the journey

Learning to think
in a new kind of machine.