Before a model reads a single page, the page has to survive a gauntlet of cleaning filters. Each one drops a slice of the web, and not a random slice. Step through the funnel and watch the pile shrink. What's left at the bottom is what the model learned the world is.
A model never reads the web. It reads a version of the web that survived cleaning. When people say a language model "knows the internet," they're picturing the whole, messy thing: every page, every forum, every voice. What actually went in is narrower than that, by a lot. The raw scrape is enormous and mostly unusable: boilerplate, spam, broken markup, duplicate pages, sixty languages tangled together. So before training, the data passes through a pipeline of filters. Each filter is reasonable on its own. Together they decide whose web the model learns.
"Trained on the internet" is shorthand for something much more specific. A filtered snapshot, of a particular slice of the web, captured at a particular time, run through a particular set of cleaning heuristics. Understanding the filters is most of understanding the blind spots.
Start with the source. Common Crawl is a nonprofit that has scraped the public web for years and publishes the result openly, petabytes of it. It's the raw material behind most large text corpora. But raw Common Crawl is a firehose of noise, so every team that uses it cleans it first, and the cleaning choices are where the worldview gets set.
Take one famous filter. For GPT-2, OpenAI didn't want to hand-rate billions of pages, so they used a clever proxy for "is this page worth reading": they took every outbound link posted to Reddit that had received at least 3 karma, and scraped those pages. They called the result WebText. It's an elegant trick: let a crowd of humans pre-filter the web for you. But read what it actually selects for. The corpus's sense of "quality" became whatever Reddit's users, around 2017 to 2019, chose to upvote and share. That's a real community with a real demographic skew and real blind spots. Pages that circle in other communities, or in no community at all, simply weren't in the room.
What the karma filter keeps
Pages a particular, skewed slice of English-speaking internet found shareable in a particular window of years. Plenty of genuinely good writing, and a specific taste, a specific set of preoccupations, a specific sense of humor.
What it can't see
Communities that don't post to Reddit. Topics that don't trend. Whole languages and dialects. "3 karma" is a human quality signal, but it's those humans' signal, not a neutral measure of worth.
The next corpus made the costs visible. For the T5 model, Google built C4, the Colossal Clean Crawled Corpus (Raffel et al., 2020): Common Crawl, scrubbed with a stack of heuristics. Drop pages without enough real sentences. Drop boilerplate. And run a blocklist of "bad words," discarding any page that contained one. Sensible-sounding. Then in 2021, Dodge and colleagues did something the field hadn't: they actually documented C4: they read what the filters removed. The blocklist, it turned out, disproportionately scrubbed text written in minority dialects, and writing about LGBTQ+ identities, because the same words appear in reclaimed, in-group, and ordinary uses, not only as slurs. The "clean" corpus was cleaner of some voices than others. The same paper found the corpus also swept in sources nobody intended. Cleaning, in other words, is not neutral. It has a direction.
This is the load-bearing finding. A word-level blocklist can't tell a slur from its reclaimed use, or a dialect from an obscenity. So it removes them together, and the writing it removes belongs disproportionately to communities already underrepresented online. The filter did its job. Its job had a bias baked in.
Not every corpus leans only on Common Crawl. The Pile (Gao et al., 2020) deliberately mixed in curated sources (academic papers, code, books, structured text) to widen and raise the quality of what a model sees. That's a different lever on the same problem: instead of only subtracting noise, you add chosen signal. But "chosen" is the operative word. Curation is a worldview too. Every corpus, scrubbed or curated, is a set of decisions about what counts.
And all of it is frozen at a moment. A corpus is a snapshot in time: the web as it looked when the scrape ran. That's the real meaning of a model's "knowledge cutoff": not just that it stops at a date, but that it learned the world from a web that no longer exists in that form. The proportions, the live debates, the communities that have grown or vanished since: none of that updated. The map is fixed even as the territory keeps moving.
The pipeline, honestly
Filtering is necessary and mostly good. You cannot train on raw spam-soaked petabytes, and dedup, language filtering, and quality heuristics genuinely improve what comes out. None of this is an indictment.
The catch
Necessary isn't neutral. Every filter is a choice with a side effect, and the side effects compound. The model at the bottom learned from a minority of a minority, and it can't tell you what fell away.
Here's the practical takeaway, the thing I actually use. When a model is confidently bland, when it gives you the flattened, median, sanded-down take and seems blind to a community's real voice or a niche you know well, don't reach first for "the model is dumb." Suspect the filters. A flat answer often means the textured version of that topic lived in the part of the web that got scrubbed, deduped, or never upvoted into the corpus in the first place. The blandness isn't a flaw in the reasoning. It's the shape of the funnel showing through.
The shift in one line. Read a model's gaps as evidence about its training data, not just its intelligence. When a voice is missing or an answer is suspiciously smooth, ask which filter would have removed the richer version, and go find that version yourself.
Sources: Radford et al., 2019, Language Models are Unsupervised Multitask Learners (GPT-2 / WebText; outbound Reddit links with ≥3 karma) · Raffel et al., 2020, Exploring the Limits of Transfer Learning… (T5 / the C4 Colossal Clean Crawled Corpus & its blocklist) · Dodge et al., 2021, Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (filters disproportionately remove minority-dialect & identity-related text; unexpected included sources) · Gao et al., 2020, The Pile: An 800GB Dataset of Diverse Text (curated multi-source mix) · Common Crawl (commoncrawl.org), the primary public scrape. Tiered: the existence and mechanics of these filters are established; the documented bias of C4's blocklist is an established empirical finding (Dodge 2021); the "confidently bland → suspect the filters" diagnostic is illustrative, a useful heuristic, not a measured law.