When I was 35, I collected every targeted web ad that popped into my browser for a month. It was a lot of ads: more than 3,000 in total. I hadn’t done this on a whim. I was working with my team at the Office for Creative Research to build a tool called Floodwatch, a browser extension that allowed anyone to collect and view all of the web ads that were being targeted toward them.
The project was born from a conversation I fell into one morning, on a train from Oxford to London, with Ashkan Soltani. Ashkan was, at the time, a privacy researcher and journalist with a byline at The Washington Post. In the next few years, he’d go on to win a Pulitzer and serve as chief technologist at the Federal Trade Commission. In between stories of computational intrigue and white hat hacking and TSA harassment, Ashkan shared with me a particular lament. In his work investigating the vagaries of online advertising, it was extraordinarily challenging to gather data, because individual users spend their online days in their own personalized version of the web.
To get around this, Ashkan would set up little farms of headless browsers — virtual web users without screens or keyboards or brains. He’d set these zombie users loose, wandering from website to website, following specific patterns that would fool ad trackers into thinking they were a particular kind of user, from a specific place, and belonging to a certain demographic. Using this approach, Soltani could see how the web looked from the perspective of a young Black man in Georgia, or a senior citizen in Greece or a middle-aged white couple living on the Upper West Side with a keen interest in luxury goods.
There were two problems with these beheaded armies. First, as diligently as Ashkan might try to model a real user’s behavior, the virtual users were just that: models. They contained few of the unpredictable whims that define your web-browsing behavior or mine. Rarely would they check travel prices to North Dakota on a whim, or tangent off to learn about flying squirrels.
Second, there was a limit to how many headless browsers could be set into action. To get a real sense of the machineries behind web advertising, and the ways in which it discriminated, Ashkan and other researchers like him needed more data.
Computers Think They Know Us
At first, Floodwatch was simple. It collected ads and then showed them back to you in a scrolling wall, a cascade of gaudy commercialism. In the background, it’d make your anonymized ad data available to trusted researchers, along with however much demographic information you were willing to share.
When I started using the tool, I was surprised by the sheer number of ads that I was seeing, typically more than 100 per day. Scrolling through weeks and then months of them, I could see patterns of my life reflected: Every time I was in an airport, for example, I’d be barraged with hotel and rental car ads. There were also families of ads that seemed to make little sense; for weeks I’d see dozens of ads a day for legal training. In the background were the stalwart advertising categories of my internet life as a middle-aged male: watches, flashlights, key chains.
Inspired by a project called Cookie Jar by the digital artist Julia Irwin, I paid 10 strangers $10 each to tell me, in writing, what kind of a person I was, based only on what they could deduce from my thousands of browser ads. I’m a 26-year- old from Montreal, a 27-year-old in Vegas, and a retiree with a penchant for photography. Mostly I’m unemployed. I'm a garner, a fashionista, an occasional beer drinker and a dog owner. I graduated from a community college, I like to travel, and I wear glasses. Also, I might be Jewish.
My ad-based biographers got a few things right. I do indeed wear glasses, I enjoy perhaps slightly more than the occasional beer, I travel a lot, and I have a dog. The rest, though? These things are detritus from my digital life, signals that advertisers, in their zeal to garner a click, took too seriously. My browser is perhaps more zigzagged than most because I tend to wiggle into strange rabbit holes quite often as I’m researching for artwork or articles.
Still, this is a common theme when people are confronted with their web personas, according to advertisers: Signals from side roads of browsing activity seem to be taken as seriously as the thoroughfares. A bigger shared response from people who used Floodwatch to get a look at their browser history doppelgangers is that, by and large, they are deeply erroneous. And yet a lot of money is spent on assembling your profile, building a picture of you that can be used to make the purchase of ad space in your web windows a little less of a gamble.
That the government, Facebook, and the Ford Motor Company know things about you seems a given. But the fact is that many of the things they “know” are statistical kludges, pieced together from some combination of data, chance and guesswork. The pervasive branding message of web-era capitalism is that these tactics are effective; that by collecting great quantities of data and processing it with sophisticated machine learning algorithms, these surveillant interests can get some precise idea of not only where we are and what we're buying but who we are. That advertisers in particular have an ability to get at our true selves with their computational machinery.
In the last 25 years, a ramshackle computational system has been duct-taped together, one that operates in the very briefest slice of time between when you load a web page and when that page is fully rendered in the browser. All of this vast machinery was developed for a simple purpose: to put an ad into your web browser that you are likely to click.
In recent years, this massive system of trackers and servers and databases designed for placing ads has been turned in other directions — to insurance, to health care, to HR and hiring, to military intelligence. To understand living in data’s ubiquitous condition of being collected from, it’s important to know how ad targeting was meant to work, how it works differently against different people and how, at the root of it, it doesn’t really work at all.
A Brief History of the Web
For a few years in the 1980s, I ran a dial-up bulletin board system called the Hawk’s Nest. Bulletin board systems (BBSs) were a community-based precursor to the internet; while DARPA and other scientific and military interests were creating the technical backbone for what would one day support the web, BBS system operators (sysops) and users were figuring out how social spaces could exist on phone lines. My own BBS had only two lines, which meant that at most two users could be online at the same time, but a social group of a few hundred hung out on the Nest, leaving messages for one another on boards and sharing software of dubious legality in the file areas.
One of the defining experiences of the BBS era was waiting.
If the dial-in line was busy, you’d have to wait to log in. If you wanted to talk to a friend, you’d have to wait for them to show up. If you wanted to download even the smallest of files, you’d have to wait, the whole time praying that no one else in the house would pick up the phone. Today’s web browsers cache an image in your computer’s memory, storing the bits and bytes until the image is complete, only then drawing it to the screen. My BBS client would render the image as it loaded; I’d watch as the (black-and-white) picture was painfully assembled, one line of pixels at a time.
When I got to college in 1993, I landed a job at the library teaching new students how to use the internet. Actually, the job was first to teach students what the internet was (almost none of them had an email address before coming to the school), and then teaching them how to use it. The first public web browser, Mosaic, had been released earlier in the year, and it was with genuine enthusiasm that I’d show the students how to load an image from a server across the world.
I'd enter in the URL, and then we’d all wait 10 seconds for the image to appear — a 64-by-64-pixel picture of the Mona Lisa. It wasn’t unusual for there to be applause.
Two years later, I moved into newly built university housing and experienced for the first time the joy of broadband. The building was wired up for ADSL, and downloads ran at four megabits per second, a speed that is still pretty respectable today. I’d just built my own web page, and I remember refreshing it again and again, marveling at how fast my carefully photoshopped buttons loaded against the tiled putting-green-grass background. It seemed clear to me that it was a matter of time before pretty much all of the web would appear instantly into a browser.
Indeed, the web as it first existed was fast. The New York Times launched its first website in January 1996, and the page was 49 kilobytes in total; from my room at the University of British Columbia, I could load it in a tenth of a second, less than half of a blink of an eye.
Before a web page fully loads, a complicated network of ad sellers and buyers, data brokers and real-time exchanges activate to “place” web ads into your browser. Ads sold directly to you, or to the person advertisers believe you are.
Here is what happens in the space of a second and a half:
When you arrive at a web page, a request is initiated for one or several advertisements to be placed in a particular spot on the page. In the lingo of online marketing, delivery of an ad is called an impression. An impression request is assembled, which includes information about the page you’ve arrived on (that it’s a news story, that it's in the culture section, that it mentions cheese) and also whatever the publisher of the page already knows about you.
This personal information almost certainly includes your IP address (the electronic signature of your device) and data about you that is stored in any number of cookies. Cookies are local files stored on your computer that hold information about your online behavior. Most important, they store unique identifiers for you, such that the next time that cookie is loaded, the owners of the page can know with confidence that it was you who came back.
Indeed, the first cookie ever deployed, by the Netscape website in 1994, was used only to check to see if the user had already visited the site. Today’s cookies are elaborate chocolate-chip-with-shredded-coconut-and-flax-seeds-and-dried-cranberry-affairs; the cookie stored by The New York Times contains 177 pieces of data about you, with strange labels like fixed_extern al_8248186_bucket_map, bfp_sn_rf_2a9d14d67e59728elbiSb2c86cb4ac6c4, and pickleAdsCampaigns, each storing an equally cryptic value (160999873%3A1610106sg, i s4s942676435, [[“c”:“4261”,”e”:1518722674680,”v“:1}J).
Advertising partners can also deliver and collect their own cookies; loading nytimes.com today without an ad blocker sees 11 different cookies written or read from your machine, including ones from Facebook, Google, Snapchat and DoubleClick.
Once a request for an impression is put together from available user data, it’s sent off to the publisher’s ad server. There, a program checks to see if the request matches any of its presold inventory: Whether, for example, a cheese maker had already bought advertisements for stories that mention cheese, or if a real estate developer had bought placements for anyone who comes from a particular ZIP code (easily gleaned from your IP address).
If there’s not an ad ready to be placed, the impression request is sent off to one of several ad exchanges. These exchanges are bustling automated marketplaces in which thousands of ad impressions are sold every second.
Prospective ad buyers communicate with other servers to build on the data that they already have about the user. A data broker might think they know a lot about your identity from your IP address — that you live in Chicago, that you have a gym membership, that you drive a Honda, that you have a chronic bowel condition, that you belong to a gay dating site, that you vote Democrat. This data (right or wrong) is sold to the prospective ad buyer all with the intention that they might make a better decision on whether to buy your ad impression.
So far, 65 milliseconds (a fifth of a blink) have elapsed. The impression request has been assembled and sent, brokers have been consulted and a particular data picture of you has been brought into focus. Now the ad exchange holds a real-time auction to bid on the chance to show you an advertisement. As many as a dozen potential ad sellers might be vying for space in your browser, and depending on who you are and what you’re reading, the price for an ad might range from a tenth of a cent to more than a dollar. The auction takes another 50 milliseconds.
The highest bidder is granted the chance to place an ad on your page, and the image is delivered, loaded and rendered. The page is loaded, Cheese of the Month Club waits eagerly for a click, and you, the user, are blissfully unaware of all that has happened.
In the space of a second, we see much of capitalism in miniature. Market research and sales teams and purchase and delivery shrunk down to milliseconds. Google’s and Yahoo’s ad exchanges use a procedure for auction that traces its roots to 19th-century stamp collectors: The second-price sealed-bid auction, also known as the Vickrey auction.
In this type of sale, bidders submit bids without knowing what others in the auction are proposing to pay. The party that bids the highest wins, but they win for the second-highest price. Mathematical models have shown that this structure for an auction encourages “truthful bidding”; that is, the parties involved tend to bid around what they believe the actual value is. The shortcomings of the Vickrey auction — namely, the chance that two bidders could collude, lowering their bids collaboratively while ensuring that one party wins — are mitigated by the extraordinarily short auction time. There’s not much space for collusion in one 200ths of a second.
The first banner ads appeared on the top of Wired magazine’s affiliate HotWired’s home page on October 27, 1994. That evening, the publishers held a rave, got drunk on Zima and celebrated breaking the internet.
“People told us if you put ads online, the internet would throw up on us,” said Wired’s co-founder Louis Rossetto. “I thought the opposition was ridiculous. There is hardly an area of human activity that isn’t commercial. Why should the internet be the exception? So we said, ‘Fuck it,’ and just went ahead and did it.”
One of the first 12 ads was for AT&T. It featured a block of text filled with a confetti of random colors that read, “Have you ever clicked your mouse right HERE?” Beside the text was an arrow pointing to the words “YOU WILL.”
The combination of bad design and arrogance seems, in hindsight, very fitting. AT&T had paid $10,000 to place that ad, and it wanted to know if it had worked, so Rossetto’s colleagues went line by line through server logs, counting how many people had clicked on the image.
What followed was a decade-long game, played against a backdrop of elevator pitches and VC funding. Advertisers asked for more and more: How many people clicked, and then who clicked and from where. Developers built systems to track these things, and then other things.
“Can we place different ads to different people?” the corporations asked. Dutifully, the web teams built systems just for this, wrangling together a system of cookies and indexed user data. And then came the ad servers and exchanges, the data brokers and the rest, a mudslide of collection and hopeful correlation. No one, it seems, paused to ask whether any of this was a good idea, whether it was legal, whether their targeting tech might be used for more nefarious purposes than selling phone-and-internet packages. Or, it turns out, whether any of it actually worked.
In 2013, Latanya Sweeney, then the chief technologist at the Federal Trade Commission, published research showing disturbing racial discrimination in Google’s AdSense product, one of its most popular and pervasive ad placement systems. She showed that on pages containing personal names — for example, a staff page at a research institute — AdSense was placing particular ads much more often for people with names assigned primarily to Black babies such as DeShawn, Darnell and Germain. These ads were suggestive of an arrest record in ways that ads placed for white-sounding names (Geoffrey, Jill, Emma) were not.
In 2016, ProPublica set out to buy blocks of web advertisements for rental housing from Facebook and requested they be targeted to a number of very specific user sets: African Americans, wheelchair users, mothers of high school kids, Jews, Spanish speakers. It picked these groups on purpose because they are protected by the federal Fair Housing Act, which prohibits any advertisements that discriminate based on race, color, religion, sex, handicap, familial status or national origin.
“Every single ad,” ProPublica wrote, “was approved within minutes.”
Facebook quickly apologized. “This was a failure in our enforcement,” mea-culpa-ed Ami Vora, the company’s vice president of project management, “and we’re disappointed that we fell short of our commitments.” It promised to fix the problem.
In 2017, ProPublica repeated the experiment. It even expanded the groups it attempted to purchase for, adding “soccer moms,” people interested in American Sign Language, gay men, and Christians. As in the previous experiment, its ads were approved right away. Facebook again promised to fix the problem, although it took a while: In March 2019, Facebook announced that advertisers could no longer target users by protected categories for housing, employment and credit offers.
Why did it take Facebook so long to close the doors on a practice that was fundamentally illegal? It might have been bureaucratic inefficiency, or a failure to prioritize legal compliance ahead of its more favored metrics of user counts and ad spends. Or, it might have been because Facebook knew the problem of discriminatory ad targeting went deeper than anyone might have imagined.
Early in the summer of 2019, a group of researchers from Cornell demonstrated that even when ad buyers are deliberately inclusive, the workings of the massive and convoluted delivery machine can exclude particular groups of users. On Facebook, they showed that the company’s efforts at financial optimization, combined with its own Al-based systems designed to predict ad “relevance,” colluded to show content to wealthy white users, despite neutral settings for targeting.
Using clever methodologies designed to isolate market effects from Facebook’s own automated systems, the researchers demonstrated results that feel similar to those of Matthew Kenney’s word2vec investigations. Housing ads were routed based on race, with certain ads delivered to audiences of more than 85 percent white users and others as little as 35 percent.
Employment ads showed high bias to gender, as well as race. Jobs for janitorial work more often appeared in feeds of Black men. Secretarial jobs were more often shown to women. Jobs in the AI industry ended up in front of mostly white men. In all of these cases, no specific choice was made by the buyers to direct ads to a certain demographic; Facebook took care of the discrimination all by itself.
What this study and others like it suggest is that the ad-targeting machine itself is biased, breaking the protections laid out in the federal Fair Housing Act and indeed in the Constitution. There are no check boxes needed to target (or exclude) white people or Black people, trans people or Muslims or the disabled, when the system obediently delivers ads based on its own built-in biases.
Facebook and other ad-centered platforms have spent a decade being trained on the reward of per-ad profits, and they have dutifully learned to discriminate.
Excerpted from Living in Data: A Citizen's Guide to a Better Information Future. Published by MCD, a division of Farrar, Straus and Giroux, on May 4th, 2021. Copyright © 2021 by Jer Thorp. All rights reserved.