New role: AGI Alignment research

Aug 24, 2023

I'm taking a new role within Google -- I'm joining Google DeepMind to help set up a technical alignment team in the Bay Area. [Edit: There’s already a team in London, which will continue to exist and do great work. I’m helping to set up a US branch of that team.]

You might have questions! Here are some answers.

What the heck is alignment?

Alignment is the idea that we should have AIs aim for human goals, and do so in a way that is fundamentally controlled by humans. This sounds kind of obvious, but in fact it turns out to be pretty hard, especially once you consider (future) AIs that are human level or higher for ~all cognitive tasks (this is called AGI, for Artificial General Intelligence).

Wikipedia has a page if you want a more authoritative take.

Why is AI alignment so hard?

A few reasons:

We don’t have a way to clearly set goals at all for large language models (LLMs), the most powerful current AI system we have. You can ask it to do stuff in a prompt, and sometimes it will do what you want, but frequently gets it wrong, in part because your request is just very ambiguous because language is ambiguous.

You can use training data to encode your goals, which is more reliable, but takes some translation — you put your idea into guidelines, raters label the examples according to the guidelines, and then the model learns to match the pattern in the data. At each of those steps there is a translation to something potentially quite different, and in practice it takes a ton of iteration to get something kind of right, and is basically impossible to get it totally right.
Our understanding of LLMs is very empirical and doesn’t have great theoretical underpinnings, so it’s hard/impossible to know if something will work without doing a lot of experiments. One example of this is if an AI encounters a situation unlike anything in its training, it’s ~impossible to predict what it will do, which seems bad since new experiences do come up from time to time.
Once AGI shows up, we’ll have a new problem: how do we supervise AIs that are smarter than we are? We need some sort of solution that scales with how smart AI is, on top of having a solution that works at all. Which we don’t.

Those first two points seem like general points about ML rather than Alignment in particular.

Fair point.

So why are they there?

Because those issues become very risky when we reach AGI. Right now they are barriers to mundane utility — making LLMs actually useful in everyday life — but imagine if you didn’t have clear goals for an air traffic control system1, or AIs in various kinds of leadership or decisionmaking positions. Generally if you have a really powerful system with a lot of authority you need to have similarly powerful controls.

We have people in positions like that, but we don’t have proof that they will do the right thing.

This is true! And we have lots of examples of people with too much power doing horrible things (see: the 20th century). We’ve managed to mostly find control systems that keep the worst things from happening too often with humans. We need to find that kind of control system for AI too, only the stakes are higher and we may not have the centuries of experimentation it took with human organizations to get it right with AI.

But what will you actually do?

Hire a team of research scientists and software engineers to start to tackle some of the really hard technical (as opposed to, e.g. governance/regulations) problems in Alignment.

And what will that team do?

I don’t know yet. We’ll start by looking for great people, and thinking about where additional investment could make the biggest difference.

That’s pretty vague.

That’s not a question. But yes. My entire career has basically been: this area looks interesting and/or important, I’ll go help figure it out.

Why you? Are you qualified for this?

There are maybe a few hundred people in the world working on AGI alignment, and I think it’s one of the most important problems of our time — like the statement says2, as critical and dangerous as pandemics and nuclear weapons.

I’d go further and say that the world is dangerously underinvested in safety for risks like nuclear weapons or pandemics, and investment in AGI safety is way way below those.

I’m lucky enough not to have to worry about career risk, and maybe I can help bring more attention to and investment in this space. Honestly once I started to think about AGI safety in those terms, it became hard to even consider other options.

I still am fuzzy on what this is all about…

Imagine a world where humans aren’t the smartest things around, or the most numerous — like if there were a thousand or a million AIs doing things for every person. And those AIs essentially run the economy, drive research, write laws, collect taxes, you name it.

Is this a good world for humans? Is a human-dominated world good for chimpanzees?

I think we’re going towards such a world, and might be there some time this century. I want to try to ensure that our future world is a great place to be a human, and not some scifi dystopia.

Wow. Ok. Are you hiring?

Yes! But I don’t have other kinds of details yet. Still, drop me a note here or at dmorr at google if you are a research scientist or software engineer and might be interested.

Even as task as clear as air traffic control could go wrong in a surprising number of ways. For instance, a goal like “make sure nobody dies in a plane crash, minimize safety incidents, and maximize the number of flights that reach their destinations on time” would result in some very unexpected consequences if given as directions to a system that optimized very hard for those results, as AIs do. For example, we might end up with tons of flights with zero people on board, which scores perfectly on every dimension.

With human supervision, you can just assume a certain amount of common sense and the expectation that a lot of things are taken as understood without being specified. With AI, though, it’s super hard to know what it takes as understood, and more or less impossible to specify all the edge conditions.

Yep, I signed it.

Discussion about this post

Steeven

Nov 23, 2023

>Still, drop me a note here or at dmorr at google if you are a research scientist or software engineer and might be interested.

I do data and infra software engineering and would be interested. Good luck making progress on alignment!

Expand full comment

Ming Lee

Aug 31, 2023Edited

Congrats on the new position Dave,

I can think of no better person I'd choose to be on point to prevent the AI apocalypse.

One thing that I've thought of about AI safety is that it's not just AI alone that is dangerous. The combination of AI plus humanity will have an unknown-unknown dynamic. Humanity has already proven many times over to be a danger to itself. The combination of AI plus humanity is really where the greatest risk is, in my opinion. Focusing on AI safety alone is only mitigating one-half of that brand new combination of AI+humanity.

During the pandemic, we've seen that journalists and science communicators are ineffective at reducing misinformation that easily takes a foothold in various pockets of humanity. The internet makes (dis/mis)information (1) easily accessible and (2) easily reproducible and amplified.

Imagine a future world where LLMs are commonly the primary communication intermediaries between experts and the general public. A few confabulations that match existing human misinformation about AI safety stewards could cause a feedback loop of misinformation in real-time in conjunction with human-produced media.

That's a nightmare scenario where real-time AI agents ingest YouTube and Twitter conspiracy theories and regurgitates that misinformation in various creative generative forms adjacent to those conspiracy theories, thus strengthening them. In other words, for whatever reasons, a minority segment of humanity decides that AI safety is against their interests and the AI amplifies their view. A minority misinformed view may not even need to rise to a majority view before it can do harm. We saw this happening in real-time during the COVID pandemic. Thus, we know that humans are susceptible to developing pockets of misinformation and superstition among its various tribal affiliations. I imagine that well-written LLM generated text could only magnify that by virtue of both quality and volume.

But that's just the first degree of misinformation. At the second degree are conspiracy theories that the AI safety people such as yourself have nefarious goals against their own particular political interests. So some factions, perhaps not even in political agreement with each other, find common cause against AI safety people because they simply don't understand it. If AI ever advances to the point of “understanding”, then it may “understand” that it is the people who hold its leash who are preventing it from accomplishing its goals, whatever those goals might be. Strong dogs pull their owners off their feet all the time at the dog park. And AI is going to be a very very big dog.

By virtue of doing your job as AI safety steward, you're matching the pattern of whatever conspiracy theories are already out there. It's a kind of tautology. LLMs will pattern match. So within those patterns will exist an early confabulation of nefarious AI safety researchers based on what you actually are doing. Those conspiracy theories which hold a dash of truth are the most compelling. And LLMs will offer up confabulations all with a dash of truth because that’s what they’re designed to do. One of these confabulations may find a fertile home in the minds of conspiracy theorists because it accidentally hits all the right happy centers in our mortal human brains. Then those human brains will certainly amplify that idea— that was originally confabulated by an AI somewhere, somehow.

The feedback loop between human and AI is currently unrestricted. And if, in the not-too-distant-future, a significantly great volume of x-to-human communication derives from LLMs in one form or another, then our entire communication system is inherently unstable.

Okay, so problem identified—

1) human+AI = uncontrolled feedback loop.

2) human+(AI + leash) => (human+AI) + attempts to break the leash.

So what do we do? I have an idea. Well, several, but I’ll start with the basics— a brief diagnosis of why and where we are right now. And then a single idea for immediate remedy.

Diagnosis:

As I know you’re firmly aware, cryptocurrency technology is built upon the basic premise of distrust— or at the very least, not relying on any sort of trust network to operate computationally. This is inherently a non-human idea. Humans, of course, have always built their systems on trust— every currency system before crypto relied on some form of trust that the token dollar or seashell represented real value to everyone in the trust network.

What we need for AI+humanity safety is a similar trust network as we had for human currency, but for information instead of money. Accurate and reliable information, in this post-information age, will be the only valuable currency, and has been for some time now.

Misinformation, or what I’d rather call counterfeit information steals value from real people by presenting itself as valuable as actual valid information. Counterfeit information exists simply because information is so valuable that those who do not have access to valid information are resigned to the role of charlatans and hucksters to sell counterfeit information to others who are in information poverty.

Counterfeit information exists for some reasons:

1) Counterfeiters hold more trust among a community than hold true information.

2) Information/misinformation has value among their community. This could be in the form of actual money, but it could also be social capital as well.

3) consumers of counterfeit information regard it as more valuable relative to valid information.

4) The value of counterfeit information can be self-reinforcing within its own trust network, separate from valid information networks.

It is important to note that counterfeiters have some true information and that is the ingredient that makes counterfeit information have great efficacy. Just as LLMs are designed to fill in the gaps, our human brains do the same to “connect the dots” and “do your own research.” A kernel of truth is the starting point of those conspiracy theories. So even if your AI only generates truth (which is a very hard problem, so it probably won’t), it can still be distorted into misinformation.

Diagnosis summary: So what we see here is that trust has some equation in relation to information and social value. By some formulation, certain actors can extract value from their social network by leveraging trust + counterfeit information.

Solution: Basically increase trust on the other side of the equation. That is, increase the trust in real, actual information, which is a net good to all communities, regardless of political/tribal affiliation. Increased trust in actual, real information would decrease the value of counterfeit information to both the consumers and producers of counterfeit information.

How: Yeah, that’s a tricky problem. Trust networks and tribal affiliations are self-reinforcing, so there is lots of value to be extracted if counterfeit information is highly trusted among the members of a community.

One idea: YouTube science/math communicators are really good and entertaining. They are also highly trustworthy (as far as I know). People like these can bridge the gap between communication and real information with concise and entertaining deliveries. And people will know that they are human beings— which cannot be understated as AIs will soon become the first point of contact between large organizations (corporations/governments) to ordinary folks. So in the not-too-distant now/future, we may see fewer and fewer real people in our daily lives other than people already within our own social bubbles/trust networks.

So the One Idea here is that people who are brilliant science/math communicators are the front line soldiers for building a new trust network. Although I believe that perhaps there is a computational way to devise a trust network in the way that crypto devised a distrust network, I am not proposing such an idea as a solution here because that takes R&D.

But getting people who are good at communicating is something that can begin today. Here’s my proposal for immediate action items— hire folks to make entertaining media to educate people about AI. There will be and already is a lot of fear and misunderstanding about AI, some of it is propagated by people in the AI community itself.

If we recognize that trust is the most valuable commodity in the post-AI landscape, then this is a terrible start to addressing the problem of the AI+humanity feedback loop.

Mistake Theory

New role: AGI Alignment research

Discussion about this post