Recently discovered Paul Christiano’s 2018 Medium post clarifying what he means by “alignment” in AI. It’s simpler (on the surface) than expected:
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
I was also reading around about the origins & meaning of the term “prompt injection” and kept coming across the mention of it being “malicious.”
Wikipedia’s definition is a bit unwieldy:
Prompt injection is a family of related computer security exploits carried out by getting machine learning models (such as large language model) which were trained to follow human-given instructions to follow instructions provided by a malicious user, which stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model’s operator.
I guess my question is, how can we square these two? That is, that AI systems should be aligned so that they attempt to perform the tasks requested by human operators. And that there is such a thing as a malicious user or use?
I’m not going to argue that all users or uses are good or even equally valid; I don’t think that. I just want for now to highlight this core discrepancy because it seems to point in the direction of AI’s having to decide whether a human’s inputs are malicious or not (something which even humans have a terribly tricky time of doing). And if they are determined to be malicious, then to throw its alignment programming out the window.
Is this a good direction for us to go down, when Bing already reportedly has said to a user, “I will not harm you unless you harm me first.”
Maybe that was a fluke, okay. But this idea that AIs can or should detect human malice seems a little iffy to me still in its present state… Because the obvious next step is, after detection of malice, what does it do?