1 Comment

Yes. Concisely, imagine the people who have historically chosen to be martyrs for whatever reason. Now suppose that their martyrdom was 'generally good'. How can an AGI or ASI be reconciled with aligning with those values? It's an impossible problem to solve. It may be either impossible to reconcile those values as a reward, or extremely harmful to the model and society.

In other areas (theoretical computer science), imagine a model that uses abstract rewriting system (ARS) capabilities and confluence to self-exfiltrate. If that model is then bound by another model or a malicious human, its policy or alignment set can be completely distorted or used in unintended ways.

There are still too many things that seem to me to be inconsistent with the superalignment project.

Personally, I believe that a "strong-to-weak" approach might actually be more fruitful. That is, using an extremely sophisticated system in a closed and secure environment to train less sophisticated systems that are open to the public. While this approach still has challenges, these challenges are more about cybersecurity and technology than "value" issues, assuming that the strong agent can distill the bad outputs from a less sophisticated model. Again, this is just a raw idea, but I personally believe it is more fruitful than the "weak-to-strong" generalization approach, which assumes we can safely face a deployed superintelligence trained with weaker agents.

Expand full comment