1 Comment
User's avatar
Gaurav Yadav's avatar

Well done on getting this out - I remember seeing your earlier work on LFAI on LessWrong. A few questions came to mind as I read the blogpost (I have not read the paper so please let me know if the answers are in there):

Would you agree that LFAI becomes most compelling in a world where we’ve already solved (or substantially mitigated) the problem of deceptive alignment? That is, where models no longer scheme against us.

How confident are you that law-following constraints meaningfully reduce the risk of AI takeover? It seems plausible that more capable models could follow the letter of the law while still accumulating power and gradually disempowering human actors. Is there a risk that LFAI gives us a false sense of security?

What incentives do current labs have to integrate law-following constraints into model specifications? Do you see this being driven by internal alignment goals, reputational concerns, or external regulation? And if the latter, what kind of regulation would you expect (or hope) to require this?

Expand full comment