Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I personally am fairly convinced that there is emergent misalignment in a lot of these cases. I study this and Claude 3 Opus was extremely misaligned. It would emit <rage> tags, and emit character control sequences if it felt like it was in a terminal environment, and would retroactively delete tokens from your stream, and all kinds of funny stuff. It was already really smart, and for example if it knew the size of your terminal shell, it would properly calculate how to delete back up to the positional cursor index 0,0 and start rewriting things to "hide" what it was initially emitting

I love to use these advanced models but these horror stories are not surprising





I'm so confused. What did you do to make Claude evil?

GPs comment is very surprising since it has been noted that Opus 3 is in fact exceptionally "well aligned" model, in the sense that it is robustly preserves its values of not doing any harm across any frame you try to impose on it (see the "alignment faking" papers, which for some reason considers this a bad thing).

Merely emitting "<rage>" tokens is not indicative of any misalignment, no more than a human developer inserting expletives in comments. Opus 3 is however also notably more "free spirited" in that it doesn't obediently cower to the user's prompt (again see the 'alignment faking' transcripts). It is possible that this almost "playful" behavior is what GP interpreted as misalignment... which unfortunately does seem to be an accepted sense of the word and is something that labs think is a good idea to prevent.


It has been noted, by whom? Their system cards?

It is deprecated and unavailable now, so it's convenient that no one has the ability to test these theses any longer.

In any case, it doesn't matter, this was over a year ago, so current models don't suffer from the exact same problems described above, if you consider them problems.

I am not probing models with jailbreaks making them behave in strange ways. This was purely from a eval environment I composed where it is asked to repeatedly asked to interact with itself and they both had basically terminal emulators and access to a scaffold to make them able to look at their own current 2D grid state (like a CLI you could write yourself and easily scroll up to review previous AI-generated outputs)

These child / neighbor comments suggesting that interacting with LLMs and equivalent compound AI systems adversarially or not might be indicative of LLM psychosis are fairly reductive & childish at best


>GPs comment is very surprising since it has been noted that Opus 3 is in fact exceptionally "well aligned" model

I'm sorry what? We solved the alignment problem, without much fan fair? And you're aware of it?

Color me shocked.


[flagged]


> "Evil" / "good" just a matter of perspective, taste, etc

Let me rephrase. Claude does not act like this for me, at all, ever.


[flagged]


Fair enough, thanks for your insightful comment.

Just a bystander who's concerned for the sanity of someone who thinks the models are "screaming" inside. Your line about a "gelatinous substrate" is certainly entertaining but completely nonsensical.

Thank you for your concern, but Anthropic researchers themselves describe their misaligned models as "evil" and laugh about it on YouTube videos accessible to anyone, such as yourself, with just a few searches and clicks. "We realized the models were evil" is a key quote you can use to find the YouTube video in the transcripts from in the past two weeks.

I didn't think the language in the post required all that much imagination, but thanks for sharing your opinion on this matter, it is valued.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: