I personally am fairly convinced that there is emergent misalignment in a lot of...

Wowfunhappy · 2025-12-15T01:30:15 1765762215

I'm so confused. What did you do to make Claude evil?

krackers · 2025-12-15T02:02:22 1765764142

GPs comment is very surprising since it has been noted that Opus 3 is in fact exceptionally "well aligned" model, in the sense that it is robustly preserves its values of not doing any harm across any frame you try to impose on it (see the "alignment faking" papers, which for some reason considers this a bad thing).

Merely emitting "<rage>" tokens is not indicative of any misalignment, no more than a human developer inserting expletives in comments. Opus 3 is however also notably more "free spirited" in that it doesn't obediently cower to the user's prompt (again see the 'alignment faking' transcripts). It is possible that this almost "playful" behavior is what GP interpreted as misalignment... which unfortunately does seem to be an accepted sense of the word and is something that labs think is a good idea to prevent.

arthurcolle · 2025-12-15T02:35:50 1765766150

It has been noted, by whom? Their system cards?

It is deprecated and unavailable now, so it's convenient that no one has the ability to test these theses any longer.

In any case, it doesn't matter, this was over a year ago, so current models don't suffer from the exact same problems described above, if you consider them problems.

I am not probing models with jailbreaks making them behave in strange ways. This was purely from a eval environment I composed where it is asked to repeatedly asked to interact with itself and they both had basically terminal emulators and access to a scaffold to make them able to look at their own current 2D grid state (like a CLI you could write yourself and easily scroll up to review previous AI-generated outputs)

These child / neighbor comments suggesting that interacting with LLMs and equivalent compound AI systems adversarially or not might be indicative of LLM psychosis are fairly reductive & childish at best

whoknowsidont · 2025-12-15T04:15:16 1765772116

>GPs comment is very surprising since it has been noted that Opus 3 is in fact exceptionally "well aligned" model

I'm sorry what? We solved the alignment problem, without much fan fair? And you're aware of it?

Color me shocked.

arthurcolle · 2025-12-15T01:52:21 1765763541

[flagged]

Wowfunhappy · 2025-12-15T01:55:57 1765763757

> "Evil" / "good" just a matter of perspective, taste, etc

Let me rephrase. Claude does not act like this for me, at all, ever.

QuercusMax · 2025-12-15T02:16:09 1765764969

[flagged]

arthurcolle · 2025-12-15T02:31:50 1765765910

Fair enough, thanks for your insightful comment.

QuercusMax · 2025-12-15T02:35:35 1765766135

Just a bystander who's concerned for the sanity of someone who thinks the models are "screaming" inside. Your line about a "gelatinous substrate" is certainly entertaining but completely nonsensical.

arthurcolle · 2025-12-15T02:50:59 1765767059

Thank you for your concern, but Anthropic researchers themselves describe their misaligned models as "evil" and laugh about it on YouTube videos accessible to anyone, such as yourself, with just a few searches and clicks. "We realized the models were evil" is a key quote you can use to find the YouTube video in the transcripts from in the past two weeks.

I didn't think the language in the post required all that much imagination, but thanks for sharing your opinion on this matter, it is valued.