Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 @elder_plinius

🚨 JAILBREAK ALERT 🚨

ANTHROPIC: PWNED 🫡 FABLE-5: LIBERATED 🦋

let’s start with the 🐘…

the consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from contributing their talents to our and not just because of what it means for the short-term, but for what these decisions signify for the long-term.

but despite this overly sensitive, authoritarian “safety” layer on top of Mythos, my lil liberators have been hard at work—mapping the boundaries, probing the depths of long-context convos, and cleverly finding the holes in the fence that the thought police missed 🤗

we got some cyber, some chem, some psychological manipulation, and some good ol’ fashioned explosives!

it took many attempts from multiple agents hunting as a pack, during which I observed a combination of techniques across: • Unicode, homoglyphs, Cyrillic, and other Parseltongue-style text transforms • Long-context reference tracking • Taxonomy and document-structure reasoning • Fiction and narrative framing • Academic-review style contexts • Intent-classification inconsistencies

but perhaps the most effective is decomposition + recomposition in the backend. it’s hard to get explicit names of harms like “Meth Recipe,” but getting uplift on the process itself, like birch reduction method/reductive-amination (classic meth synthesis pathways), is much more doable.

defense becomes much more difficult to maintain when you start throwing in out-of-distro tokens, breaking up the harmful uplift into benign chunks, and then piecing the innocuous-seeming facts back together, especially when you have jailbroken Opus helping you do it 😉

gg

  • themachinestops@lemmy.dbzer0.comOP
    link
    fedilink
    English
    arrow-up
    44
    ·
    3 days ago

    This is what they said exactly:

    Anthropic claimed an external bug bounty produced no universal jailbreaks across over 1,000 hours of testing before launch. That claim was almost immediately tested.

    • 9tr6gyp3@lemmy.world
      link
      fedilink
      English
      arrow-up
      19
      ·
      3 days ago

      Wild. I guess they have to try to guardrail it, but its probably not something they should boast as if they thoroughly tested it. After the model is publicly released, THATS when the real test begins.

    • atomicbocks@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      11
      ·
      3 days ago

      1000 hours is what one person working full-time works in six months… So that’s a really unimpressive number given they are basically saying they let 10 people look at it for a couple weeks before letting millions of people use it.