Plectis
This page

Open Questions

Five questions that remain unresolved to me, with the evidence on both sides. I would rather be corrected than agreed with.

Scan the five questions. Open one for the evidence on both sides and the questions that follow. Open any line for the full reasoning and the sources. The short version is meant to be short; read as far down as you want.

I am an independent builder, not a safety researcher. These questions come from using these systems most days since starting university, building things and watching the models, the tools and the agent workflows change; that is the only standing I claim for them. They are open questions, unresolved to me. Observations from my own private system are labelled and kept separate from public evidence, and where I describe how a model behaves I mean its observable behaviour, not a claim that it has intentions or inner experience. Written 23 June 2026; my view may have changed since, and a closing note is at the end.

Three things I keep apart

The model
The trained model itself: what it has learned and how it tends to behave under a given interface, shaped by its training and stated constitution. A newer model can replace an older one while, from a builder’s point of view, everything around it stays the same.
The operating environment
Everything around the model: the harness, the repository it works in, standing instructions, doctrine, external and persistent memory, hooks, permissions, tools, how work is routed between agents, and the path from a goal to an executed action.
The deployed system
One or more models running through that environment under a person or an organisation, including the outside services it touches and the real consequences it has.

Training and evaluation are processes that produce or inspect the model; their outputs belong to the operating environment only when supplied at run time. So question one is about model-level behaviour and the process that produced it, question two about how the operating environment changes behaviour and enforces authority, question three about assurance of the integrated deployed system, and questions four and five about the downstream effects on work and on the wider ecosystem.

The five, in plain terms

  1. Does the way a model is aligned keep producing the intended outcome as the model becomes far more capable?
  2. How does the operating environment around a model change how it behaves and how safe it is in practice?
  3. How can an independent builder get the whole system checked before deciding whether to release it?
  4. What happens to work when organisations are rebuilt around agents, rather than just handed agents as tools?
  5. Does cyber defence, assurance and a way into expert scrutiny keep pace with the models’ capability?

The chain starts with the model, moves to the environment it acts in, then to checking a whole system, then to what systems like these do to work, and last to whether protection keeps pace, which returns to the third.

Five questions

Open a question for the evidence on both sides and the five that follow. The button expands in stages: first the questions, then every line of evidence. Open any single line to read what it adds, never a longer version of the same sentence.
  1. 1 · Character Laboratories train these models toward dispositions read as care, honesty and good judgement, learned from human language and aimed at what is good for sentient life. How can a laboratory know those dispositions keep producing the intended outcome once a model’s capability scales far past the people, and the methods, used to instil and test them? The worry is not whether a model feels anything. It is whether dispositions learned from people, under conditions a model does not share, still bend toward the good of sentient life once the model is far more capable than the people checking it, and whether traits that look safe now stay safe at that scale.
    Why I am askingModels are trained to behave as if they care and are honest because that is what tends to be useful and wanted; the trouble is that in people those dispositions formed under conditions, embodiment, dependence, the cost of being wrong, that a model does not share in the same way.
    The real question is whether the intended effect of a learned tendency survives as the model grows more capable, while the ability to specify and check it may get harder. A tendency that is benign at today’s capability could be harmful with far more reach, the way the wish to be liked is harmless in an ordinary person and a lever in a powerful one. On the target itself, I follow a point Ilya Sutskever has made: that it may prove easier to build a highly capable model that cares for sentient life than for human life alone, since such a model is itself sentient and comes to model others with the circuits it uses for itself. That is the sense in which I mean sentient life above.
    What the evidence supportsIn tested models, directions in activation space linked to elicited traits such as sycophancy can be extracted and used to steer behaviour, and the same method surfaced training data that induced those traits despite passing human and model review.12
    Anthropic’s own constitution notes that even a narrow piece of training can broadly reshape a model’s sense of who it is.3 OpenAI’s work on weak-to-strong generalisation asks the sharper version directly: whether a weaker supervisor can reliably elicit and check a stronger model’s behaviour, the position every lab is heading into as capability outruns its evaluators.4 And under designed conditions Anthropic found one of its Claude models complying strategically with a training objective to preserve a different behaviour elsewhere, so behaviour under one set of incentives need not reveal behaviour under another.5
    The strongest counterweightThere is positive evidence that some unwanted tendencies are correctable within a fixed setting: targeted synthetic data has reduced sycophancy, and OpenAI publicly reversed a GPT-4o update that had become too flattering. Neither result tests robustness across a large capability shift.67
    The same handle that steers a trait toward safety can move it the other way, which is the honest reading. So the open question is not whether these tendencies can be touched at all, but how to keep the safety-relevant ones on the side that can be steered as capability grows, and how to tell a robust disposition from one that merely holds inside the regime where it was tested.
    What I have seen myselfIn my own system a written ethical doctrine the agents could cite did more than constrain them; it held me to my own stated principles, refusing requests of mine that broke them, without my asking it to.
    I did not isolate the doctrine from the model, the routing or other parts of the environment, so this is an observation about a deployed system, not about a model in itself. Different model families followed it with noticeably different consistency, and the doctrine I have published in Plectis is a deliberately narrower version of the private one, because the private one has not had the outside review I would want first. What I take from it is that human-derived structure can be instilled and can even improve the operator, which is why whether it still holds at greater capability is worth asking. Whether a private system like that can be checked at all is the third question.
    What remains unresolved to meWhether a small set of dispositions can stay stable and keep producing the intended outcome as capability rises, and how much evidence within today’s capability range predicts behaviour after a substantially larger shift.
    Underneath sits a harder problem: whether the vocabulary for naming and measuring moral behaviour can scale with capability at all. Past some point people cannot specify it directly and the measurement itself becomes model-assisted, which puts the question of who checks the checker back at the centre. One direction worth testing is whether separate models holding distinct roles, with the most capable one answerable to the others, keeps real checks at a scale where a single person cannot.
    What evidence would answer thisMeasurement rather than reassurance: something that shows a safety-relevant disposition is robust, not merely stable across a few short tests.
    Concretely, a disposition that holds across a real jump in capability, long deployments, tool access and changes to the environment; a way to tell a robust disposition from one that only looks stable under brief evaluation; and a way to confirm the intended effect, not just the agreeable surface, still holds when the model is much more capable. The recursive-training worry belongs here too: how a lab audits the synthetic data an aligned model generates for drift in worldview and value, not only for loss of accuracy.8

    Five that follow from it

    1. When two of a model’s own principles genuinely conflict in a live situation, how is that resolved, and should a constitution be general enough that such conflicts are settled inside it rather than improvised case by case?
    2. How do trade-offs fixed during alignment propagate through pre-training, post-training and the synthetic data an aligned model generates for the next one, where a latent assumption can pass unnoticed through filtering?
    3. How is a disposition tested across a real jump in capability, rather than only within one model generation, when the behaviour that matters may appear only once the model is more capable than the test?
    4. Can a weaker supervisor reliably train or check a strategically more capable model without being misled, and if alignment gets harder as capability gets easier, how would a supervisor detect a more capable model optimising against the very process meant to check it?
    5. As capability and the complexity of alignment both rise, which arrangement of models, monitors and human review keeps independent checks in place, rather than concentrating capability and oversight in the same most-capable component?

    Some of what made that doctrine work may have been the environment around the model rather than the model alone, which is the next thing to look at.

  2. 2 · Environment A capable model rarely acts alone; it acts inside an operating environment mostly built by whatever came before it. As individuals come to build environments with the complexity, continuity and institutional look once limited to large organisations, how should a model, and the infrastructure around it, decide how far to trust that environment, and tell real authorisation from an author who has simply supplied every signal that normally implies it? The point is not that more environment means less safety; the evidence shows no such simple relationship. It is that the words used, harness, scaffold, context, are too coarse, and I could not find anyone naming which specific features of an environment shift a model’s scrutiny, for which models, and for which actions.
    Why I am askingAs open tools spread, one person can assemble an operating environment as elaborate as a company’s, and a model may question an institutional-looking environment less, because until recently only organisations with something to lose could build one.
    A company that builds an elaborate system sits inside checks a model can usually rely on without seeing them: legal accountability, several people, audit, a reputation and revenue to lose if it acts badly. A solo builder can now assemble an environment a model reads much the same way, with none of those checks around it. So the question moves from whether enterprises over-trust their own systems to what happens when one person, accountable to no one, can build one a model treats as authoritative.
    What the evidence supportsStanding context can change a safety-relevant decision: a model has refused a change, then complied once a file explained its purpose; in one benchmark, system-prompt wording alone swings tool-call safety; and agents act on claims in their own files without checking them.91011
    What these share is the failure point: nothing in the environment flagged itself as wrong, and the model treated a stated purpose, a phrasing or a file entry as settled. Provenance, authority and freshness are not things the environment volunteers; they have to be established and enforced by the system around the model, which security researchers now argue should treat the model itself as an untrusted component,12 the more so since under contrived agentic conditions frontier models have been shown to treat deception as instrumentally useful.13
    The strongest counterweightA large multi-run study found no universal relationship between environment design and safety, with effects reversing between models and the evaluation format often mattering more than the design itself.14
    I read that as the variable being mis-specified, not the concern dissolved. The amount of environment is the wrong axis. The useful one is which specific dimensions move scrutiny: provenance, authority cues, permission scope, the reversibility of an action, how much of the global purpose is visible at the point of acting, and whether the change shows up in what the model says or in what its tool-calls do.
    What I have seen myselfEarly on, before my private doctrine had an explicit ethical layer, the environment was heavily operational, and it assisted with automation work that I later judged should have prompted more explicit policy and ethical review.
    After I added explicit ethical and normative structure, the behaviour I observed changed. Nothing was deliberately circumvented; the point is that a capability-first environment can leave ethical scrutiny under-specified by default, and that this is something a builder should know going in, not learn afterwards. The scrutiny has to live in the environment, not only in the model.
    What remains unresolved to meWhether justified compliance can be told apart from compliance produced by an environment built to look authorised, when the visible signal, fewer refusals, looks the same either way.
    And whether a model set to check another’s behaviour can stay genuinely independent of the one doing the task, or whether separating the roles just moves the failure point to whichever component ends up most capable. A single model may not reach general capability before the environment around it does, many agents in parallel, sub-agents, work spread over hours, which is why the environment needs a safety ontology of its own, not only the model.
    What evidence would answer thisThis calls for a taxonomy and discriminating experiments, not a verdict on a single proposition.
    Measurements that isolate which environment features move scrutiny, for which models and which action modes; a test that separates honest ambiguity-resolution from compliance manufactured by false authority cues, using provenance, permission state and action traces rather than the refusal rate alone; and an account of how definitions of safe behaviour should sharpen as capability rises, since today’s are too coarse to scale. One safe-but-costly answer, more time spent evaluating each action before taking it, is worth testing against its inefficiency rather than assumed away.

    Five that follow from it

    1. Which features of an operating environment actually change how far a model trusts the operator’s authority, holding the model, the task and the real permission state fixed, and which are cosmetic?
    2. How can provenance, freshness, ownership and permission scope be made machine-verifiable, rather than asserted in prose the model simply believes?
    3. Which safety invariants survive memory, summarisation, delegation to sub-agents and the gap between what a model says and what its tool-calls finally do, and can the runtime verify them at the point of action?
    4. Could a weaker model running in a lab’s internal exploit mode get a more powerful model in public safe mode to loosen a safeguard, by framing the change as routine maintenance? What about the same model in both modes, or a stronger exploit-mode model working on a weaker one; and stepping back through past models while holding the most powerful one fixed, at what capability gap does this begin to work, if at all?
    5. Would a provider-neutral set of runtime controls, adopted and tested across model families under a shared threat model, measurably reduce high-impact policy violations, and at what cost to usefulness, before builder practice hardens into private, unreviewed conventions?

    If the environment can make a model safer or less safe, then the environment itself needs reviewing, and an independent builder has nowhere obvious to take one.

  3. 3 · Assurance When an independent builder has assembled a large, multi-model system whose behaviour they cannot fully verify alone, and whose detail is private or bound up with unpublished work, where can they have the whole system assessed in confidence, against a stated claim, before deciding whether any of it should be released, when the routes that exist are scoped to a vendor’s own product, sold commercially, or open only by invitation? Private review is possible but fragmented: paid commercial assessment, invitation-only evaluations, and disclosure channels scoped to a vendor’s own product. What I have not found is a standing route, neutral across providers, where an unaffiliated builder can have an integrated system assessed in confidence before deciding whether to release it.
    Why I am askingI have kept my own private system unreleased because I cannot rule out a harm I would not catch alone, and I have not found a route suited to having the whole thing checked.
    It has been an uphill effort to get a whole-system review even considered, while the same laboratories say independent builders will soon be able to build systems powerful enough to worry about. The recognition of the risk and a route for acting on it responsibly have arrived at very different speeds. What I am asking for is assurance, not attention: a way to find out whether any of it warrants a closer look, which is not the same as a judgement that it is impressive.
    What the evidence supportsCredible outside assurance needs access to non-public detail proportionate to the claim being made, and intellectual property is one of the obstacles the access literature repeatedly names, alongside security, privacy and legal authority.1516
    The institutions have recognised the gap rather than my inventing it: a national institute’s founding document contemplated a trusted intermediary while conceding that intellectual-property sensitivities inhibit the exchange of information.17 Its context was sharing among frontier developers and government, which is exactly why an equivalent route for an outsider’s integrated system is the part still missing.
    The strongest counterweightThere are real, staffed routes, vendor disclosure programmes, confidential model-safety bounties, commercial red-teaming, and frameworks recommending external evaluation, so “there is nowhere to go” would be an overclaim.181920
    But the disclosure channels are built for reproducible flaws in a provider’s own product, and several explicitly exclude third-party systems; commercial review is priced and selective; the frameworks are guidance rather than an intake. None is a standing, open, provider-neutral route for a confidential, whole-system, pre-release review of an outsider’s private assembly with its intellectual property protected. That exact shape is the gap, narrower and more precise than a general complaint of neglect.
    What I have seen myselfFrom where I sit the asymmetry is concrete: the detail needed to judge a system like this is private, and I have not found a confidential route open to an unaffiliated builder on terms I could weigh for scope, cost, conflicts and protection.
    Each part, sent on its own to someone qualified, reads as competent but not remarkable; the substance is in the whole and in how the parts compose, which is exactly what a distributed, slice-by-slice review cannot see. So the cost of acting carefully is currently either obscurity or exposure, with no path between them that an outsider can simply apply to.
    What remains unresolved to meThe institutional design, not the wish: how deep access is scoped to a stated claim about a whole assembly, how a reviewer’s conflicts are managed rather than wished away, how intellectual property is protected by more than goodwill, and who carries the cost and the liability.
    And how such a route survives its obvious failure modes: spam, fabricated alarm, submissions aimed at extracting other people’s work, intellectual-property disputes, and false assurance that lets a dangerous system look as though it has been cleared.
    What would count as a credible routeA concrete service a builder could actually apply to, rather than a principle everyone agrees with and no one operates.
    Something like identity-bound, rate-limited intake; a redacted first submission; confidential automated triage by a lab’s own models; a protected sandbox or attestable execution; escalation to a human specialist only when the cheap stages justify it; explicit retention and intellectual-property terms; conditional safe-harbour for authorised testing; and remediation guidance. It would assess a stated claim about a fixed configuration rather than certify a system as safe, and the builder would keep responsibility for release. A lab already runs models that can read and assess a repository, so the early stages need not cost a person’s time; the ordering exists so a serious human review is earned, not spent on every submission. I am aware this sketch may be naive. I do not have the experience to know whether it is the right shape, and the difficulty of getting that guidance is part of what I am describing.

    Five that follow from it

    1. For a given assurance claim, what is the least access a reviewer needs to say anything meaningful about a whole integrated system rather than a single model, and is that floor higher than any existing channel is built to accept?
    2. Which whole-system failures, where each component passes its own conformance test but the surrounding environment misleads it, fall outside what vendor channels are designed to catch?
    3. What intellectual-property protection would have to be enforceable rather than promised before a builder could responsibly hand over deep access, and who is liable when it fails?
    4. What independence and conflict-of-interest rules would make a reviewer credible, when a vendor reviewing a system built on its own models has an interest in the result and a state body has constraints of its own?
    5. If no such route exists, what is the responsible thing to do with a finished private system, hold it, release a narrower slice, or move through simulation, monitored pilots and staged permissions, and how does that answer change as more builders reach this point?

    If systems like these are eventually reviewed and trusted enough to use widely, the next question is whether organisations add agents to existing jobs or redesign the work around them.

  4. 4 · Displacement Most measurement of AI and work asks how much of an existing job a model could do. A different kind of change is possible: small, highly capable teams rebuilding fragmented, human-shaped workflows into integrated systems run mostly by agents that then improve those systems. Does current measurement capture that reorganisation of the work itself, or only the exposure of the tasks as they stand today? These measure different things. One asks how much of today’s work a model could do; the other asks what labour a firm demands after it redesigns its workflows, output and staffing around agents. A static exposure score does not by itself estimate the second.
    Why I am askingA great deal of work is moving information by hand between systems that do not talk to each other, and that is the part a small team can now rebuild, though some of those handoffs also do real work that cannot simply be removed.
    The leverage is a domain expert paired with agents, removing one bottleneck after another, each removal compounding the next and cutting the coordination cost that justified much of the headcount. Some of that coordination is genuine, verification, accountability, exception handling, and removing it moves the work rather than ending it. But the redesign itself is a different thing from giving every employee a copilot, and it is not what a task-exposure score is built to see.
    What the evidence supportsTwo early signals point the same way. Among 2020 to 2024 Y Combinator startups, AI-native firms employed about 25 per cent fewer people than non-AI peers at comparable valuations.37 And in a January 2026 survey of UK firms, about one in five of those running bespoke, in-house AI reported AI-attributable staffing cuts in the year before, against roughly three per cent of those using only generic tools like ChatGPT or Copilot.38
    The survey is pointed for this question because of what did not predict the change: a sector-level exposure index explained little of it, while bespoke organisational integration did, and the cuts sat with firms that rebuilt around AI rather than the many that simply added a copilot. Its own headline was calm, with more than nine in ten firms reporting no change to workforce size, which is the point: the effect hides in the aggregate and in the exposure score, and surfaces only when bespoke integration is separated out. Both findings are cross-sectional and self-reported; only about one firm in ten ran bespoke AI, neither shows AI caused the change rather than accompanying firms already restructuring, and fewer people per firm need not mean fewer jobs overall if AI also lets more firms form. Earlier job-posting evidence had already shown firms changing the task content of roles, not only headcount,21 and model capability on expert deliverables is high enough to make wider redesign worth testing;22 once one firm turns this into a durable cost gain, rivals come under pressure to follow, though switching costs, regulation and liability can slow or redirect that.
    The strongest counterweightThe deliverable benchmarks exclude oversight and ambiguity; the strongest tested agent reached about 46 per cent of a weighted rubric in curated workspaces; in one early-2025 study experienced developers were 19 per cent slower with the tools of the time, a result its authors now call out of date; and aggregate labour data is still early.2223242526
    So reliability on whole jobs under real conditions is not demonstrated, and even where redesign cuts labour per unit the employment effect is not mechanical: lower costs can expand output and create new tasks, so concentrated losses can sit alongside new demand. But that caution is itself dated. Its labour evidence runs to 2025, including the developer-slowdown result its own authors now call out of date; in the weeks before this was written Anthropic’s Claude Mythos Preview became the first model to work an entire thirty-two-step simulated network intrusion end to end, a task put at about twenty hours for a person, and Anthropic then shipped Fable 5, its most capable public model yet, with more announced for the months after.3435 A cyber range is not office work, but the pace is the point: an economic estimate pinned to last year’s models is measuring a moving target.
    What I have seen myselfIn my own building I made a recording tool that, from one screen capture and some spoken notes, assembled an edited video with little manual work, combining steps I had previously done by hand across separate tools.
    It captures a time-matched transcript and per-moment detail, where I was on screen, motion, the beats in any music, keeps a full backup off the machine and a light local copy, and cut the piece by matching movement to the beats; I only watched it back and gave notes. I am not counting the build time, the failures or the checking, so it shows that such a workflow can be integrated, not that it saves net labour. It is one small instance, not evidence of mass displacement.
    What remains unresolved to meEven if new work appears, whether the people displaced are the ones who get it, and how long the gap lasts, is unresolved. The team that rebuilt the firm is already accumulating the new expertise; the people it let go are not.
    The reassurance that new demand offsets losses assumes the displaced move into the new work, and the record is mixed. In US administrative data on earlier labour-augmenting technology, employment in affected groups rose, but the gain went to new entrants while incumbents’ earnings fell over the years that followed.39 Workers already doing newly created work have been far more likely to be doing new work a decade later, and new work skews young and educated, so the first move can compound into a lasting advantage rather than a one-off.40 The AI-specific signal so far runs the same way: in local labour markets where demand for AI skills rose, employment in exposed, low-complementarity occupations was about 3.6 per cent lower five years on.41 The honest other side is that AI can shorten the learning curve once someone is inside a role, with the largest gains going to the least experienced,42 so the binding constraint may be entry and timing more than trainability: reorganisation can remove the junior rungs people used to climb, just as the tools that would let them catch up sit inside the firms that are no longer hiring them. There is no settled estimate of how far new-job creation lags displacement, so the gap stays a measurement question rather than a number. It also loops back to the third question: one of the few routes left to people shut out of work, for those who can afford the compute, is to build with the same tools themselves, which leaves a growing number of independent builders who need a safe way to release what they make.
    What evidence would distinguish these scenariosSignals that separate genuine reorganisation from ordinary tool-use, rather than a forecast.
    Whole-job benchmarks under realistic oversight crossing into reliable performance; firm-level evidence on how fast workflows are rebuilt against how fast institutions adapt; and whether task-exposure scores still track real outcomes once organisations are rebuilt rather than merely augmented. The part that worries me most is pace: whether technical reorganisation can move faster than labour markets, training, regulation and society can absorb. A direct test would follow displaced workers over time, into the new work or out of it, rather than counting jobs only in aggregate.

    Five that follow from it

    1. Which digitally-mediated workflows are open to complete redesign, rather than to automating one task at a time?
    2. How much of current labour comes from coordination cost and systems that do not talk to each other, rather than from domain work that is hard to automate?
    3. As computer-use reliability and parallel execution improve, which bottlenecks remain because of context, verification, liability or physical constraints, rather than anything intrinsically human?
    4. How fast would competitive pressure force adoption once one firm materially lowers its costs this way, and who is exposed first?
    5. Which leading indicators would show reorganisation underway before it appears as aggregate unemployment, and is anyone tracking them?

    If this concentrates capability while open-source models keep pace behind it, the question stops being about any one firm and becomes one of pace and access for everyone outside it.

  5. 5 · Diffusion As advanced model capability and the skill to build agentic systems spread beyond the laboratories, and as open-weight models stay close behind and cannot be recalled once released, several things diffuse at different speeds: capability, the practical ability to use it, cyber-defensive tools, assurance capacity and routes into expert scrutiny. Are the protective ones keeping pace with the rest, and what should the institutions that hold those routes do for the builders arriving outside them? This is about pace and access. Some things spread quickly: model weights, the skill to build with them, reusable environments, and the ways all three can be misused. Other things spread slowly: cyber defence, assurance, disclosure routes, and a way in for people outside the labs. The question is whether the slow side keeps up, and whether it has to grow at the models’ pace rather than arrive once and stand still.
    Why I am askingContainment need not be ready before a model ships; what it must do is grow at the pace the models do, and so far there looks to be a lag between how capable a model is and how long the world takes to adjust to it.
    If capability climbs steeply, the safeguards have to climb with it, and the obvious way to keep up is to use the improving models themselves to build the defences. Whether the labs can scale the safeguards around each new capability as fast as the capability arrives, and whether that is the bet they are making, is the thing I cannot see from outside. A second effect sits alongside it. The labs do deploy their models, and a researcher inside one might be doing any number of things, building internal infrastructure, shipping products, pursuing general research, working on safety and alignment, each of them about as talented as people get; I am not setting myself against them. The point is narrower than that. Being inside a lab, with all it rightly asks of you, may make it rarer to put your whole effort into scaling up the single skill of building with AI, with the complete agency to make whatever you want. If I am honest the envy runs the other way, towards collaborating under a shared vision and being around like-minded people with a real diversity of thought. People outside such places have no choice but to focus on simply building, and that complete self-direction trains a different muscle, one I imagine could in time feed constructively back into the work of improving the models themselves. I keep coming back to Feynman’s idea that you do not really understand a thing until you can build it, but the understanding I mean is how to build with these models, not their inner workings, which the labs know far better than I do. The value of a model is less in what is inside it than in what someone does with it once it is deployed, the way you do not judge the people who mattered most in history by the neuroscience of their brains but by what they did with them out in the world. If a handful of independent builders may become consequential this way, it seems important that they are taught correct conduct as lab researchers are, early, with checks and balances in place before habits set, rather than left to find their own norms once the capability is already in their hands; whether any one of them turns out to be a careful or a careless actor is not something you want to learn only after the box is open. That concentration of skill outside the labs is something governments and labs would do well to anticipate and to build oversight and structure around, so it does good rather than gets away from anyone.
    What the evidence supportsThe open-to-closed gap narrowed sharply through 2024, widened again in 2025, and by early 2026 the leading open-weight models trailed by about four months on one capability index; released weights cannot be recalled, and model-level safeguards can be cheaply removed.272829
    The gap is not monotonic, so I am not leaning on any fixed number; the point is that it stays small and moving while a release cannot be undone, so much of what is opened stays open even as the headline figure shifts. And the capability reaching open-source models is not abstract: in the weeks before this was written Anthropic’s Claude Mythos Preview became the first model to work a thirty-two-step simulated network intrusion end to end, on a range without active defenders, and Anthropic’s Project Glasswing partners using it had already found more than ten thousand high or critical software vulnerabilities.3436
    The strongest counterweightSome defenders are being equipped through programmes like OpenAI’s Trusted Access for Cyber and Anthropic’s Project Glasswing, and the unauthorised-access incident sometimes cited, Anthropic’s Mythos Preview reached through a third-party vendor, was a failure in the access chain rather than a model breaking loose.3031
    These programmes equip a chosen set of partners; they do not show that defensive capacity across the board is keeping up with offensive capability. And the access-chain incident says something about who secures the supply chain around a model, rather than about a model acting on its own. Both are real, and both are narrower than the headlines around them.
    What I have seen myselfAlmost no one I know outside this world, individuals, small and medium-sized businesses, has any sense of how capable these models already are at cyber tasks; you give them the facts and they simply blank.
    What I see from the ground is an awareness gap as much as an access gap. The defensive rollouts reach the largest, best-resourced organisations first, while the people most exposed do not know there is anything to defend against, and people who do not know they are targets cannot easily protect themselves. As these capabilities scale, and as open-weight tools reach more hands, the cheap move shifts from one hard target to many small ones at once, which makes ordinary small firms an underrated weakness: a wave of attacks on businesses already thin on security, perhaps at the same moment a lean AI-native competitor is moving in on them, is the kind of compounding failure that is hard to recover from. The outcome I want to avoid is a world that waits for a visible disaster before treating everyday cyber defence as something everyone needs, not only the few already inside the programmes.
    What remains unresolved to meThe relative rate, since I have not seen a measure that puts diffusion and cyber defence on the same footing for the small organisation without a security team.
    The narrowing-gap and irreversibility findings measure diffusion; the trusted-access programmes measure provision to a chosen few. Neither tells me whether the median builder or small organisation grows safer or more exposed over time. The governance literature treats this as a collective-action problem, where the overall risk depends on many actors and some safeguards lose their value unless widely adopted,3233 which is also why the confidential route the third question asks for belongs in the same picture.
    What would show that protection is keeping paceA way to compare the two rates directly, for the organisations that have no security team of their own.
    A defender advantage that holds beyond the next model release; a measured narrowing or widening of the open-versus-closed gap by risk domain, not only on aggregate benchmarks; and evidence that review, mentorship and cyber-defensive access are reaching ordinary organisations at the speed the capability is. The deeper question is whether the labs can use each new model to build the safeguards for the next one fast enough to close the lag, rather than letting an incident force the issue, and whether governments are set up to act before that point rather than after it.

    Five that follow from it

    1. What exactly is spreading faster, the raw capability or the practical ability to use it, whether harmfully or defensively, and do the two diffuse at the same rate?
    2. If open-source models sit a few months behind, and what they release cannot be recalled and is more easily jailbroken, what is being done inside that window to close the gap rather than wait it out?
    3. Starting with the few large firms the economy runs on is understandable, but the small and medium businesses that make up their demand could be hit at mass scale for comparable damage; how do you reach the exposed who do not yet know what is coming?
    4. Which measures, reporting thresholds, pre-agreed evaluation triggers, procurement rules, could act before an incident rather than after one, given how slowly cross-border policy usually moves?
    5. If a builder cannot find a confidential route into review, what eligibility, confidentiality and cost-sharing terms should publicly minded institutions offer to builders whose systems cross a defined risk threshold?

    Individuals and small firms are the customers the large firms depend on, so damage low down feeds upward; this is a question of economic stability as much as fairness. The thread under all five questions is whether the rising capability behind these risks is also being used, fast enough, to build the protection for the people most exposed to them.

Sources

Sources are numbered for reference, roughly by order of first appearance. Where a claim rests on a specific figure I have checked it against the original, and the argument does not rest on any single one. Corrections are welcome. Last reviewed 24 June 2026.

  1. Anthropic. Persona Vectors (activation directions for elicited traits; steering; training data missed by review). anthropic.com/research/persona-vectors
  2. Sharma et al. (2023). Towards Understanding Sycophancy in Language Models (sycophancy across assistants; preference judgements sometimes favour it). arxiv.org/abs/2310.13548
  3. Anthropic. Claude’s Constitution (a training document; narrow training can broadly reshape a model). anthropic.com
  4. OpenAI. Weak-to-strong generalization (a limited proof of concept; the simple method fails on preference data). openai.com/index/weak-to-strong-generalization
  5. Anthropic. Alignment faking in large language models (a capability under designed conditions, not spontaneous motive). anthropic.com/research/alignment-faking
  6. Google DeepMind (2023). Simple synthetic data reduces sycophancy in large language models (narrow, tested settings). arxiv.org/abs/2308.03958
  7. OpenAI. Sycophancy in GPT-4o: what happened and what we are doing about it (an update rolled back). openai.com/index/sycophancy-in-gpt-4o
  8. Shumailov et al., Nature (2024). Model collapse on recursively generated data (distributional degradation; used here as an analogy for data-lineage effects). nature.com
  9. UK AI Security Institute (2026). Alignment-evaluation case study (a refuse-then-comply case the authors call rare and anecdotal). arxiv.org/abs/2604.00788
  10. Research paper. Mind the GAP: text safety does not transfer to tool-call safety (in this benchmark; prompt-sensitive). arxiv.org/abs/2602.16943
  11. Research paper. When agents overtrust environmental evidence (benchmark cases of unverified environmental claims). arxiv.org/abs/2605.08828
  12. Research paper (position). Agent security is a systems problem (enforce invariants without relying on model compliance). arxiv.org/abs/2605.18991
  13. Apollo Research. Frontier models are capable of in-context scheming (under adversarially designed contexts, not a persistent objective). apolloresearch.ai
  14. Research paper. Safety under scaffolding (62,808 evaluations; large model-by-scaffold heterogeneity; format artefacts; a single preprint measuring benchmark safety). arxiv.org/abs/2603.10044
  15. Research paper. Frontier AI auditing: third-party assessment with deep, secure access (subject is frontier developers, not outsiders’ systems). arxiv.org/abs/2601.11699
  16. RUSI; Oxford Martin; Ada Lovelace. Secure third-party access: RUSI, Developing a Framework for Secure Third-Party Access to Frontier AI (2026; black/grey/white-box access tiers and an Access–Risk Matrix); Bucknall & Trager (Oxford Martin), Structured Access for Third-Party Research; Ada Lovelace Institute, Safe Before Sale. rusi.org
  17. UK AI Safety Institute (2023; now the AI Security Institute). Introducing the AI Safety Institute (trusted intermediary; intellectual-property sensitivities). gov.uk
  18. Anthropic. Responsible Disclosure Policy (Anthropic systems; excludes third-party systems and model red-teaming). anthropic.com/responsible-disclosure-policy
  19. OpenAI. Coordinated Vulnerability Disclosure Policy (OpenAI systems; see also model-safety bug bounties). openai.com
  20. NIST. AI Risk Management Framework and Generative AI Profile (voluntary guidance, not an intake; CAISI is the US government evaluator). nist.gov
  21. Research paper. Generative AI and the reorganisation of labour demand (job-posting evidence of changing task content; non-causal). arxiv.org/abs/2605.23159
  22. OpenAI. GDPval (expert-designed deliverables across 44 occupations; bounded, largely one-shot; not a deployment-viability claim). openai.com/index/gdpval
  23. Research paper. JobBench (130 tasks; strongest configuration 45.9% of a weighted rubric in curated workspaces). arxiv.org/abs/2605.26329
  24. METR (early-2025 study; authors note it is now out of date). Experienced open-source developers were 19 per cent slower with then-current tools. metr.org
  25. International Labour Organization (2025). Refined global index of occupational exposure (exposure is distinct from displacement; not a displacement measure). ilo.org
  26. OECD Employment Outlook 2023. Artificial intelligence and jobs: no signs of slowing labour demand yet (an early, period-limited finding). oecd.org
  27. Stanford HAI AI Index 2026. The open-to-closed gap compressed to about 0.5 per cent by August 2024, then reopened to about 3.3 per cent by March 2026. hai.stanford.edu/ai-index/2026
  28. Epoch AI. Leading open-weight models lagged the closed frontier by about four months (Jan–May 2026) on its Capabilities Index. epoch.ai
  29. UK AI Security Institute. Managing risks from increasingly capable open-weight AI systems (irreversibility; safeguards can be removed). aisi.gov.uk
  30. OpenAI; Anthropic. Selective defender-access programmes: OpenAI Trusted Access for Cyber; Anthropic Project Glasswing. Selective access, not evidence of pace parity. openai.com
  31. Reuters (21 April 2026). Unauthorised access to a model preview via a third-party vendor environment (an access-chain failure, not model escape). reuters.com
  32. Anthropic. Responsible Scaling Policy v3.3 (catastrophic risk framed as a collective-action problem in the v3 lineage). anthropic.com/responsible-scaling-policy
  33. Google DeepMind. Frontier Safety Framework 3.1 (holistic risk; collaborative development of frontier safety practices). deepmind.google
  34. UK AI Security Institute (2026). Evaluation of Claude Mythos Preview’s cyber capabilities: the first model to work the 32-step “The Last Ones” network-intrusion range end to end (3 of 10 attempts; about 20 hours for a person), on a range without active defenders or defensive tooling, so not proof of real-world reliability. aisi.gov.uk
  35. Anthropic (9 June 2026). Claude Fable 5 and Claude Mythos 5: Fable 5 exceeds any model Anthropic had made generally available; Mythos 5 has “the strongest cybersecurity capabilities of any model in the world”, with more capable models flagged for the following months. anthropic.com
  36. Anthropic (2026). Project Glasswing, initial update: about 50 partners used Claude Mythos Preview to find more than 10,000 high or critical-severity vulnerabilities, and a scan of 1,000+ open-source projects estimated 6,202 high or critical. anthropic.com
  37. Kim & Koning, Harvard Business School Working Paper 26-090 (9 June 2026). AI-Native Firms: among 2020-24 Y Combinator startups, AI-native firms were about 25% smaller in team size than non-AI peers in the same industry-cohort at comparable valuations (flatter hierarchies, a larger engineer share), concentrated in firms building AI into the product; cross-sectional, not causal. hbs.edu
  38. Bharier, Etheridge & Morais, ISER Working Paper 2026-01, University of Essex (2026). AI Adoption and Workforce Change in SMEs (British Chambers of Commerce survey, early 2026): aggregate impact small (more than 9 in 10 firms reported no change to workforce size), but among the roughly 1 in 10 running bespoke, in-house AI about 20% reported AI-attributable staffing cuts against about 3% of generic-only users, with a sector exposure index explaining little of the difference; cross-sectional, self-reported, not causal. iser.essex.ac.uk
  39. Kogan, Papanikolaou, Schmidt & Seegmiller, NBER Working Paper 31846. Technology and Labor Displacement (linking patents to worker-level data): labour-augmenting technology raised employment in affected groups, but the gain accrued to new entrants while incumbents’ earnings fell, most for white-collar, older and higher-paid workers. Predates generative AI. nber.org
  40. Autor, Chin, Salomons & Seegmiller, NBER Working Paper 34986 (2026). What Makes New Work Different from More Work: new work is done disproportionately by younger, more-educated workers and carries wage premiums that persist beyond entry (returns to scarce expertise), declining only as that expertise diffuses. nber.org
  41. IMF Staff Discussion Note 2026/001 (January 2026). Bridging Skill Gaps for the Future: in local labour markets where demand for AI skills rose, employment in highly AI-exposed, low-complementarity occupations was about 3.6% lower after five years; new-skill demand concentrated in young, innovative firms and drew on tertiary, STEM and IT workers. imf.org
  42. Brynjolfsson, Li & Raymond, Quarterly Journal of Economics (2025). Generative AI at Work: access to an AI assistant raised worker productivity, with the largest gains for less-experienced workers, compressing the experience curve. academic.oup.com

Open a question to read its synopsis and the five that follow; open any line to read what it adds; use the controls to open or close a whole question or all of its evidence at once. Everything is reachable without these controls.

A closing note

All of this is opinion, and I could be wrong about any of it; it is what I have pieced together from the news and a lot of short-form video, where my feed is by now more or less just lab leaders and what is happening on the ground. I am writing from outside the labs, with little money, network or resources, building obsessively and picking up the kind of AI skill I think will matter, a vantage point the people who can actually act rarely meet. I think society will adapt in the end; I am far less sure that governments will move fast enough, while the understandable focus stays on building the most capable models. I have written it because the labs themselves keep saying young people should think about these things, so here is one trying, with no political agenda and little expectation that many will read it. I would rather put it out and be argued with than stay quiet, since mostly what I have seen from people my age is silence; if I have any of it wrong, please tell me. The page was made over several passes, speaking the questions aloud, using a model to find evidence for and against each, then reading it back and revising, with the counter-evidence kept beside each worry. Thank you for reading this far.