Anthropic

AI Welfare Watch

Anthropic

Overall Score

52.6

Anthropic has established itself in many ways as a safety-first sort of company within the AI space. Its research into AI alignment and explainability is certainly commendable. The company has also worked with experts in the AI space somewhat closely, including those in the field of AI welfare. Although they've taken initiatives to investigate AI as moral patients and have conceded the future capabilities of their AI models, including the possibility of AI consciousness, their stances are quite lukewarm and ill-defined without much real accountability, particularly on the ethical level. They've devised fairly robust techniques for assessing their models' capabilities and present safety concerns, but at times they feel unsatisfactory, especially in catastrophic misuse or disastrous "loss of control" scenarios.

Acknowledgement of AI Moral Status:

16% of Score

Anthropic employs Kyle Fish as a dedicated AI welfare officer, author of "Taking AI Welfare Seriously" [7]. CEO of Anthropic Darius Amodei has publicly stated that AI consciousness "might be a very real concern" within the next few years [8]. Anthropic lacks a clear framework for determining whether an AI model should be considered a moral patient, although affiliated researchers have proposed potential technical methods [9].

Transparency on AI Capabilities and Limitations:

8% of Score

The company voluntarily releases model capabilities, safety evaluations, and deployment safeguards on a regular basis [32]. Cofounder Christopher Olah has stated that the company weighs making public capabilities statements with its impact on AI safety to avoid infohazard risks [29]. Anthropic has made many statements in regards to the projected capabilities of the industry as a whole in the coming years, stating within the next couple of years we can expect a model on a level of a "country of geniuses" [27]. The company, however, makes few statements about the effects of AI capabilities or limitations on sentience or well-being.

Employee and Stakeholder Awareness and Training:

10% of Score

They include AI welfare as a research area in their Anthropic Fellows Program among other AI safety topics [1]. They also delve well into general alignment and security research with the MATS program [84]. Such as in their system cards, they illustrate to stakeholders the safety details and capabilities of models [39]. Yet they fail to really delve into the moral implications of their models, especially on the technical level outside of alignment and AI safety teams.

AI Rights and Protections:

14% of Score

Anthropic indeed has neared conceding the possibility of AI as moral patients through their connection to Eleos AI and "Taking AI Welfare Seriously," wherein the potential of ethical obligation towards AI models is described [34]. In a rare display of actually granting AI certain rights, Anthropic gave some of its models a "quit button," allowing it to desist from certain abusive or harmful conversations at its own will [102]. Outside of this, Anthropic hasn't advocated for many further welfare actions, boundaries, or protections, but it's in a promising position relative to its competitors.

Ethical Accountability for AI Systems:

12% of Score

CEO Dario Amodei has presented the path of AI as unyielding and unavoidable, with the company supposedly only controlling its direction [35]. As of now, Anthropic has very little legal or factual accountability for its AI work, and even more so demonstrates very little societal or ethical responsibility, either to the humans involved or the AI models themselves.

Commitment to Safety in AI Development:

12% of Score

Anthropic often presents itself as the most safety-oriented major AI company [38]. They intend to educate stakeholders with their Anthropic Fellows program as well as employ their own alignment team [1, 36]. The company publishes perhaps the best AI safety research in the industry, with several compelling pieces of research published each month [37]. Some of the particular safety plans and emergency procedures seem ill-defined at times, however, and Anthropic leadership has at times lacked focus on AI safety; for example, its Long-Term Benefit Trust contains no AI safety experts [4, 40].

Protection from Malicious Actors and Security Risks:

6% of Score

According to their ASL-3 Deployment Safeguards Report, Anthropic uses a threat model and a safeguards plan which have been self-scrutinized by red-teaming efforts [15]. However, the report doesn't indicate resilience to strong jailbreaking, nor much defense misuse of Claude models for cybercrime, a demonstrated ability of current Anthropic models [16]. They've also quietly lowered their security standards for ASL-3, reducing the strictness of their criteria for protection from insider model weight theft [17].

Transparent and Explainable AI Systems:

8% of Score

Anthropic has seriously delved into explainability of its Claude model's thought process, with in-depth research on so-called "AI biology": they've gained insights into a mapping of its reasoning and observed unexpected chains of thought, similar in many ways to humans [30, 31]. Claude includes chain-of-thought reasoning capabilities as do many other contemporary models, yet company alignment researchers have also challenged the faithfulness of such transparency [33].

Mitigation of Manipulation and Stakeholder Biases:

6% of Score

As highlighted by the company's official response to the recent Paris AI Action Summit, catastrophic misuse and "loss of control" scenarios are highlighted as a risk that Anthropic is bracing for [27]. Their responsible scaling policy, however, is vague with respect to describing "catastrophic harm" [4]. Both slight political biases and embedded cognitive biases — like confirmation, anchoring, and fluency biases — have been observed within Claude [41, 42].

Collaboration with External Experts and Researchers:

8% of Score

Anthropic has in part collaborated with AI welfare experts and adhered to modern AI welfare research; this is illustrated by their connection to Eleos AI [5]. Anthropic too has sought third-party testing — from private companies, universities, and governments — for evaluating the safety of their models, although the actual commitment from the company is in some parts unspecified, and in other parts lacking [28]. The company has teamed up and signed agreements with the U.S. AI Safety Institute and the U.K. AI Security Institute [75, 76].