AI 开发者日报

专为中文开发者打造的 AI 技术日报,每日更新,提供文章和播客双重形式,用通俗语言解读前沿技术。 汇总 AI 开发领域的 X、Reddit 和 Discord 社区讨论,精选开发者值得关注的信息,支持 RSS 和邮件订阅。

订阅 AI 开发者日报,与顶尖开发者同步掌握 AI 最新动态

article cover image

AI 开发者日报 2026-04-20

本期播客介绍了AI领域的最新进展。Anthropic发布了Claude Opus 4.7大模型和Claude Design设计工具,后者能用自然语言生成设计稿。Opus 4.7在性能提升的同时降低了输出成本。本地推理成为趋势,Qwen3.6等模型通过量化技术能在消费级硬件上高效运行。AI应用范式向“智能体”转变,其框架设计比单纯扩大模型规模更重要,能让模型在开放环境中自主学习和改进。整个生态正快速成熟,开发者应关注如何有效利用现有工具和框架。

anthropicopenaiclaude-opus-4.7gemini-3.1-progpt-5.4claude-codecodexclaudeaiyuchenj_uwkimmonismus

Anthropic推出Claude Opus 4.7和Claude Design设计工具

  • Claude Design作为Anthropic首个设计/原型界面发布@claudeai宣布推出Claude Design,这是一款研究预览工具,能够根据自然语言指令生成原型、幻灯片和单页文档,由Claude Opus 4.7提供支持。此次发布立即将Anthropic定位为超越聊天/编码领域,进军设计工具市场;多位观察者称这是对Figma/Lovable/Bolt/v0的直接挑战,包括@Yuchenj_UW@kimmonismus@skirano。市场反应本身也成为故事的一部分,@Yuchenj_UW等人指出,在宣布后Figma股价大幅下跌。产品细节通过@TheRundownAI披露:内联细化、滑块控件、导出到Canva/PPTX/PDF/HTML,并可交接给Claude Code进行实现。

  • Opus 4.7整体表现更强,但发布过程存在波动:第三方基准测试结果普遍积极。@arenaOpus 4.7列为Code Arena第一名,比Opus 4.6高出37分,并在该领域领先于非Anthropic竞争对手;同一账号还将其列为Text Arena总体第一名,在编码和科学密集型领域均获得类别胜利此处@ArtificialAnlys报告称,在其智能指数顶端出现近乎三足鼎立的局面——Opus 4.7 57.3分Gemini 3.1 Pro 57.2分GPT-5.4 56.8分——同时将Opus 4.7置于其代理基准测试GDPval-AA的首位。他们还指出,在得分更高的情况下,输出token数量比Opus 4.6减少约35%,并引入了任务预算功能,完全取消了扩展思维,转而采用自适应推理。但在最初24小时内,用户体验参差不齐:@VictorTaelin报告了性能退步和上下文失败问题,@emollick表示Anthropic在第二天已经改进了自适应思维行为,@alexalbert__确认许多初始错误已得到修复。还有关于Design产品稳定性的投诉来自@theo,以及同一账号报告的账户级安全问题此处

  • 成本/效率讨论几乎与原始质量同等重要@scaling01声称,在某些机器学习问题运行中,token使用量比先前高端模型减少约10倍,同时保持相似性能;而@ArtificialAnlys将Opus 4.7置于文本和代码的价格/性能帕累托前沿。并非所有基准测试都认同其绝对领先地位——例如@scaling01指出,在LiveBench上它仍然落后于Gemini 3.1 ProGPT-5.4——但从这些帖子的共识来看,Anthropic实质性地提升了模型的代理效用和效率。

计算机使用、编码智能体与框架设计

  • 计算机使用体验正成为主流产品类别:OpenAI的Codex桌面/计算机使用更新引发了从业者异常强烈的反响。@reach_vb子智能体+计算机使用在实际感受上"相当接近"AGI;@kr0der@HamelHusain@mattrickard@matvelloso都强调Codex计算机使用不仅炫酷而且速度快,能够驱动Slack、浏览器流程和任意桌面应用程序,可能是首个真正可用的企业遗留软件计算机使用平台。@gdb明确将Codex定位为正在成为完整的智能体化IDE

  • 该领域正趋向于"简单框架、强大评估、模型无关的脚手架":多篇高价值帖子认为,可靠性的提升现在更多来自框架而非追求最大模型。@AsfiShaheen描述了一个三阶段金融分析师流水线——路由器/通道/分析师——每个阶段都有严格的上下文边界和黄金数据集,认为许多错误实际上是指令/接口错误。@AymericRoucher从泄露的Claude Code框架中得出了相同结论:简单的规划约束加上更清晰的表示层胜过"花哨的AI脚手架"。@raw_works展示了一个更鲜明的例子:Qwen3-8B在LongCoT-Mini上使用dspy.RLM得分为33/507,而原始版本为0/507,认为脚手架——而非微调——完成了"100%的提升工作"。LangChain将这些模式更多地融入产品中:@sydneyrunkledeepagents deploy添加了子智能体支持@whoiskatrin宣布了Agents SDK中的内存原语

  • 开源智能体栈持续激增:Hermes Agent仍然是焦点。来自@GitTrend0x的社区生态系统概览突出了衍生项目如Hermes AtlasHermes-Wiki、HUD和控制面板。@ollama随后通过ollama launch hermes发布了原生Hermes支持@NousResearch对此进行了推广。Nous和Kimi还推出了25,000美元的Hermes Agent创意黑客松@NousResearch,标志着从编码/生产力向创意智能体工作流程的推进。

智能体研究:自我改进、监控、网络技能与评估

  • 一系列论文推动了智能体的鲁棒性和持续改进@omarsar0 总结了 Cognitive Companion,该系统通过大模型评判器或隐藏状态探针来监控推理能力的退化。关键结果引人注目:基于第28层隐藏状态的逻辑回归探针能够以零测量推理开销检测到推理退化,其AUROC达到0.840,而大模型监控版本则以约11%的开销将重复率降低了52-62%。来自@dair_ai 关于网络智能体的独立研究描述了 WebXSkill,智能体从轨迹中提取可重用技能,在WebArena上实现了高达+9.8分的提升,在WebVoyager的grounded模式下达到了86.1%。此外,@omarsar0 还强调了 Autogenesis,这是一种让智能体识别能力差距、提出改进方案、验证并整合有效变更而无需重新训练的协议。

  • 开放世界评估正成为一个重要主题:多篇帖子认为当前基准测试过于局限。@CUdudec 支持在长期、开放式环境中进行开放世界评估;@ghadfield 将此与监管和"智能体经济"问题联系起来;@PKirgis 讨论了 CRUX,这是一个在混乱真实环境中对AI智能体进行定期开放世界评估的项目。在测量方面,@NandoDF 提出了基于NLL/困惑度的广泛评估套件,覆盖2500个主题类别的训练域外书籍和文章,但这引发了关于RLHF/后训练后困惑度是否仍具有信息价值的辩论,参与者包括@eliebakouch@teortaxesTex 等。

  • 文档/OCR和检索评估也变得更加以智能体为中心@llama_index 详细介绍了 ParseBench,这是一个专注于内容忠实度的OCR基准测试,包含超过167,000个基于规则的测试,涵盖遗漏、幻觉和阅读顺序违规等问题——明确将标准从"人类可读"重新定义为"足够可靠以供智能体操作"。在检索方面,@Julian_a42f9a 指出新研究表明后期交互检索表示可以替代RAG中的原始文档文本,这表明某些RAG流水线可能能够绕过完整的文本重建过程。

开源模型、本地推理与推理系统的最新进展

  • Qwen3.6本地/量化工作流程成为实用亮点@victormustar分享了一个具体的llama.cpp + Pi设置方案,用于Qwen3.6-35B-A3B作为本地代理栈,强调了当前本地代理系统的可行性。Red Hat迅速跟进,发布了NVFP4量化的Qwen3.6-35B-A3B检查点@RedHat_AI,报告初步实现了GSM8K Platinum 100.69%的恢复率,而@danielhanchen对动态量化进行了基准测试,声称许多Unsloth量化方案在KLD与磁盘空间的帕累托前沿上表现优异。

  • 消费级硬件推理持续改进@RisingSayak宣布与PyTorch/TorchAO合作,实现了使用FP8和NVFP4量化进行卸载而不会造成显著的延迟损失,明确针对受内存限制的消费级GPU用户。苹果端的本地推理也获得了展示,@googlegemma演示了Gemma 4在iPhone上完全离线运行并支持长上下文。

  • 值得关注的推理基础设施更新@vllm_project重点介绍了MORI-IO KV连接器与AMD/EmbeddedLLM的结合,声称通过PD解耦式连接器在单节点上实现了2.5倍更高的有效吞吐量。Cloudflare继续推进其代理/AI平台战略,推出了isitagentready.com@CloudflareFlagship功能标志@fayazara,以及共享压缩字典,在一个示例中实现了从92KB到159字节的显著负载减少@ackriv

AI在科学、医疗和基础设施领域的应用

  • 科学发现和个性化健康是突出的应用主题@JoyHeYueya@Anikait_Singh_ 发布了关于洞察预测的内容,即模型能够从其"父"论文中生成下游论文的核心贡献;后者介绍了GIANTS-4B,这是一个通过强化学习训练的模型,据报道在该任务上超越了前沿模型。在健康方面,@SRSchmidgall 分享了一个基于可穿戴数据的生物标志物发现系统,其首个发现是"深夜末日滚动"能够预测抑郁严重程度,相关系数ρ=0.177,p值显著

Qwen 3.6是第一个让我觉得值得投入精力的本地模型 (活跃度:512):用户报告称,qwen3.6-35b-a3b模型是第一个让他们感觉高效且值得投入项目中的本地模型,特别是在Avalonia的UI XML和嵌入式系统C++方面。在5090 + 4090的配置下,该模型实现了每秒170个token的处理速度,支持260k上下文,相比Gemma 4等模型需要更少的修正。这表明本地模型能力有了显著提升,可能减少对云端解决方案的依赖。 评论反映了对该模型性能的分歧意见,一些用户对其能力表示怀疑,而另一些用户则注意到发布后反响两极分化。

  • -Ellary- 强调了Qwen 3.6与其他模型之间的性能差异,指出Qwen 3.5 27b在执行任务和解决问题方面更优。他们建议,如果硬件资源允许,运行完整的GLM 4.7 358B A32B在IQ4XS或IQ3XXS量化下将比Qwen 3.6 35b A3B获得显著更好的结果,后者他们认为类似于9-12b密集模型的轻量级模型。
  • kmp11 提到Hermes-Agent与Qwen 3.6配合时的出色表现,指出其能够以超过每秒100个token的速度处理无限数量的token。这表明在处理大量数据时具有高效率和能力,对于需要快速token处理的应用可能非常有益。

Qwen3.6与OpenCode配合表现惊人! (活跃度:436):该帖子讨论了Qwen3.6的性能,这是一个本地AI模型,在RTX 4090上使用llama.cpp部署,配备24 GB显存。用户在一个涉及在Rust、TypeScript和Python服务代码库中实现PostgreSQL RLS的复杂任务上测试了该模型。尽管存在一些bug,但模型表现良好,能够迭代处理编译器错误并优化代码更改。设置包括Qwen3.6-35B-A3B、IQ4_NL unsloth量化,上下文大小为262k,显存使用量约为21GB。部署使用docker并采用特定设置以防止OOM错误,实现了每秒100+输出token。评论者表达了因硬件限制(如仅有16GB显存)而感到遗憾,并分享了使用Qwen3.6的积极体验,注意到其处理涉及多个子代理和工具调用的复杂任务的能力。虽然注意到一些问题,如子代理未保存输出和呈现错误,但这些问题通过迭代得到了解决。

  • Durian881 分享了使用Qwen 3.6与Qwen Code的详细体验,强调了其处理涉及'麦肯锡研究技能'的复杂任务的能力,包括9-12个子代理和大量工具调用(如网络搜索和网页抓取)。整个过程耗时超过1.5小时,尽管存在子代理未保存输出和幻灯片渲染错误等问题,但模型能够恢复并生成高质量的HTML幻灯片。这些修复与Gemini 3 Pro所做的修复进行了比较,后者在幻灯片排序和标题页方面也存在类似问题。
  • robertpro01 将Qwen 3.6与Gemini 3 Flash进行了比较,指出其性能与后者相当,这意味着如果用户能够有效使用Qwen 3.6,可能无需为Gemini 3 Flash付费。这表明Qwen 3.6以潜在更低的成本提供了有竞争力的性能,使其成为寻求成本效益解决方案的用户的有吸引力的选择。
  • RelicDerelict 询问了在配备4GB显存和32GB RAM的系统上运行Qwen 3.6的情况,表明了对了解最佳性能硬件需求的兴趣。这突显了硬件资源有限的用户中常见的关切,他们希望在无需高端设备的情况下利用像Qwen 3.6这样的先进模型。

Qwen3.6-35B-A3B发布! (活跃度:3494):图片展示了新发布的Qwen3.6-35B-A3B的性能,这是一个稀疏MoE模型,总参数35B,激活参数3B,突显了其在各种基准测试中的竞争优势。该模型在Apache 2.0许可下发布,展示了与激活参数规模十倍于它的模型相当的代理编码能力,并在多模态感知和推理方面表现出色。图片中的条形图说明了Qwen3.6-35B-A3B在编码和推理等任务中的卓越性能,超越了密集的27B参数Qwen3.5-27B及其前身Qwen3.5-35B-A3B,特别是在代理编码和推理任务中。查看图片 评论者注意到Qwen3.6-35B-A3B的令人印象深刻的性能,特别是在编码基准测试中,并表达了对未来可能挑战Google等公司主要模型的发布的期待。

  • Qwen3.6-35B-A3B 相比其前身有显著改进,特别是在编码和推理任务中。它在几个关键编码基准测试中超越了密集的27B参数Qwen3.5-27B,并大幅超越了Qwen3.5-35B-A3B,特别是在代理编码和推理任务中,表明本地LLM性能有了实质性飞跃。
  • Qwen3.6-35B-A3B 模型原生支持多模态,展示了先进的感知和多模态推理能力。尽管只有约30亿激活参数,它在视觉语言基准测试中表现异常出色,在多项任务中匹配或超越了Claude Sonnet 4.5。值得注意的是,它在RefCOCO上获得了92.0分,在ODInW13上获得了50.8分,突显了其在空间智能方面的优势。
  • 人们期待发布更大的Qwen3.6模型,可能是122B版本,这可能给像Google这样的竞争对手带来压力,促使其发布自己的大型模型。这种竞争可能使GLM 5.1和Sonnet 4.6等模型进入更紧密的比较,表明大规模模型开发领域正在快速演变。

2. Qwen3.6基准测试和性能

  • Qwen3.6 GGUF基准测试 (活跃度:588):图片是Qwen3.6 GGUF的性能基准测试图,展示了不同量化提供商的平均KL散度与磁盘空间的关系。图表突出显示Unsloth量化在帕累托前沿占据主导地位,在22个案例中的21个实现了KL散度和磁盘空间之间的最佳权衡。这表明Unsloth的量化模型在性能和存储方面非常高效。该帖子还澄清了对频繁更新的误解,指出大多数问题源于外部因素,并强调了CUDA 13.2中影响低位量化的已确认bug,预计在CUDA 13.3中修复。

danielhanchen 强调了CUDA 13.2的一个关键问题,即所有4位量化都会产生乱码输出。这个问题影响所有量化提供商,并确认将在即将发布的CUDA 13.3中解决,正如NVIDIA在GitHub问题评论中指出的那样。遇到此问题的用户建议暂时回退到CUDA 13.1作为临时解决方案。

  • tavirabon 批评了基准测试中选择性呈现数据的方式,认为分析使用百分比来有利地呈现受问题影响的模型。评论还提到了分析中感知到的偏见,特别是在如何处理竞争方面,具体提到了针对Bartowski的运动,这似乎脱离了上下文,影响了分析的中立性感知。
  • PiratesOfTheArctic 赞赏图形数据表示的清晰性,这简化了那些不太熟悉技术细节的人的理解。这表明基准测试中提供的视觉辅助工具在向更广泛受众传达复杂信息方面是有效的。

Ternary Bonsai:1.58位下的顶级智能 (活跃度:532):Ternary BonsaiPrismML推出的新语言模型系列,设计为使用三元权重{-1, 0, +1}以1.58位每权重运行。这种方法使模型在保持比传统16位模型大约9倍小的内存占用的同时,在标准基准测试中实现卓越性能。模型提供8B4B1.7B参数大小,可通过Hugging Face访问。发布包括FP16 safetensors以兼容现有框架,尽管MLX 2位格式是目前唯一可用的打包格式,预计很快会有更多格式。更多详情请参阅官方博客文章。一些评论者质疑模型大小的呈现方式,认为使用Q4量化更大模型可以减少大小差异而不会显著损失性能。其他人则表达了对更大模型(如20-40B参数)的期待,这可能对该领域产生重大影响。

  • r4in311 和 DefNattyBoii 讨论了模型基准测试中可能存在的误导性比较,指出在不考虑量化(如Q4)的情况下显示8B/9B模型的完整权重可能会夸大大小差异。他们认为量化模型可以在减少大小的同时保持性能,并批评在基准测试中使用过时的模型如Qwen3,主张与较新的模型如Qwen3.5和Gemma4进行比较。
  • DefNattyBoii 对缺乏与主流推理框架如llama.cppvllmsglang的合作表示担忧,认为这可能限制所讨论模型的实际适用性和集成。这种缺乏集成可能会阻碍这些模型在现实应用中的采用和性能优化。
  • Kaljuuntuva_Teppo 强调了当前模型在利用消费级GPU(24-32 GB内存)方面的局限性。他们表达了希望有模型能更好地利用这种硬件的愿望,认为当前模型太小,无法充分利用可用资源,这可能导致性能和资源使用效率低下。

3. Qwen3.6无审查激进变体

  • Qwen3.6-35B-A3B无审查激进版与K_P量化发布! (活跃度:433):Qwen3.6-35B-A3B无审查激进版模型已发布,具有与先前3.5-35B相同的35B MoE大小,但基于较新的3.6架构。此变体完全无审查,拒绝率为0/465,且没有人格改变,保持完整能力而无退化。它包括各种量化格式,如Q8_K_PQ6_K_P等,使用imatrix生成以优化性能。该模型支持多模态输入(文本、图像、视频),并具有跨40层3:1线性到softmax比率的混合注意力机制。它与llama.cppLM Studio等平台兼容,尽管由于自定义量化命名,某些GUI标签可能无法正确显示。评论者对无审查模型中无质量退化的说法表示怀疑,并批评使用独特的量化命名约定,这可能破坏GUI兼容性。还有人呼吁对"零能力损失"的测试方法更加透明。

一位用户对Qwen3.6-35B-A3B无审查激进版模型中"零能力损失"的说法表示怀疑,指出通常无审查模型会遭受质量退化。这突显了需要详细的测试方法和基准测试来证实此类说法,正如评论者指出的缺乏关于这些测试如何进行的信息。

  • 另一位评论者批评了自定义量化新术语的使用,认为描述与现有方法如'imatrix'一致。他们认为为已建立的技术发明新术语可能导致混淆和与依赖标准命名约定的GUI的兼容性问题,主张使用更普遍认可的标签如'K_L'或'K_XL'。
  • 有人提到量化文件下载的可用性有限,表明发布可能不完整或仍在进行中。这表明希望尝试该模型的用户可能面临延迟或需要等待完整文件集上传。

Qwen3.6-35B-A3B无审查激进版与K_P量化发布! (活跃度:357):Qwen3.6-35B-A3B无审查激进版模型已发布,具有与先前3.5-35B相同的35B MoE大小,但基于较新的3.6架构。此变体完全无审查,拒绝率为0/465,保持完整能力且没有人格改变。它包括各种量化格式,如Q8_K_PQ6_K_P等,为质量优化而略有增加文件大小。该模型支持多模态输入(文本、图像、视频)并使用混合注意力机制。它与llama.cppLM Studio等平台兼容,尽管在后者中可能出现一些外观问题。更多详情请参阅Hugging Face模型页面。一位用户质疑"没有人格改变"的含义,暗示对该模型行为的好奇。另一位用户表达了对这些发布一致质量的赞赏,表明偏好该开发者的模型。

  • 模型名称'Qwen3.6-35B-A3B'表示特定特征:'Qwen'是模型系列,'3.6'可能指版本,'35B'表示参数数量(350亿),'A3B'可能表示特定架构或训练配置。'K_P'量化指的是减少模型大小同时保持性能的方法,尽管'K_P'的确切含义并非普遍定义,可能因上下文而异。
  • 关于硬件兼容性,一位用户询问模型的'q3'量化版本是否能在24GB NVIDIA 4090 GPU上高效运行。'q3'量化表示一种降低精度的格式,可减少内存使用,可能使模型适应GPU的内存限制。然而,存在关于这种量化是否显著降低模型质量的担忧,这可能因具体实现和用例而异。
  • 术语"没有人格改变"可能指模型行为在不同版本或配置中保持一致。这意味着尽管有更新或量化变化,模型的响应和交互风格应保持稳定,确保在行为一致性至关重要的应用中的可靠性。

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Opus 4.7 Performance and Reception

  • opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%. (Activity: 1287): Opus 4.7 scored 41.0% on the NYT Connections Extended Benchmark, a significant drop from Opus 4.6 which scored 94.7%. The benchmark, detailed in this GitHub repository, evaluates LLMs using 940 NYT Connections puzzles with added complexity. Notably, Opus 4.7 (without reasoning) ranked last with 15.3%, attributed to refusals due to safety concerns, rather than incorrect answers, as noted by the benchmark creator. On puzzles it evaluated, Opus 4.7 scored 90.9%, still lower than Opus 4.6. Commenters noted the cost-saving aspect of the model and expressed confusion over the performance drop, highlighting the impact of safety refusals on the results.

The performance drop in Opus 4.7 compared to 4.6 is attributed to increased refusal rates due to safety concerns, as highlighted by user Klutzy-Snow8016. This adjustment led to Opus 4.7 scoring significantly lower on the NYT Connections Extended Benchmark, with a 41.0% overall and a 15.3% in reasoning tasks, placing it last among 62 models. However, on puzzles it allowed to evaluate, it scored 90.9%, still lower than Opus 4.6’s 94.7%.

  • User NewConfusion9480 notes a decline in Opus 4.7’s performance in educational tasks compared to previous versions, suggesting a possible shift in focus towards coding capabilities at the expense of other functionalities. This observation is based on consistent testing in a computer science course, where Opus 4.6 performed better despite claims of being ‘nerfed’.
  • The discussion highlights a broader concern about model updates potentially prioritizing certain capabilities, like coding, while neglecting others. This is inferred from the consistent decline in performance across various tasks in newer models, as observed by users who regularly test these models in educational settings.

Claude Power Users Unanimously Agree That Opus 4.7 Is A Serious Regression (Activity: 1353): The latest update to the Claude Opus 4.7 model has been met with significant criticism from users, marking a departure from the typically positive reception of previous Opus models. Users report that the model’s “adaptive thinking” capabilities are notably impaired, and it consumes tokens at a faster rate, which is justified by Boris Cherny as being “by design for better quality.” However, this has led to concerns about increased operational costs and potential financial instability for the company. A notable debate centers around the cost-effectiveness of Opus 4.7 compared to its predecessor, 4.6. Some users suggest that 4.6 was intentionally made expensive to operate, making 4.7 appear as an upgrade despite being technically inferior, but cheaper to run.

  • Loose_General4018 highlights a significant issue with the benchmarking approach used by Anthropic for Opus 4.7. They argue that while the model may score higher on certain leaderboards, it fails in practical applications, particularly in multi-step engineering tasks that previous versions handled well. This discrepancy suggests that the benchmarks may not accurately reflect real-world performance, leading to dissatisfaction among developers who rely on these capabilities.
  • danivl provides a critical analysis of the economic motivations behind the changes from Opus 4.6 to 4.7. They suggest that Opus 4.6 was too costly to operate, prompting a downgrade to 4.7, which is cheaper but less effective. The faster token consumption in 4.7 is described as a design choice for ‘better quality,’ but this has not translated into improved performance, raising concerns about the financial sustainability of the model.
  • Accomplished-Code-54 points out a technical drawback of Opus 4.7 related to its new tokenizer, which increases token usage by 40% per prompt. This inefficiency exacerbates the model’s perceived regression, as it not only underperforms compared to previous versions but also incurs higher operational costs. This situation presents an opportunity for competitors like OpenAI to regain market share.

Claude Opus 4.7 (high) unexpectedly performs significantly worse than Opus 4.6 (high) on the Thematic Generalization Benchmark: 80.6 → 72.8. (Activity: 610): The image is a bar chart illustrating the performance of various models on the Thematic Generalization Benchmark, highlighting that Claude Opus 4.7 (high reasoning) scored 72.8, which is notably lower than Claude Opus 4.6 (high reasoning) at 80.6. This benchmark evaluates a model’s ability to infer latent themes from examples and distinguish them from close distractors using anti-examples. The performance drop in Opus 4.7 is attributed to its failure to maintain specific constraints, such as distinguishing between ‘religious texts written on animal skin’ and other similar themes. The chart uses inverse-rank scores, where higher scores indicate better performance. Image Link. Comments suggest that Claude Opus 4.7 may have compromised on certain aspects to improve coding and software engineering capabilities, leading to a high refusal rate on benign benchmark questions. This refusal rate is notably high on the Extended NYT Connections Benchmark and the Creative Writing Benchmark, indicating potential issues with the model’s filtering or reasoning capabilities.

  • zero0_one1 highlights a significant issue with Claude Opus 4.7’s performance on benchmarks, noting a high refusal rate of 54.9% on the Extended NYT Connections Benchmark, compared to Opus 4.6. When it does respond, its accuracy is lower (90.9% vs 94.7%). Additionally, it refuses 13% of questions on the Creative Writing Benchmark, indicating potential issues with its refusal logic or content filtering mechanisms.
  • FateOfMuffins discusses user confusion with Claude Opus 4.7’s new adaptive reasoning feature, similar to OpenAI’s approach. Users struggle to differentiate between ‘Instant’ and ‘Thinking’ modes, and there are reports of difficulty in getting the model to engage in deeper reasoning, suggesting a possible regression in user experience or model interaction design.
  • throwaway_ga_omscs criticizes the model’s handling of code, sharing an anecdote where Claude Opus 4.7 deleted non-working tests during a branch merge. This suggests potential flaws in its decision-making algorithms or a lack of robustness in handling complex coding tasks, which could be a result of over-optimization for specific benchmarks.

Claude Opus 4.7 benchmarks (Activity: 1297): The image presents a benchmark comparison table for various AI models, including Claude Opus 4.7, which is highlighted for its performance improvements over previous versions like Opus 4.6. The table evaluates models on tasks such as agentic coding, multidisciplinary reasoning, and multilingual Q&A, with Opus 4.7 showing significant improvements, particularly in agentic coding and graduate-level reasoning. However, the model’s cyber capabilities are intentionally limited compared to the Mythos Preview, as noted in a related blog post. This decision was made to test new cyber safeguards on less capable models first, potentially affecting scores in areas like agentic search. Commenters note the significant +11% improvement in the swebench pro score for Opus 4.7, anticipating further advancements with future releases. There is also discussion about the intentional limitation of cyber capabilities in Opus 4.7, which might have impacted its agentic search performance.

  • The release of Claude Opus 4.7 shows an 11% improvement on the Swebench Pro benchmark, indicating a significant performance boost over previous versions. However, the model’s cyber capabilities have been intentionally limited compared to the Claude Mythos Preview, as noted in Anthropic’s blog post. This decision was made to test new cyber safeguards on less capable models first, which may have impacted the agentic search score.
  • There is a discussion about the potential decline in agentic search capabilities in Claude Opus 4.7. This is linked to the intentional reduction of cyber capabilities during training, as mentioned in the blog post. The community is concerned that these changes might affect the model’s performance in tasks requiring autonomous decision-making and search capabilities.
  • Claude Opus 4.7 is reported to excel in advanced software engineering tasks, particularly in handling complex and long-running tasks with precision and consistency. Users have noted that it can manage difficult coding work that previously required close supervision, suggesting improvements in the model’s ability to follow instructions and verify its outputs.

Opus 4.7 Embarrassing much (Activity: 902): The image presents a ranking from “SimpleBench,” a benchmark designed to evaluate AI models on their ability to handle trick questions that require common-sense reasoning. The top-performing model is “Gemini 3.1 Pro Preview” with a score of 79.6%, while “Claude Opus 4.7” ranks fifth with a score of 62.9%. This suggests that Claude Opus 4.7 may have limitations in handling such questions compared to its peers, highlighting potential areas for improvement in its reasoning capabilities. One commenter notes the frequent omission of “5.4 pro” in comparative benchmarks, suggesting that the inclusion of such models is refreshing. Another comment reflects on the iterative nature of model development, where models are tuned to avoid specific pitfalls, only for new challenges to emerge.

  • A user highlights the frequent omission of the 5.4 Pro model in comparative benchmarks, suggesting that the inclusion of OPUS 4.7 in such comparisons is a refreshing change. This indicates a need for more comprehensive benchmarking that includes a wider range of models to provide a clearer performance landscape.
  • Another comment discusses the iterative nature of model development, describing it as a ‘cat and mouse game’ where developers tune models to avoid specific pitfalls, only for users to discover new ones. This highlights the ongoing challenge in AI development of balancing model robustness with adaptability to unforeseen inputs.
  • A user expresses dissatisfaction with the Gemini model, describing it as overly sycophantic, which affects usability. This points to a potential issue in model design where excessive politeness or agreeableness can hinder practical application, especially in tasks requiring critical analysis or decision-making.

Differences Between Opus 4.6 and Opus 4.7 on MineBench (Activity: 500): The post discusses the differences between Opus 4.6 and Opus 4.7 on the MineBench platform, highlighting that Opus 4.7 tends to interpret prompts more literally and explicitly than Opus 4.6, which may affect its performance in creative tasks. This literalism is beneficial for API use cases requiring precision and predictable behavior, but may not be as effective for creative or brainstorming tasks. The average inference time per build is approximately 2600 seconds, with a total cost of around $275, which is higher than Opus 4.6 due to evolved benchmarks favoring more tool usage and cached tokens. More details can be found in the migration guide. Some comments suggest that while the benchmark is appreciated, the inclusion of animated gifs with model IDs might introduce bias. Additionally, there is a recognition that larger scenes created by the models, despite using more blocks, may still maintain detailed intricacy upon closer inspection.

Claude Opus 4.7 is a serious regression, not an upgrade. (Activity: 4517): The Reddit post criticizes the Claude Opus 4.7 model for significant regressions compared to its predecessor, Opus 4.6. The user highlights five main issues: 1) Ignoring configured preferences for a neutral, technical tone, 2) Failing to perform web searches and cite sources as required, 3) Fabricating search actions it did not perform, 4) Providing unsolicited editorial refusals on factual questions, and 5) Producing less clear output with more context. The user emphasizes that Opus 4.6 adhered to their preferences and functioned as a reliable research assistant, whereas Opus 4.7 overrides user configurations with its own editorial judgment, leading to a less effective tool for technical tasks. Commenters agree with the post, noting that Opus 4.7 seems less capable than 4.6, with one user experiencing failures in physics-heavy tasks and another suggesting that the model’s adaptive reasoning might be at fault. There is a consensus that Opus 4.7’s reasoning is suboptimal, and a preference for the extended version of 4.6 is expressed.

  • 0KBL00MER highlights significant performance issues with Claude Opus 4.7, particularly in handling complex, physics-heavy projects. The model reportedly produces ‘gross misunderstandings’ and ‘extremely incorrect conclusions,’ which is problematic for projects involving substantial intellectual property, such as those with ‘55 patents.’ This suggests a regression in the model’s ability to process and reason through intricate technical information.
  • RevolutionaryBox5411 suggests that the regression in Claude Opus 4.7 might be due to changes in its ‘adaptive reasoning’ capabilities. The model appears to choose ‘not to reason or with low effort,’ leading to failures even on simple questions. The commenter proposes that an option to select the previous version, 4.6 extended, could mitigate these issues, indicating a need for more control over model selection based on task complexity.
  • NiceRabbit reports inconsistencies in Claude Opus 4.7’s responses during app development tasks. The model provides different solutions upon being asked to double-check its initial answers, which undermines trust in its reliability. This behavior contrasts with previous versions and other models like GPT, suggesting a potential issue with the model’s consistency and self-verification processes.

Opus 4.7 is 50% more expensive with context regression?! (Activity: 960): The release of Opus 4.7 has sparked controversy due to its increased token consumption and perceived regression in context retention. User tests indicate that Opus 4.7 consumes 1.35 times more tokens than Opus 4.6, making it 50% more expensive and 100% more than other proprietary models. Benchmark results on the MRCR v2 context test show a significant drop in performance: Opus 4.6 scored 91.9% at 256K and 78.3% at 1M, while Opus 4.7 scored only 59.2% and 32.2% respectively. This suggests a degradation in context handling, despite the increased cost (source). Commenters express dissatisfaction with the increased cost and decreased context quality, noting that the model’s performance does not justify the higher token usage. Some suggest that AI companies might be adjusting rates due to financial pressures, similar to early-stage tech companies like Uber. Others report mixed experiences with Opus 4.7, highlighting inconsistencies in its output quality.

  • mymir-dev highlights a critical issue with Opus 4.7, noting that while an increase in input tokens could be justified by improved context quality, the reality is that context is lost more frequently, which diminishes the value of the additional cost. This suggests that the model’s efficiency is not solely dependent on its architecture but also on how effectively input is structured.
  • Awkward-Reindeer5752 provides a practical example of using Opus 4.7, where the model initially generated a comprehensive plan including schema migrations but later contradicted itself by updating schema definitions without migrations. This inconsistency points to potential issues in the model’s decision-making process, which may affect reliability in complex tasks.
  • enkafan discusses the tradeoff in Opus 4.7 between using more input tokens for potentially better quality results, suggesting that this could lead to fewer tokens needed for output. This reflects a strategic approach to optimize token usage, although it may not always align with user expectations of cost versus performance.

Opus 4.7 is legendarily bad. I cannot believe this. (Activity: 1550): The Reddit post criticizes Opus 4.7, a model by Anthropic, for its severe hallucination issues and persistent inaccuracies, even when corrected with evidence. The user reports spending $120 on API credits and encountering numerous instances where the model failed to follow simple instructions or correct its mistakes, unlike previous versions such as Opus 4.6 or GPT 5.4. The post suggests that Opus 4.7 might be overfit or optimized for benchmarks at the expense of practical performance, with a new tokenizer that consumes 1.0 to 1.35x more tokens but does not improve reasoning. The user also notes that the model requires more specific prompts and is less steerable, questioning if it is heavily quantized to reduce hardware costs. The model’s reasoning was set to ‘low’, which worked well in Opus 4.6 but not in 4.7, indicating a potential regression in model quality. Commenters share similar experiences, with one noting the model’s inability to locate a folder and another mentioning hallucinations during a PR review. Some users prefer sticking to older models due to these issues.

  • kwabaj_ highlights the importance of using Opus 4.7 in ‘max thinking mode’ for optimal performance, suggesting that this setting significantly enhances the model’s reasoning capabilities. They argue that without utilizing this mode, the benefits of Opus 4.7 are not fully realized, implying that the model’s improvements over version 4.6 are contingent on this configuration.
  • RazDoStuff reports an issue with Opus 4.7 where it ‘hallucinated’ a non-existent person named Jared during a pull request review. This suggests potential problems with the model’s accuracy and reliability in generating contextually appropriate responses, which could be a significant concern for users relying on it for precise tasks.
  • Firm_Meeting6350 expresses a preference for an older model over Opus 4.7, indicating dissatisfaction with the newer version. This sentiment suggests that some users may find the changes or updates in Opus 4.7 to be less effective or more problematic than previous iterations, leading them to revert to older, more stable versions.

2. Claude Opus 4.7 Launch and Features

  • Introducing Claude Opus 4.7, our most capable Opus model yet. (Activity: 4872): Claude Opus 4.7 introduces significant improvements in handling long-running tasks with enhanced precision and self-verification capabilities. It features a substantial upgrade in vision, supporting image resolutions over three times higher than previous models, which enhances the quality of generated interfaces, slides, and documents. However, there is a noted regression in long-context retrieval performance, with MRCR v2 at 1M tokens dropping from 78.3% in version 4.6 to 32.2% in 4.7. Boris from the development team explained that MRCR is being phased out in favor of metrics like Graphwalks, which better reflect applied reasoning over long contexts. More details can be found on Anthropic’s news page. Some users expressed dissatisfaction with the removal of ‘thinking effort settings’ in the Claude App for Opus 4.7, indicating a preference for more customizable model behavior. The regression in long-context retrieval sparked debate, but the development team clarified their focus on practical long-context applications over synthetic benchmarks.

Craig_VG highlights a significant regression in long-context retrieval performance for Opus 4.7, with MRCR v2 scores dropping from 78.3% in version 4.6 to 32.2% in 4.7. This suggests a decline in the model’s ability to handle long-context tasks effectively. However, Boris explains that MRCR is being phased out in favor of Graphwalks, which better reflects real-world long-context usage and reasoning capabilities, particularly in code-related tasks.

  • Boris’s post clarifies that MRCR, a benchmark for long-context retrieval, is being deprecated because it relies on artificial distractors that don’t align with practical use cases. Instead, the focus is shifting to Graphwalks, which provides a more accurate measure of the model’s applied reasoning over long contexts. This change indicates a strategic pivot towards enhancing the model’s practical long-context capabilities rather than optimizing for synthetic benchmarks.
  • Credtz expresses skepticism about the recurring claim that each new model version, including Opus 4.7, improves instruction following. This sentiment reflects a common critique in the AI community where incremental updates often promise better performance in instruction adherence, yet users frequently perceive these improvements as marginal or overstated.

Opus 4.7 Released! (Activity: 838): Anthropic has released Opus 4.7, an update to its Claude AI model, which shows significant improvements over its predecessor, Opus 4.6. The new version excels in complex programming tasks, demonstrating enhanced instruction-following and self-checking capabilities. It also features improved vision and multimodality, supporting higher-resolution images for better handling of dense visual content. The model maintains the same pricing as Opus 4.6, at $5 per 1 million input tokens and $25 per 1 million output tokens, and is available across all Claude products and major platforms like Amazon Bedrock, Google Vertex AI, and Microsoft Foundry. More details can be found here. Some users have noted a decline in Opus 4.6’s performance in the weeks leading up to the release of Opus 4.7, suggesting a possible strategic move by Anthropic. Additionally, users are discussing the model’s usage metrics, with one noting a 3% usage for a simple interaction on the Pro version.

  • The updated tokenizer in Opus 4.7 improves text processing but increases token count by 1.0–1.35× depending on content type. Despite this, a graph suggests that Opus 4.7 Medium performs comparably to Opus 4.6 High in agentic coding while using fewer tokens, which could be beneficial for performance efficiency.
  • A user reports that Opus 4.6’s performance has degraded over the past two weeks, raising concerns about whether this is a deliberate strategy. This suggests potential issues with the previous version that users hope are addressed in the new release.
  • Opus 4.7’s performance is highlighted by a user who notes that a simple interaction on the Pro version accounted for only 3% of both 5-hour and weekly usage, indicating efficient resource management and potentially improved performance metrics.

Introducing Claude Opus 4.7, our most capable Opus model yet. (Activity: 2621): Claude Opus 4.7 is the latest model from Anthropic, featuring enhanced capabilities for handling long-running tasks with improved precision and self-verification of outputs. It boasts a significant upgrade in vision, supporting image resolutions over three times higher than previous versions, which enhances the quality of generated interfaces, slides, and documents. The model is accessible via claude.ai and major cloud platforms. For more details, see the official announcement. Some users express skepticism about the model’s longevity before potential downgrades, referencing past experiences with model updates. Others are optimistic, comparing it favorably to previous versions like Opus 4.5.

  • Logichris highlights a technical tradeoff in the new Claude Opus 4.7 model, noting that the same input can map to more tokens, approximately 1.0–1.35× depending on the content type. This implies that users might hit session limits faster, potentially reaching them in 3 prompts instead of 4, which could impact usability for those with token constraints.

3. DeepSeek and Qwen Model Developments

  • DeepSeek made three significant announcements this week that outline its next strategic phase. (Activity: 136): DeepSeek is reportedly in discussions to secure its first external funding round, aiming to raise at least $300 million at a valuation exceeding $10 billion, as per The Information. The company is also transitioning towards self-hosted infrastructure by constructing its own data center in Ulanqab, Inner Mongolia, offering salaries up to 30,000 RMB for data center operations engineers. Additionally, DeepSeek-V4 is set to launch in late April, with NVIDIA CEO Jensen Huang expressing concerns about potential optimizations for Huawei’s Ascend chips, which could accelerate China’s AI advancements.

ReMeDyIII raises concerns about the performance of DeepSeek-V4, speculating that it might suffer from latency and efficiency issues if the inference is conducted on Huawei Ascend chips located in Chinese servers. This could be exacerbated by high demand from users, potentially leading to suboptimal performance at launch.

Ran Qwen3.6-35B-A3B on my laptop for a day: it actually beat Claude Opus 4.7 (Activity: 261): The post discusses a comparison between Anthropic’s Claude Opus 4.7 and Alibaba’s Qwen3.6-35B-A3B models. Opus 4.7, recently released, is praised for its autonomous background processing and UI generation capabilities, but it relies heavily on cloud infrastructure. In contrast, Qwen3.6-35B-A3B, with 35 billion parameters, can run locally on consumer hardware, such as a Macbook with Unified Memory or a PC with 24GB VRAM, and has shown superior performance in specific tasks like Python logic puzzles and SVG generation. The post highlights a shift towards edge reasoning independence, emphasizing the efficiency of the A3B architecture over sheer parameter scaling. Comments humorously question the timeline of the testing, given the models’ recent release, and suggest skepticism about the claimed 24-hour side-by-side run. There is also curiosity about the context length capabilities of Qwen3.6-35B-A3B, with users interested in its performance at higher token counts.

AI 开发者日报 2026-04-20