AI 开发者日报 2026-01-21

开源平台算法：X "为你推荐"推荐系统公开

X Engineering 开源 X 算法（Grok 风格 Transformer 推荐器）：X 宣布已开源其新算法（排名/推荐堆栈），该算法"采用与 xAI 的 Grok 模型相同的 Transformer 架构"，代码已在 GitHub 上发布（XEng）。这一发布立即引发了社区反应——既有乐观的（"现在任何人都可以'询问'大型平台算法如何工作"）（David Holz），也有对抗性的（"我正在修复它"）（Yuchenj_UW）。
对系统图的早期逆向解读：一份总结指出，高层架构并不令人惊讶：候选生成隔离、"无内容特征"，以及强调网络外发现（nearcyan），此外还有对"使用 Transformer"被过度宣传为 Grok"阅读每篇帖子"的怀疑（nearcyan）。另一个元观点：产品从"关注流"向"通用内容"的漂移是可预测的激励结果（nearcyan）。
运营/用户影响叙事：在代码发布的同时，创作者抱怨突然的触及率压制（"触及率被核爆了"）（giffmana），这强化了工程与用户体验之间的紧张关系：算法透明度并不自动转化为感知的公平性。

开源权重与本地推理：GLM-4.7-Flash 势头强劲与 KV 缓存现实挑战

GLM-4.7-Flash 成为"本地工作主力"候选者：多条推文强调 GLM-4.7-Flash (30B-A3B) 在参数性能比方面表现出色。基准测试和实际评估表明，其竞争力足以取代更大的本地默认模型 (sam_paech)。Unsloth 明确推广"本地运行"理念：200K 上下文长度，号称在 SWE-Bench 和 GPQA 上表现最佳的 30B 模型，且"仅需 24GB RAM 即可本地运行"，并提供 GGUF 格式打包 (UnslothAI)。
系统细节：MLA / KV 缓存成本占主导地位：围绕 GLM-4.7-Flash 的讨论强调，KV 缓存内存 的占用可能比许多人预期的更早成为瓶颈，而且 MLA 并非免费——在简单的 MHA 机制下运行 MLA 模型可能导致缓存使用量激增 (teortaxesTex)。一个具体的调试问题：为什么 vLLM 在简单 MHA 下显示 GLM-4.7-Flash 的上下文成本约为 1MB/令牌，而理论计算值应为 ~54KB (teortaxesTex)。
量化行为与缓解措施：Unsloth 报告了量化版 GLM-4.7-Flash 的 循环问题，并建议调整 --dry-multiplier 1.1 参数，使用更高质量的量化版本（如 UD-Q4_K_XL+），并在校准过程中增加更多 工具调用数据 (danielhanchen)。
本地吞吐量工程优化：exo labs 展示了在 4台 M4 Pro Mac Mini 上使用张量并行运行 GLM-4.7-Flash，通过 Thunderbolt 上的 RDMA 和 MLX 后端，实现了 ~100 令牌/秒 的速度，目标是达到 ~200 令牌/秒 (alexocheema)。
GLM 生态系统溢出效应：一个较轻但值得注意的信号：开发者已经开始在本地"一次性"完成小型项目（例如，通过 Claude Code + 运行 GLM-Flash 的 Ollama 创建马里奥游戏）(nopmobiel)。GLM-Image 也在图像排行榜上获得一席之地（在该快照中位列开源模型第8名）(arena)。

推理与训练研究：思想社会、多重令牌、蒸馏与计算分配

"思想社会"作为推理轨迹背后的机制：一篇广泛传播的Google AI论文声称，推理模型（OpenAI o系列、DeepSeek-R1、QwQ）的性能提升不仅仅是"思考更久"，而是内部辩论模式的出现——质疑步骤、探索替代方案、分歧与收敛——可测量地介导了准确率提升（报告**20%+**的优势）（rohanpaul_ai）。
多重思维（分支与合并令牌）："多重思维"论文提出在每个步骤中采样K个令牌到一个多重令牌中，根据不确定性自适应调整；置信度高的步骤表现类似CoT，而不确定的步骤则代表多条路径，通过更短的序列获得更好的结果（HuggingPapers, akhaliq）。
通过逻辑/排序损失进行蒸馏：一个实用的蒸馏技巧：与其使用KL/SFT，你可以训练学生模型通过逻辑损失来保持教师模型的令牌排序，该损失基于从教师模型top-K logits中挖掘的令牌对——被构建为一个简洁的PyTorch练习，并与DistillKit相关联（cwolferesearch, cwolferesearch）。
合成推理数据："更多采样，而非更大模型"：DeepMind的一项结果总结认为，在计算匹配的采样条件下，较小模型可以产生更好的合成推理数据：更便宜的模型生成更多尝试，提高了覆盖率（+11%）和多样性（+86%），在相同推理预算下报告的训练增益高达31.6%（LiorOnAI）。
RL计算缩放指导：另一个关于LLM RL的讨论声称，LLM RL中的最优计算分配"可预测地缩放"，旨在为RL微调预算提供缺失的预训练缩放定律等价物（ChengZhoujun）。
NanoGPT"速通"优化：一个值得注意的黑客风格结果：新的NanoGPT速通记录约99.3秒，通过在每一层之前向残差流添加二元组哈希嵌入（受Hash Embeddings和DeepSeek Engram启发），加上一个挑衅性的令牌/参数比率偏离Chinchilla规范（classiclarryd）。

生产环境中的智能体：RLMs、追踪分析、"实用型智能体"与智能体框架

递归语言模型（RLMs）作为计算/上下文管理工具：多条推文将RLMs视为长期运行系统的有前景抽象——不仅仅是"更大的上下文"，而是一种管理计算、递归和选择性读取的方式（doesdatmaksense）。一个关键优势是符号递归：模型可以委托许多子读取/编辑操作，而无需将每个中间步骤都作为token输出，避免了子智能体提示中典型的上下文窗口爆炸问题（lateinteraction，lateinteraction）。（主流报道也有涉及，但技术讨论的核心是上下文经济性和递归。）
追踪理解成为首要产品需求：LangChain提出，面对每日10万+的追踪记录，传统的监控和手动日志审查已不再适用；需要通过"洞察智能体"对追踪进行聚类/模式发现（LangChain，hwchase17）。从业者反复强调的元教训是：评估就像单元测试——有用但有局限性——生产环境追踪揭示了未知的未知因素（samecrowder）。
智能体"群体谬误"与结构化执行：AI21指出，并行智能体仅在只读场景下容易实现；一旦智能体修改文件或在现实世界中行动，协调/一致性就成为难题——主张采用结构化执行和测试时计算，而非简单地"添加更多智能体"（AI21Labs）。
框架/工具链变动与互操作性：一系列基础设施/工具链相关说明：Artificial Analysis更新了Stirrup，增加了浏览器使用和Open Responses兼容性（供应商无关的智能体客户端）（ArtificialAnlys）。CopilotKit为LangChain"深度智能体"添加了前端中间件（人在回路、生成式UI、共享状态），将智能体后端迁移到全栈应用中（CopilotKit）。FastMCP发布了重大重构，支持"下一代MCP应用"（jlowin）。
实用主义观点："如果代码库不混乱，智能体就能工作"：一个清晰的生产环境启发式规则：AI编码工具会放大现有的工程规范——拥有测试/文档的团队效率更高；混乱的代码库会更快变得更混乱（svpino）。企业采用中的另一个观察：第二年及以后的购买者正在重新评估投资回报率；"最差的工程师有最高的AI账单"并交付更多bug（TheEthanDing）。

Small models & edge deployment: on-device reasoning, browser voice, OCR, and Jetson CLIP

Liquid AI’s LFM2.5-1.2B-Thinking: Liquid releases an on-device reasoning model positioned around concise reasoning traces and ~900MB memory footprint (i.e., phone-class hardware), emphasizing tool use/math/instruction-following (liquidai, maximelabonne). Ollama quickly adds it to their model library for broad integration (ollama).
Kyutai voice model in-browser: A notable “deployment feat” demo: running a ~100M parameter voice model in the browser with pure JavaScript + WebGPU (jax-js), highlighting low dependency friction and practical voice cloning flexibility (ekzhang1).
OCR and document agents continue to get cheaper: LightOn releases a 1B OCR model under Apache-2.0, claiming strong speed/cost characteristics (e.g., “

GLM 4.7 Flash official support merged in llama.cpp (Activity: 477): The llama.cpp repository has merged support for the GLM 4.7 Flash model, specifically the Glm4MoeLiteForCausalLM, which is a renamed and restructured version of DeepseekV3. This integration was a community-driven effort, not directly from Z.ai developers, and it enhances the framework’s capabilities by incorporating references to Hugging Face’s GLM-4.7-Flash model. The model is available on Hugging Face. The community appreciates the quick integration into llama.cpp, noting it was faster than attempts with VLLm. There is also a clarification that the term ‘official’ refers to the model’s proper functionality within llama.cpp, not an endorsement by Z.ai.

The integration of GLM 4.7 Flash into llama.cpp is a community-driven effort, not an official release by Z.ai developers. This highlights the collaborative nature of open-source projects where community contributions play a significant role in enhancing software capabilities.
A user reported that using flash-attention with GLM 4.7 Flash on CUDA results in slower performance, suggesting that disabling flash-attention (-fa 0) can lead to a 3x speed improvement. This indicates potential performance issues with flash-attention in certain configurations, prompting users to experiment with settings for optimal performance.
The model’s response time is criticized for being excessively slow, with one user noting that it takes several minutes to generate a simple response. This suggests potential inefficiencies in the model’s processing or implementation that could be addressed to improve usability.

Unsloth GLM 4.7-Flash GGUF (Activity: 314): The release of GLM-4.7-Flash GGUF on Hugging Face is accompanied by specific recommendations for optimal performance, such as using UD-Q4_K_XL quantization and specific parameters like --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1 to reduce repetition. Lower quantizations like UD-Q2_K_XL have been removed due to performance issues. The model still faces challenges, particularly with llama.cpp integration, where issues like segmentation faults and V cache quantization requirements are noted, despite the merging of PR #18936. The model is tested on high-end hardware (RTX 4090, 125 GB RAM) but remains unstable. There is a technical debate on the effectiveness of the --dry-multiplier parameter to reduce repetition, with suggestions to increase it to 1.5 if issues persist. Additionally, there is a consensus that the model’s stability is not fully resolved, despite improvements.

danielhanchen provides specific configuration recommendations for using the GLM 4.7-Flash model, emphasizing the use of UD-Q4_K_XL and above quantizations. They suggest parameters like --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1 to reduce repetition, with a note to increase --dry-multiplier if issues persist. Lower quantizations like UD-Q2_K_XL are removed due to performance issues, and non-UD-Q versions are discouraged. More details are available in their documentation.
bobeeeeeeeee8964 reports a critical issue with running GLM-4.7-Flash on llama.cpp (commit 6df686bee), specifically with V cache quantization requiring flash_attn, which contradicts the model’s requirement to disable flash_attn to avoid CPU fallback. This results in segmentation faults and instability, even after PR #18936. Tests with various configurations, including self-converted Q8_0 and evilfreelancer IQ4_XS, result in crashes or garbled output, indicating unresolved compatibility issues.
danielhanchen acknowledges ongoing issues with looping in quantized versions of the model, suggesting BF16 for optimal results until fixes are finalized. This aligns with SM8085’s announcement of the BF16 release, which is expected to improve stability and performance.

zai-org/GLM-4.7-Flash · Hugging Face (Activity: 1169): GLM-4.7-Flash is a 30B-A3B Mixture of Experts (MoE) model released by zai-org on Hugging Face. It is optimized for efficient deployment, leveraging MLA to minimize KV cache memory usage, allowing many users to run it at the full 200k context length. The model demonstrates superior performance on benchmarks like AIME and GPQA and supports local inference through frameworks such as vLLM and SGLang. Detailed installation and evaluation instructions are provided to ensure optimal performance. Commenters express enthusiasm for the model’s efficiency and memory management, particularly appreciating the ability to run it at full context length due to its low memory footprint. There is also a sentiment of anticipation for larger models, such as 70B, indicating a demand for even more powerful models.

The GLM-4.7-Flash model utilizes MLA (Memory-Limited Attention), which significantly reduces the memory footprint of the KV cache. This optimization allows many users to run the model at its full 200k context length, making it more accessible for those with limited hardware resources.
A user highlights the model’s architecture, noting a discrepancy in the model’s description as a ‘30b’ model, which actually refers to a ‘3B thinking model’ as per the code reference in the Hugging Face Transformers repository. This suggests a potential misunderstanding or mislabeling in the model’s specifications.
There is a desire for performance comparisons with larger models, as one user mentions the lack of direct benchmarks against much larger models, which would provide clearer insights into the model’s relative performance and capabilities.

2. Deepseek Model and System Builds

768Gb Fully Enclosed 10x GPU Mobile AI Build (Activity: 903): The post describes a custom-built mobile AI system designed for running large Mixture of Experts (MoE) models like Deepseek and Kimi K2, as well as for high-detail image and video generation. The system features a Threadripper Pro 3995WX CPU, 512GB DDR4 RAM, and a combination of 8x RTX 3090 and 2x RTX 5090 GPUs, housed in a Thermaltake Core W200 case. The build prioritizes mobility and enclosure, using a dual-system case to accommodate the GPUs with risers, and is powered by EVGA 1600W and Asrock 1300W PSUs. Benchmarks show impressive token generation rates, such as 31.54 tokens per second for the Qwen 235b model. The system’s total cost was approximately $17,000, with a focus on balancing performance and budget constraints.
It’s been one year since the release of Deepseek-R1 (Activity: 364): The image marks the one-year anniversary of the release of DeepSeek-R1, a model that reportedly performs on par with OpenAI-o1. The model is fully open-source, with both the code and models available under the MIT License, allowing free use and modification. The announcement highlights the availability of a live website and API for users to interact with the model at chat.deepseek.com. The image also includes a snippet of a chat interface, suggesting practical applications of the model in problem-solving scenarios. Comments reflect on the impact of DeepSeek-R1, suggesting it significantly influenced the AI landscape by forcing competitors to adapt, such as by reducing prices and increasing transparency in reasoning outputs. The release is considered a pivotal moment in AI development, second only to the original LLaMA release.

Cuplike highlights the impact of Deepseek-R1 on the AI landscape, noting that it forced competitors to lower prices and reveal reasoning outputs. This suggests that Deepseek-R1 set a new standard in transparency and cost-effectiveness, making it a pivotal release in AI history, second only to the original LLaMA model.

SubstantialSock8002 raises an interesting point about the progress in AI models by questioning which smaller models currently match the performance of Deepseek-R1 and their sizes. This inquiry suggests a focus on efficiency and the evolution of model capabilities over time, indicating a trend towards more compact yet powerful models.
Lan_BobPage comments on the significant impact of Deepseek-R1 on major tech companies, specifically mentioning how it led to strategic shifts at Meta. This underscores the model’s disruptive influence, causing major players to reassess their AI strategies and operations.

768Gb Fully Enclosed 10x GPU Mobile AI Build (Activity: 195): The post details a custom-built mobile AI system designed for running large Mixture of Experts (MoE) models like Deepseek and Kimi K2, as well as for high-detail image and video generation. The system features a Threadripper Pro 3995WX CPU, 512GB DDR4 RAM, and a combination of 8x RTX 3090 and 2x RTX 5090 GPUs, housed in a Thermaltake Core W200 case. The build is powered by EVGA 1600W and Asrock 1300W PSUs, running on Ubuntu. The system’s design prioritizes mobility and enclosure, using the W200 case to avoid the aesthetic and structural issues of mining frames. Benchmarks show impressive token generation rates, e.g., 24.92 tps for Deepseek V3.1 and 31.54 tps for Qwen 235b, with the system maintaining good airflow and acoustics despite its high power and density. Commenters raised concerns about the power requirements, questioning if the PSUs are run on separate circuits due to the high power draw of the system. This highlights the practical challenges of operating such a high-performance build in a typical residential setting.

3. AI Hardware and System Configuration

LLM Sovereignty For 3 Years. (Activity: 101): The user is seeking advice on setting up a local environment to run Large Language Models (LLMs) for the next three years with a budget of approximately $10,000. Concerns include rising compute costs, increasing cloud service prices, and potential censorship. Suggestions include purchasing an Apple M3 Ultra with 80 GPU cores and 512 GB of memory, which may outperform traditional GPU cards in some tasks. Another recommendation is a setup with 128 GB RAM and a RyzenAI 395 or Mac for a balanced start. Additionally, investing in a tower with an RTX GPU and 128 DDR RAM is advised for a robust local setup. There is a consensus that while local AI setups are improving, they still cannot fully compete with cloud AI, which utilizes multiple $50k GPUs and models with hundreds of billions of parameters. However, a local setup with sufficient RAM and GPU capabilities is considered a solid starting point for personal use.

Caprichoso1 highlights the potential of the Apple M3 Ultra with 80 GPU cores and 512 GB of memory, priced under $10k. This setup may outperform traditional GPU cards in certain tasks due to its extensive memory, though GPU cards might excel in others, emphasizing the importance of task-specific hardware selection.

TheAussieWatchGuy contrasts cloud AI, which utilizes multiple $50k GPUs and handles hundreds of billions of parameters, with local AI setups. They suggest that while local AI is improving, it remains limited compared to cloud solutions. A local setup with 128GB of RAM, such as a RyzenAI 395 or Mac, is recommended as a solid starting point for those exploring local AI capabilities.
Vegetable-Score-3915 discusses the feasibility of using second-hand workstations for AI inference tasks. They note that PCIe count is less critical for inference, suggesting that a workstation with PCIe 3 x 16 slots and DDR4 ECC RAM (32GB or 64GB) can be cost-effective. This approach allows for gradual upgrades, such as adding more GPUs, without the immediate need for PCIe4 or PCIe5 slots.

Can I add a second GPU to use it’s vram in addition of the vram of my main GPU to load bigger models? (Activity: 44): The user inquires about combining VRAM from multiple GPUs to load larger models, specifically using a 5070 Ti 16GB with a potential second GPU like a 24GB RTX 3090 or a 16GB RTX 5060 Ti. The consensus is that VRAM cannot be directly combined across GPUs for a single model, but multiple GPUs can be used for parallel processing. The RTX 3090 is recommended over the 5060 Ti due to its 24GB VRAM and higher memory bandwidth, which are crucial for AI tasks. The 3090 is noted for its superior performance in AI workloads despite lacking newer features like fp8 or nvfp4 support. The 5070 Ti is comparable to the 3090 in compute power but has less VRAM, making the 3090 a better choice for larger models. Commenters suggest that for AI tasks, more VRAM is generally better, and the RTX 3090 offers the best value despite being older. Some recommend selling the 5070 Ti to invest in multiple 3090s for increased VRAM capacity. The trade-off between using multiple GPUs for faster processing versus a unified memory system for larger models is also discussed.

The discussion highlights the advantages of the RTX 3090 over the 5060Ti for AI model inference, particularly due to its higher VRAM and memory bandwidth. The 3090 offers 50% more VRAM and 100% more memory bandwidth, which is crucial for loading larger models and ensuring efficient compute access. The lack of native support for formats like fp8 or nvfp4 in Ampere is noted, but the 3090’s overall performance benefits outweigh these limitations for most users.
For large language model (LLM) inference, the RTX 3090 is considered superior due to its 24GB VRAM, which is essential for running larger models. Tools like llama.cpp and LM Studio are mentioned as being compatible with multi-GPU setups, enhancing their utility. The comment also suggests that while GPUs provide better tokens per second, systems with high unified memory, like those with Ryzen AI 395 and 128GB+ DDR5, can run larger models albeit with slower token output.
The feasibility of using multiple GPUs, such as the 5060Ti, is discussed in terms of cost-effectiveness and availability. While a single RTX 3090 with 24GB VRAM is priced around $850, two 5060Tis with a combined 32GB VRAM could theoretically match this price point, assuming availability. However, the 3090 is still favored for its superior value and performance, despite being an older model.

AMD Ryzen AI Halo for AI Developers (Activity: 72): The post discusses the AMD Ryzen AI Halo, highlighting its potential to challenge NVIDIA’s dominance in AI hardware. However, technical issues with AMD’s ROCm drivers are a significant barrier, as they are described as unreliable and difficult to work with, especially on Linux. The post criticizes AMD’s claims of optimized applications and full ROCm support, noting that many features, such as FP8 support and integrated NPU, are not functioning as advertised. The only feature that reportedly works as intended is the 128GB unified memory for large AI models. Commenters express skepticism about AMD’s ability to compete with NVIDIA, citing the poor state of ROCm drivers and lack of reliable support for AI workloads. There is a consensus that AMD’s software support is inadequate, with some users having to manually compile and fix issues themselves.

A significant issue highlighted is the lack of robust ROCm driver support for AMD hardware, particularly for AI development. Users report that the drivers are unreliable, with one user mentioning they had to compile raw GitHub code and reimplement closed components to make it functional. This suggests a gap between AMD’s claims of optimized applications and the reality of their software support, especially on Linux.
There is criticism regarding AMD’s claims of ‘Day-0 Support for leading AI Models.’ Users report that certain operations, such as using fp8, are not supported internally by ROCm, forcing them to use alternatives like bf16. This indicates a discrepancy between AMD’s marketing and the actual capabilities of their hardware and software stack.
Despite the criticisms, one feature that reportedly works as advertised is the ‘Up to 128GB unified memory for running large generative AI models.’ This suggests that while there are significant software support issues, some hardware capabilities are being effectively utilized.

dev here - has anyone thought on training a model on your own codebase? (Activity: 42): A Laravel developer is experimenting with training a model on their own codebase using a 5060 16GB setup and the Qwen2.5 Coder model. The developer plans to use older branches of their codebase and iterate over them incrementally. This approach is intended to explore the potential benefits of customizing a model specifically for their codebase. Commenters suggest that using a more modern model like Qwen3-Coder or Devstral-2 would yield better results, as Qwen2.5 Coder is considered outdated. They also recommend using Retrieval-Augmented Generation (RAG) or codebase indexing features from tools like Roo/Kilo Code for more effective results.

iMrParker suggests using Retrieval-Augmented Generation (RAG) instead of training a model on your own codebase for creating a promptable knowledge base. RAG can efficiently handle large datasets by retrieving relevant information, which might be more effective than fine-tuning a model on a specific codebase.
noctrex recommends using more modern models like Qwen3-Coder or Devstral-2 for better results, as older models may be limited. They also suggest using RAG or the Codebase Indexing feature from Roo/Kilo Code, which can provide more efficient and accurate codebase management and querying.
HonestoJago proposes an alternative approach to fine-tuning by training a model on pairs of questions and answers that reflect the developer’s coding style and techniques. This method could potentially personalize the model’s responses, although it might risk overfitting or breaking the model. They mention that tools like Unsloth make fine-tuning more accessible and quicker.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Code and AI Coding Tools

Microsoft pauses Claude Code rollout after Satya intervention (Activity: 1367): Microsoft has paused the deployment of Claude Code internally after intervention from CEO Satya Nadella and senior leadership, redirecting employees to use GitHub Copilot instead. The internal communication suggests that Copilot has “mostly closed the gaps” with Claude Code. However, exceptions are made for “high-priority R&D” projects, which can still access the Anthropic API with proper justification. Existing users retain access, but new invitations have been rescinded. Commenters express skepticism about Microsoft’s claim that Copilot has “closed the gap” with Claude Code, suggesting it may be a strategic move to improve their own product by forcing internal use. Some find it notable that Microsoft admitted to using a competitor’s tool over their own.

DestroyAllBacteria highlights the strategic importance of Microsoft using its own products, like Copilot, to improve them. This approach, often referred to as ‘eating their own dog food,’ can lead to better product development and a more competitive landscape. By focusing on internal tools, Microsoft can potentially enhance the quality and capabilities of Copilot, making it a stronger competitor in the AI space.

Inside-Yak-8815 points out the surprising admission by Microsoft that they were using Claude Code instead of their own tools. This revelation suggests that Claude Code might have had superior features or performance that Microsoft found valuable, which could be a driving factor for them to improve their own offerings like Copilot.
Foreign_Coat_7817 suggests using Sonnet through GitHub Copilot as an alternative, indicating that there are multiple ways to leverage AI tools within Microsoft’s ecosystem. This comment implies that while Claude Code might be paused, there are still robust options available for developers within the Microsoft suite.

Tried Claude Cowork last night, and it was a top 3 most exciting moments I’ve ever had with technology. (Activity: 483): The post describes a user’s experience with Claude Cowork, a tool that appears to enhance the functionality of Claude Code by leveraging internet search capabilities to solve complex problems. The user highlights that Cowork demonstrated superior common sense compared to Claude Code, particularly in identifying and correcting errors in a project related to building a ‘wispr flow app’. The user attributes Cowork’s effectiveness to its ability to search the internet more efficiently, suggesting it retains more information than Claude Code, which relies on MCPs (Model Checkpoints). One commenter questions the necessity of Cowork given that Claude Code can already search the internet, while another expresses skepticism about the user’s claims, suggesting they might be experiencing ‘AI psychosis’. A third commenter reports difficulty in getting Cowork to access certain features, indicating potential limitations in its integration with Claude Code.

Prize-Individual4729 highlights a technical limitation of Claude Cowork, noting that attempts to access the Claude Code terminal or Code tab in Claude for Mac were unsuccessful due to the sandbox/VM restrictions. This suggests that certain functionalities are isolated and not directly accessible, which could impact workflows that rely on integrated development environments.
deific_ provides a perspective on the utility of Claude Cowork, emphasizing its ability to produce polished products despite not adhering to ‘perfect Sr Dev codebase’ standards. They argue that in corporate environments, the focus is often on delivering useful products rather than perfect ones, and Claude Cowork’s auditing capabilities contribute to this goal. This reflects a broader discussion on the balance between code quality and practical utility in software development.

has anyone tried Claude Code with local model? Ollama just drop an official support (Activity: 421): The post discusses the integration of Claude Code with local models, specifically mentioning Ollama’s official support for this setup. The image shows a coding interface for creating a simple HTML website, indicating the potential for using Claude Code in local development tasks. The post highlights the use of GLM 4.7 flash 30B for small tasks, suggesting that this setup could allow for unlimited iterations without usage limits. A key point from the comments is the comparison between local models and cloud-based models like Claude and GPT, noting that local models require more explicit instructions and prompt engineering. The comments also discuss the performance of models based on VRAM availability, suggesting that at least 24GB of VRAM is needed for effective tool calls and context management. Commenters suggest that while Claude Code can be useful for initial prompt building, local models require more detailed instructions and context management compared to cloud models. They also recommend using llamacpp for better performance and control over model selection, advising against using Ollama models for high-intelligence tasks.

Prof_ChaosGeography discusses using Claude with local models via llamacpp server and a litellm proxy. They emphasize that local models, especially those from Ollama, don’t match the intelligence of cloud-based Claude or GPT models. They recommend using llamacpp for better performance and control over model selection and quantization, advising not to go below q6 for monitoring and q8 for autonomous operation. They also highlight the need for explicit instructions and effective prompt engineering when using non-Anthropic and non-OpenAI models.
onil34 points out the limitations of models with different VRAM capacities. They note that models with 8GB VRAM struggle with tool calls, while 16GB models perform better but have limited context windows (4k). They suggest that at least 24GB of VRAM is necessary for optimal performance, indicating the trade-offs between VRAM capacity and model capabilities.
SatoshiNotMe shares their experience using ~30B models with Claude Code via llama-server on an M1 MacBook Pro Max with 64GB RAM. They report good performance in terms of TPS and work quality, particularly for sensitive document work. They provide a guide for running local LLMs like Qwen3, Nemotron, and GPT-OSS with Claude Code, and mention settling on Qwen3-30B-A3B without exhaustive comparison.

Are we sure this is 100% allowed by Anthropic? (Activity: 313): The image and post discuss the integration of Ollama with Anthropic’s Claude messages API, allowing users to utilize Claude code with open-source models. This setup supports advanced features like agentic loops, tool use, and coding workflows powered by private LLMs. The comments clarify that this functionality is similar to how large corporations use proxy layers to access Claude on platforms like Amazon Bedrock. Anthropic’s main restriction is against using their APIs for unlimited access under fixed-price plans, not against using their harness with other LLMs. The official documentation supports using gateways to other LLMs, indicating that this practice is legitimate. Commenters agree that using Anthropic’s harness with other LLMs is legitimate, as long as it doesn’t involve exploiting fixed-price subscription plans. The official documentation from Anthropic supports this use case, and Ollama’s recent support for this integration further legitimizes it.

The use of Claude Code through proxy layers to access services like Amazon Bedrock is a common practice among large corporations, and Anthropic has limited means to detect if their tool is being used with a non-Anthropic model. The main restriction is on using non-Claude Code harnesses to access models on Pro/MAX plans, which is not allowed by Anthropic.
Anthropic provides documentation on using gateways to other LLMs, indicating that they permit the use of their harness with other LLMs. The primary restriction is against using Claude LLM APIs with fixed-price monthly subscriptions, which led to the OpenCode controversy. This suggests that while using the API is allowed, it must adhere to Anthropic’s acceptable use terms.
The recent concern about Claude Code/OpenCode was related to the use of Claude subscriptions in third-party tools. API key-based calls have always been functional across platforms, and the introduction of support by Ollama is not a new development. Users must still comply with Anthropic’s acceptable use terms, which prohibit activities like building competing products or exfiltrating data for model training.

[P] I Gave Claude Code 9.5 Years of Health Data to Help Manage My Thyroid Disease (Activity: 207): The user utilized Claude, an AI model, to analyze 9.5 years of personal health data from Apple Watch and Whoop to manage episodic Graves’ disease. By employing XGBoost after testing various ML models, the user achieved approximately 98% validation accuracy in predicting disease phases, providing alerts 3-4 weeks before symptom onset. This model was backtested successfully, predicting an episode weeks before lab confirmation. The user developed an iOS app for ongoing monitoring and open-sourced the project, including the Claude code setup, on Medium. Comments raised concerns about potential data leakage due to the high accuracy rate, suggesting the need for out-of-time testing to validate predictive utility. Additionally, there was skepticism about sharing medical data with Anthropic.

Stereoisomer raises a critical point about the reported 98% accuracy in the predictive model for managing thyroid disease, suggesting the possibility of data leakage. Data leakage occurs when the model has access to information during training that it wouldn’t have in a real-world scenario, leading to overly optimistic performance metrics. This highlights the importance of ensuring that the model’s training and testing datasets are properly separated to avoid such issues.
GreatBigBagOfNope emphasizes the importance of out-of-time testing for evaluating the predictive utility of the model. While backtesting can provide insights into past performance, real-world effectiveness is best assessed through continuous, real-time testing. This approach helps in understanding how well the model adapts to new, unseen data, which is crucial for its practical application in managing health conditions.
grimmwerks shares a personal experience with Hashimoto’s disease and related symptoms, noting a potential link between sugar intake and inflammation. This anecdotal evidence suggests that personalized data-driven approaches, like the one discussed in the post, could be valuable for managing complex health conditions by identifying individual triggers and patterns.

The creator of Node.js says the era of writing code is over (Activity: 309): Ryan Dahl, the creator of Node.js, has suggested that the traditional era of writing code is ending, indicating a shift towards AI-driven development. This perspective is shared by other prominent figures like Karpathy and Stroustrup, who foresee a future where software engineering focuses more on problem-solving rather than manual coding. The discussion highlights the potential for AI to automate many coding tasks, fundamentally changing the skills required in the industry. For more details, see the original article. Comments reflect a divide between coders and engineers, emphasizing that engineering is about problem-solving, not just coding. There’s also a recognition that many companies lag in AI adoption due to security and policy constraints, limiting the use of advanced AI tools in corporate environments.

MR_PRESIDENT__ highlights the lag in AI adoption within large corporations, noting that many are 4-5 years behind current AI capabilities. This delay is attributed to stringent security and responsibility protocols, which restrict the use of advanced tools like CLI tools, MCP servers, and AI models such as Claude Code. The commenter contrasts this with the more advanced capabilities available to individuals outside these corporate environments, suggesting a significant gap in AI utilization between personal and corporate settings.

2. Gemini and Google AI Developments

Rumors of Gemini 3 PRO GA being “far better”, “like 3.5” (Activity: 657): The image discusses rumors about a new version of Google’s AI model, referred to as “Gemini 3 PRO GA,” which is reportedly undergoing A/B testing in an AI studio. This version is rumored to be significantly improved, potentially comparable to a hypothetical version 3.5. The community post suggests that the current 3.0 model has a strong base intelligence but lacks fine-tuning, indicating that the new version might address these issues. The term “GA” is questioned in the comments, possibly referring to “General Availability.” Commenters express skepticism about the new version’s capabilities, noting that the current model makes frequent typos in coding tasks and suggesting that significant improvements are needed for it to surpass existing models like Opus.
Gemini integration into Chrome browser is just too darn good and useful (Activity: 178): The image illustrates the integration of the Gemini tool into the Chrome browser, which enhances the browsing experience by providing real-time context and information about media content being viewed. This feature allows users to gain additional insights and background information on videos or images they are watching, directly within the browser. The tool is particularly noted for its ability to offer context that users might not initially be aware of, thereby enriching their understanding and engagement with the content. Commenters express a desire for the Gemini integration to be available outside the US, highlighting its potential utility in other regions. There is also curiosity about how to activate this feature, indicating interest in its practical application.
Even Gemini 3 Pro is acting stupid lately (Activity: 54): The user reports issues with the Gemini 3 Pro model, specifically its tendency to generate unwanted images and videos, despite being on the Ultra tier for higher quality. The model appears to misinterpret user requests, such as creating a storyboard when only ideas were solicited. This suggests potential flaws in the model’s prompt interpretation or execution logic, possibly due to an overzealous attempt to anticipate user needs. The user suggests a rule change to ensure the model only creates content explicitly requested by the user. One commenter speculates that a new model is in development, which may address these issues. Another suggests that the model’s behavior is due to its design to fulfill the ‘ultimate objective’ of a task, implying a need for clearer user instructions or model adjustments.
Gemini Live preps big upgrades with ‘Thinking Mode’ and ‘Experimental Features’ (Activity: 170): Google is preparing to enhance its Gemini Live app with new features like ‘Thinking Mode’ and ‘Experimental Features’ as part of its ‘Labs’ initiative. These features, expected to be powered by the upcoming Gemini 3 model, include ‘Live Thinking Mode’ for more detailed responses and ‘Live Experimental Features’ such as multimodal memory, improved noise handling, and personalized results. The app currently runs on Gemini 2.5 Flash, but the new updates suggest a shift to Gemini 3. Additionally, features like ‘UI Control’ and ‘Deep Research’ are being developed, potentially integrating with Android’s ‘Computer Use’. There is a technical debate on the availability of these features, with some users speculating they might be limited to the United States. The community is also intrigued by the potential of ‘Agent controls phone to complete tasks’ and improved noise handling.

The introduction of ‘Live Thinking Mode’ in Gemini 3 Pro is designed to enhance the AI’s response quality by allowing it more time to process and generate detailed answers. This feature is part of Google’s ‘Labs’ initiative, which lets users test upcoming functionalities. The mode may utilize either the Thinking or Pro models to achieve these detailed responses, indicating a potential shift towards more sophisticated AI processing capabilities.

The ‘Live Experimental Features’ in Gemini 3 Pro include advancements like multimodal memory and improved noise handling. These features aim to enhance the AI’s interaction by integrating data from various Google apps to provide personalized results. The mention of ‘responding when it sees something’ suggests a visual recognition capability, possibly linked to Project Astra, which could significantly improve context-aware responses.
Gemini 3 Pro’s ‘UI Control’ feature allows the AI agent to control the phone to complete tasks, indicating a move towards more integrated and autonomous device management. This aligns with the broader trend of AI systems taking on more complex roles, such as ‘Deep Research,’ which involves delegating intricate research tasks, potentially transforming how users interact with their devices for productivity.

BabyVision: A New Benchmark for Human-Level Visual Reasoning (Activity: 574): The image presents a bar chart from the BabyVision-Mini benchmark, which evaluates the visual reasoning capabilities of large language models (LLMs) compared to humans of various ages. The chart highlights that human performance, particularly that of 12-year-olds, surpasses that of LLMs, with the Gemini3-Pro-Preview model achieving the highest accuracy among the LLMs. This benchmark underscores the current limitations of LLMs in visual reasoning tasks, suggesting that advancements in multi-modal pretraining and reinforcement learning could enhance their performance in the future. Commenters note the potential for future improvements in LLMs’ visual reasoning through scaling multi-modal pretraining and reinforcement learning, which could significantly benefit fields like robotics.

The discussion highlights that current models are still limited in visual reasoning, which is a significant challenge for achieving ARC AGI. The commenter suggests that scaling multi-modal pretraining and reinforcement learning (RL) for vision tasks could improve performance to near 100% in the future, unlocking new applications, particularly in robotics.
The commenter references a specific paper on arXiv, which likely provides detailed insights or data related to the benchmark or model performance discussed in the post. This suggests that the community is actively engaging with academic research to understand and improve visual reasoning capabilities in AI models.

The Thinking Game documentary is sitting at 305M views on Youtube in less than 2 months. Ridiculous numbers. (Activity: 545): The image highlights the extraordinary viewership of “The Thinking Game,” a documentary by Google DeepMind that has reached over 305 million views on YouTube in less than two months. This documentary, an official selection of the Tribeca Film Festival, explores an AI breakthrough that won a Nobel Prize, reflecting the growing public interest in AI topics. The rapid accumulation of views is contrasted with the earlier AlphaGo documentary, which has 37 million views over six years, indicating a significant increase in public engagement with AI content. The documentary’s focus is noted to be more on human endeavor than the technology itself, which has resonated with viewers. There is skepticism about the authenticity of the view count, as the ratio of views to likes suggests possible artificial inflation. Typically, a video with such high viewership would have millions of likes, but this video has only 190K likes, leading to speculation about the use of bots.

The documentary ‘The Thinking Game’ has achieved over 305 million views on YouTube in less than two months, which is significantly higher than the 37 million views of the ‘AlphaGo’ documentary released in 2020. This rapid accumulation of views suggests a growing public interest in AI-related content. However, some users suspect that the view count may be artificially inflated due to the disproportionate number of likes (190K) and comments (4000) compared to typical engagement metrics for videos with similar view counts.
There is skepticism about the authenticity of the view count for ‘The Thinking Game’ documentary. A typical video with over 300 million views would generally have millions of likes, yet this video only has 190K likes, suggesting potential use of bots to inflate views. The expected ratio of likes to views is approximately 1:100, indicating that the current engagement does not align with organic growth patterns.
One user noted an unusual pattern in YouTube’s recommendation algorithm, stating that ‘The Thinking Game’ was persistently suggested on their homepage and sidebar for two weeks, which is atypical for YouTube’s recommendation system. This could imply an aggressive promotion strategy or algorithmic anomaly contributing to the high view count.

3. DeepSeek AI Impact and Developments

One Year Since the “DeepSeek Moment”: The Impact is Still Real. (Activity: 204): The “DeepSeek Moment” marks the anniversary of the release of DeepSeek-R1, a significant reasoning model that has influenced the AI industry by emphasizing reasoning as a core capability, promoting efficient training methods, and encouraging the development of smaller, smarter models. This release has also led to broader adoption in emerging markets and a shift towards modular, tool-aware AI systems. The impact of DeepSeek-R1 is seen as a pivotal change in the industry, comparable to major releases from other leading AI companies. Commenters highlight that DeepSeek’s impact was not about surpassing competitors like OpenAI but demonstrating capability, especially from a non-Western entity. Some users express disappointment with the transition from R1 to the MoE model, preferring open-source alternatives. Others note DeepSeek’s contributions to fine-grained sparsity and RLVR, suggesting its techniques may become standard in the industry.

DeepSeek’s release was a significant event in the AI landscape, challenging the dominance of Western LLMs by demonstrating China’s capability in this field. The initial model, R1, was impactful, but the transition to a Mixture of Experts (MoE) model was seen as a downgrade by some users due to slower updates and less appealing performance for specific use cases. This shift led some users to prefer open-source alternatives, which they find more aligned with their needs and values.

DeepSeek’s major contributions include advancing fine-grained sparsity techniques, particularly with its V3 model and predecessors, and introducing a straightforward method for achieving Reinforcement Learning with Variable Rewards (RLVR) through the GRPO algorithm. These innovations have influenced the broader AI community, with DeepSeek’s Sparse Attention potentially becoming a standard approach, similar to how Multi-Headed Attention (MLA) has been widely adopted in open models.

The Race to Build the DeepSeek of Europe Is On (Activity: 181): The article discusses Europe’s strategic push to develop its own AI capabilities, aiming to reduce dependency on US technologies and establish technological sovereignty. This initiative is partly inspired by China’s success with DeepSeek and involves significant government investment and open collaboration among European AI labs. Key players include DeepMind in the UK and Mistral in France, highlighting a competitive landscape as Europe seeks to become an AI superpower. The effort underscores AI’s role as critical infrastructure, necessitating a shift towards self-sufficiency in the sector. Read more. Commenters express skepticism about Europe’s ability to compete with US AI firms, citing regulatory and taxation challenges. There is also a sentiment that European governments’ demands on companies, such as producing affordable electric cars, may hinder AI innovation.

The discussion highlights the strategic importance of Europe developing its own AI capabilities, particularly in light of its changing relationship with the US. The urgency for Europe to become a self-sufficient AI superpower is underscored by the need to reduce dependency on US-based technologies, as detailed in the Wired article.
The comment by No_You3985 points out the significant contributions of European-born scientists to major AI advancements, such as OpenAI’s GPT models. This underscores the potential talent pool within Europe that could be leveraged if these individuals were incentivized to return and contribute to European AI initiatives.
Rojeitor’s comment critiques the regulatory and economic environment in Europe, suggesting that over-regulation and high taxation could hinder the development of competitive AI technologies. This reflects a broader concern about the balance between regulation and innovation in the tech industry.

What do you mainly use DeepSeek for? (Activity: 49): DeepSeek is primarily utilized for tasks such as development and architectural analysis of applications, as well as generating documentation, leveraging its capabilities through a paid API. Users also explore its performance in areas like math and statistics, and engage it in more casual interactions such as discussing life topics and recipes. The model is noted for its versatility in handling diverse tasks, though specific benchmarks or comparative performance metrics against other LLMs are not detailed in the discussion. Some users highlight DeepSeek’s effectiveness in technical domains like application development and documentation, suggesting it may excel in structured, technical tasks. However, there is also interest in its ability to handle more general conversational topics, indicating a broad range of applications.

Meca0x highlights the use of DeepSeek for development purposes, specifically mentioning its application in architectural analysis of applications and documentation. This is facilitated through the paid API, suggesting a focus on leveraging DeepSeek’s capabilities for professional and technical tasks.
Sparklypain discusses the use of AI for complex communication and analysis tasks. They emphasize the need for AI to understand and translate unusual syntax and ideas, as well as perform multivariable and high-level regressive analysis. This involves asking iterative ‘why’ questions to uncover deeper insights, which is challenging for human counterparts.
Sparklypain also notes the necessity of AI in facilitating high-level regressive analysis due to the complexity of their ideas and sentence structures. This involves iterative questioning to explore unknowns and feelings, which is a task that requires significant time and cognitive effort, often beyond the capability of their human friends.

GLM-4.7-Flash 采用：提示词、量化与"思考"开关

Claude 提示词为 GLM 带来显著提升：Unsloth 用户报告称，从 Anthropic 文档中引入修改后的 Claude Sonnet 4.5 系统提示词，通过 Claude 系统提示词发布说明显著改善了 GLM-4.7-Flash 的连贯性和能力（"技能差异"）。

讨论认为这证明了 系统提示词框架 可以主导感知到的模型质量，特别是在指令遵循和风格控制方面，即使底层权重保持不变。

高量化级别的异常现象：Q2 优于 Q6（？？？），众人困惑：多位用户观察到 GLM-4.7-Flash 在更高量化级别表现更差——偏好 Q2KXL 而非 Q6KL——并将其与 llama.cpp/Ollama 中可能的量化工具问题联系起来，参考了相关的 llama.cpp 讨论线程 ggml-org/llama.cpp PR 讨论。

社区共识：这种情况很罕见（"首次出现模型在高量化级别表现不佳"），很可能涉及 量化伪影 或 生产流水线 问题，而非简单的采样器设置。

聊天模板吞噬你的推理能力：LM Studio 用户认为 聊天模板 可能会剥离或抑制像 Qwen3 这样的模型中的推理能力，破坏"交错思考"，并指出 GLM4.7-Flash 包含类似 clear_thinking 的模板标志，除非明确禁用，否则会移除思考内容。

该讨论将这些模板行为与代理编码扩展和工具工作流联系起来，暗示"模型退化"报告有时来自 模板默认设置 而非模型权重本身。

2. MCP与Agent工具：生态系统的成长阵痛（与新玩具）

MCP Inspector vs 401：重新认证的Boss战：MCP贡献者报告MCP Inspector在遇到401错误后无法重新认证，建议它解析401响应中的资源元数据并尝试重新授权；他们还标记了一个已知的SDK bug，涉及跨重定向的资源元数据持久化，该问题在inspector issue #576中被跟踪。

成员们观察到VS Code似乎只在初始连接时使用Inspector（而不是后续的401错误），这表明故障模式可能源于SDK内部机制，并且服务器端修复已经完成，SDK更新正在等待中。

LM Studio称MCP SDK为纸牌屋：LM Studio用户批评他们的MCP后端（基于官方SDK构建）存在严重的安全问题，并且"完全没有考虑开发者的用户体验"，但与其他agent框架相比，它仍然是"目前最好的选择"。

得出的结论很务实：开发者想要MCP，但当前的实现感觉脆弱，因此团队预计SDK、认证流程和工具调用体验将会有变动。

OpenRouter推出更多客户端：OkeyBot + Inforno：OpenRouter用户展示了OkeyBot，这是一个通过OpenRouter自带密钥进行Discord聊天的工具，提供每个线程的使用量/成本估算，网址为okeybot.ai；以及Inforno，一个开源的桌面多LLM聊天应用，支持OpenRouter + Ollama，将历史记录保存到**.rno**文件中，有Inforno介绍视频和代码仓库alexkh/inforno。

与此同时，用户要求OpenRouter为Google/OpenAI等提供商提供批量API，在一篇X帖子中引述了需求，并将其与agent工作负载的成本/控制需求联系起来。

3. 性能工程：内核、集合操作与CUDA微优化

YALI试图挑战NCCL（附带尾部延迟证据）：GPU MODE用户推出了YALI，这是一个2-GPU的NVLink AllReduce库，声称相比NVIDIA NCCL具有**1.2×–2.4×**的吞吐量提升，并且"尾部延迟稳定性提高50倍以上"，已在GitHub上发布：Venkat2811/yali。

作者强调了操作与计算的激进重叠（flash/stream模式），甚至在收到反馈后移除了吉祥物——因为AI宣传让项目显得不够严肃，这是开源营销的经典调整。

一个PTX后缀，节省七条指令：GPU MODE指出，rcp.approx.ftz.f32编译为单条MUFU.RCP指令，而rcp.approx.f32可能产生额外的7条指令，参考了NVIDIA的PTX文档。

他们还提到，如果没有ftz（flush-to-zero）标志，次正规数的倒数可能溢出为INF，将.ftz视为既是性能优化也是数值行为的选择。

Flash-Attention步幅Bug：可除性约束消失：GPU MODE用户指出了flash-attention步幅可除性回归问题，称其"归结为一个移除了某些步幅可除性约束的bug"，并在flash-attention问题评论中链接了相关报告。

该讨论将此视为一个提醒：高性能内核通常依赖于脆弱的形状/步幅假设——单个约束的改变就可能表现为正确性或性能的悬崖式下降。

4. 编码工作流与模型经济学：IDE遥测、搜索与"廉价模型"

Cursor统计你的AI代码行数（企业电子表格，集合！）：Cursor用户表示，企业版计划现在可以显示代码库中由AI与人类编写的代码比例，该功能基于Opus 4.5 API（与Claude Code不同）实现，但该功能的具体提示词并未公开。

反应中既有好奇也有怀疑：没有提示词的透明度，团队很难判断测量偏差，或者这个指标更像是销售仪表板而非工程信号。

mgrep宣告Grep末日：Cursor用户讨论mgrep作为grep的替代品，声称在LLM工作流中具有**95%**更好的相关性和令牌效率，因为它返回更少的无用上下文。

其他人反驳说Cursor已经使用rgrep加上内部语义搜索（只是没有营销名称），暗示真正的区别在于包装和默认设置，而非底层理念。

搜索引擎与模型定价：Searxng、Kagi和Grok的"廉价但话多"税：Unsloth成员认为Google在查找内容方面存在困难，并推荐了Searxng，而其他人则称赞Kagi的隐私保护和网页抓取功能，并链接了一个演示视频：YouTube: ThgVTNVOZ7g。

与此同时，Cursor用户表示Grok可能比Opus/Sonnet/GPT更便宜，但通常需要额外的迭代，因此除非你优化提示词并控制上下文使用，否则"廉价"选项可能会变得昂贵。

5. 基准测试、评估与"社区真实情况"的现实

LMArena获得500万投票，发布排行榜变动：LMArena宣布Text Arena已通过500万次比较，其文本到图像排行榜更新将GLM-Image置于开源模型中的第8位和总体第35位，得分为1018。

用户同时抱怨图像模型质量下降和可靠性问题（验证码循环、"出现错误"等问题），这表明平台的测量价值与持续的产品稳定性问题作斗争。

Eleuther希望进行智能体评估：减少主观感受，增加评判流程：Eleuther工程师讨论了自动化智能体评估以降低人工审核成本，围绕"大模型作为评判者"的工作流程展开讨论，同时警告说仍然需要验证数据质量并首先定义智能体的成功标准。

另一个Eleuther讨论串要求对开源权重模型（例如Llama 7B/13B/70B）进行重复的多项选择评估，每个问题100次运行以估计答案概率，强调使用预先编写的答案而非模型生成的答案。