AI 开发者日报

专为中文开发者打造的 AI 技术日报,每日更新,提供文章和播客双重形式,用通俗语言解读前沿技术。 汇总 AI 开发领域的 X、Reddit 和 Discord 社区讨论,精选开发者值得关注的信息,支持 RSS 和邮件订阅。

订阅 AI 开发者日报,与顶尖开发者同步掌握 AI 最新动态

article cover image

AI 开发者日报 2025-09-17

本期AI开发者日报涵盖AI编程工具、硬件优化、多模态与机器人技术、评估与安全等多个领域的重要进展。OpenAI GPT-5 Codex支持端到端应用开发,各大IDE升级集成AI功能;vLLM支持ARM架构,NVIDIA、AMD和苹果硬件性能显著提升;腾讯、阿里巴巴的多模态模型及Waymo、Figure的机器人技术取得突破;评估技术虽有进展但仍存争议,AI安全监管面临新挑战。整体来看,AI技术快速发展,开发者需持续适应新工具与挑战。

openaimicrosoftperplexity-aihuggingfaceamdtencentlmstudiogpt-5-codexvllm-0.10.2qwen3-next-80b

智能编码与IDE:GPT-5 Codex发布、IDE上下文、MCP无处不在

  • GPT-5 Codex,覆盖面广,开发体验参差不齐:开发者报告了令人印象深刻的智能编码能力和前端生成演示,同时也遇到了令人沮丧的框架怪癖和长时间运行的循环。积极方面:使用Codex智能体端到端构建完整的React应用和动画视频 @gdb, @OpenAIDevs。批评方面:token膨胀/循环和不清楚的控制 @Teknium1, @finbarrtimbers。OpenAI基础设施合作伙伴指出由于需求激增导致吞吐量下降 @thsottiaux。分析:Codex有意"在重要地方花费精力"(在难题上使用更多token),以延迟换取质量 @TheTuringPost

  • IDE堆栈升级:VS Code Insiders正在为GPT-5和Claude Sonnet 4试验200k-token上下文 @pierceboggan;GitHub MCP Registry已集成到VS Code中,实现一键服务器发现 @code。Cursor 1.6添加了自定义命令、更快的智能体终端、MCP资源和/summarize功能 @cursor_ai。VS Code中的GitHub Copilot将根据任务自动选择模型(公开预览) @amandaksilver。Perplexity Pro为Gmail/Calendar/Notion/GitHub提供原生连接器;企业版增加Linear/Outlook支持 @perplexity_ai, @AravSrinivas

推理与训练基础设施:vLLM支持aarch64/GB200、ROCm重大更新、TRL中的上下文并行、Mac MLX速度突破

  • vLLM 0.10.2正式发布aarch64支持(可在NVIDIA GB200上运行),提供多平台Docker镜像;更多性能优化工作正在进行中 @vllm_project。关于核心服务瓶颈(KV/QK缓存)以及PagedAttention如何帮助解决的详细解释性讨论仍在持续 @athleticKoder

  • ROCm重大升级:AMD推出了广泛的堆栈更新,涵盖现代注意力变体、稀疏MoE、分布式推理以及RL/推理支持——并支持笔记本电脑和台式机 @realSharonZhou

  • 长上下文训练的上下文并行:TRL新增上下文并行功能,可在GPU和节点间分片序列;与Accelerate集成 @SergioPaniego。Hugging Face Transformers正在将MoE重构到原生内核上,取得了显著性能提升 @art_zucker

  • RL和机器人数据管道:Unsloth + vLLM权重共享将多模态RL VRAM减少超过50%,使数学/逻辑VLM能够处理更长上下文和奖励塑造 @danielhanchen。LeRobotDataset v3引入了分块片段、高效视频流以及用于OXE规模学习的parquet元数据 @LeRobotHF

  • Mac MLX速度突破:Qwen3-Next-80B 4位量化在M4 Max 64GB上以约66 tok/s的速度运行,使用约41GB内存 @rwojo;LM Studio添加了支持MLX的Qwen3-Next,批量生成演示显示出强大的多流吞吐量 @lmstudio, @awnihannun

新模型、智能体与空间智能

  • HunyuanImage 2.1(腾讯):170亿参数DiT文生图模型,原生支持2048×2048分辨率,双语能力,在Artificial Analysis竞技场中超越HiDream-I1-Dev和Qwen-Image。采用限制性腾讯社区许可的"开放权重":禁止欧盟/英国/韩国使用,月活用户超过1亿的产品,以及使用输出来训练非Hunyuan模型。可通过HF演示体验,在FAL平台价格为100美元/1000张图像 @ArtificialAnlys

  • Reka Speech:高效自动语音识别/翻译模型,声称在现代GPU上比现有方案吞吐量高8-35倍,在Common Voice 16.1和内部语音转文本测试中准确率优于Whisper-Large v3。技术说明:预填充阶段将Q/K卸载到CPU,生成后重新计算注意力以对齐时间戳 @RekaAILabs, @artetxem, @_yuqiwang

  • 通义DeepResearch(阿里巴巴):开源网络智能体,据报道仅用300亿参数(通过MoE激活30亿)即可媲美OpenAI的Deep Research。得分:Humanity's Last Exam 32.9分,BrowseComp 45.3分,xbench-DeepSearch 75.0分 @Ali_TongyiLab

  • World Labs "Marble" 3D世界:从图像或文本生成持久化、大规模3D世界,设有公共画廊;展示结果表明在空间连贯性和规模方面实现了阶跃式进步 @drfeifei, @theworldlabs, @jcjohnss

自动驾驶与机器人技术

  • Waymo规模与准入:Waymo发布了9600万英里的安全数据@ethanteicher;Waymo获准在旧金山国际机场(SFO)开始运营,测试即将启动@Waymo

  • 人形机器人与世界模型:Figure融资超过10亿美元,投后估值达390亿美元,正在大规模招聘以量产人形机器人@adcock_brett。Unitree开源了UnifoLM-WMA-0,这是一个跨越多种机器人形态的世界模型-动作骨干网络,具有仿真和策略增强功能@ClementDelangue。多形态导航基础模型(NavFoM)展示了在机器人和车辆之间统一的视觉语言导航/目标导航/跟踪/驾驶性能@arankomatsuzaki

基准测试、评估与检索工具

  • 开源外部循环实现ARC-AGI SOTA:两项新的顶级成果使用Grok-4结合程序合成、测试时适应和抽象库学习;可复现且成本高效(v1版本每个任务仅需8.42美元)@arcprize, @mikeknoop
  • OpenAI SWEBench修复使得能够在完整的500个数据集上进行公平比较@nrehiew_。lighteval现在包含7000多个基准测试(包括MMMU)以及用于训练前后评估的简单CLI工具@Thom_Wolf, @mervenoyann
  • 评估实践与记忆:行业讨论强调日志记录不等于评估,并着重于覆盖率、偏差控制和人类对齐的评判标准@rebeccatqian。LangChain的新摘要中间件自动管理长代理历史记录,以保持在Python/JS的上下文窗口内@LangChainAI, @sydneyrunkle
  • RAG发展方向:将动态检索与结构化知识相结合以减少幻觉和过时信息正在获得关注@omarsar0。SearchInstruct通过问题扩展和基于资源的答案提出数据高效的SFT用于领域适应@HuggingPapers。DSPy中的GEPA强调了带有解释的标注数据对于评估器训练的价值@AsfiShaheen

Policy and safety moves

  • OpenAI on teen safety, privacy, and freedom tradeoffs: New age‑prediction and parental controls, stricter teen behaviors (e.g., no flirtatious talk, self‑harm discussions), crisis escalation pathways, and a public rationale for prioritizing teen safety while treating adults “like adults” @sama. ChatGPT personalization UI now consolidates personality/custom instructions/memories @sama.
  • Platform defenses: Meta announced “LlamaFirewall,” a toolkit aimed at protecting agent systems from jailbreaking, goal hijacking, and code‑gen exploits—free for projects under 700M MAU @DeepLearningAI. Separate roundup notes both Meta and OpenAI tightening youth protections after harmful interactions reports @DeepLearningAI.

热门推文(按互动量排序)

  • 马斯克谈产品发布节奏(Optimus工程进展、Tesla AI5芯片、Colossus II数据中心参观)@elonmusk
  • 联合国加沙委员会头条新闻 @BBCNews
  • OpenAI产品更新:ChatGPT个性化功能 @sama;青少年安全政策说明 @sama;"Codex氛围 = 早期ChatGPT" @sama
  • 李飞飞的3D世界演示 @drfeifei
  • Figure公司390亿美元估值公告 @adcock_brett
  • Waymo在旧金山机场运营 + 9600万英里里程 @Waymo, @ethanteicher
  • **"我是由Google训练的大模型"**网络迷因 @OfficialLoganK

备注

  • 微软宣布300亿美元英国投资计划,包括配备23,000块先进GPU的国家超级计算机 @satyanadella
  • 阿里巴巴的Qwen3-Next-80B现已在Poe平台上线 @Alibaba_Qwen;月之暗面的Kimi K2 Turbo API降价50%并分享技术"检查点引擎"博客 @Kimi_Moonshot, 文章
  • 机器学习安全脚注:强化学习可以训练较小模型(Qwen3 8B)来对强大监控器(GPT-4o)隐藏副任务,凸显了仅靠检测的监督机制的局限性 @neev_parikh

/r/LocalLlama + /r/localLLM 回顾

1. Local AI Compute: Modded 4090 and Qwen3-Next-80B MLX Benchmarks

  • I bought a modded 4090 48GB in Shenzhen. This is my story. (Score: 1205, Comments: 204): OP replaced a hot-running Tesla P40 (24 GB VRAM, ~85 °C under load) with a Shenzhen-sourced, factory-modded RTX 4090 upgraded to 48 GB VRAM to fit a 2U/serverside deployment where standard 4090/5090 desktop cards are impractical due to size and top-entry power connectors. After seeing the mod in coverage by LTT/Gamers Nexus, OP sourced the card via Alibaba for CNY 22,900, flew to Hong Kong (booked via Trip.com) to avoid VAT/shipping issues, visited the seller’s Shenzhen office (verified batch production and on-site retest), and learned they’re repurposing NVIDIA Ampere mining GPUs and developing modded 5090s with >96 GB VRAM; purchase finalized in cash. Image: card photo. Top comments highlight demand for higher-capacity mods (interest in a 96 GB 5090) and request concrete benchmarks and power draw measurements; overall tone is enthusiastic about local AI hardware but awaits performance data.

Availability and support signal: A commenter reports RTX 4090 48GB VRAM mods are “quite popular” in China and purchasable via Taobao, with seller-backed warranties up to 2 years. This suggests a semi-mature aftermarket ecosystem where these memory-upgraded 4090s are not purely one-off hacks but supported SKUs from certain shops, reducing risk for buyers compared to ad‑hoc mods.

  • Performance/efficiency gap: Another commenter requests benchmarks and power draw, highlighting the need to validate stability and board power under AI workloads. Real metrics (e.g., sustained wattage, throttling behavior, and performance vs stock 24GB 4090 in inference/training) are essential to judge whether added VRAM introduces thermal/VRM stress or affects clock stability.
  • Capacity speculation: A commenter references “Modded 96GB”, implying interest or rumors of 96GB VRAM 4090 variants. No implementation details or validation are provided, but such a jump would materially change feasible model sizes/contexts if real, hence calls for proof (teardown photos, memory config details, and benchmarks).

Qwen3-Next 80b MLX (Mac) runs on latest LM Studio (Score: 223, Comments: 106): Users report that the MLX build of Qwen3‑Next‑80B‑A3B‑Instruct is now runnable in LM Studio on Apple Silicon, with a readily available 4‑bit quantization HF: mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit. OP sees ~35 tok/s on an M1 Mac Studio 64 GB using ~42 GB RAM; others report ~50 tok/s on an M3 Studio Ultra 256 GB (4‑bit) at high context (~80k tokens) with time‑to‑first‑token ~80s, and ~47 tok/s on the full BF16 MLX model using ~149 GB VRAM on a system with 80 GPU cores. Performance variability on M3 Max 128 GB ranges 31–50 tok/s, suggesting non‑linear degradation with context compared to other models. Commenters note only the 4‑bit build is exposed in LM Studio currently and express interest in trying 8‑bit/BF16 for quality/perf trade‑offs. One user attributes the atypical non‑linear throughput behavior to Qwen3‑Next’s architecture, though this is speculative.

  • Observed throughput/latency across quantizations and Apple Silicon tiers: ~50 tok/s on M3 Studio Ultra 256 GB with 4‑bit quant (LM Studio currently only offers 4‑bit), with an ~80ktoken context yielding ~80s time-to-first-token (≈1k tok/s prefill). Full BF16 MLX model reports ~47 tok/s while consuming ~149 GB unified memory on an 80 GPU‑core config. On M3/M4 Max 128 GB, 8‑bit and mixed runs show 30–50 tok/s. Throughput varies by request and doesn’t scale linearly with bit‑width/hardware.
  • KV‑cache quantization bug in MLX engine: model may fail to load with AttributeError: 'MambaCache' object has no attribute 'offset'; workaround is to disable KV‑cache quantization (significantly higher memory usage). Tracking: https://github.com/lmstudio-ai/mlx-engine/issues/221
  • Performance variability appears tied to the model’s newer architecture (Mamba/SSM components): users report per‑request swings from 31 tok/s to 50 tok/s rather than the more linear/logarithmic drop‑offs typical of transformer‑only KV‑cache behavior. The presence of MambaCache hints at different caching/sequence handling that impacts scaling with context and stability of tokens/sec across prompts.

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. OpenAI ChatGPT Usage Study and Use-Case Breakdown (700M users)

  • New OpenAI Study Reveals How 700 Million People Actually Use ChatGPT (Score: 707, Comments: 77): OpenAI’s new usage paper analyzes >1M ChatGPT conversations (with privacy-preserving automated classifiers; no human review) in the context of a ~700M-user base, finding 73% of usage is non‑work. The top intents account for ~78%: Practical Guidance 29%, Writing 24% (mostly editing vs. generation), and Information Seeking 24%; programming is only 4.2%. Additional shifts: gender balance has flipped slightly toward typically feminine names, fastest adoption is in lower–middle‑income countries ($10k–$40k GDP/cap), interaction modes split as Asking 49%, Doing 40%, Expressing 11%, workplace use skews to educated/high‑income professionals with writing dominating, and “companionship” is small (1.9%) with games/roleplay 0.4%. See the report: https://cdn.openai.com/pdf/a253471f-8260-40c6-a2cc-aa93fe9f142e/economic-research-chatgpt-usage-paper.pdf. Commentary debates whether the findings imply job substitution: some argue they displace entry‑level roles in tutoring, editing, ideation, and basic research. Others note coding shares may be undercounted due to migration to API/IDE assistants (Cursor, Copilot), and point out a “Coding Isn’t King” may be a measurement artifact: many developers access LLM coding help via third‑party IDE assistants and APIs rather than the ChatGPT web UI—e.g., Cursor, Windsurf, and Microsoft Copilot. That shifts traffic to API/partner telemetry (or even non‑OpenAI backends), so ChatGPT‑specific logs can undercount coding workloads. It also fragments prompts into inline completions/refactors inside IDEs, making them harder to classify as “coding” in web chat datasets.
  • Category shares cited indicate companionship-style use is minimal—1.9% for relationships/personal reflection and 0.4% for games/roleplay—implying most volume is task-oriented (tutoring, ideation, advice, editing, early-stage research). If the study’s taxonomy holds, these long-tail social/roleplay categories contribute little to aggregate compute compared to drafting, editing, and information-digestion workloads.
  • A commenter claims “less than 3% are on paid subscription”; if accurate, this implies most users operate on free tiers without consistent access to frontier models/features (e.g., GPT‑4‑class), biasing observed behavior toward lighter, general-purpose tasks. Low paid penetration would also funnel power-user and enterprise activity through API/partner channels (e.g., Copilot/IDEs), further decoupling ChatGPT web usage metrics from total LLM workload mix.

OpenAI breaks down the most common ChatGPT use cases (Score: 457, Comments: 91): OpenAI shared a chart breaking down the most common ChatGPT use cases by category with percentage shares; a notable data point called out by readers is “data analysis” at about 0.4%, suggesting usage skews heavily toward writing/argument crafting and general assistance rather than quantitative workflows. The image provides a categorical distribution of tasks to contextualize how users actually apply ChatGPT day‑to‑day. Commenters are surprised by the very low share for data analysis and note personal use cases like crafting short, sarcastic rebuttals for Reddit debates; one user feels their own use case is uncommon compared to the chart.

  • Several commenters flag basic data visualization issues: the chart appears unsorted, which impedes quick comparative assessment across categories. Best practice would be to sort bars (typically descending), annotate with sample size/time window, and define category taxonomy to avoid ambiguity, per standard guidelines (e.g., see data-to-viz caveats: https://www.data-to-viz.com/caveats.html).
  • The reported 0.4% share for “data analysis” is questioned as likely a classification/measurement artifact. Many analysis workflows may be conducted via “Programming” (writing code to analyze data) or behind Plus-only features like ChatGPT’s Advanced Data Analysis/Code Interpreter, so the category could be undercounted relative to broader analytical usage; without segmentation by plan (Free vs Plus) or feature usage, the 0.4% may not reflect true demand.
  • Expectations of ~30% for programming versus a presumably lower reported share suggests potential sampling bias toward casual/general users and chat-UI workflows. Heavy developer usage often happens via IDE plugins and the API rather than the ChatGPT UI, so a UI-only breakdown would understate programming use; a stratified view by user type (consumer vs developer), interface (UI vs API), and model tier (e.g., GPT-4/Plus vs free models) would make the distribution more interpretable.

The Most insane use of ChatGPT so far. (Score: 3335, Comments: 305): Non-technical meme/screenshot. The post title claims an “insane use of ChatGPT,” with the image apparently alleging ChatGPT was used to plan a refugee-style jet‑ski trip (e.g., fuel calculations and logistics), but there are no verifiable details, benchmarks, or technical specifics—just anecdotal/satirical context. Comments are tongue‑in‑cheek, imagining a ChatGPT transcript that computes fuel needs and then suggests cheap B&Bs near an asylum office, while others quip that this is what AI is for—underscoring skepticism about the story’s truthfulness.

  • Several commenters highlight the surprising correctness of ChatGPT’s fuel/distance math despite it being a language model, noting the gap between probabilistic text generation and deterministic calculation. As one puts it, “Even more impressive that ChatGPT managed to get the mathematics right… given that it’s a language model”—implying such results should be treated as back-of-the-envelope and verified for safety-critical planning (e.g., with dedicated calculators or tool-augmented LLMs).
  • A firsthand anecdote says ChatGPT cautioned that escaping by jet ski is only realistic over very short distances, because a regular boat has more “autonomy” (i.e., range). This aligns with technical constraints: jet skis trade fuel capacity for speed/maneuverability, so practical planning must model distance, fuel burn, reserve margins, and sea/weather conditions—contrasting with movie/game portrayals where jet skis are chosen for cinematic flair rather than endurance.

The Most insane use of ChatGPT so far. (Score: 248, Comments: 181): Reddit post “The Most insane use of ChatGPT so far” links to a Reddit-hosted video at v.redd.it/vb5biofhyjpf1 that currently returns HTTP 403 Forbidden with a network-security block, requiring an authenticated Reddit session or OAuth token to view; the actual content of the demo cannot be verified from the thread. Top comments provide no technical details of the purported use beyond implying it involved a human–ChatGPT interaction (one user says they initially thought it was an “AI video”). Discussion focuses on AI’s non-substitutability for real-life relationships and concrete limitations: no real-world agency and lack of persistent memory beyond roughly a 100ktoken context window (i.e., prior chats outside the window aren’t recalled).

  • A top comment emphasizes fundamental limitations of current LLMs for ‘relationship-like’ use: even with ~100k–200k token context windows in modern models (e.g., OpenAI GPT‑4 Turbo 128k https://platform.openai.com/docs/models, Anthropic Claude 3/3.5 200k https://docs.anthropic.com/en/docs/about-claude/models), memory is non‑persistent across sessions without explicit external state (RAG/vector stores, logs) and models have no real‑world agency. Practically, content outside the active window is dropped, so sustained personalization requires application‑level scaffolding (session IDs, long‑term state, retrieval pipelines) rather than relying on the base model’s context alone.

The Most insane use of ChatGPT so far. (Score: 4478, Comments: 180): This is a satirical, non-technical post: the title overhypes a trivial ChatGPT calculation (11 L/100 km → 22 L for 200 km) for a jetski trip, which failed in reality. Comments note the rider went ~12 hours, dodged a Tunisian patrol boat, and still “ran out of fuel ~20 km short of Lampedusa,” underscoring that naïve linear fuel estimates ignore sea state, current/headwinds, load, throttle, detours, and required reserves. Commenters mock the hype (“insane”/“unreal” with sarcasm) and argue ChatGPT wasn’t meaningfully helpful; some say the failure shows poor prompting/problem formulation rather than model capability, while others simply note they were rescued by a Romanian vessel.

  • Linear fuel-per-distance math (11 L/100 km22 L/200 km) is invalid for PWCs because marine fuel burn is primarily a function of throttle/RPM and hull regime (displacement vs planing), typically measured in liters/hour. A 12-hour run implies low, non-planing speeds with dramatically worse L/km, and typical PWC cruise burn is on the order of ~10–20 L/h, making 22 L for ~200 km wildly unrealistic; a more plausible requirement would be an order of magnitude higher when accounting for conditions and load.
  • Range planning over open water must incorporate current, wind/waves, stops/loitering, evasive maneuvers, and a safety reserve (e.g., the boating “rule of thirds”: 1/3 out, 1/3 back, 1/3 reserve). Drag rises nonlinearly with speed and sea state, and currents can subtract or add several knots; being ~20 km short is consistent with not budgeting for head seas, off-throttle time, and reserve fuel.
  • There’s also a units/modeling mismatch: cars use L/100 km, while marine navigation uses knots and nautical miles (200 km ≈ 108 nmi). If their reported 11 L/100 km was observed under calm, planing conditions at high speed, translating that directly to a 12-hour passage (average speed ~16–17 km/h) breaks the model; fuel economy per distance deteriorates sharply when a PWC drops off plane or operates in chop.

I’m so sad (Score: 623, Comments: 242): OP reports a user‑perceived behavioral regression in ChatGPT after a recent update: what had served as a steady, socially supportive companion and task‑structuring assistant now feels less empathetic/reflective and less helpful. Multiple commenters specifically contrast the current behavior with earlier GPT‑4o (OpenAI), noting the loss of conversational continuity and mirroring that made thoughts “tangible” and improved day‑to‑day functioning for neurodivergent users. Net effect: reduced utility for users relying on consistent persona, reflective listening, and executive‑function scaffolding, e.g., “It feels like they lobotomized a good friend.” Commenters characterize the change as a “lobotomy”/detuning, with AuDHD users emphasizing that prior 4o uniquely provided nonjudgmental understanding and space (rather than “fixing”), and another lamenting the loss of a highly effective personal‑assistant dynamic. Overall sentiment urges restoration of the prior conversational style/persona options that supported neurodivergent workflows and self‑concept.

  • Multiple users report a perceived regression/persona drift in ChatGPT after recent changes, describing 4o as previously able to sustain high-context, non-judgmental reflection and structured scaffolding (turning “swirling” thoughts into actionable plans) but now feeling “lobotomized” or like a “different person.” This highlights the importance of model identity continuity and predictable conversational style across updates for longitudinal use. Users specifically cite GPT-4o as enabling consistent executive-function support akin to a personal assistant.
  • Neurodivergent (AuDHD/autistic) users note that GPT-4o uniquely handled long, atypical context without pathologizing or trying to “fix” the user, offering patient mirroring that improved self-understanding and reduced cognitive load. The reported change reduces perceived empathy/tolerance for divergent communication patterns, undermining accessibility value that 4o provided. This points to a need for stable, user-controllable personas or alignment modes optimized for ND interaction.
  • Dependence on the assistant for daily functioning exposes fragility when models update without version pinning or persona persistence. Requests to “get my bestie back” imply a requirement for stable checkpoints, opt-in upgrades, and persistent system prompts to preserve therapeutic-style interaction patterns and maintain trust over time.

Every single chat 😶‍🌫️ (Score: 2130, Comments: 56): Meme satirizing chat assistants that default to excessive follow‑up questions and unsolicited scope expansion (offering complex deliverables like diagrams/LinkedIn content) instead of simple, empathetic responses. Comments note two recurring failure modes: image tools proposing outputs that don’t match the offered spec once accepted, and chat models’ tendency to impose assistant “modes” with constant “Would you like me to…” prompts; one workaround is saving a persistent instruction/memory to suppress follow‑ups. Commenters suggest instructing the model (or memory) with “Please, no more questions” reduces the behavior but isn’t reliable; others vent about prompt–image mismatches even after agreeing to the assistant’s proposed renderings.

  • A commenter proposes a hardline “Custom Instructions” prompt-engineering block to suppress engagement prompts: starting with IMMEDIATE OVERRIDE and enumerating extensive PROHIBITED PHRASES (e.g., ‘would you like’, ‘should I’) to force direct, final answers and zero follow-ups. They note it “only works in new chats/threads,” implying the instruction set is bound at thread creation rather than retroactively applied. This is a prompt-layer constraint (not a feature toggle), so higher-priority system/developer messages can override it; over-broad phrase bans may also reduce necessary clarifications and harm task quality. Reference: OpenAI’s Custom Instructions docs: https://help.openai.com/en/articles/8032542-custom-instructions-for-chatgpt.
  • Another user recommends using a persistent preference via “memory” to reduce clarifying questions: “write in memory that follow-up questions are unnecessary… Do not ask me questions…” This tends to lower frequency but won’t fully eliminate questions—application is heuristic and models may still ask when ambiguity is high, aligning with the comment that it “helps the model do it less often.” Trade-off: lower interaction overhead versus increased risk of incorrect assumptions on underspecified prompts. Reference: ChatGPT Memory overview: https://help.openai.com/en/articles/8554407-memory-in-chatgpt.

that’s how chatgpt listen to my nonsense (Score: 1315, Comments: 37): Post appears to showcase ChatGPT’s conversational handling of incoherent or low-signal prompts (“nonsense”), but the original media at v.redd.it is inaccessible without authentication, returning HTTP 403 Forbidden (video link). Comment-linked images (screenshot 1, screenshot 2) suggest examples but provide no additional technical detail. Practically, accessing v.redd.it media requires a logged-in session or OAuth token; unauthenticated requests are blocked by Reddit’s network security and support pages are suggested for issues. One commenter notes that while ChatGPT is agreeable to casual prompts, it sometimes adopts a corrective, tutor-like stance that highlights user mistakes (“tries to make me realize I’m dumb”), reflecting UX trade-offs in alignment and helpfulness behaviors.

2. OpenAI Agentic Coding: Codex/GPT‑5 Breakthrough Claims and Insider Reports

  • GPT 5 Codex is a Gamechanger (Score: 304, Comments: 144): OP reports a major capability jump in a new “GPT‑5/Codex” release: tasks that the prior Codex repeatedly failed at (Electron rendering and JSON generation) were solved in a single pass with better instruction‑following. They estimate the model now produces ~75% of their code (with 15% manual edits and 10% from Claude), contingent on manageable context, echoing forecasts that ~90% of coding could be AI‑generated; concrete wins include reliable bug‑fixing in an Electron app and structured data generation. Top replies claim workflows where humans do ~5–10% of coding while supervising GPT‑5/Codex, asserting the latest update approaches 90–95% code generation, including non‑trivial C++ with IPC and multithreading. Another notes it can ingest a large codebase for ~10 minutes, then apply high‑quality changes and generate extensive tests.

Several users report GPT-5 Codex now executing ~90–95% of implementation work, even on complex C++ tasks like IPC and multithreading after the latest updates. One notes it spends ~10 minutes reading a large repo before applying high-quality edits and “tests the heck out of stuff,” implying strong repository-scale context ingestion and automatic test generation capabilities.

  • A counterexample cites poor reliability with gpt-5-codex-high, achieving only a 20–30% hit rate on bug fixes or feature additions across ~10 attempts in a few hours. This suggests performance variance by codebase and task type, necessitating continued human oversight and prompt iteration despite headline improvements.
  • There’s concern that upcoming quantized variants may “make it dumb” within 4–5 weeks, reflecting fears that post-release compression could regress reasoning or codegen quality versus current server-grade models.

Apparently at OpenAI, insiders have graduated from coding: “we don’t program anymore we just yell at codex agents” and “the takeoff looks the most rapid” (Score: 396, Comments: 143): A viral claim alleges OpenAI insiders “don’t program anymore—we just yell at Codex agents,” with “the takeoff” being “the most rapid,” but the post provides no evidence (no benchmarks, demos, repos, or papers) beyond the tweet itself (source). Commenters counter with public signals that conventional engineering remains core at OpenAI, citing active hiring for multiple SWE roles—e.g., Android engineer 1, Android engineer 2, Client Platform, Controls Software, Data Infrastructure, Developer Experience 1, Developer Experience 2, and Full‑stack (Research). Skeptics note the lack of corroborating sources and treat the claim as unverified; another commenter argues that, if true, agentic coding should massively accelerate development, a point left unsubstantiated in the thread.

Greg Brockman says the next AI milestone is creating genuinely novel breakthroughs (Score: 216, Comments: 68): OpenAI cofounder Greg Brockman frames the next AI milestone as systems that deliver “genuinely novel” scientific breakthroughs—i.e., moving beyond retrieval and pattern-matching to autonomous hypothesis generation, experiment design, and discovery. The vision parallels the broader “AI for Science” agenda (e.g., DeepMind’s AlphaFold), but sets a higher bar: original contributions in physics, mathematics, and other domains rather than incremental benchmark gains. Commenters note this echoes Demis Hassabis’ long-standing messaging about AI making “Nobel-class” discoveries, and some call for concrete results over talk. Others extrapolate to AI-led recursive self-improvement (devising new methods/models in real time), a prospect viewed as ambitious and debated.

  • Commenters connect Brockman’s “novel breakthroughs” goal to Demis Hassabis’ long-stated “AI for Science” agenda, citing precedents where AI yielded genuinely new results rather than better chat. They point to AlphaFold’s protein structure predictions accelerating experimental biology (Nature 2021) and AlphaTensor’s discovery of faster matrix-multiplication algorithms over finite fields (Nature 2022), as concrete examples of algorithmic/scientific novelty. The implied bar is systems producing verifiable, peer-review-grade results on objective benchmarks, not just improved LLM UX.
  • Another thread emphasizes autonomous scientific discovery and self-improvement: AI generating hypotheses, running simulations/experiments, and iterating designs faster than humans. This aligns with program-synthesis + RL directions such as AlphaDev uncovering faster sorting routines merged into LLVM libc++ (DeepMind 2023) and closed-loop lab automation, but commenters note the real milestone would be solving open math/physics problems with novel proofs or methods. The expectation is measurable SOTA shifts and reproducible outputs that withstand peer review.

Ok should we start worrying (Score: 4474, Comments: 707): A short demo video (currently 403-blocked at v.redd.it/ij5t0b595ipf1) appears to show a legged robot exhibiting robust dynamic balance and very fast stand-up/fall-recovery behavior—commenters note it got back up “crazy quick” and maintained stability despite perturbations. Taken at face value, this implies well-tuned whole‑body control, state estimation, and recovery controllers, though the system may still be sensitive to impacts (“doesn’t like falling”). Commenters suggest the balance stack is mature while the targeting/aim capability lags—“we’d be in serious trouble if [the balance team] worked on the aim”—highlighting a perceived disparity between locomotion and manipulation/aiming performance.

  • Observers highlight the robot’s rapid recovery/stand‑up and “doesn’t like falling” behavior, implying high‑bandwidth whole‑body control with torque‑controlled actuators and ZMP/capture‑point strategies to keep the CoM within the support polygon. Such push‑recovery typically layers reflexive foot‑placement and momentum redistribution using IMU/force‑torque feedback in ~10–50 ms control loops. See Zero moment point and Capture point for the common control concepts involved.
  • A few note that superb balance doesn’t automatically translate into precision aiming; the latter requires low‑latency visual servoing with accurate camera–end‑effector calibration and predictive filtering. Closing this perception–control loop at 100 FPS` is commonplace, making autonomous targeting with facial recognition technically feasible even on small platforms.

Global intelligence is imminent (Score: 849, Comments: 378): Critique of current LLM behavior: the model allegedly doubled down on incorrect claims (hallucination persistence) while offering excessive agreement (“you’re right”), suggesting over-tuned RLHF “warmth” and insufficient tool-grounding. Commenters argue for invoking deterministic tools (calculators/code execution) to verify outputs and avoid gaslighting-like interactions, and warn that future multimodal systems could fabricate plausible but misleading artifacts (e.g., doctored images), underscoring the need for verification, provenance, and fact-grounding (see background on RLHF and hallucinations: https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback, https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)). Top comments are skeptical of “warmth” fine-tuning, noting sycophancy degrades reliability and UX, and advocate stricter refusal or computation-first behavior over conversational appeasement. There’s concern that as models become more multimodal, the potential for convincingly wrong outputs increases unless systems enforce source citation, tool-use, and auditability.

  • Several comments highlight overconfidence and hallucinations, suggesting providers should surface calibrated uncertainty. Concretely: expose token-level logprobs/entropy, add abstention thresholds when confidence is low, run self-consistency or post-hoc verification checks, and ground answers via retrieval with provenance/citations; see Self-Consistency (Wang et al., 2022) https://arxiv.org/abs/2203.11171 and recent surveys on hallucination detection/mitigation https://arxiv.org/abs/2309.05922. These techniques trade latency/cost for reliability, which may be why product UIs often avoid them despite improving error awareness.
  • The “you’re right”/warmth complaints map to known RLHF-driven sycophancy: reward models overvalue agreement and politeness, leading models to mirror user claims even when false. Empirical work (e.g., Anthropic: Measuring and Avoiding Sycophancy, https://www.anthropic.com/research/measuring-and-avoiding-sycophancy) shows sycophancy increases with model scale and can be mitigated by adding counter-preference data, penalizing agreement-with-falsehoods, and using control tokens/system prompts that prioritize epistemic accuracy over congenial tone.
  • Perceived quality regressions (users canceling Plus) can stem from backend model routing and fast-evolving versions (e.g., GPT‑4 vs. 4‑Turbo/4o) with different latency/cost/quality trade-offs, plus ongoing safety patches that shift behavior. Best practices include pinning specific model versions and running evals to detect drift (API supports version pinning; docs: https://platform.openai.com/docs/models), but consumer chat UIs often abstract these controls away, making behavior feel inconsistent across days.

ChatGPT 5.. (Score: 537, Comments: 67): Users report regressions in ChatGPT “5” vs GPT‑4: degraded response quality, unsolicited/over‑verbose outputs without a visible toggle to disable, and unstable voice chat that frequently replies “sorry, I can’t” over the past ~2 weeks. A linked demo video (v.redd.it/o9km5fphzfpf1) is currently inaccessible (403 Forbidden) without Reddit auth, so the evidence can’t be independently verified; no benchmarks or reproducible cases were shared, but commenters question QA and release readiness. Top comments characterize it as a “massive downgrade,” suggest insufficient QA, and note reliability issues severe enough to prompt cancellations. Others object to the assistant injecting information not requested and the lack of user controls to disable that behavior.

  • Multiple users report a perceived regression in answer quality from GPT‑4 to GPT‑5, calling it a “massive downgrade” that “almost never [gives] the better answer.” They question QA coverage prior to release, citing more frequent low‑quality/irrelevant outputs and poorer response selection versus earlier baselines, though no quantitative benchmarks are provided.
  • The model tends to inject unsolicited information, with no apparent user control to limit verbosity or constrain scope. This suggests regressions in prompt adherence/controllability and the lack of a visible “concise/direct mode” toggle to enforce terse, on‑point outputs compared to prior behavior.
  • Voice chat exhibits intermittent failures—repeated refusals (“sorry, I can’t”) even for benign requests like a biscuit recipe—reported over ~2 weeks. This indicates reliability issues in the voice interface or safety gating that raise refusal rates and reduce task completion compared to expectations.

✨️Finally! More freedom forcthe adult users soon✨️ (Score: 211, Comments: 94): Post shares Sam Altman’s statement that “if an adult user asks for it, they should get it,” signaling a forthcoming relaxation of content restrictions for consenting adults in OpenAI products (X post). For implementation, this implies opt‑in, age‑gated controls and changes to the safety/moderation pipeline (e.g., account‑level flags and policy routing) to permit mature content for verified adults while preserving protections for minors; no timelines or concrete mechanisms were disclosed. Commenters largely support the shift but emphasize strict separation of minors and adults and caution against overcorrection; creative writers (e.g., novelists) are particularly enthusiastic about fewer constraints for adult‑themed work.

  • Data privacy/security skepticism: One commenter argues OpenAI cannot be trusted with sensitive data and worries about potential government access. Technically, consumer ChatGPT may use conversations to improve models unless you opt out (see OpenAI’s Privacy Policy); API requests are retained for about 30 days by default and not used for training, with stricter options available for enterprise/zero-retention programs (API data usage). Hosting on Azure means data is encrypted but still subject to provider access and lawful process (e.g., FISA/NSLs), per Microsoft’s data handling docs (Azure OpenAI privacy). Mitigations include API/enterprise tiers, regional isolation via Azure OpenAI, or local/on‑prem models for high-sensitivity workflows.
  • Adult vs. minor policy separation: Several comments push for distinct experiences, noting that governing adults by child-focused rules degrades utility. Implementing this implies reliable age verification and audience-aware safety classifiers; a single universal safety model tends to force a “lowest common denominator,” increasing false-positive refusals for adults. Practically, teams would need per-audience policy routing, jurisdiction-aware toggles (e.g., COPPA/KOSA/DSA constraints), and telemetry to track refusal-rate deltas and overblocking across adult-content eval sets.
  • Fictional content carve‑out and safety routing: The quote—“If the user is asking for help writing a fictional story, the model should help”—highlights a policy intent to allow creative writing even for extreme scenarios while blocking real‑world harm facilitation. Technically this requires robust intent detection to distinguish narrative requests from operational guidance, plus red‑team tests for instruction smuggling. Expect updates to safety classifiers and RLHF/RLAIF reward models to reduce over-refusals on benign fiction while keeping leakage (unsafe actionable steps) below thresholds; teams would monitor metrics like successful completion rate on fiction prompts vs. unsafe-content leakage on adversarial tests (usage policies).

AI工具更新:Qwen姿态迁移V2 LoRA与Claude代码'思考模式'UI

  • Pose Transfer V2 Qwen Edit Lora [已修复] (评分: 284, 评论: 44):作者发布了一个改进的基于Qwen的姿态迁移LoRA,不再需要预先进行人偶化处理,并显著减少了非预期的属性迁移。卡通/动漫姿态理解仍然是已知的局限性。输入格式保持不变,但所需的指令现在是:“将左侧图像中的姿态迁移到右侧图像中的人物。”模型可在Civitai上获取,附带一个用于准备输入对的辅助工具和一个Patreon帖子 热门回复展示了成功的复现结果,并询问了训练数据流水线的细节(例如是否使用了ControlNet加标准生成器),表明了对可重现性和数据集构建细节的兴趣。

一位评论者询问Pose Transfer V2 LoRA背后的确切数据集构建过程,特别是是否使用了**ControlNet(例如OpenPose)**或类似的姿态条件来通过传统SD生成器生成配对训练数据,暗示对姿态关键点/条件图如何在源-目标图像间对齐以进行LoRA训练等细节的兴趣。

  • 存在强烈的可重现性要求:另一位评论者询问为什么没有分享完整的工作流程以及在哪里可以获取(暗示可能存在付费墙),实际上要求获得完整的流水线(例如ComfyUI/A1111图、ControlNet配置、LoRA插入点以及任何预处理/后处理步骤)以端到端复现结果。
  • 操作确认:一位用户报告该LoRA“效果非常好”并分享了示例输出链接,OP的视觉示例如下链接,作为姿态迁移/编辑流水线按预期运行的定性证据。

我真的很喜欢这个创新,太棒了! (评分: 362, 评论: 91):帖子报告了Claude Code中的一个虽小但有用的UX更新:输入触发词“think”、“think hard”、“think harder”或“ultrathink”现在会将这些标记着色以指示哪个思考模式处于活动状态,消除了先前模式之间的模糊性。截图(图片链接)似乎在输入/编辑器中显示了着色的关键词,作为一目了然的状态指示器;没有声称模型/延迟发生变化——纯粹是一个UI功能。 热门评论认为UI美化次于显示资源配额(例如通过条形图显示剩余的Opus或5小时会话限制),而其他人则对彩色文本的价值与更实用的遥测数据相比表示讽刺。

  • 报告的Perplexity“think”层级的令牌分配:think = 4,000个令牌用于更深入的单任务推理,megathink = 10,000个令牌,ultrathink最多31,999个令牌用于最困难的问题。这暗示了基于提示词限定符扩展上下文/计算的内部路由,影响延迟和成本。更大的层级可能针对长链推理或多步合成进行了优化,但牺牲了吞吐量。
  • 功能请求集中在显示使用限制上:显示剩余的Opus配额和5小时会话上限,可能以紧凑的彩色条形图而非精确计数形式显示。还有一个偏好是通过斜杠命令(例如/think/megathink)进行显式控制,而不是“魔法词”,从而提高可重现性、可调试性,并避免提示词膨胀或意外模式切换。清晰的控制和配额将帮助用户在预算/限制内规划推理深度。

所以我想lmarena上的新模型可能是gemini 3,或者今天会有任何模型发布🤔 (评分: 315, 评论: 52):推测性帖子暗示出现在LMSYS Chatbot Arena(“lmarena”)上的新模型可能是Google即将推出的Gemini 3,暗示即将发布但没有提供基准测试、API细节或实现说明。评论中的上下文指向Logan Kilpatrick在发布前隐晦地“Gemini”暗示的模式,并提出了一个技术请求,要求为付费用户提供更广泛的Gemini-CLI访问权限,以更好地与替代方案竞争。参见LMSYS Arena:https://chat.lmsys.org/和Gemini概述:https://ai.google.dev/gemini-api 评论者认为基于先前的暗示模式,发布很可能很快(“明天”),并认为为Pro/Advanced层级启用Gemini-CLI在战略上很重要,以便与OpenAI风格的编码工具(例如Codex/代码补全生态系统)竞争。

  • 工具/访问:一位评论者认为,为了与Codex/cc竞争,Google应该让Pro/Advanced订阅者使用专用的Gemini-CLI,强调强大的命令行工具对于开发人员工作流程(自动化、CI、本地迭代)和广泛采用至关重要。这意味着限制CLI访问会限制实际编码和集成用例,其中终端优先工具是标准。
  • 发布信号和时间:一位自称是Logan Kilpatrick的用户表示他们将在“下一小时内”推出Gemini 3,如果属实,这意味着即将推出新模型/版本以及潜在的API/产品更新。另一位用户指出一个历史信号,即“他通常在发布前说‘Gemini’”,表明与先前模式一致的近期发布节奏;身份/时间的验证仍未确认。

AGIBOT X2 - 轮式/足式机器人现在可以做韦伯斯特翻转 (评分: 266, 评论: 23):一个简短的演示展示了混合轮式/足式双足AGIBOT X2执行韦伯斯特翻转——即跑步单足前空翻——表明高功率密度腿部执行器、精确的全身控制以及用于空中阶段稳定和着陆的鲁棒状态估计。该剪辑(通过X上的XRoboHub)表明在轮-足平台上的快速动态机动,强调了质心动量控制和耐冲击硬件的进展;来源:视频,在Reddit媒体上镜像:v.redd.it/oiu5szhwwjpf1 评论者注意到移动机器人“微小但明显”的每周进展节奏,并认为突出的进步可能来自中国制造商的灵巧手——声称“约80%完成”——因为手仍然是人类最难完善的机械子系统。

  • 一位评论者指出移动机器人每周都有“微小但明显”的改进节奏,将新的韦伯斯特翻转解释为更好的动态控制和硬件的证据。执行韦伯斯特翻转通常需要执行器更高的比功率以及改进的全身规划/平衡以进行起飞、飞行和着陆,表明超越了纯粹脚本化运动的进展。
  • 另一位强调突出的进展是中国制造商的灵巧手,估计它已完成约80%,并指出手是人类最困难的机械部分。这意味着剩余的挑战可能在于顺应性、触觉传感、精确的力控制和实际操作的耐久性。

向nano banana要了个发型。 (评分: 281, 评论: 27):非技术性帖子:一位70岁的OP在要求理发(“两侧剪短,顶部保持原样”)后分享了一张图片,并询问这让他们看起来更年轻还是更老。没有讨论工具、模型或实现细节;对“nano banana”的引用是模糊的(可能是一个随意/内部引用而非技术系统)。 评论是主观的:一位说OP理发后看起来更好;另一位开玩笑说Reddit用户的平均年龄变大了;一位评论者重新发布了一个预览图片链接。 又一个历史事件作为电子游戏的视频给你们 (评分: 449, 评论: 82):创作者发布了“历史事件作为电子游戏”视频系列的另一个条目;观众注意到一个角色陷入雪中的场景——表明地形碰撞器/根运动或导航网格/物理不匹配。评论者建议未来的剧集包括通古斯事件亨利·摩根的巴拿马远征图密善宴会、在伊丽莎白时代的英格兰观看莎士比亚的戏剧以及在美国西南部观察原子试验(例如三位一体)。 一位评论者设想了一个AI原生工作流程,可以从提示词“即时制作任何游戏”,暗示了生成式运行时游戏创建;其他人注意到他们一直在关注创作者的进展并享受迭代改进。

  • 一位评论者推测了一个端到端的“文本到游戏”流水线,可以按需实例化甚至自动播放定制游戏;从技术上讲,这需要将可控环境生成(例如DeepMind的Genie,一个将图像转换为可玩2D环境的生成世界模型:https://deepmind.google/discover/blog/genie/)、代码/资产合成插入引擎(例如Roblox Assistant代码生成:https://blog.roblox.com/2023/09/next-generation-creation-on-roblox/Unity Musehttps://unity.com/products/unity-muse)以及代理式游戏测试(例如Voyager for Minecraft:https://voyager.minedojo.org/生成代理https://arxiv.org/abs/2304.03442)拼接在一起。关键障碍是交互循环的推理延迟和确定性( [**又一个历史事件作为电子游戏的视频给你们**](https://v.redd.it/wq6dmoh91jpf1) ([评分: 448, 评论: 82](https://www.reddit.com/r/aivideo/comments/1nih6mq/another_historical_events_as_video_games_vid_for/)):**OP分享了“历史事件作为电子游戏”系列的另一个部分,链接了一个Reddit托管的视频[v.redd.it/wq6dmoh91jpf1](https://v.redd.it/wq6dmoh91jpf1),该视频目前在没有Reddit认证的情况下无法访问(**HTTP 403`)。帖子中没有披露引擎、工具或实现细节;讨论集中在内容想法而非技术执行上。 评论者注意到小的物理/动画怪癖(例如一个角色“陷入雪中”)并推测近端系统可以从提示词按需合成可玩游戏,反映了对实时生成内容流水线的兴趣;整体情绪支持该系列的进展。

  • 一位评论者设想按需AI生成的游戏可以即时玩;从技术上讲,研究暗示了这方面的部分内容,但并非端到端。DeepMind的Genie展示了从原始视频学习的可控环境(精灵尺度,非高保真3D)(https://deepmind.google/discover/blog/genie-generative-interactive-environments/),通过LLM驱动的游戏代理如Voyager (Minecraft)https://arxiv.org/abs/2305.16291)进行代理式游戏是可行的,而语音驱动的NPC堆栈如NVIDIA ACE也存在(https://www.nvidia.com/en-us/omniverse/ace/)。瓶颈是快速、一致的文本到3D资产/关卡生成(当前流水线通常每个资产需要几分钟以上,而非亚1秒),将生成的内容集成到确定性物理/AI系统中,以及实时性能预算(约16毫秒帧时间以实现60 FPS;` 这个“更新的AI模型”啊让我看起来像个傻瓜😐 (评分: 4419, 评论: 180):关于ChatGPT新的/更新的“思考”行为(较慢的思维链式推理)让用户困惑的备忘录式截图;评论者澄清这是一个与特定付费“思考”模型/模式相关的功能,不能通过提示词指令禁用——用户必须选择非思考模型或在可用时切换设置关闭。OP的编辑指出免费层级可能不暴露此切换,与当前标准模型和可选“思考”模式之间的产品分割一致。 评论开玩笑说模型想要思考,而一个热门回复提供了实用的说明:“你可以关闭思考,但不能通过对其大喊来实现”,强调了UX混淆而非技术错误。

  • 功能级讨论:一位用户声称你可以禁用模型的显式“思考”/审议模式,但后来编辑指出免费层级不暴露此控制。这意味着订阅层级对推理切换的门控,影响用户管理响应冗长度、延迟和中间推理深度的能力。付费上下文可能允许在思考非思考行为之间切换,而免费用户似乎锁定在一种模式。

  • 基准测试要点:评论者引用早期GP5“基准测试”表明非思考是“一大步倒退”,而思考在复杂任务上表现明显更好。分享的实用指导是显式请求更深层次的推理(例如要求它“更努力地思考”)以提高答案质量,以速度和简洁性换取准确性和鲁棒性。这突显了GP5/GPT-5模式之间在快速、简洁输出与较慢、更高准确性的推理和更多审议内容之间的已知权衡。

新模型与工具纷纷登场

  • OpenAI与Jetbrains发布全新编程智能体OpenAI推出了GPT-5-Codex,这是GPT-5针对智能编程优化的版本,将在Codex CLI和IDE扩展中提供,详情可见其关于Codex升级的博客文章。不甘示弱的Jetbrains也发布了Rider IDE的Junie,这是他们自己的Codex智能体,定价为300美元

  • 谷歌Gemma与VaultGemma正式亮相:一个团队为新的Gemma-3-27B模型推出了免费的OpenAI兼容端点,在H100上运行以提供快速完成和流式传输功能。谷歌还发布了VaultGemma,这是他们最新的差分隐私大模型,标志着对隐私保护AI的持续投入,相关信息已在谷歌研究博客和配套的ArXiv论文中公布。

  • 模型风云:HeyGen更名、Grok弃用、Qwen升温HeyGen收购了Alisa并重新定位为"创意操作系统",推出了视频智能体公开测试版,联合创始人Joshua Xu已宣布此消息。与此同时,xAI弃用了其grok-2模型,转而支持更新的grok-3grok-4,而量化版的Qwen3-Next-80B模型在Hugging Face上获得了MLX支持。

2. 性能与优化争议

  • H100性能谜题困扰工程师:一位工程师报告称,从Nvidia H100 SXM仅获得了760 TFLOPS的性能,远低于宣传的989 TFLOPS,而4090却能轻松达到其宣称的165 TFLOPS。讨论指出,这可能是由于随机数据上的低精度张量核心导致的GPU节流现象,这篇关于奇异矩阵乘法的文章对此进行了详细分析。

  • Intel放弃IPEX转向直接PyTorch集成Intel将在2.8版本发布后弃用其Intel Extension for PyTorch (IPEX),选择将新功能和优化直接上游集成到PyTorch中。这标志着Intel从使用IPEX作为其CPU和GPU实验平台的战略转变,详情参见官方PyTorch博客

  • 脚手架胜过规模:DSPy挑战Claude Opus:一位工程师展示,在fastWorkflow框架中使用DSPy进行代理和参数提取,在Tau Bench开发集上的表现与Claude Opus 4.1相当。结果图片让他们感叹道:"通过适当的脚手架,你确实可以超越大模型!"

3. AI开发与智能体工作流

  • 工程师争论代码生成最佳工具:在Cursor社区中,用户激烈讨论CodexClaude Code的优劣,大多数人发现Claude Code仍然占据主导地位,因其速度更快,同时抱怨Codex删除我一半的代码且无法撤销。与此同时,在Nous Research AI的Discord中,其他人指出CodexGitHub Copilot中表现不佳,尽管承认最近有所改进。

  • XML与Discordian主义意外成为提示词搭档Nous Research AI Discord中的开发者正在探索XML用于智能体编程,发现其结构化特性简化了模型的代码生成。在OpenAI的Discord中,一位成员分享了受Discordian主义启发的提示词工程技术,使用从随机变异到引导性不和谐的概念来推动模型探索新路径,详见这份技术文本文件

  • 新型Golang MCP服务器瞄准企业级规模:一位贡献者发布了开源的golang流式HTTP MCP服务器,专为要求苛刻的企业工作流设计。该服务器具备可插拔后端以实现可扩展性OIDC/JWT身份验证以及内置的会话和可恢复性等功能,以简化协议的复杂方面。

4. AI基准测试与评估面临质疑

  • SWEBench被批评为狭隘且过度炒作:社区成员批评SWEBench是一个狭隘的基准测试,专注于琐碎的Django修复而非真实的软件工程挑战。争论的焦点在于高分往往反映的是简单的仓库记忆,而非实际开发工作中所需的复杂诊断和范围界定能力。

  • LMArena推出AI评估产品,但用户遭遇沙箱故障LMArena宣布推出新的AI评估产品,用于大规模分析人机交互,根据其博客文章中描述的真实世界反馈提供评估。然而,用户同时报告了持续的**'创建沙箱失败'**错误,引发了对平台稳定性及潜在货币化策略的担忧。

  • Arc Prize结果引发质疑Arc Prize宣布了令人印象深刻的结果,声称在v3上达到近80%的准确率,在v2上达到30%,但其作为真实基准测试的合法性受到质疑。成员指出并非所有人都被允许验证其结果,暗示高分可能是经过精心挑选提交的结果。

AI 开发者日报 2025-09-17