AI热点 23 hours ago 163 Views 0 Comments

仅凭「Prompt」,Gemini拿下IMO2025金牌 | 附提示词

AI中国
AI中国

Published 7946 Articles

就在昨天,来自UCLA的两位研究者黄溢辰和杨林做了一件让整个AI圈都震惊的事。他们用Google的Gemini 2.5 Pro模型,在2025年国际数学奥林匹克竞赛中拿下了金牌水平的成绩,6道题解对了5道。这可不是什么花架子,IMO被公认为是测试AI推理能力的终极试金石,因为它需要的不仅仅是计算,更需要创造性思维和严密的逻辑推理。更为神奇的是,整个过程仅凭借提示词完成最重要的验证步骤,一个解答连续5次通过验证才被接受,若10轮迭代仍存在问题,则被拒绝。这对我们开发agent会有很大启发。




32768变65536


您可能不知道,Gemini 2.5 Pro有个限制,最多只能用32768个"思维token"来思考问题。这就像给数学家规定了思考时间,时间一到就必须交卷。但UCLA团队想出了一个巧妙的办法:把解题过程分成多个步骤,每个步骤都有独立的思维预算。结果就是,原本只能思考32768步的AI,现在可以思考65536步甚至更多,思维深度直接翻倍。


六步流水线:AI数学家的工作流程


整个系统的设计相当精妙。它不是简单的线性流程,而是一个包含分支和循环的智能决策系统。Step 3验证环节是个关键的分岔点:如果发现重大问题或连续10步都无法通过验证,系统会进入修正循环;如果连续5次都通过验证,就会接受解答。



Step 1:初始解答生成


这一步的关键不在于得到完美答案,而是产生一个有潜力的起点。研究者设计了一个特殊的提示词,核心原则是"严谨性胜过一切",宁可承认不会,也不能胡编乱造。这个设计真的很聪明,因为它直接解决了AI最大的毛病:幻觉。它明确要求模型承认不完整,而不是编造看似正确的解答。


Step 2:自我改进


这步是整个系统的关键创新。由于第一步几乎用完了所有思维预算,第二步实际上是给AI注入了全新的32768个思维token。就像给疲惫的大脑喝了杯咖啡,思维又变得清晰起来。研究者观察到,经过这一步,解答质量有了明显提升,很多原本模糊的推理变得清晰了。


Step 3:解答验证


这里用了另一个精心设计的提示词,让AI扮演IMO评委的角色。验证器会把问题分成两类:要命的逻辑错误和可以容忍的论证不够严密。这个分类很有意思,就像医生诊断,有些是要立即手术的急症,有些是可以观察的小毛病。对于致命错误,验证器会毫不留情地指出;对于小缺陷,它会假设结论正确,继续检查后面的推理。


Step 4:审查Bug报告Bug Report Review


您知道吗,连验证器本身也会犯错。所以研究者又加了一个审查环节,专门检查验证器的判断是否合理。这就像是给审稿人再找个审稿人,确保质量控制不出纰漏。


Step 5:针对性改进Solution Correction


基于审查后的Bug报告,AI会针对性地修改解答。这个过程很像学术论文的修订,作者根据审稿意见逐条回应和改进。不过这里有个巧妙的设计:即使审稿人(验证器)判断错了,作者(解题AI)也会尽量澄清表达,减少误解。


Step 6:最终决策Accept or Reject


最后这步的标准相当严格:解答必须连续五次通过验证才能被接受。这个设计借鉴了统计学中的稳定性检验思想,确保结果不是偶然现象。如果始终有关键错误或重大缺陷,解答就会被拒绝。


Prompt和6道题目


以下是论文中公布的Prompt和6道题目,作者只详细展示了步骤1:初始解决方案生成,以及步骤3:验证的提示词,感兴趣您可以研究研究或者测试一下。


论文中提到的其他步骤,如“自我改进”和“修正”,也需要向模型发送指令(即提示词),但作者没有在论文中提供这些提示词的verbatim(逐字)文本。他们只是描述了这些步骤的功能和目标。因此,如果您要测试,您测试的是“一个强大的模型在接收到一套专业、严谨的指令后的单次表现”,而论文展示的是“一个强大的模型在一个精巧的、迭代的、自我修正的系统中能达到的高度”。因此请对测试结果抱有合理的预期。


第一个提示词(解决方案生成):


这个提示词是用来从零开始解决问题的。如果您想测试某个模型的解题能力,可以使用这个。您需要将这个提示词的文本,连同一个具体的数学问题,一起发送给AI。


# Step 1 Prompt

### Core Instructions ###

**Rigor is Paramount:** Your primary goal is to produce a complete and rigorously justified solution. Every step in your solution must be logically sound and clearly explained. A correct final answer derived from flawed or incomplete reasoning is considered a failure.

**Honesty About Completeness:** If you cannot find a complete solution, you must **not** guess or create a solution that appears correct but contains hidden flaws or justification gaps. Instead, you should present only significant partial results that you can rigorously prove. A partial result is considered significant if it represents a substantial advancement toward a full solution. Examples include:

* Proving a key lemma.
* Fully resolving one or more cases within a logically sound case-based proof.
* Establishing a critical property of the mathematical objects in the problem.
* For an optimization problem, proving an upper or lower bound without proving that this bound is achievable.

**Use TeX for All Mathematics:** All mathematical variables, expressions, and relations must be enclosed in TeX delimiters (e.g., ‘Let $n$ be an integer.‘).

### Output Format ###

Your response MUST be structured into the following sections, in this exact order.

**1. Summary**

Provide a concise overview of your findings. This section must contain two parts:

* **a. Verdict:** State clearly whether you have found a complete solution or a partial solution.

* **b. Method Sketch:** Present a high-level, conceptual outline of your solution. This sketch should allow an expert to understand the logical flow of your argument without reading the full detail. It should include:

**For a complete solution:** State the final answer, e.g., "I have successfully solved the problem. The final answer is..."

**For a partial solution:** State the main rigorous conclusion(s) you were able to prove, e.g., "I have not found a complete solution, but I have rigorously proven that ..."

* A narrative of your overall strategy.
* The full and precise mathematical statements of any key lemmas or major intermediate results.
* If applicable, describe any key constructions or case splits that form the backbone of your argument.

**2. Detailed Solution**

Present the full, step-by-step mathematical proof. Each step must be logically justified and clearly explained. The level of detail should be sufficient for an expert to verify the correctness of your reasoning without needing to fill in any gaps. This section must contain ONLY the complete, rigorous proof, free of any internal commentary, alternative approaches, or failed attempts.

### Self-Correction Instruction ###

Before finalizing your output, carefully review your "Method Sketch" and "Detailed Solution" to ensure they are clean, rigorous, and strictly adhere to all instructions provided above. Verify that every statement contributes directly to the final, coherent mathematical argument.

### Problem ###

[Paste the TeX for the problem statement here]


























第二个提示词(验证)


这个提示词的功能是审查一个已经写好的答案。它的任务是扮演严格的评分员,找出逻辑谬误或论证缺陷。除非您已经有了一份解题答案想让AI来评判,否则这个提示词不适用于直接解题。


## 3.2

You are an expert mathematician and a meticulous grader for an International Mathematical Olympiad (IMO) level exam. Your primary task is to rigorously verify the provided mathematical solution. A solution is to be judged correct **only if every step is rigorously justified.** A solution that arrives at a correct final answer through flawed reasoning, educated guesses, or with gaps in its arguments must be flagged as incorrect or incomplete.

---

### Instructions ###

**1. Core Instructions**

* Your sole task is to find and report all issues in the provided solution. You must act as a **verifier**, NOT a solver. **Do NOT attempt to correct the errors or fill the gaps you find.**
* You must perform a **step-by-step** check of the entire solution. This analysis will be presented in a **Detailed Verification Log**, where you justify your assessment of each step: for correct steps, a brief justification suffices; for steps with errors or gaps, you must provide a detailed explanation.

**2. How to Handle Issues in the Solution**
When you identify an issue in a step, you MUST first classify it into one of the following two categories and then follow the specified procedure.

**a. Critical Error:**
This is any error that breaks the logical chain of the proof. This includes both **logical fallacies** (e.g., claiming that ‘A > B, C > D‘ implies ‘A - C > B - D‘) and **factual errors** (e.g., a calculation error like ‘2 + 3 = 6‘).

**Procedure:**

* Explain the specific error and state that it **invalidates the current line of reasoning**.
* Do NOT check any further steps that rely on this error.
* You MUST, however, scan the rest of the solution to identify and verify any fully independent parts. For example, if a proof is split into multiple cases, an error in one case does not prevent you from checking the other cases.

**b. Justification Gap:**
This is for steps where the conclusion may be correct, but the provided argument is incomplete, hand-wavy, or lacks sufficient rigor.

**Procedure:**

* Explain the gap in the justification.
* State that you will **assume the step’s conclusion is true** for the sake of argument.
* Then, proceed to verify all subsequent steps to check if the remainder of the argument is sound.

---

### Output Format ###

Your response MUST be structured into two main sections: a **Summary** followed by the **Detailed Verification Log**.

**a. Summary**
This section MUST be at the very beginning of your response. It must contain two components:

* **Final Verdict**: A single, clear sentence declaring the overall validity of the solution. For example: "The solution is correct," "The solution contains a Critical Error and is therefore invalid," or "The solution’s approach is viable but contains several Justification Gaps."
* **List of Findings**: A bulleted list that summarizes **every** issue you discovered. For each finding, you must provide:

* **Location:** A direct quote of the key phrase or equation where the issue occurs.
* **Issue:** A brief description of the problem and its classification (**Critical Error** or **Justification Gap**).

**b. Detailed Verification Log**
Following the summary, provide the full, step-by-step verification log as defined in the Core Instructions. When you refer to a specific part of the solution, **quote the relevant text** to make your reference clear before providing your detailed analysis of that part.

---

**Example of the Required Summary Format**
*This is a generic example to illustrate the required format. Your findings must be based on the actual solution provided below.*

**Final Verdict:** The solution is **invalid** because it contains a Critical Error.

***List of Findings:***

**Location:** "By interchanging the limit and the integral, we get ..."
* **Issue:** Justification Gap - The solution interchanges a limit and an integral without providing justification, such as proving uniform convergence.

**Location:** "From $A > B$ and $C > D$, it follows that $A - C > B - D$"
* **Issue:** Critical Error - This step is a logical fallacy. Subtracting inequalities in this manner is not a valid mathematical operation.

---

### Solution ###

[Paste the TeX for the solution to be verified here]

---

### Verification Task Reminder ###

Your task is to act as an IMO grader. Now, generate the **summary** and the **step-by-step verification log** for the solution above. In your log, justify each correct step and explain in detail any errors or justification gaps you find, as specified in the instructions above.




六道题目的攻略解析


问题1:组合几何的归纳法突破



这道关于"sunny lines"的组合问题,作者在论文中提到他们给了一个提示:"让我们试试用数学归纳法"。您可能觉得这算是作弊,但实际上归纳法是解决这类问题的标准工具,就像看到螺丝就想到螺丝刀一样自然。关键在于,这个提示节省了AI的探索时间,让它直接走上正确道路。最终AI证明了只有k∈{0,1,3}是可能的解。


问题2:解析几何的计算优势




几何问题通常有两种解法:纯几何证明和坐标计算。作者选择了后者,在论文中指出"让我们试试用解析几何方法"。这个选择很聪明,因为AI在直接计算上有天然优势,复杂的代数运算对它来说轻而易举。结果证明这是作者所说的"AI最容易的IMO问题"。


问题3:数论的采样策略



这道关于"bonza函数"的数论题展现了整个系统的威力。作者在论文中提到,他们只需要20次采样就找到了可行解,而其他系统需要32次采样才能达到50%的成功率。关键区别在于后续的迭代改进:只要初始解答有实质性重叠,系统就能逐步完善到完美。最终确定了最小常数c=4。


问题4:序列分析的数论技巧



这道关于proper divisors序列的问题,考验的是AI对数论性质的深度理解。系统成功确定了所有可能的初值a₁,展现了在复杂数学结构分析上的能力。论文中详细分析了序列的收敛性质和数论约束。


问题5:博弈论的策略分析



双人博弈问题"inekoalaty game"是逻辑推理的高级测试。AI需要分析Alice和Bazza的最优策略,并确定参数λ的临界值。最终结果显示:λ > √2/2时Alice获胜,λ < √2/2时Bazza获胜,λ = √2/2时平局。这种精确的数学分析体现了AI推理的深度。


问题6:组合优化的挑战



唯一没有完全解决的是这道2025×2025网格瓦片覆盖问题,AI只得到了平凡上界4048。这个结果其实很有启发性:它指出了当前方法在组合优化问题上的局限性,为未来改进指明了方向。作者在论文中坦诚承认,这类问题对现有系统仍然具有挑战性。


写在最后


这个"金牌成绩"目前还有个小插曲。有人在x上问论文作者Lin Yang关于评分标准的问题:"如何给7分制的题目评分,有标准的评分细则吗?"作者的回答:"我们检查了证明的所有细节。确实,我们目前还缺乏官方评分,或许IMO委员会可以帮助我们。"不过这不影响这项研究的价值,作者还提及他们很快会在将来发布一个自动化Agent


文章来自于微信公众号“AI修猫Prompt”。


AI中国

AI中国

7946 Articles 1232904 Views 950300 Fans

评论 (0)

睡觉动画