feat: refactor API key configuration and enhance application initialization
- Renamed `check_environment` to `check_api_key_configured` for clarity, simplifying the API key validation logic. - Removed the blocking behavior of the API key check during application startup, allowing the app to run while providing a prompt for configuration. - Updated `LocalAgentApp` to accept an `api_configured` parameter, enabling conditional messaging for API key setup. - Enhanced the `SandboxRunner` to support backup management and improved execution result handling with detailed metrics. - Integrated data governance strategies into the `HistoryManager`, ensuring compliance and improved data management. - Added privacy settings and metrics tracking across various components to enhance user experience and application safety.
This commit is contained in:
245
docs/P1-02_重试策略修复说明.md
Normal file
245
docs/P1-02_重试策略修复说明.md
Normal file
@@ -0,0 +1,245 @@
|
||||
# P1-02 重试策略修复说明
|
||||
|
||||
## 问题描述
|
||||
|
||||
**问题标题**: 重试策略声明与实际行为不一致
|
||||
**问题类型**: 技术/稳定性
|
||||
**所在位置**: `llm/client.py:68, 149, 218`
|
||||
|
||||
### 核心问题
|
||||
网络异常(`Timeout`、`ConnectionError`)先被包装为 `LLMClientError`,后续 `_should_retry` 方法只能通过字符串匹配判断是否重试,导致大部分网络异常无法被正确识别为可重试异常,弱网环境下稳定性下降。
|
||||
|
||||
### 影响范围
|
||||
- 意图识别模块
|
||||
- 生成计划模块
|
||||
- 代码生成模块
|
||||
- 所有 LLM 调用场景
|
||||
|
||||
在网络抖动环境下,这些模块的失败率显著升高。
|
||||
|
||||
---
|
||||
|
||||
## 修复方案
|
||||
|
||||
### 1. 异常分类系统
|
||||
|
||||
为 `LLMClientError` 添加了错误类型分类:
|
||||
|
||||
```python
|
||||
class LLMClientError(Exception):
|
||||
# 异常类型分类
|
||||
TYPE_NETWORK = "network" # 网络错误(超时、连接失败等)
|
||||
TYPE_SERVER = "server" # 服务器错误(5xx)
|
||||
TYPE_CLIENT = "client" # 客户端错误(4xx)
|
||||
TYPE_PARSE = "parse" # 解析错误
|
||||
TYPE_CONFIG = "config" # 配置错误
|
||||
|
||||
def __init__(self, message: str, error_type: str = TYPE_CLIENT,
|
||||
original_exception: Optional[Exception] = None):
|
||||
super().__init__(message)
|
||||
self.error_type = error_type
|
||||
self.original_exception = original_exception
|
||||
```
|
||||
|
||||
### 2. 统一重试判断逻辑
|
||||
|
||||
重构 `_should_retry` 方法,基于异常类型而非字符串匹配:
|
||||
|
||||
```python
|
||||
def _should_retry(self, exception: Exception) -> bool:
|
||||
"""
|
||||
判断是否应该重试
|
||||
|
||||
可重试的异常类型:
|
||||
- 网络错误(超时、连接失败)
|
||||
- 服务器错误(5xx)
|
||||
- 限流错误(429)
|
||||
"""
|
||||
# LLMClientError 根据错误类型判断
|
||||
if isinstance(exception, LLMClientError):
|
||||
# 网络错误和服务器错误可以重试
|
||||
if exception.error_type in (LLMClientError.TYPE_NETWORK,
|
||||
LLMClientError.TYPE_SERVER):
|
||||
return True
|
||||
|
||||
# 检查原始异常
|
||||
if exception.original_exception:
|
||||
if isinstance(exception.original_exception,
|
||||
(requests.exceptions.ConnectionError,
|
||||
requests.exceptions.Timeout,
|
||||
requests.exceptions.ChunkedEncodingError)):
|
||||
return True
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
### 3. 保留原始异常信息
|
||||
|
||||
在所有异常包装点保留原始异常:
|
||||
|
||||
**非流式请求 (chat)**:
|
||||
```python
|
||||
except requests.exceptions.Timeout as e:
|
||||
raise LLMClientError(
|
||||
f"请求超时({timeout}秒)",
|
||||
error_type=LLMClientError.TYPE_NETWORK,
|
||||
original_exception=e
|
||||
)
|
||||
```
|
||||
|
||||
**流式请求 (chat_stream)**:
|
||||
```python
|
||||
except requests.exceptions.ConnectionError as e:
|
||||
raise LLMClientError(
|
||||
"网络连接失败",
|
||||
error_type=LLMClientError.TYPE_NETWORK,
|
||||
original_exception=e
|
||||
)
|
||||
```
|
||||
|
||||
### 4. 状态码分类
|
||||
|
||||
根据 HTTP 状态码自动分类错误类型:
|
||||
|
||||
```python
|
||||
if response.status_code >= 500:
|
||||
error_type = LLMClientError.TYPE_SERVER # 可重试
|
||||
elif response.status_code == 429:
|
||||
error_type = LLMClientError.TYPE_SERVER # 限流,可重试
|
||||
else:
|
||||
error_type = LLMClientError.TYPE_CLIENT # 不重试
|
||||
```
|
||||
|
||||
### 5. 增强重试度量
|
||||
|
||||
在 `_do_request_with_retry` 中增强度量记录:
|
||||
|
||||
- 记录重试次数
|
||||
- 记录错误类型
|
||||
- 记录重试后成功/失败
|
||||
- 输出更详细的重试日志
|
||||
|
||||
---
|
||||
|
||||
## 测试验证
|
||||
|
||||
### 测试结果
|
||||
|
||||
✅ **所有测试通过**
|
||||
|
||||
```
|
||||
测试 1: 异常分类
|
||||
✓ 网络错误类型: network
|
||||
✓ 服务器错误类型: server
|
||||
✓ 客户端错误类型: client
|
||||
|
||||
测试 2: 重试判断逻辑
|
||||
✓ 网络错误应该重试: True
|
||||
✓ 超时错误应该重试: True
|
||||
✓ 服务器错误应该重试: True
|
||||
✓ 客户端错误不应该重试: False
|
||||
✓ 解析错误不应该重试: False
|
||||
✓ 配置错误不应该重试: False
|
||||
✓ 带原始异常的网络错误应该重试: True
|
||||
|
||||
测试 3: 错误类型保留
|
||||
✓ 状态码 500-504 (服务器错误): server
|
||||
✓ 状态码 429 (限流错误): server
|
||||
✓ 状态码 400-404 (客户端错误): client
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 修复效果
|
||||
|
||||
### 可重试的异常类型
|
||||
|
||||
| 异常类型 | 修复前 | 修复后 |
|
||||
|---------|--------|--------|
|
||||
| 网络超时 (Timeout) | ❌ 不重试 | ✅ 重试 |
|
||||
| 连接失败 (ConnectionError) | ❌ 不重试 | ✅ 重试 |
|
||||
| 服务器错误 (5xx) | ⚠️ 部分重试 | ✅ 重试 |
|
||||
| 限流错误 (429) | ❌ 不重试 | ✅ 重试 |
|
||||
| 客户端错误 (4xx) | ❌ 不重试 | ❌ 不重试 |
|
||||
| 解析错误 | ❌ 不重试 | ❌ 不重试 |
|
||||
| 配置错误 | ❌ 不重试 | ❌ 不重试 |
|
||||
|
||||
### 预期改进
|
||||
|
||||
1. **稳定性提升**: 弱网环境下的请求成功率显著提高
|
||||
2. **用户体验**: 网络抖动时自动恢复,无需手动重试
|
||||
3. **可观测性**: 更详细的重试日志和度量指标
|
||||
4. **准确性**: 只重试真正可恢复的错误,避免无效重试
|
||||
|
||||
---
|
||||
|
||||
## 度量指标
|
||||
|
||||
### 建议监控的指标
|
||||
|
||||
1. **LLM 请求成功率**: 总成功次数 / 总请求次数
|
||||
2. **平均重试次数**: 总重试次数 / 总请求次数
|
||||
3. **超时后恢复成功率**: 重试成功次数 / 超时次数
|
||||
4. **网络错误分布**: 各类网络错误的占比
|
||||
5. **重试延迟**: 重试导致的额外延迟时间
|
||||
|
||||
### 度量数据位置
|
||||
|
||||
- 配置度量: `workspace/.metrics/config_metrics.json`
|
||||
- 重试日志: 控制台输出
|
||||
|
||||
---
|
||||
|
||||
## 向后兼容性
|
||||
|
||||
✅ **完全向后兼容**
|
||||
|
||||
- `LLMClientError` 仍然是标准异常,可以正常捕获
|
||||
- 新增的 `error_type` 和 `original_exception` 属性是可选的
|
||||
- 现有代码无需修改即可受益于修复
|
||||
|
||||
---
|
||||
|
||||
## 使用示例
|
||||
|
||||
### 捕获特定类型的错误
|
||||
|
||||
```python
|
||||
from llm.client import get_client, LLMClientError
|
||||
|
||||
try:
|
||||
client = get_client()
|
||||
response = client.chat(messages=[...], model="...")
|
||||
except LLMClientError as e:
|
||||
if e.error_type == LLMClientError.TYPE_NETWORK:
|
||||
print("网络错误,已自动重试")
|
||||
elif e.error_type == LLMClientError.TYPE_CONFIG:
|
||||
print("配置错误,请检查 .env 文件")
|
||||
else:
|
||||
print(f"其他错误: {e}")
|
||||
```
|
||||
|
||||
### 检查原始异常
|
||||
|
||||
```python
|
||||
try:
|
||||
response = client.chat(...)
|
||||
except LLMClientError as e:
|
||||
if e.original_exception:
|
||||
print(f"原始异常: {type(e.original_exception).__name__}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 相关文件
|
||||
|
||||
- `llm/client.py`: 主要修复文件
|
||||
- `llm/config_metrics.py`: 度量指标增强
|
||||
- `test_retry_fix.py`: 验证测试脚本
|
||||
|
||||
---
|
||||
|
||||
## 总结
|
||||
|
||||
此次修复解决了重试策略声明与实际行为不一致的核心问题,通过引入异常分类系统和保留原始异常信息,确保网络异常能够被正确识别并重试。预期在弱网环境下,系统稳定性将显著提升。
|
||||
|
||||
Reference in New Issue
Block a user