Unify thinking budget-to-effort conversion in a shared helper, handle disabled/default thinking cases in translators, adjust zero-budget mapping, and drop the old OpenAI-specific helper with updated tests.
Expose thinking/effort normalization helpers from the executor package
so conversion tests use production code and stay aligned with runtime
validation behavior.
When budget 0 maps to "none" for models that use thinking levels
but don't support that effort level, strip thinking fields instead
of setting an invalid reasoning_effort value.
Tests now expect removal for this edge case.
- **Thinking Support**:
- Enabled thinking support for all Kiro Claude models, including Haiku 4.5 and agentic variants.
- Updated `model_definitions.go` with thinking configuration (Min: 1024, Max: 32000, ZeroAllowed: true).
- Fixed `extended_thinking` field names in `model_registry.go` (from `min_budget`/`max_budget` to `min`/`max`) to comply with Claude API specs, enabling thinking control in clients like Claude Code.
- **Kiro Executor Fixes**:
- Fixed `budget_tokens` handling: explicitly disable thinking when budget is 0 or negative.
- Removed aggressive duplicate content filtering logic that caused truncation/data loss.
- Enhanced thinking tag parsing with `extractThinkingFromContent` to correctly handle interleaved thinking/text blocks.
- Added EOF handling to flush pending thinking tag characters, preventing data loss at stream end.
- **Performance**:
- Optimized Claude stream handler (v6.2) with reduced buffer size (4KB) and faster flush interval (50ms) to minimize latency and prevent timeouts.
Ensure thinking settings translate correctly across providers:
- Only apply reasoning_effort to level-based models and derive it from numeric
budget suffixes when present
- Strip effort string fields for budget-based models and skip Claude/Gemini
budget resolution for level-based or unsupported models
- Default Gemini include_thoughts when a nonzero budget override is set
- Add cross-protocol conversion and budget range tests
When using OpenAI-compatible providers with model aliases (e.g., glm-4.6-zai -> glm-4.6),
the alias resolution was correctly applied but then immediately overwritten by
ResolveOriginalModel, causing 'Unknown Model' errors from upstream APIs.
This fix skips the ResolveOriginalModel override when a model alias has already
been resolved, ensuring the correct model name is sent to the upstream provider.
Co-authored-by: Amp <amp@ampcode.com>
Add package and constructor documentation for AI Studio, Antigravity,
Gemini CLI, Gemini API, and Vertex executors to describe their roles and
inputs.
Introduce a shared stream scanner buffer constant in the Gemini API
executor and reuse it in Gemini CLI and Vertex streaming code so stream
handling uses a consistent configuration.
Update Refresh implementations for AI Studio, Gemini CLI, Gemini API
(API key), and Vertex executors to short‑circuit and simply return the
incoming auth object, while keeping Antigravity token renewal as the
only executor that performs OAuth refresh.
Remove OAuth2-based token refresh logic and related dependencies from
the Gemini API executor, since it now operates strictly with API key
credentials.
Align thinking suffix handling on a single bracket-style marker.
NormalizeThinkingModel strips a terminal `[value]` segment from
model identifiers and turns it into either a thinking budget (for
numeric values) or a reasoning effort hint (for strings). Emission
of `ThinkingIncludeThoughtsMetadataKey` is removed.
Executor helpers and the example config are updated so their
comments reference the new `[value]` suffix format instead of the
legacy dash variants.
BREAKING CHANGE: dash-based thinking suffixes (`-thinking`,
`-thinking-N`, `-reasoning`, `-nothinking`) are no longer parsed
for thinking metadata; only `[value]` annotations are recognized.
Add fallback parsing for quota reset delay when RetryInfo is not present:
- Try ErrorInfo.metadata.quotaResetDelay (e.g., "373.801628ms")
- Parse from error.message "Your quota will reset after Xs."
This ensures proper cooldown timing for rate-limited requests.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>