Moonshot AI Launches Kimi K2.5 Open-Source Multimodal Model

Moonshot AI’s Kimi K2.5 is an open‑source, trillion‑parameter multimodal model that combines vision and language in a single backbone. It introduces native visual perception, a “Coding with Vision” workflow that turns images into executable code, and a research‑preview Agent Swarm mode that coordinates dozens of sub‑agents for complex tasks efficiently.

Native Multimodal Perception in Kimi K2.5

Scale and Architecture

Mixture‑of‑Experts transformer with 1 trillion total parameters, activating 32 billion parameters per token across 384 experts.
Integrated vision encoder MoonViT adds 400 million parameters.
Trained on approximately 15 trillion mixed visual‑text tokens, creating a unified vision‑language backbone.

Benchmark Performance

MMMU‑Pro: 78.5
OCRBench: 92.3
VideoMMMU: 86.6
MMLU‑Pro: 87.1
AIME: 96.1

Vision‑Driven Coding with “Coding with Vision”

The “Coding with Vision” feature enables Kimi K2.5 to generate functional HTML, CSS, and JavaScript directly from UI mock‑ups or short interaction videos. By aligning visual patterns with programming constructs during joint pre‑training, the model can synthesize code, solve visual puzzles, and produce interactive web components in a single pass.

Agent Swarm Mode: Parallel AI Orchestration

Agent Swarm is a research‑preview mode where Kimi K2.5 acts as an orchestrator, spawning up to 100 sub‑agents and managing up to 1,500 execution steps. This parallel‑agent framework reduces end‑to‑end execution time by roughly 4.5× compared with single‑agent operation, enabling scalable solutions for tasks such as large‑scale data mining or multi‑domain research.

Open‑Source Availability and Community Support

Kimi K2.5 is released under a modified MIT license and hosted on a public model repository. Comprehensive documentation provides step‑by‑step guidance for loading the model, configuring vision inputs, and enabling Swarm mode via API calls, empowering developers to integrate the model into their own applications without licensing constraints.

Implications for AI Development

The combination of native multimodality, vision‑driven coding, and parallel agent orchestration positions Kimi K2.5 as a benchmark for open‑source AI innovation. Its strong visual reasoning scores suggest that joint vision‑language training can outperform many closed‑source alternatives, potentially reshaping workflows in UI automation, visual data extraction, and collaborative AI research.

Future Roadmap

Moonshot AI hints at a next‑generation model, Kimi K5, featuring larger context windows (up to 256 K tokens) and an expanded expert count. Meanwhile, K2.5 offers a ready‑to‑use platform for developers to explore vision‑centric coding and agentic AI, laying the groundwork for more advanced perception‑aware systems.

Native Multimodal Perception in Kimi K2.5

Scale and Architecture

Benchmark Performance

Vision‑Driven Coding with “Coding with Vision”

Agent Swarm Mode: Parallel AI Orchestration

Open‑Source Availability and Community Support

Implications for AI Development

Future Roadmap

Trending Now ...

AMI Labs Secures $1.03B Seed Round, Pushes World-Model AI

Netflix Announces $600M Deal for Ben Affleck AI Startup

Supreme Court Keeps Human Authorship Rule, AI Art Unprotected