AI that thinks. AI that sees. AI that acts. AI that reasons like a brain. The model landscape in 2026 is more complex – and strategically more important – than ever before. Within just three months (February to April 2026), Anthropic, OpenAI, Google, Alibaba, and NVIDIA have released new flagship models. Alibaba's Qwen team alone has established a complete model family with Qwen3.5, Qwen3.5-Omni, and Qwen3.6 that directly challenges closed-source competition across many fronts. At the same time, open source is catching up so fast that choosing the right model has become a genuine business decision.
This article provides a structured overview of the eight core LLM types, their current representatives, and the implications for enterprise AI architectures in the DACH market.
Why LLM Types Are Strategically Relevant
Back in 2023, model selection was simple: ChatGPT or LLaMA. Today there are eight architecturally distinct categories, each solving different problems in different ways. Deploying a Large Reasoning Model for a medical diagnostic process, or a Large Action Model for contract analysis, are fundamentally wrong decisions that affect both performance and cost.
According to a current market report (WhatLLM, October 2025), open-source models already account for 62.8% of all available LLMs and now deliver SOTA performance for around 80% of real-world use cases at a fraction of the cost – on average 7.3× cheaper.
1. GPT – Generative Pre-trained Transformer
The standard that started it all
The GPT type refers to decoder-only Transformer models with autoregressive token prediction and massive pretraining on web data, refined through RLHF and DPO. This architecture is the foundation of virtually all current frontier models.
Closed Source – Current SOTA Models
GPT-5.4 (OpenAI, March 5, 2026) The most capable general-purpose frontier model to date, featuring native Computer Use, a 1-million-token context window via API, and integrated Codex coding capabilities. It achieves 83% on the GDPval benchmark (professional knowledge work across 44 occupational fields), surpassing human experts in many scenarios. Hallucinations have been reduced by 33% compared to GPT-5.2. On March 17, 2026, GPT-5.4 mini and nano followed for sub-agents and high-volume workloads ($0.75/1M input tokens, 400K context).
Claude Opus 4.7 (Anthropic, April 16, 2026)
87.6% on SWE-bench Verified (+6.8 pts vs. Opus 4.6), 64.3% on SWE-bench Pro (industry leader, +10.9 pts), 78.0% on OSWorld-Verified. Vision resolution tripled to 3.75 megapixels (2,576 px), visual acuity up from 54.5% to 98.5%. New xhigh effort level is the default in Claude Code; /ultrareview command for multi-stage code reviews. Pricing unchanged at $5/$25 per million tokens. Note: new tokenizer generates 1.0–1.35× more tokens. Terminal-Bench 2.0 slightly down (69.4% vs. GPT-5.4 at 75.1%).
Claude Sonnet 4.6 (Anthropic, February 17, 2026) Achieves 79.6% on SWE-bench Verified – just 1.2 percentage points behind Opus 4.6, at one-fifth of the price ($3/$15 per million tokens). Developers preferred it in 59% of coding sessions over the previous Opus 4.5. Effectively the best price-performance ratio on the market.
Gemini 3 Pro (Google) 1M-token context, 100% on AIME 2025 (with code execution), 80.6% on SWE-bench Verified.
Open Source – Current SOTA Models
| Model | Parameters | License | Strength |
|---|---|---|---|
| Qwen3-235B | 235B / 22B active | Apache 2.0 | AIME 89.2%, Chatbot Arena 1422 |
| Qwen3.6-35B-A3B | 35B / 3B active | Apache 2.0 | 73.4% SWE-bench Verified, 92.7% AIME 2026, beats Gemma 4-31B |
| GPT-OSS-120B | 117B / 5.1B active | Apache 2.0 | OpenAI's first open weights since GPT-2 |
| GLM-5.1 (Zhipu AI) | 744B / 40B active | MIT | SWE-Bench Pro: surpasses Claude Opus 4.6 and GPT-5.4; $3/month in GLM Coding Plan |
| Mistral Large 3 | 675B MoE | Apache 2.0 | 92% GPT-5.2 performance, 15% of the cost |
| DeepSeek-V3.2 | 671B / 37B active | Open | Fine-Grained Sparse Attention, $0.07/MTok |
Enterprise Recommendation: Claude Opus 4.7 as the default for demanding coding and agent workflows (SWE-bench Pro #1, Vision 3.75MP). Sonnet 4.6 for standard tasks at 5× lower cost. Hybrid routing saves 60–80% of budget at near-identical quality.
2. LRM – Large Reasoning Model
AI that thinks before it answers
LRMs extend standard LLMs with explicit chain-of-thought phases trained through Reinforcement Learning (GRPO/PPO). The model "thinks" – in visible or hidden reasoning traces – before responding. Inference-time scaling rather than parameter scaling is the key concept: more compute at inference time rather than more parameters during training.
Closed Source
- GPT-5.4 Thinking – Upfront Planning: users can view and correct the reasoning process before answer generation
- Claude Opus 4.7 Adaptive Thinking + xhigh – 5 effort levels (low/medium/high/xhigh/max);
xhighis the new default in Claude Code; Opus 4.7 athighsurpasses Opus 4.6 atmaxusing fewer tokens - Gemini 3 Deep Think – 2.5× reasoning improvement over its predecessor, 45.1% on ARC-AGI-2
Open Source
- DeepSeek-R1 / R1-0528 – 671B MoE, MIT license; AIME score improved from 70% to 87.5%
- QwQ-32B – Alibaba, 32B, RL-trained, Apache 2.0
- Sky-T1-32B – UC Berkeley, trained for ~$450, fully open source
- Qwen3 (Thinking Mode) – Hybrid: thinking and non-thinking modes switchable via toggle
Enterprise Recommendation: For compliance reviews, legal analysis, financial modeling, and medical decision support, LRMs are the first choice. Costs are well manageable through effort level control.
3. MoE – Mixture of Experts
The efficiency revolution of the Transformer era
MoE models activate only a small fraction of all parameters per token: a router network selects 2–8 from thousands of expert sub-networks. DeepSeek-V3, for example, activates only 37 out of 671 billion parameters per forward pass. This architecture has established itself as the de facto standard for all major frontier models in 2025/26.
Open Source – Current SOTA Models
| Model | Total / Active Params | Highlight |
|---|---|---|
| DeepSeek-V3.2 | 671B / 37B | Fine-Grained Sparse Attention, 50% efficiency↑ |
| Kimi K2.5 (Moonshot) | 1T / 32B | HumanEval 99.0%, MATH-500 98.0% |
| Nemotron 3 Super (NVIDIA) | 120B / 12B | Hybrid Mamba-Transformer + Latent MoE, 1M Ctx, 5× throughput, agentic-optimized |
| Gemma 4 26B MoE (Google) | 26B / 3.8B active | Apache 2.0, #6 Arena AI, 97% of 31B performance, 256K Ctx, ollama-ready |
| GLM-5.1 (Zhipu AI) | 744B / 40B active | MIT license, 200K Ctx, SWE-Bench Pro above Claude Opus 4.6 level; $3/month Coding Plan |
| GLM-4.7 (Z.ai) | 355B MoE | #1 Open-Source Leaderboard early 2026 |
| Qwen3.5-397B-A17B | 397B / 17B | Apache 2.0, 8.6–19× higher decode throughput; basis of the Qwen3.5 family |
| Qwen3.5-122B-A10B | 122B / 10B | Apache 2.0, Feb 2026; balance of capacity and on-prem efficiency |
| Qwen3.6-35B-A3B | 35B / 3B | Apache 2.0, Apr 2026; 73.4% SWE-bench Verified, 262K Ctx, consumer-GPU-capable |
| Mixtral 8x22B | 141B / 39B | Apache 2.0, proven in production |
Enterprise Recommendation: For co-location deployments and on-premises setups with NVIDIA infrastructure, MoE models are the most cost-efficient option. DeepSeek-V3.2 for batch workloads ($0.07/MTok with cache), Qwen3.5-397B and Qwen3.6-35B-A3B for coding agents. GLM-5.1 (MIT, 744B/40B active) is the most aggressive price-performance attack on closed source in the current cycle – 94.6% of Opus 4.6 coding quality at a fraction of the cost. For high-volume multi-agent pipelines on NVIDIA Blackwell hardware: Nemotron 3 Super – the only open-source model explicitly designed to address context explosion and the thinking-tax effect in agentic workflows.
4. VLM – Vision-Language Model
AI that sees and understands
VLMs combine a vision encoder (typically ViT-based) with an LLM backbone through cross-attention fusion layers. They process text, images, documents, and video within a unified framework. Open-source VLMs have reduced inference costs by up to 60% compared to commercial APIs while maintaining competitive benchmark scores.
Closed Source
- GPT-5.4 Vision – 81.2% on MMMU-Pro without tools, up to 10.24 megapixels (original resolution), native Computer Use for screenshot-based GUI automation
- Claude Opus 4.7 – Vision at 3.75MP (3× vs. 4.6), 82.1% visual reasoning without tools, 98.5% visual acuity; pixel-precise coordinates eliminate scaling errors in Computer Use
- Claude Sonnet 4.6 – 94% accuracy on insurance benchmarks (highest measured value)
- Gemini 3 Pro – Video-native approach, 1M-token context, Pan & Scan for dynamic resolution
Open Source
- Gemma 4 31B / 26B MoE (Google) – Natively multimodal (text, image, video up to 60 s), audio in E2B/E4B; 140+ languages; Apache 2.0;
ollama run gemma4; 256K Ctx - Qwen3.5-Omni (Alibaba, March 30, 2026) – The most advanced fully omnimodal open-weight model to date: processes text, images, audio, and video natively in a single inference call. Thinker-Talker architecture with Hybrid-Attention MoE. 256K context corresponds to >10 hours of audio or ~400 seconds of 720p video. Speech recognition in 113 languages/dialects, speech output in 36 languages. 215 SOTA benchmark results; surpasses Gemini 3.1 Pro on audio benchmarks. Emergent capability: Audio-Visual Vibe Coding – code generation directly from audio/video instructions without text input. Three variants: Plus (flagship), Flash (latency-optimized), Light (edge/on-device). Realtime API with semantic interrupt detection and ARIA technology (Adaptive Rate Interleave Alignment) for natural speech flow control. Apache 2.0.
- Qwen3.5-VL – Video, images, documents; 200+ languages; Apache 2.0
- LLaMA 4 Scout / Maverick – Meta, 109B/400B MoE, natively multimodal
- GLM-4.7V (Z.ai) – Computer Vision + Video Understanding, Open
- DeepSeek-OCR – Document OCR specialist, up to 20× token compression at 97% accuracy
- DeepSeek-VL (1.3B) – Smallest VLM with strong reasoning results
Enterprise Recommendation: For insurance documents, medical image analysis, and automated invoice processing (IDP), VLMs are the key building block. Qwen3.5-Omni is the first genuine open-source alternative to proprietary omni models – particularly for voice AI applications, multilingual customer communication (113 ASR languages), and multimodal agents that must process text, image, audio, and video in a single pipeline. Privacy-sensitive deployments: Gemma 4 26B MoE or Qwen3.5-VL locally via Ollama – both Apache 2.0, single-GPU-capable.
5. SRM – Small Reasoning Model
Frontier reasoning in edge format
SRMs are compact reasoning models under ~15 billion parameters, derived from large reasoning models through knowledge distillation. Microsoft Phi-4-mini-reasoning (3.8B) was distilled from DeepSeek-R1 and achieves 88.6% on MATH-500 – nearly on par with significantly larger models. RL fine-tuning on synthetic mathematics data is the decisive training step.
Open Source
| Model | Params | Benchmark | Highlight |
|---|---|---|---|
| Phi-4-mini-reasoning (Microsoft) | 3.8B | MATH-500: 88.6% | 128K Ctx, 20+ languages, Apache-like |
| DeepSeek-R1-Distill-Qwen3-8B | 8B | AIME: 87.5% | Beats Gemini 2.5 Flash! Single-GPU |
| Gemma 4 E2B / E4B (Google) | 2.3B / 4.5B eff. | MMLU 85.2% (31B) | Apache 2.0, Audio+Image+Text, on-device, 128K Ctx, ollama run gemma4 |
| Qwen3 1.7B–8B | 1.7–8B | Best in class | Apache 2.0, Ollama-ready |
| SmolLM3-3B (HuggingFace) | 3B | Beats Llama-3.2-3B | Fully transparent (data, methodology) |
Closed Source (SRM tier)
- GPT-5.4 mini / nano – 400K context, $0.75/1M input, 2× faster than GPT-5.4
- Claude Sonnet 4.6 – Effectively positioned in the SRM price segment ($3/$15) with flagship quality
- Gemini 3 Flash – 78% SWE-bench Verified, fastest closed-source tier
Enterprise Recommendation: For local AI workstations (ASUS/NVIDIA with Ollama), GDPR-compliant on-device deployments, and offline scenarios in healthcare facilities, SRMs are the first choice. Gemma 4 E4B runs on smartphones, Gemma 4 26B MoE on a consumer GPU – both Apache 2.0. Phi-4-mini-reasoning runs on a single NVIDIA RTX 4090.
6. LAM – Large Action Model
From answer to action
LAMs combine language understanding with an execution layer for real-world actions: calling APIs, filling out forms, controlling software, managing files. The defining characteristic: LAMs learn from human action sequences and can autonomously execute multi-step plans without asking for confirmation at each step.
GPT-5.4 marks a milestone in 2026: it is the first general-purpose frontier model with native Computer Use – 75% on OSWorld-Verified, the benchmark for desktop automation. Claude Opus 4.6 introduces Agent Teams: multiple parallel Claude instances coordinate on complex projects.
A special role is played by NVIDIA Nemotron 3 Super (120B, 12B active, March 11, 2026): multi-agent systems generate up to 15× more tokens than standard chats, as history, tool outputs, and reasoning traces are resent at every turn. Over long tasks, this leads to "context explosion" and "goal drift" – the model gradually loses alignment with the original objective. Nemotron 3 Super directly addresses this "thinking tax" effect through its 1M-token context with linear Mamba scaling, achieving up to 2.2× higher inference throughput than GPT-OSS-120B.
Open Source
- Nemotron 3 Super (NVIDIA) – 120B / 12B active, Hybrid Mamba-Transformer + Latent MoE, 1M Ctx, 85.6% on PinchBench (agentic benchmark), #1 open model in its class; via build.nvidia.com, OpenRouter and Hugging Face
- Qwen3-Coder-Next (Alibaba, Feb 2026) – 80B MoE / 3B active; specialized coding-agent model for multi-file refactoring, repo-level tasks and autonomous debugging. 70.6–71.3% on SWE-bench; 262K Ctx; $0.12/$0.75 per million tokens via API. Requires locally at least 46 GB RAM (Mac Studio 64GB+ recommended).
- xLAM (Salesforce) – #2 on Berkeley Function Calling Leaderboard V1
- OpenHands (All-Hands AI) – Open-source framework for software engineering agents
- Qwen3-Agent – MCP-ready, tool-use-optimized, Apache 2.0
Closed Source
- GPT-5.4 + Computer Use – Tool Search API reduces token consumption in multi-tool setups by up to 47%
- Claude Opus 4.7 Agent Teams – 87.6% SWE-bench Verified, MCP-Atlas industry leader (+9.2 pts vs. GPT-5.4); Task Budgets (Beta) for controlled agent loops
- Qwen3.6-Plus (Alibaba, April 2, 2026) – Proprietary agentic coding flagship with 1M-token context and always-on chain-of-thought. 78.8% on SWE-bench Verified, 61.6% on Terminal-Bench 2.0. Preserve-Thinking parameter for consistent agent loops. Approximately 12× cheaper than Claude Opus 4.6 ($0.29/$1.65 per million tokens). Via OpenRouter and Alibaba Cloud Bailian. Important: closed source, no on-premises deployment.
- Google Agentspace – Enterprise integration in Google Workspace
Enterprise Recommendation: LAMs are the most direct path to replacing RPA systems (UiPath, Automation Anywhere) with language-driven agents. MCP (Model Context Protocol, now under the Linux Foundation) is the emerging industry standard for tool and data access. For on-premises multi-agent deployments on NVIDIA DGX or Blackwell hardware: Nemotron 3 Super as a powerful open-source alternative to Claude Opus 4.7 Agent Teams.
7. HRM – Hierarchical Reasoning Model
The paradigm shift: thinking in latent space
The Hierarchical Reasoning Model is the most spectacular architectural innovation of 2025. Developed by Sapient Intelligence (Singapore, July 2025), it is inspired by the hierarchical, multi-timescale processing of the human brain.
The architecture consists of two interdependent recurrent modules:
- High-level module – slow, abstract planning (corresponds to Kahneman's System 2)
- Low-level module – fast, detailed computation (System 1)
The key distinction: HRM does not think in token space, but in latent space – no chain-of-thought, no verbalization of the reasoning process, no pretraining, no CoT data required.
Benchmark Results
With just 27 million parameters and approximately 1,000 training examples, Sapient HRM outperforms models such as OpenAI o3-mini-high, DeepSeek-R1 (671B!) and Claude 3.7 Sonnet on specific reasoning benchmarks:
| Benchmark | HRM (27M) | o3-mini-high | DeepSeek-R1 |
|---|---|---|---|
| ARC-AGI-2 | 5% | 0% | 0% |
| Sudoku-Extreme | ~100% | 0% | 0% |
| Maze-Hard (30×30) | ~100% | 0% | 0% |
Important caveat: HRM is not a generalist model. It cannot converse, generate code, or summarize. It is a specialized reasoning system that must be trained directly for each new task.
Open source on GitHub: github.com/sapientinc/HRM
Enterprise Recommendation: Not yet production-ready as a standalone system. Highly relevant for research collaborations, specialized constraint-satisfaction problems (routing, planning, optimization), and as a foundation for hybrid architectures. Observation horizon: 12–24 months.
8. LCM – Large Concept Model
Beyond the token: thinking in sentences
Meta's Large Concept Model represents the most conceptually radical approach in this comparison: instead of processing text token by token, the LCM operates at the sentence level within the SONAR embedding space.
The three core components:
- Concept Encoder – converts input into a semantic embedding space
- Core (Inference) – operates on abstract concept representations
- Concept Decoder – transforms abstractions back into natural language
The SONAR space supports 200 text languages and 76 languages for audio – without language-specific retraining. A single LCM model is therefore natively multilingual to a degree that token-based models can barely match.
Current Implementation
Meta LCM 7B – Surpasses LLaMA-3.1-8B on multilingual summarization (XLSum); modularly extensible; open source: github.com/facebookresearch/large_concept_model
Meta positions LCM as scientific diversification – explicitly not as a direct competitor to current frontier LLMs, but as a long-term research path away from the token paradigm.
Enterprise Recommendation: For multilingual DACH deployments and scenarios with 20+ languages (European financial institutions, international insurance groups), LCM is a fascinating research candidate. Production deployment: 12–24 month horizon.
Market Landscape: Open vs. Closed Source in April 2026
The balance of power has shifted fundamentally. Three key messages for enterprise AI decision-makers:
1. Performance parity for 80% of use cases
Open-source models such as DeepSeek V3.2, Qwen3-235B, GLM-5.1, and Qwen3.6-35B-A3B offer pricing from $0.07–0.29 per million tokens at quality scores of 56–58 (on a 0–70 scale) – compared to $15–30 for comparable closed-source models. GLM-5.1 (MIT, 744B/40B active) is the most aggressive assault: 94.6% of Claude Opus 4.6 coding performance at a fraction of the cost. For compliance-intensive industries (insurance, banking, healthcare), on-premises deployments of these models are more economically attractive than ever.
2. Closed source retains the lead for the hardest 20% of tasks
GPT-5.4 leads in native computer-use workflows and professional knowledge-work benchmarks. Claude Opus 4.7 dominates in complex reasoning chains and terminal-based coding agents (69.5% on Terminal-Bench 2.0). For these premium use cases, the quality advantage justifies the higher cost.
3. Hybrid multi-model routing as best practice
The strategically optimal architecture in April 2026 is not a single model, but an intelligent routing system:
- 80–90% of requests → Cost-efficient tier (Sonnet 4.6, GPT-5.4 mini, Qwen3.5-35B-A3B, Qwen3.6-35B-A3B)
- 10–20% of requests → Escalation to Opus 4.7 or GPT-5.4 for complex tasks
- Coding agents (cloud) → Qwen3.6-Plus (1M Ctx, $0.29/$1.65, 78.8% SWE-bench) as a cost-efficient alternative to Opus 4.7
- Agentic pipelines on-prem → Nemotron 3 Super on NVIDIA hardware (eliminates context-explosion costs)
- Multimodal voice pipelines → Qwen3.5-Omni (113 ASR languages, Apache 2.0, on-prem-capable)
- Savings: 60–80% of budget at near-identical quality
Conclusion: Strategic Recommendations for the DACH Market
The eight LLM types are not competitors – they are a toolkit. A well-considered enterprise AI architecture combines:
| Type | Recommended Use | Model (Recommendation) |
|---|---|---|
| GPT / LRM (Closed) | Frontier coding, knowledge work, vision | Claude Opus 4.7 / Sonnet 4.6 |
| MoE / SRM (Open) | Scalable batch workloads, GDPR, on-prem | DeepSeek-V3.2, Qwen3.5-35B-A3B, GLM-5.1, Gemma 4 26B |
| VLM Omni (Open) | Voice AI, multilingual agents, audio-video analysis | Qwen3.5-Omni (Apache 2.0, 113 languages) |
| VLM (Mix) | Document and image processing, IDP | Opus 4.7 (API) / Gemma 4 / Qwen3.5-VL (local) |
| SRM Edge / On-Device | Mobile, offline, on-device, GDPR | Gemma 4 E2B/E4B / Phi-4-mini / Qwen3.5-4B |
| LAM – Agentic Coding (Open) | Repo-level coding, multi-file refactoring | Qwen3-Coder-Next / Nemotron 3 Super |
| LAM – Agentic (Cloud) | RPA replacement, multi-agent on-prem, NVIDIA infra | Qwen3.6-Plus (API) / xLAM |
| LAM – Agentic (Closed) | Parallel agent teams, Computer Use API | Opus 4.7 Agent Teams / GPT-5.4 |
| HRM / LCM | Research collaborations, future planning | Sapient HRM / Meta LCM (Open) |
For companies in the DACH market: now is the right time to define an AI model strategy – not as a one-time decision, but as a living architecture capable of keeping pace with the rapid rate of innovation.
This article was produced by the AI Data Center Practice at GlobalCore Consulting. GlobalCore supports companies in the DACH market in selecting, architecting, and implementing enterprise AI systems – from local workstation deployments to co-location data center strategies.
For an individual LLM architecture analysis, contact us via our contact form.

