zeb labs

A comparative study of standard Bedrock Converse and Mantle inference under stress conditions

Empirical findings from a controlled stress-test study.

24
stress-condition runs
19,212
total requests
2
serving paths compared
4
concurrency levels

Abstract

Modern LLM serving systems often expose multiple inference paths that differ in latency behavior, throughput scaling, and reliability under load. Choosing the wrong path can degrade user experience or violate service-level objectives (SLOs), especially during traffic surges. We compare two production-relevant options, Standard Bedrock Converse and Mantle OpenAI-compatible serving, under controlled stress conditions spanning architecture type (Dense, MoE), prompt-length bands, and low-to-high concurrency regimes. The benchmark is centered on a translation use case to keep task semantics stable across all test conditions.

The study uses a full-factorial design with run configurations, i.e., each configuration combines architecture, input-length band, and concurrency under one unified protocol. Results show a clear crossover: Standard is more often faster on median latency, while Mantle is stronger under high-load reliability and throughput criteria. This pattern is strongest in MoE experiments, where Mantle maintains substantially higher weighted success.

These findings imply that inference-path selection should be objective-driven: latency-first routing can prefer Standard in lighter regimes, whereas scale-stability routing should prefer Mantle under sustained stress.

Introduction

Large-scale inference deployments are now routinely exposed to heterogeneous serving paths, each optimized for different operational goals. In this setting, platform teams must decide whether to prioritize median response speed, stable throughput at saturation, or reliability under concurrency spikes. This paper studies that decision problem in the context of translation-style LLM workloads routed through two serving concepts: Standard Bedrock Converse and Mantle inference.

Throughout this paper, a run configuration means one exact test condition formed by architecture, input-length band, and concurrency level.

Comparing these two concepts is practically important because both can appear competitive when evaluated with a single metric, yet exhibit different failure modes under stress. A latency-only perspective can favor one path, while a reliability-weighted throughput perspective can favor the other. As a result, decision quality depends on evaluation design, not just raw score tables.

Prior internal and public evaluations in this domain often focus on one serving concept at a time, use narrow load bands, or report non-unified metric suites. Such setups make cross-path interpretation difficult when stress conditions vary across studies. Our work closes this gap with a unified, controlled comparison across architectures, prompt lengths, and concurrency levels.

Main contributions

  • Systematic comparison of Standard and Mantle serving across 24 stress-condition runs and 19,212 total requests.
  • Joint metric framework spanning latency, throughput, reliability, error dynamics, and reliability-weighted throughput.
  • Empirical identification of crossover behavior: Standard is frequently latency-favored, while Mantle is more robust in high-load regimes.

At a high level, we find that Standard outperforms Mantle on median-latency winner counts, whereas Mantle outperforms Standard on stress-regime reliability and effective throughput. This confirms that path selection should be conditioned on service objective and traffic regime rather than treated as a static global choice.

Related work

Why existing evaluations fall short

Background on Standard Bedrock Converse Serving

Standard Bedrock Converse deployments are commonly evaluated through request-level latency and aggregate throughput summaries, with emphasis on end-user responsiveness. This literature and practice generally capture successful-response timing well, but often provide limited characterization of overload-phase reliability decay.

Background on Mantle-Style OpenAI-Compatible Serving

OpenAI-compatible gateway layers, including Mantle-like serving paths, are typically motivated by compatibility, routing flexibility, and operational control. Evaluations in this family frequently emphasize sustained throughput and production stability, especially where admission control and queue behavior influence observed outcomes.

Comparative and Robustness-Focused Evaluation Work

Comparative studies in adjacent systems domains show that methodology strongly affects conclusions: single-metric leaderboards can mis-rank systems when error rates diverge under stress. Robustness-oriented benchmarking therefore recommends multi-condition stress matrices, repeated protocol controls, and reliability-aware scoring.

Existing work remains insufficient for this decision context because most comparisons are not jointly systematic across architecture, prompt scale, and concurrency using one protocol and one metric vocabulary. Our study addresses that limitation with a unified A/B evaluation under controlled stress conditions and harmonized reporting.

Problem Setup and Concepts

Task setting and formalization

Let x denote an input prompt and y the generated translation output. For each experiment configuration c (one exact combination of architecture, input-token band, and concurrency), the system dispatches requests through one of two serving concepts and records latency, throughput, and reliability telemetry. We compare concepts under identical task semantics and evaluate both request-level behavior and aggregate service behavior.

For clarity, we define experiment-configuration notation as:

c = (a, ℓ, q)

where a is architecture, ℓ is input-length band, and q is concurrency level.

TermMeaning
xInput prompt for a translation request.
yModel output generated for prompt x.
aModel architecture slice (Dense or MoE).
Input-length band (200, 600, 1000 tokens).
qConcurrency level (1, 100, 500, 1000).
cExperiment configuration, defined as one unique triple (a, ℓ, q).
pServing path under evaluation (Standard or Mantle).

Concept A: Standard Bedrock Converse

Concept A uses Amazon Bedrock Runtime converse() calls through boto3. Its key characteristics in this benchmark are direct runtime invocation, per-run controlled execution, and latency-centric winner scoring compatibility with existing scripts.

Concept B: Mantle Inference Serving

Concept B uses Mantle through an OpenAI-compatible chat-completions interface. Under the same run configurations, Mantle is evaluated with identical prompt families and concurrency targets so that differences can be attributed to serving behavior rather than workload mismatch.

Illustrative Behavioral Difference

Across the full condition grid (model architecture, input-length band, and concurrency level), both concepts can look similar in easier runs, but they separate clearly in harder runs. As architecture complexity increases, input length grows, and concurrency rises, one concept can maintain completion reliability while the other degrades faster. This is why latency-only interpretation can be misleading and why we use a dual-view evaluation.

Benchmark attributes and measured variables

Core variables, grouped by role

Core benchmark attributes used in this study, grouped by their role in design and analysis.

AttributeCategoryShort description
RegionEnvironmentAWS region where benchmark requests were executed.
ArchitectureWorkload sliceModel family under test (Dense or MoE).
APIPath variableInference path: Standard Converse or Mantle endpoint.
Input tokensLoad variablePrompt size bucket controlling input-length pressure.
ConcurrencyLoad variableNumber of in-flight requests generated per test run.
Total requestsRun volumeTotal requests sent for a test run.
Prompt task typePrompt controlFixed translation task to keep workload semantics stable.
Success, errors, success rateReliability metricsCompletion count, failure count, and completion percentage.
Median ms, p95 ms, p99 msLatency metricsCentral and tail latency behavior of successful requests.
RPSThroughput metricAchieved successful requests per second.
Queue factorCongestion proxyRatio of max latency to median latency in a run.
WinnerDecision metricPer-run winner under script rule (lower median latency).
Effective RPSComposite metricThroughput adjusted by success rate for scale evaluation.

Experimental setup

Measured variables and protocol

The benchmark task is fixed to multilingual translation so stress effects read as serving-path behavior rather than prompt drift. Architectures are Dense (Qwen3 32B) and MoE (Qwen3 Next 80B A3B), two dominant scaling paradigms in the Bedrock catalog.

Data and stress conditions

Stress axisLevelsWhy it matters for production
ArchitectureDense, MoECaptures architecture-specific compute and routing behavior under load.
Prompt length200, 600, 1000 input tokensTests short-to-long context handling and token-processing pressure.
Concurrency1, 100, 500, 1000Represents progression from light traffic to stress-regime saturation.

Rationale for Selecting Dense and MoE Architectures

We focus on Dense and Mixture-of-Experts (MoE) because they are two primary open-model architecture families represented in the Bedrock model catalog and directly relevant to production text inference trade-offs. In this study, the representative Dense model is Qwen3 32B, and the representative MoE model is Qwen3 Next 80B A3B.

This selection should be interpreted as representative rather than exhaustive. Bedrock supports a broader model ecosystem across multiple providers and modalities (text, image, video, speech, and embeddings), but this study isolates text-generation serving behavior across two dominant scaling paradigms so that latency-throughput-reliability trade-offs can be compared under a controlled protocol.

Methods Compared

Both methods are evaluated under identical run configurations.

Method A (Standard)

Bedrock Runtime converse() through boto3.

Method B (Mantle)

OpenAI-compatible chat completions through Mantle.

No additional external baseline is introduced in this report; the study is an A/B serving-path comparison under one unified protocol.

Evaluation metrics and protocol

Region
us-east-1
Models
Dense: qwen.qwen3-32b-v1:0; MoE: qwen.qwen3-next-80b-a3b-v1:0
Input lengths
200, 600, 1000
Concurrencies
1, 100, 500, 1000
Request multiplier
1 (requests = concurrency)
Prompt task
Multilingual translation
Run order per experiment
Standard first, then Mantle

Workload volume and scoring

Workload volume is determined by the concurrency set,

Σc ∈ {1, 100, 500, 1000}c = 1,601
  • Per architecture per API path: 3 × 1,601 = 4,803 requests.
  • Across both architectures per API path: 9,606 requests.
  • Across both paths: 19,212 requests.
  • Recorded metrics include median ms (P50), p95 ms, p99 ms, success rate, error count, RPS, and queue factor.

In this report, an error denotes a failed request event; the dominant observed case is Read timeout on endpoint URL during Bedrock Converse requests.

Primary winner rule (script-native)

lower median ms wins per experiment run.

Secondary robustness-aware score
effective rps = rps ×success rate100

This metric captures throughput and completion reliability jointly.

Results / overall comparison

Averages across all stress runs

Aggregated per architecture and path.

ArchPathAvg P50 (ms)Avg P95 (ms)Avg P99 (ms)Avg RPSMax RPSAvg QueueMax QueueAvg Success (%)Weighted Success (%)
DenseStandard8,167.9410,647.0313,139.0115.7635.842.144.80100.00100.00
DenseMantle8,719.6411,013.4011,386.6825.2747.301.281.73100.00100.00
MoEStandard11,849.6821,105.2722,408.105.969.331.833.2972.8344.31
MoEMantle18,036.0030,890.6334,644.066.4812.342.073.4993.0582.64

Key deltas (Mantle vs Standard).

Dense

Latency percentiles are nearly unchanged (about −1% on average), while average throughput increases by about +60%.

MoE

Latency percentiles increase (about +51% on average), but average throughput is still higher (about +9%).

Reliability (MoE)

Success-rate measures improve by about +29 percentage points.

Across all stress runs, Standard more often minimizes median latency, while Mantle delivers better reliability-weighted scaling in high-load regimes.

Condition cluster 1

Performance and scaling

This cluster examines throughput response as concurrency rises.

Conc.PathP50 (ms)RPSWeighted Success (%)
1Standard5,229.450.19100.00
Mantle6,110.940.16100.00
100Standard7,953.747.50100.00
Mantle8,003.727.15100.00
500Standard12,946.5514.0280.50
Mantle14,655.7428.07100.00
1000Standard13,905.5021.7365.17
Mantle24,740.8728.1286.10

Observed from pooled concurrency slices:

  • Mantle has higher pooled throughput (Dense+MoE) at high concurrency: 28.07 vs 14.02 RPS at c=500, and 28.12 vs 21.73 RPS at c=1000.
  • Mantle RPS is nearly flat from 500 to 1000 (28.07 → 28.12), indicating an early throughput ceiling.
  • Standard scales more gradually in RPS (7.50→14.02→21.73), but with worsening reliability.

Pooled throughput vs concurrency

Figure 1: RPS vs concurrency for Standard and Mantle. Mantle saturates near 28 RPS at higher concurrency, while Standard scales more gradually.

Condition cluster 2

Latency behavior

Pooled P50 latency vs concurrency

Figure 2. Pooled median latency in milliseconds. The Mantle to Standard gap grows from +0.63% at c=100 to +77.93% at c=1000.

  • Pooled P50 gap (Mantle vs Standard) is +0.63% at c=100, +13.20% at c=500, and +77.93% at c=1000.
  • Dense carries a tail nuance: Mantle is worse on average P50 and P95 but better on average P99.
  • Mantle has higher median latency in 17 of 24 experiment runs.

Condition cluster 3

Reliability and error dynamics

Weighted success rate

Figure 3. Completion reliability vs concurrency. Higher is better.

Error growth

Figure 4. Error count vs concurrency. Lower is better.

  • At low/moderate load (c=1,100), both paths achieve 100% weighted success.
  • At c=500, Standard drops to 80.50% weighted success while Mantle remains at 100%.
  • At c=1000, Standard falls to 65.17% while Mantle is 86.10%.
  • Total errors: Standard = 2675, Mantle = 834 (Standard is 3.21× higher).
  • Dominant observed error type: Bedrock Converse endpoint read timeout (Read timeout on endpoint URL).
  • Standard errors are strongly non-linear: 585 at c=500 rising to 2090 at c=1000.

Results

System behavior inference from measurements

The following statements are inferential, based on observed metric patterns:

Mantle

Mantle likely uses stronger admission control or batching behavior, reflected by early throughput saturation and higher high-load completion rates.

Standard

Standard behaves more like a direct endpoint under overload: lower latency among successful requests, but sharper reliability collapse and larger error growth at high concurrency.

Winner counts by scoring view

Figure 5. Run wins under each scoring rule.

Median-latency winners by architecture

Figure 6. Standard leads in both Dense and MoE under the median-latency rule.

Results / efficiency and operational complexity

Efficiency and operational complexity

Summarizes stress-regime efficiency trade-offs. Memory footprint was not instrumented in this run; therefore, complexity analysis focuses on latency-throughput-reliability behavior measured directly from execution telemetry.

Data and Stress Conditions

MetricStandardMantleMantle vs Standard
Effective RPS at c=50011.2928.07+148.69%
Effective RPS at c=100014.1624.21+70.96%
Total errors (all runs)2,675834−68.82%
Dense avg queue factor2.141.28−40.19%

System behavior inference from measurements

Load bandPrimary SLORecommended pathWhy
Low (c=1 to 100)Lowest latencyStandardLower pooled P50, both paths at 100% weighted success.
Medium/high (c=500 to 1000), DenseThroughput + queue stabilityMantleMuch higher RPS and lower queue factor at equal 100% success.
Medium/high (c=500 to 1000), MoECompletion reliabilityMantleHigher weighted success and lower total errors.
Any load bandMedian latency onlyStandardMore median-latency wins overall (17/24).

Conclusion

Route by workload regime, not by default

This comparative study shows that no single serving path dominates every objective under stress. Standard is frequently favorable in latency-first scoring, while Mantle is consistently stronger in high-load reliability and effective throughput. The key implication is methodological and operational: evaluation frameworks must jointly model latency, completion reliability, and throughput to avoid biased path selection. For production deployment, the most effective policy is conditional routing aligned to workload regime and SLO priority rather than a one-size-fits-all endpoint choice.

Appendix A

Per-run results

Every run in the grid: architecture, input length, concurrency, latency percentiles, success rate, throughput, queue factor, and the median-latency winner.

InputConc.Std P50Mnt P50Std P95Mnt P95Std P99Mnt P99Std SuccMnt SuccStd RPSMnt RPSStd QMnt QWinner
20014,916.585,519.994,916.585,519.994,916.585,519.99100%100%0.20.1811Standard
2001008,040.37,896.538,753.188,521.548,831.859,313.08100%100%11.1910.71.11.18Mantle
2005009,087.889,157.1810,229.819,872.8115,965.7310,744.92100%100%18.844.682.911.2Standard
200100011,021.0113,685.3718,320.3620,954.8519,023.5421,245.57100%100%35.8443.662.511.61Standard
60014,810.525,388.894,810.525,388.894,810.525,388.89100%100%0.20.1911Standard
6001005,364.576,123.419,054.588,858.4825,744.528,957.7100%100%3.8711.054.81.46Standard
6005009,734.829,585.1510,678.8710,261.2813,892.9110,652.45100%100%19.0742.312.671.22Mantle
600100011,326.8612,344.1218,602.0919,729.619,064.5420,347.63100%100%34.8546.182.471.71Standard
100014,492.545,431.234,492.545,431.234,492.545,431.23100%100%0.210.1811Standard
10001008,175.867,913.68,751.078,453.839,890.99,218.9100%100%9.9310.751.211.16Mantle
10005009,666.379,162.3510,800.099,836.6112,294.1110,119.78100%100%19.3147.32.611.14Mantle
1000100011,377.9712,427.8618,354.6319,331.6618,740.3619,699.96100%100%35.5946.112.441.73Standard

Ready to transform
your enterprise?

Let's build something that lasts. Our team is ready to talk.