Loading your vault…
Loading…Loading your vault…
Loading…Every figure below comes from our reproducible official benchmark suite. We report cloud and on-device modes side by side, and we don't round up. Where a comparison is like-for-like, we say so; where it isn't, we don't.
Share of queries where the correct passage is ranked first, on the same retrieval test (n=658). This is a like-for-like lead over leading embedding models on the identical test.
| Question type | Cloud | On-device |
|---|---|---|
| Single-session assistant | 100% | 100% |
| Single-session user | 94.3% | 94.3% |
| Knowledge update | 89.7% | 89.7% |
| Temporal reasoning | 92.2% | 88.3% |
| Multi-session | 82.7% | 82.3% |
| Single-session preference | 66.7% | 63.3% |
| Overall (official, both modes) | 86.7% | 86.1% |
The per-type figures are from our full official run; the overall row is the LongMemEval-S score of record (both modes), not a naive mean of the rows above (which would be ≈87.6 / 86.3). We deliberately do not present this as “beating” any named system on LongMemEval: LLM-judge scoring varies between systems, so cross-system leaderboard comparisons on this benchmark are not apples-to-apples. We publish our own per-type results, both modes, so you can see exactly where we are strong and where we are not.
Linking the same entity across different sources.
When it does link, it is almost always right.
Questions that require fusing two sources to answer.
answers are grounded in retrieved sources
adversarial-injection test suite
no data crossed user boundaries