Measured, not marketed

The numbers, in the open.

Every figure below comes from our reproducible official benchmark suite. We report cloud and on-device modes side by side, and we don't round up. Where a comparison is like-for-like, we say so; where it isn't, we don't.

Passage retrieval — Hit@1

96.2%

Share of queries where the correct passage is ranked first, on the same retrieval test (n=658). This is a like-for-like lead over leading embedding models on the identical test.

Anvric96.2%

OpenAI text-embedding-3-large95.7%

Cohere embed-v495.6%

LongMemEval-S — accuracy by question type

Question type	Cloud	On-device
Single-session assistant	100%	100%
Single-session user	94.3%	94.3%
Knowledge update	89.7%	89.7%
Temporal reasoning	92.2%	88.3%
Multi-session	82.7%	82.3%
Single-session preference	66.7%	63.3%
Overall (official, both modes)	86.7%	86.1%

The per-type figures are from our full official run; the overall row is the LongMemEval-S score of record (both modes), not a naive mean of the rows above (which would be ≈87.6 / 86.3). We deliberately do not present this as “beating” any named system on LongMemEval: LLM-judge scoring varies between systems, so cross-system leaderboard comparisons on this benchmark are not apples-to-apples. We publish our own per-type results, both modes, so you can see exactly where we are strong and where we are not.

77.5%

Cross-source resolution F1

Linking the same entity across different sources.

99.4%

Cross-source precision

When it does link, it is almost always right.

100%

Cross-source fusion QA

Questions that require fusing two sources to answer.

Safety & integrity

Hallucinated / false answers

answers are grounded in retrieved sources

100%

Prompt-injection resistance

adversarial-injection test suite

0 leaks / 50

Cross-user isolation

no data crossed user boundaries

How we measure

All figures are produced by our locked, reproducible official benchmark recipe — the same method, the same judge, run in both cloud and on-device modes.
Retrieval Hit@1 is measured on the identical test (n=658) used for the embedding-model comparison above, which is what makes that one a like-for-like lead.
LongMemEval-S results are our own per-type scores. We do not claim a head-to-head win against other systems on this benchmark, because LLM-judge scoring is not consistent across systems.
Numbers are reported as measured. We update this page when the benchmark is re-run, and we don't round up.

Loading your vault…

Measured, not marketed

The numbers, in the open.

Passage retrieval — Hit@1

96.2%

Share of queries where the correct passage is ranked first, on the same retrieval test (n=658). This is a like-for-like lead over leading embedding models on the identical test.

Anvric96.2%

OpenAI text-embedding-3-large95.7%

Cohere embed-v495.6%

LongMemEval-S — accuracy by question type

Question type	Cloud	On-device
Single-session assistant	100%	100%
Single-session user	94.3%	94.3%
Knowledge update	89.7%	89.7%
Temporal reasoning	92.2%	88.3%
Multi-session	82.7%	82.3%
Single-session preference	66.7%	63.3%
Overall (official, both modes)	86.7%	86.1%

77.5%

Cross-source resolution F1

Linking the same entity across different sources.

99.4%

Cross-source precision

When it does link, it is almost always right.

100%

Cross-source fusion QA

Questions that require fusing two sources to answer.

Safety & integrity

Hallucinated / false answers

answers are grounded in retrieved sources

100%

Prompt-injection resistance

adversarial-injection test suite

0 leaks / 50

Cross-user isolation

no data crossed user boundaries

How we measure

All figures are produced by our locked, reproducible official benchmark recipe — the same method, the same judge, run in both cloud and on-device modes.
Retrieval Hit@1 is measured on the identical test (n=658) used for the embedding-model comparison above, which is what makes that one a like-for-like lead.
LongMemEval-S results are our own per-type scores. We do not claim a head-to-head win against other systems on this benchmark, because LLM-judge scoring is not consistent across systems.
Numbers are reported as measured. We update this page when the benchmark is re-run, and we don't round up.