Assessment Methodology

How Vannus evaluates AI tools for enterprise procurement

The Wrapper Problem

Thousands of AI tools are indistinguishable apps built as thin pass-throughs to upstream LLM APIs with zero proprietary logic. For SMBs, the cost of choosing a wrapper over a sovereign platform is measured in wasted procurement cycles, vendor dependency, and stranded training investment.

Vannus addresses this with an elimination-first scoring framework. We evaluate every tool against nine trust dimensions, synthesize the signals into a composite Resilience Score (0–100), and assign each tool to a resilience tier. Tools that fail on critical dimensions are eliminated from search results entirely — not down-ranked.

Resilience Tiers

Every tool is classified into one of five tiers based on its composite Resilience Score. Click any tier to see the qualifying criteria:

Sovereign

Score 80+ — The gold standard

- ▸

Sovereign tools own their core technology stack. They have independent model weights, proprietary algorithms, or custom training pipelines that cannot be replicated by switching upstream providers. They offer real data residency, local execution options, and deep integrations.

What qualifies a tool as Sovereign:

Strong native IP and proprietary tech Deep integration footprint Verified compliance (SOC 2 / HIPAA / ISO) Enterprise pricing tier Multi-platform support Open data export

Examples: Datadog, Zapier, Salesforce, HubSpot, Monday.com

Durable

Score 65–79 — Defensible architecture

- ▸

Durable tools have defensible workflows and proprietary data moats that create genuine switching costs. They may rely on some upstream infrastructure but add substantial value through custom logic, integrations, or domain expertise.

What qualifies a tool as Durable:

Defensible native IP Mid-deep integration footprint At least one compliance cert Sustainable pricing model Multiple documented use cases

Examples: Canva AI, Jasper, Notion AI, Loom

Moderate

Score 50–64 — Adequate but watch closely

- ▸

Moderate tools are functional but have gaps in compliance documentation, limited integrations, or unclear differentiation from competitors. Usable for non-critical workflows, but require active monitoring.

Typical characteristics:

Some native IP Limited integrations Sparse compliance docs Functional pricing

Fragile

Score 35–49 — Significant dependency risk

- ▸

Fragile tools carry significant upstream dependency. They may function today but face existential risk from API pricing changes, provider policy shifts, or competitive obsolescence. Evaluate alternatives before committing.

Low native IP Minimal integrations No compliance certs Thin-margin pricing Single-ecosystem lock-in

Wrapper

Score <35 — Thin UI, zero data moat

- ▸

Wrappers are thin pass-throughs to upstream LLM APIs (OpenAI, Anthropic, etc.) with zero proprietary logic, no data moat, and no defensible competitive position. Any investment in a wrapper tool is at maximum vendor risk.

No native IP Pure API arbitrage No integrations No compliance Description signals dependency

The Nine Trust Dimensions

Every tool is evaluated against the following dimensions, drawn from our complete catalog. The signals listed under each dimension are the raw inputs we examine; the way they synthesize into a final score is described in the next section.

1. Data Sovereignty

Backend hosting jurisdiction. Corporate ownership and US CLOUD Act exposure. EU Data Governance Act compliance. Data residency options for the customer.

2. Allied Infrastructure

Vendor headquarters in CFIUS-exempt or allied jurisdictions. Critical dependencies on adversarial infrastructure or sanctioned entities.

3. Training Privacy

Contractual guarantees that customer data is not used for model training. Default-on vs. default-off settings. Audit-log availability.

4. Conditional Privacy

Verifiable opt-out mechanisms for data usage. Granularity of controls. User-level vs. account-level toggles.

5. Compliance Standard

Active certifications: SOC 2 Type II, ISO/IEC 42001, FedRAMP, GDPR Article 35, HIPAA where applicable. Currency of audit reports.

6. Operational Resilience

Native IP and proprietary algorithms vs. thin upstream wrappers. Uptime Institute tier classification. Multi-region redundancy. Stated SLA.

7. Exit Portability

Export formats (open vs. proprietary). Migration tools and documentation. Off-ramp guarantees. Data lock-in patterns.

8. Real-World Utility

Empirical Prompt Success Rate where measurable. Documented use cases. Customer-reported outcomes. Integration depth and ecosystem breadth.

9. Caution Flag

Material risks that warrant cautioning prospective buyers: known unremediated breaches, vendor under sanctions, sustained noncompliance, predatory pricing patterns. The Caution Flag is applied evidence-based and time-stamped; tools may be re-evaluated as circumstances change.

How signals become scores

Signals from each dimension are synthesized into a composite Resilience Score on a 0–100 scale. The exact weighting is proprietary — publishing it would let vendors game the score rather than improve their products. What we will disclose:

The score is weighted toward verifiable performance, longevity, and sovereignty. Marketing-led signals (popularity counts, review averages, sponsored placements) are not inputs.
Compliance and Caution Flag dimensions act as hard gates. A tool may be excluded from default search results regardless of its scores on other dimensions if a critical compliance gap or caution flag is detected. Excluded tools remain accessible by direct search; we surface the reasoning, not a verdict.
Every score is reproducible from publicly cited signals. If you question why a tool received its score, contact us — we will walk you through every signal that contributed.
Affiliate relationships are architecturally separated from scoring. The evaluation engine has no access to partnership data. This is verifiable in the runtime architecture, not just policy.

Why we don't publish the weights: the dimension names are public so users can verify what we evaluate; the synthesis formula is private so vendors cannot reverse-engineer optimal "checking the boxes" without earning the underlying property. This is the same logic Google has used with PageRank since 2003 — the inputs are public knowledge, the weighting is not. Specificity reads as rigor; mystery reads as fakery. We aim for the first by publishing what the inputs are, exactly enough that any individual score can be audited end-to-end.

Grading Scale

Grade	Score Range	Tier	Procurement Guidance
A+ / A / A-	80–100	Sovereign	Strong vendor profile. Standard review sufficient.
B+ / B / B-	65–79	Durable	Good profile. Minor areas for vendor clarification.
C+ / C / C-	50–64	Moderate	Adequate. Request additional documentation on flagged areas.
D	35–49	Fragile	Below standard. Enhanced due diligence required.
F	0–34	Wrapper	Significant gaps. Escalate to security/legal review.

Automated Risk Flags

The scoring engine automatically generates contextual flags when it detects patterns that warrant procurement attention:

Identified as thin API wrapper — Native IP score penalized, description signals upstream dependency
No documented integrations — Siloed tool, higher switching cost
No compliance certifications — May not meet enterprise security requirements
Very low pricing — Margin structure suggests pure API pass-through, not sustainable R&D
Single-provider dependency — Name or description implies reliance on specific upstream infrastructure
Disproportionate data exposure — Vendor data access exceeds the value proposition

Procurement Tools

Tuesday Test ROI Calculator

Quantify the cost of tool failure for your specific organization. Input your team parameters and get a personalized risk/savings projection. Try it →

RFP Builder

Generate procurement-ready RFP documents with resilience criteria baked in. Select tools, define requirements, and export a structured evaluation framework. Build an RFP →

Our Principles

Evaluation Independence

Scores, rankings, and eliminations are computed independently of commercial relationships. Affiliate partnerships exist in a separate layer and never touch the evaluation engine.

Auditable Methodology

Every score is derived from documented signals. The dimensions we evaluate and the inputs per dimension are public. The synthesis weighting is proprietary, so vendors can't reverse-engineer it; any specific score can still be walked through signal by signal on request.

Vendor-Neutral

We evaluate open source and proprietary tools using the same framework and criteria.

Regular Updates

We re-evaluate tools when vendors change practices, pricing, or ownership. Assessments reflect the latest information we have.

Self-Assessment

We apply the same methodology to ourselves. Our practices and limitations are disclosed.

Actionable Output

Every report includes specific due diligence questions and procurement checkpoints.

Why curation, not enumeration

By 2028, the universe of named, procurement-relevant AI tools will likely stabilize at three to five thousand — consolidation and attrition catching up to launch volume, while regulatory enforcement (EU AI Act, US state-level AI laws) raises the bar for any tool that wants serious enterprise consideration. Catalogs that compete on raw count miss the point. The buyer's question is not "show me every tool"; it is "which of these can I trust with my data, my workflow, and my regulatory posture?"

Vannus is built around that question. The catalog is curated — not encyclopedic — because curation requires editorial judgment that a longer list cannot provide. The methodology is published — not proprietary in framework, only in weighting — because customers verifying our work is part of how trust gets earned. The Caution Flag is named accordingly: it warrants buyer caution based on uniformly applied criteria, not eliminationist verdicts on individual vendors.

Independence is structural, not aspirational. Vannus does not run paid placements, sponsored rankings, or vendor advertising. We charge customers (subscriptions, audits) instead of vendors (placements). When a buyer reads a Vannus evaluation, they can trust that the recommendation reflects the methodology applied to public evidence — not what a vendor paid to see.

Commercial Independence

Vannus maintains affiliate partnerships with some tools in our database. These partnerships are structurally separated from the evaluation engine.

The wall: Affiliate relationships exist in a separate layer from scoring. The evaluation engine does not receive, process, or consider any information about which tools have commercial agreements with Vannus. Scores are computed from technical criteria only. This separation is architectural, not policy-based — the scoring code literally does not have access to partnership data.

Verification: Every score Vannus produces is reproducible from the published methodology. If you question whether a partner tool received favorable treatment, you can audit the score against the criteria documented above.

Full disclosure: See our Partners & Transparency page for a complete list of commercial relationships and our commitments around evaluation integrity.