How Vannus evaluates AI tools for enterprise procurement
Thousands of AI tools are indistinguishable apps built as thin pass-throughs to upstream LLM APIs with zero proprietary logic. For SMBs, the cost of choosing a wrapper over a sovereign platform is measured in wasted procurement cycles, vendor dependency, and stranded training investment.
Vannus addresses this with an elimination-first scoring framework. We evaluate every tool against nine trust dimensions, synthesize the signals into a composite Resilience Score (0–100), and assign each tool to a resilience tier. Tools that fail on critical dimensions are eliminated from search results entirely — not down-ranked.
Every tool is classified into one of five tiers based on its composite Resilience Score. Click any tier to see the qualifying criteria:
Sovereign tools own their core technology stack. They have independent model weights, proprietary algorithms, or custom training pipelines that cannot be replicated by switching upstream providers. They offer real data residency, local execution options, and deep integrations.
What qualifies a tool as Sovereign:
Examples: Datadog, Zapier, Salesforce, HubSpot, Monday.com
Durable tools have defensible workflows and proprietary data moats that create genuine switching costs. They may rely on some upstream infrastructure but add substantial value through custom logic, integrations, or domain expertise.
What qualifies a tool as Durable:
Examples: Canva AI, Jasper, Notion AI, Loom
Moderate tools are functional but have gaps in compliance documentation, limited integrations, or unclear differentiation from competitors. Usable for non-critical workflows, but require active monitoring.
Typical characteristics:
Fragile tools carry significant upstream dependency. They may function today but face existential risk from API pricing changes, provider policy shifts, or competitive obsolescence. Evaluate alternatives before committing.
Wrappers are thin pass-throughs to upstream LLM APIs (OpenAI, Anthropic, etc.) with zero proprietary logic, no data moat, and no defensible competitive position. Any investment in a wrapper tool is at maximum vendor risk.
Every tool is evaluated against the following dimensions, drawn from our complete catalog. The signals listed under each dimension are the raw inputs we examine; the way they synthesize into a final score is described in the next section.
Backend hosting jurisdiction. Corporate ownership and US CLOUD Act exposure. EU Data Governance Act compliance. Data residency options for the customer.
Vendor headquarters in CFIUS-exempt or allied jurisdictions. Critical dependencies on adversarial infrastructure or sanctioned entities.
Contractual guarantees that customer data is not used for model training. Default-on vs. default-off settings. Audit-log availability.
Verifiable opt-out mechanisms for data usage. Granularity of controls. User-level vs. account-level toggles.
Active certifications: SOC 2 Type II, ISO/IEC 42001, FedRAMP, GDPR Article 35, HIPAA where applicable. Currency of audit reports.
Native IP and proprietary algorithms vs. thin upstream wrappers. Uptime Institute tier classification. Multi-region redundancy. Stated SLA.
Export formats (open vs. proprietary). Migration tools and documentation. Off-ramp guarantees. Data lock-in patterns.
Empirical Prompt Success Rate where measurable. Documented use cases. Customer-reported outcomes. Integration depth and ecosystem breadth.
Material risks that warrant cautioning prospective buyers: known unremediated breaches, vendor under sanctions, sustained noncompliance, predatory pricing patterns. The Caution Flag is applied evidence-based and time-stamped; tools may be re-evaluated as circumstances change.
Signals from each dimension are synthesized into a composite Resilience Score on a 0–100 scale. The exact weighting is proprietary — publishing it would let vendors game the score rather than improve their products. What we will disclose:
Why we don't publish the weights: the dimension names are public so users can verify what we evaluate; the synthesis formula is private so vendors cannot reverse-engineer optimal "checking the boxes" without earning the underlying property. This is the same logic Google has used with PageRank since 2003 — the inputs are public knowledge, the weighting is not. Specificity reads as rigor; mystery reads as fakery. We aim for the first by publishing what the inputs are, exactly enough that any individual score can be audited end-to-end.
| Grade | Score Range | Tier | Procurement Guidance |
|---|---|---|---|
| A+ / A / A- | 80–100 | Sovereign | Strong vendor profile. Standard review sufficient. |
| B+ / B / B- | 65–79 | Durable | Good profile. Minor areas for vendor clarification. |
| C+ / C / C- | 50–64 | Moderate | Adequate. Request additional documentation on flagged areas. |
| D | 35–49 | Fragile | Below standard. Enhanced due diligence required. |
| F | 0–34 | Wrapper | Significant gaps. Escalate to security/legal review. |
The scoring engine automatically generates contextual flags when it detects patterns that warrant procurement attention:
Quantify the cost of tool failure for your specific organization. Input your team parameters and get a personalized risk/savings projection. Try it →
Generate procurement-ready RFP documents with resilience criteria baked in. Select tools, define requirements, and export a structured evaluation framework. Build an RFP →
Scores, rankings, and eliminations are computed independently of commercial relationships. Affiliate partnerships exist in a separate layer and never touch the evaluation engine.
Every score is derived from documented signals. The dimensions we evaluate and the inputs per dimension are public. The synthesis weighting is proprietary, so vendors can't reverse-engineer it; any specific score can still be walked through signal by signal on request.
We evaluate open source and proprietary tools using the same framework and criteria.
We re-evaluate tools when vendors change practices, pricing, or ownership. Assessments reflect the latest information we have.
We apply the same methodology to ourselves. Our practices and limitations are disclosed.
Every report includes specific due diligence questions and procurement checkpoints.
By 2028, the universe of named, procurement-relevant AI tools will likely stabilize at three to five thousand — consolidation and attrition catching up to launch volume, while regulatory enforcement (EU AI Act, US state-level AI laws) raises the bar for any tool that wants serious enterprise consideration. Catalogs that compete on raw count miss the point. The buyer's question is not "show me every tool"; it is "which of these can I trust with my data, my workflow, and my regulatory posture?"
Vannus is built around that question. The catalog is curated — not encyclopedic — because curation requires editorial judgment that a longer list cannot provide. The methodology is published — not proprietary in framework, only in weighting — because customers verifying our work is part of how trust gets earned. The Caution Flag is named accordingly: it warrants buyer caution based on uniformly applied criteria, not eliminationist verdicts on individual vendors.
Independence is structural, not aspirational. Vannus does not run paid placements, sponsored rankings, or vendor advertising. We charge customers (subscriptions, audits) instead of vendors (placements). When a buyer reads a Vannus evaluation, they can trust that the recommendation reflects the methodology applied to public evidence — not what a vendor paid to see.
Vannus maintains affiliate partnerships with some tools in our database. These partnerships are structurally separated from the evaluation engine.
The wall: Affiliate relationships exist in a separate layer from scoring. The evaluation engine does not receive, process, or consider any information about which tools have commercial agreements with Vannus. Scores are computed from technical criteria only. This separation is architectural, not policy-based — the scoring code literally does not have access to partnership data.
Verification: Every score Vannus produces is reproducible from the published methodology. If you question whether a partner tool received favorable treatment, you can audit the score against the criteria documented above.
Full disclosure: See our Partners & Transparency page for a complete list of commercial relationships and our commitments around evaluation integrity.