USE - Usability & Output Quality
Is it actually useful?
Transparency and control mean nothing if the system doesn't deliver value. The USE criterion cuts through vendor promises to focus on what actually matters: Does it work? Do people use it? Does it deliver ROI?
What "USE" Means
The USE criterion evaluates:
- Output Quality: Are responses accurate, relevant, and helpful?
- User Adoption: Do people actually use it, or do they find workarounds?
- Consistency: Does it perform reliably or produce erratic results?
- Practical Value: Does it solve real problems or just create busywork?
- User Experience: Is it intuitive or frustrating to use?
- ROI Evidence: Can the vendor demonstrate actual value, not just claims?
Why Usability and Quality Matter
Nothing Else Matters if It Doesn't Work
Direct Version: Perfect transparency into a system that produces garbage is worthless. Full control over a tool nobody wants to use is pointless. If the output quality is bad or the UX is terrible, the other criteria are academic. This is the only metric that actually determines ROI.
Suitable for Work Version: Output quality and usability are the primary determinants of AI system value. Without demonstrated effectiveness:
- User adoption fails regardless of technical capabilities
- Promised productivity gains don't materialize
- Implementation costs exceed realized benefits
- Strategic objectives remain unmet
User Adoption is the Real Test
Direct Version: Vendors will show you cherry-picked demos that look amazing. What matters is: Do actual users, doing actual work, choose to use this tool? Or do they avoid it, complain about it, and find workarounds? Users vote with their behavior, and that vote is usually accurate.
Suitable for Work Version: Sustained user adoption indicates genuine utility. Low adoption reveals:
- Output quality insufficient for real-world tasks
- User experience friction exceeding perceived benefits
- Mismatch between tool capabilities and actual needs
- Inadequate training or change management
Quality is Subjective but Measurable
Direct Version: AI outputs aren't right or wrong—they're useful or not useful for specific tasks. A vendor saying "our accuracy is 95%" is meaningless without context. Accurate at what? Measured how? Useful for whom? Demand evidence based on your use cases, not theirs.
Suitable for Work Version: Output quality must be evaluated in context:
- Task-specific accuracy and relevance
- Consistency across different input types
- Performance on edge cases and challenging queries
- Alignment with organizational standards and voice
What Good Usability and Quality Look Like
Excellent (Green)
A vendor with strong usability and quality provides:
✅ Demonstrable Quality: Evidence-based metrics for accuracy, relevance, and helpfulness
✅ High User Adoption: Usage data showing sustained, growing engagement
✅ Consistent Performance: Reliable outputs across different queries and contexts
✅ Real-World Validation: Customer references, case studies with measurable outcomes
✅ Intuitive UX: Users need minimal training to be productive
✅ Continuous Improvement: Quality metrics improve over time based on feedback
✅ ROI Evidence: Documented productivity gains, cost savings, or revenue impact
Example: "87% of users engage daily after 90 days. Customer reference: 'Reduced research time from 45 min to 8 min per query, 35% productivity increase.' A/B testing shows 4.2/5 average helpfulness rating. Quality dashboard tracks accuracy trends over time."
Acceptable with Caveats (Yellow)
A vendor with partial quality/usability:
⚠️ Quality is acceptable but inconsistent across use cases
⚠️ User adoption exists but growth is slow or plateauing
⚠️ Some customer references but lacking detailed metrics
⚠️ UX requires significant training or has known friction points
⚠️ ROI claims are directional rather than quantified
Example: "Users report general satisfaction. Most queries produce helpful results. Training program reduces time-to-productivity. Some use cases require prompt engineering. Customer feedback is positive but anecdotal."
Unacceptable (Red)
A vendor with poor quality/usability:
❌ No objective quality metrics, only vague claims
❌ Low user adoption or high abandonment rates
❌ Inconsistent or unreliable outputs
❌ Customer references are vague testimonials without data
❌ Users complain about accuracy or relevance issues
❌ No evidence of ROI—just theoretical benefits
❌ Complex UX requiring extensive training and support
Example: "Our AI delivers powerful insights. Customers love it. We don't track usage metrics—privacy reasons. Quality varies by use case but continuously improving. Some users need coaching to get good results."
Evaluation Questions
When evaluating usability and output quality, ask:
Output Quality
- Q: What metrics do you use to measure output quality?
- Q: Can I see quality benchmarks specific to my use case?
- Q: How do you handle hallucinations or inaccurate responses?
- Q: What's your approach to improving quality over time?
User Adoption
- Q: What percentage of licensed users are active monthly?
- Q: What's your user retention rate at 30/60/90 days?
- Q: Do you have usage data showing engagement trends?
- Q: What are common reasons for low adoption?
Consistency
- Q: How consistent are outputs for similar queries?
- Q: How do you detect and address quality regressions?
- Q: What happens when model versions change?
- Q: Can I test consistency against my own queries?
Practical Value
- Q: Can you share case studies with quantified outcomes?
- Q: What specific tasks does this solve better than alternatives?
- Q: What are the most common user complaints?
- Q: How do you measure and track ROI?
User Experience
- Q: How long does training typically take?
- Q: What's your user satisfaction (NPS or CSAT) score?
- Q: Can I talk to actual users, not just executives?
- Q: What are known UX pain points?
ROI Evidence
- Q: Can you provide customer references with measurable results?
- Q: What productivity gains do customers typically see?
- Q: How long until customers realize value?
- Q: What percentage of pilots convert to full deployments?
Red Flags
Watch out for vendors who:
🚩 Refuse to provide usage or adoption metrics
🚩 Show only perfect demo scenarios, never edge cases
🚩 Provide customer testimonials but no quantified outcomes
🚩 Blame users for quality issues ("they need better prompts")
🚩 Can't explain how they measure or improve quality
🚩 Have impressive technology but no evidence of practical value
🚩 Avoid letting you talk to actual users during evaluation
🚩 Can't provide references relevant to your use case
Why Vendors Avoid Quality Discussions
What they say: "Quality is highly subjective and varies by use case."
What it often means:
- They don't track quality metrics systematically
- The metrics they have would look bad
- User adoption is lower than they'd like to admit
- They're betting on sales momentum, not product quality
The truth: Quality is measurable if vendors choose to measure it. If they won't show you data, assume it's not flattering.
Best Practices for Procurement
During Evaluation
- Pilot with Real Users: Test with actual end-users doing real work, not just IT evaluators
- Track Usage Metrics: Measure how often users choose to use the tool during pilot
- Collect Feedback: Survey users about quality, relevance, and UX satisfaction
- Benchmark Against Alternatives: Compare to existing tools or manual processes
- Test Edge Cases: Don't just test the happy path—find the failure modes
- Request References: Talk to users (not executives) at reference accounts
In Contracts
- Quality SLAs: Minimum accuracy/relevance thresholds if possible
- Usage Guarantees: Right to pause or exit if adoption stays below X%
- Reference Rights: Right to contact customer references throughout relationship
- Improvement Commitments: Vendor obligation to address systematic quality issues
Post-Deployment
- Monitor Usage: Track active users, session frequency, query volume
- Measure Quality: Implement user feedback mechanisms (thumbs up/down, ratings)
- Survey Regularly: Collect structured user feedback quarterly
- Calculate ROI: Track time savings, cost reductions, or revenue impact
- Compare Alternatives: Periodically test competing solutions
- Address Adoption Barriers: Investigate and resolve causes of low usage
Real-World Impact
Case Study: Demo vs. Reality
Scenario: AI research tool delivered perfect answers in demos. In production, 40% of queries produced irrelevant results.
Root Cause: Demo used curated test queries and cherry-picked document set. Real data was messier and queries more varied.
Outcome: User adoption crashed after 2 weeks. Tool was abandoned. $200K implementation cost lost.
Lesson: Always pilot with real users and real data, not sanitized demos.
Case Study: Adoption as Signal
Scenario: Two AI tools piloted side-by-side for customer support. Tool A had better technology on paper. Tool B had simpler UX.
Tool A Results: 25% active users after 30 days. Average session: 2 minutes. Users complained it was "too complicated."
Tool B Results: 78% active users after 30 days. Average session: 15 minutes. Users said it "just works."
Decision: Deployed Tool B. Technology specs don't matter if people won't use it.
Case Study: Quality Degradation
Scenario: AI summary tool worked great for 6 months. Then summaries became verbose and less focused.
Root Cause: Vendor switched underlying model without notice. New model had different behavior.
With Monitoring: Team detected quality drop within 1 week via user ratings. Escalated to vendor. Model reverted within 2 days.
Without Monitoring: Would have taken months to discover. User trust would have eroded silently. Tool adoption would have collapsed.
Quality Evaluation Framework
Output Quality Dimensions
Accuracy: Is the information factually correct?
- Test: Compare outputs to ground truth for verifiable queries
Relevance: Does it answer the actual question?
- Test: User ratings on "was this helpful?"
Completeness: Does it provide sufficient depth?
- Test: Do users need follow-up queries or is first response enough?
Conciseness: Is it appropriately brief?
- Test: Can users quickly extract value or do they have to read pages?
Consistency: Are similar queries answered similarly?
- Test: Submit same query multiple times, measure variation
Source Quality: Are citations credible and authoritative?
- Test: Review source documents for quality and relevance
User Experience Dimensions
Ease of Use: Can users be productive without extensive training?
- Measure: Time to first successful query
Speed: Are responses fast enough for the workflow?
- Measure: P50, P95, P99 response times
Reliability: Does it work consistently or fail unpredictably?
- Measure: Error rate, timeout frequency
Learnability: Do users improve with experience?
- Measure: Quality of queries over time
Satisfaction: Do users enjoy or tolerate using it?
- Measure: NPS, CSAT surveys
Key Takeaway
Quality and usability are the only metrics that actually matter for ROI.
You can have perfect transparency into a terrible system. You can have complete control over a tool nobody wants to use. You can have an exit strategy from a product that delivers no value.
None of that matters if the tool doesn't work.
During evaluation:
- Demand evidence, not promises
- Pilot with real users doing real work
- Track adoption and usage rigorously
- Talk to actual users at reference accounts
- Test edge cases and failure modes
If it doesn't deliver value in the pilot, it won't deliver value in production. Vendor promises won't change that.