USE - Usability & Output Quality

Is it actually useful?

Transparency and control mean nothing if the system doesn't deliver value. The USE criterion cuts through vendor promises to focus on what actually matters: Does it work? Do people use it? Does it deliver ROI?

What "USE" Means

The USE criterion evaluates:

Output Quality: Are responses accurate, relevant, and helpful?
User Adoption: Do people actually use it, or do they find workarounds?
Consistency: Does it perform reliably or produce erratic results?
Practical Value: Does it solve real problems or just create busywork?
User Experience: Is it intuitive or frustrating to use?
ROI Evidence: Can the vendor demonstrate actual value, not just claims?

Why Usability and Quality Matter

Nothing Else Matters if It Doesn't Work

Direct Version: Perfect transparency into a system that produces garbage is worthless. Full control over a tool nobody wants to use is pointless. If the output quality is bad or the UX is terrible, the other criteria are academic. This is the only metric that actually determines ROI.

Suitable for Work Version: Output quality and usability are the primary determinants of AI system value. Without demonstrated effectiveness:

User adoption fails regardless of technical capabilities
Promised productivity gains don't materialize
Implementation costs exceed realized benefits
Strategic objectives remain unmet

User Adoption is the Real Test

Direct Version: Vendors will show you cherry-picked demos that look amazing. What matters is: Do actual users, doing actual work, choose to use this tool? Or do they avoid it, complain about it, and find workarounds? Users vote with their behavior, and that vote is usually accurate.

Suitable for Work Version: Sustained user adoption indicates genuine utility. Low adoption reveals:

Output quality insufficient for real-world tasks
User experience friction exceeding perceived benefits
Mismatch between tool capabilities and actual needs
Inadequate training or change management

Quality is Subjective but Measurable

Direct Version: AI outputs aren't right or wrong—they're useful or not useful for specific tasks. A vendor saying "our accuracy is 95%" is meaningless without context. Accurate at what? Measured how? Useful for whom? Demand evidence based on your use cases, not theirs.

Suitable for Work Version: Output quality must be evaluated in context:

Task-specific accuracy and relevance
Consistency across different input types
Performance on edge cases and challenging queries
Alignment with organizational standards and voice

What Good Usability and Quality Look Like

Excellent (Green)

A vendor with strong usability and quality provides:

✅ Demonstrable Quality: Evidence-based metrics for accuracy, relevance, and helpfulness

✅ High User Adoption: Usage data showing sustained, growing engagement

✅ Consistent Performance: Reliable outputs across different queries and contexts

✅ Real-World Validation: Customer references, case studies with measurable outcomes

✅ Intuitive UX: Users need minimal training to be productive

✅ Continuous Improvement: Quality metrics improve over time based on feedback

✅ ROI Evidence: Documented productivity gains, cost savings, or revenue impact

Example: "87% of users engage daily after 90 days. Customer reference: 'Reduced research time from 45 min to 8 min per query, 35% productivity increase.' A/B testing shows 4.2/5 average helpfulness rating. Quality dashboard tracks accuracy trends over time."

Acceptable with Caveats (Yellow)

A vendor with partial quality/usability:

⚠️ Quality is acceptable but inconsistent across use cases

⚠️ User adoption exists but growth is slow or plateauing

⚠️ Some customer references but lacking detailed metrics

⚠️ UX requires significant training or has known friction points

⚠️ ROI claims are directional rather than quantified

Example: "Users report general satisfaction. Most queries produce helpful results. Training program reduces time-to-productivity. Some use cases require prompt engineering. Customer feedback is positive but anecdotal."

Unacceptable (Red)

A vendor with poor quality/usability:

❌ No objective quality metrics, only vague claims

❌ Low user adoption or high abandonment rates

❌ Inconsistent or unreliable outputs

❌ Customer references are vague testimonials without data

❌ Users complain about accuracy or relevance issues

❌ No evidence of ROI—just theoretical benefits

❌ Complex UX requiring extensive training and support

Example: "Our AI delivers powerful insights. Customers love it. We don't track usage metrics—privacy reasons. Quality varies by use case but continuously improving. Some users need coaching to get good results."

Evaluation Questions

When evaluating usability and output quality, ask:

Output Quality

Q: What metrics do you use to measure output quality?
Q: Can I see quality benchmarks specific to my use case?
Q: How do you handle hallucinations or inaccurate responses?
Q: What's your approach to improving quality over time?

User Adoption

Q: What percentage of licensed users are active monthly?
Q: What's your user retention rate at 30/60/90 days?
Q: Do you have usage data showing engagement trends?
Q: What are common reasons for low adoption?

Consistency

Q: How consistent are outputs for similar queries?
Q: How do you detect and address quality regressions?
Q: What happens when model versions change?
Q: Can I test consistency against my own queries?

Practical Value

Q: Can you share case studies with quantified outcomes?
Q: What specific tasks does this solve better than alternatives?
Q: What are the most common user complaints?
Q: How do you measure and track ROI?

User Experience

Q: How long does training typically take?
Q: What's your user satisfaction (NPS or CSAT) score?
Q: Can I talk to actual users, not just executives?
Q: What are known UX pain points?

ROI Evidence

Q: Can you provide customer references with measurable results?
Q: What productivity gains do customers typically see?
Q: How long until customers realize value?
Q: What percentage of pilots convert to full deployments?

Red Flags

Watch out for vendors who:

🚩 Refuse to provide usage or adoption metrics

🚩 Show only perfect demo scenarios, never edge cases

🚩 Provide customer testimonials but no quantified outcomes

🚩 Blame users for quality issues ("they need better prompts")

🚩 Can't explain how they measure or improve quality

🚩 Have impressive technology but no evidence of practical value

🚩 Avoid letting you talk to actual users during evaluation

🚩 Can't provide references relevant to your use case

Why Vendors Avoid Quality Discussions

What they say: "Quality is highly subjective and varies by use case."

What it often means:

They don't track quality metrics systematically
The metrics they have would look bad
User adoption is lower than they'd like to admit
They're betting on sales momentum, not product quality

The truth: Quality is measurable if vendors choose to measure it. If they won't show you data, assume it's not flattering.

Best Practices for Procurement

During Evaluation

Pilot with Real Users: Test with actual end-users doing real work, not just IT evaluators
Track Usage Metrics: Measure how often users choose to use the tool during pilot
Collect Feedback: Survey users about quality, relevance, and UX satisfaction
Benchmark Against Alternatives: Compare to existing tools or manual processes
Test Edge Cases: Don't just test the happy path—find the failure modes
Request References: Talk to users (not executives) at reference accounts

In Contracts

Quality SLAs: Minimum accuracy/relevance thresholds if possible
Usage Guarantees: Right to pause or exit if adoption stays below X%
Reference Rights: Right to contact customer references throughout relationship
Improvement Commitments: Vendor obligation to address systematic quality issues

Post-Deployment

Monitor Usage: Track active users, session frequency, query volume
Measure Quality: Implement user feedback mechanisms (thumbs up/down, ratings)
Survey Regularly: Collect structured user feedback quarterly
Calculate ROI: Track time savings, cost reductions, or revenue impact
Compare Alternatives: Periodically test competing solutions
Address Adoption Barriers: Investigate and resolve causes of low usage

Real-World Impact

Case Study: Demo vs. Reality

Scenario: AI research tool delivered perfect answers in demos. In production, 40% of queries produced irrelevant results.

Root Cause: Demo used curated test queries and cherry-picked document set. Real data was messier and queries more varied.

Outcome: User adoption crashed after 2 weeks. Tool was abandoned. $200K implementation cost lost.

Lesson: Always pilot with real users and real data, not sanitized demos.

Case Study: Adoption as Signal

Scenario: Two AI tools piloted side-by-side for customer support. Tool A had better technology on paper. Tool B had simpler UX.

Tool A Results: 25% active users after 30 days. Average session: 2 minutes. Users complained it was "too complicated."

Tool B Results: 78% active users after 30 days. Average session: 15 minutes. Users said it "just works."

Decision: Deployed Tool B. Technology specs don't matter if people won't use it.

Case Study: Quality Degradation

Scenario: AI summary tool worked great for 6 months. Then summaries became verbose and less focused.

Root Cause: Vendor switched underlying model without notice. New model had different behavior.

With Monitoring: Team detected quality drop within 1 week via user ratings. Escalated to vendor. Model reverted within 2 days.

Without Monitoring: Would have taken months to discover. User trust would have eroded silently. Tool adoption would have collapsed.

Quality Evaluation Framework

Output Quality Dimensions

Accuracy: Is the information factually correct?

Test: Compare outputs to ground truth for verifiable queries

Relevance: Does it answer the actual question?

Test: User ratings on "was this helpful?"

Completeness: Does it provide sufficient depth?

Test: Do users need follow-up queries or is first response enough?

Conciseness: Is it appropriately brief?

Test: Can users quickly extract value or do they have to read pages?

Consistency: Are similar queries answered similarly?

Test: Submit same query multiple times, measure variation

Source Quality: Are citations credible and authoritative?

Test: Review source documents for quality and relevance

User Experience Dimensions

Ease of Use: Can users be productive without extensive training?

Measure: Time to first successful query

Speed: Are responses fast enough for the workflow?

Measure: P50, P95, P99 response times

Reliability: Does it work consistently or fail unpredictably?

Measure: Error rate, timeout frequency

Learnability: Do users improve with experience?

Measure: Quality of queries over time

Satisfaction: Do users enjoy or tolerate using it?

Measure: NPS, CSAT surveys

Key Takeaway

Quality and usability are the only metrics that actually matter for ROI.

You can have perfect transparency into a terrible system. You can have complete control over a tool nobody wants to use. You can have an exit strategy from a product that delivers no value.

None of that matters if the tool doesn't work.

During evaluation:

Demand evidence, not promises
Pilot with real users doing real work
Track adoption and usage rigorously
Talk to actual users at reference accounts
Test edge cases and failure modes

If it doesn't deliver value in the pilot, it won't deliver value in production. Vendor promises won't change that.

What "USE" Means​

Why Usability and Quality Matter​

Nothing Else Matters if It Doesn't Work​

User Adoption is the Real Test​

Quality is Subjective but Measurable​

What Good Usability and Quality Look Like​

Excellent (Green)​

Acceptable with Caveats (Yellow)​

Unacceptable (Red)​

Evaluation Questions​

Output Quality​

User Adoption​

Consistency​

Practical Value​

User Experience​

ROI Evidence​

Red Flags​

Why Vendors Avoid Quality Discussions​

Best Practices for Procurement​

During Evaluation​

In Contracts​

Post-Deployment​

Real-World Impact​

Case Study: Demo vs. Reality​

Case Study: Adoption as Signal​

Case Study: Quality Degradation​

Quality Evaluation Framework​

Output Quality Dimensions​

User Experience Dimensions​

Key Takeaway​

Next Steps​