How do you prototype an AI service?

Building a functional AI prototype to test a caregiving service for working family caregivers, and learning that the medium itself is the design problem.

AgeTechPrototyping AIService BlueprintUser Research

Client: Fortune 500 insurer (innovation team)
Year: 2026
Role: UX & service design lead

50M+ U.S. family caregivers

$7K Average annual out-of-pocket

12 Caregivers tested across two structurally different versions

8 mo Concept already in development when I joined

A concept with eight months of work behind it, and a real decision point ahead.

An innovation team inside a conservative Fortune 500 company had spent eight months developing a concept for an AI-powered caregiving service. The idea had legs. But a concept is not a service, and positive reactions in concept testing are not the same as signal from real users.

This case study covers one part of a months-long engagement: how I designed and tested a functional AI prototype of the service’s core experience, what failed in the first version, and how a mid-study redesign rebuilt the underlying interaction model rather than just the user flow.

The team had a concept for a service aimed at employed family caregivers: people working full time while also managing the care of an aging parent or other elderly family member. At its center was an AI that could onboard a caregiver, generate a personalized multi-year forecast of care needs and costs, and serve as an always-available caregiving coach.

Foundational research was done. Archetypes existed. High-level concept testing had shown that caregivers responded positively to the idea. But the team was approaching a real decision point: was there enough signal to justify the next step, investing in the early technology needed to run a beta?

To answer that, we needed something closer to the actual experience. We needed to know whether people would engage with the AI, whether the tone would feel useful rather than cold or overwhelming, and whether a forward-looking care forecast would mean anything to someone managing a parent’s decline.

I joined eight months in. Scope was: build the thing, then learn from it.

A note on where I was coming in. I’d been using AI tools heavily for a couple of years, both for client work and as a thinking partner across most of what I do. I had strong instincts for what these models do well, where they fail, and how to push them past surface-level output. What I hadn’t done was build a custom GPT. Designing a system prompt, writing behavioral rules, structuring knowledge files, governing how a model interacts with users at scale: that was new. I was working at the edge of what I knew, with enough fluency in the underlying tools to know what questions to ask.

I joined well after the foundational work was complete. For this experiment, my scope included designing and building a prototype functional enough to generate real signal, then testing it with caregivers and synthesizing what we learned.

What I owned:

The interaction model: how the service should ask, listen, surface insight, and produce artifacts
The full questionnaire architecture: what to ask, in what order, with what branching logic, and why
A modular knowledge file system: condition trees, modifier stacking rules, multi-condition routing, forecast specifications, and compassionate language rules, all consulted by the system prompt at the right phase
The system prompt, behavioral rules, crisis protocols, and knowledge file orchestration that governed how the GPT interacted with users
A compassionate language framework: what the AI could and couldn’t say, how to handle emotionally charged moments, how to stay helpful without drifting clinical or patronizing
The design and facilitation of a moderated prototype testing study with 12 caregivers across two cohorts
The mid-study redesign that rebuilt the interaction model based on what the first cohort revealed
Research logistics and client management

I didn’t know I’d have to create all of these things when I started. I was developing my own deliverables and methods as I went.

Prototyping a digital service is solved. Prototyping an AI service isn’t.

Wireframes, clickable prototypes, Wizard of Oz methods: the field has good tools for simulating a service that hasn’t been built yet.

Prototyping an AI service is different. The medium is the message in a way that’s hard to fake. A caregiver interacting with a static mockup of an AI interface isn’t learning whether they trust it, whether the tone lands, whether the questions feel invasive or genuinely useful, whether they’d come back to it. They’re just looking at boxes on a screen.

For this concept to be testable, the prototype had to actually behave like the service. It had to ask questions, adapt to answers, handle emotional content gracefully, and produce output that felt credible enough to react to. The forecast didn’t need to be numerically accurate. The data science team would build the real model later. But it needed to feel real. Only then can you learn what’s actually wrong with it.

I built a custom GPT using access to the client’s enterprise OpenAI account.

Two intertwined problems: what to ask, and how to ask it.

What to ask, and the rabbit hole I climbed out of

The forecast hadn’t actually been defined when I joined. I led the working sessions that defined what a useful forecast should include: care trajectory, cost over time, signs to watch for in the care recipient, signs to watch for in the caregiver. Once we had a target output, I started reverse-engineering the inputs.

The first instinct was to go deep on medical conditions. People over 65 don’t have one condition, they have combinations: diabetes and heart disease, hypertension and early cognitive change, cancer plus everything else. I started building a knowledge base around the most common combinations and their projected impact on care trajectory and cost.

A few iterations in I stopped. The custom GPT couldn’t actually do that computation. It was a predictive language model sitting on top of ChatGPT, not a statistical engine, and trying to encode that level of medical specificity was both technically wrong for the medium and unnecessary for what we were trying to learn. The forecast didn’t need to be statistically precise. That’s the data science team’s job in the next phase. It needed to be close enough to be reactable: real enough that a caregiver could tell us whether the output meant something to them.

I cut the comorbidity layer and refocused on the questions themselves: what to ask, in what order, with what rules governing how the GPT asked them. That’s where the leverage was for this prototype.

I organized the question set into tiers because a flat question list couldn’t do two things at once: collect enough to make the forecast feel real, and respect the limits of someone already at their limit.

v1 questionnaire architecture

A flat question list couldn't do two things at once. Tiers separated the goals.

Tier 3

Tier 2

Tier 1

1

Required · every user

Care recipient relationship, age, U.S. state, primary condition. Geographic state mattered because care costs vary 3x across regions and the forecast was unusable without it.
2

High-value · most users

Living arrangement, current help in place, recent changes in status. The questions that take a generic forecast and make it specific to this family.
3

Enrichment · if willing to go deeper

Financial picture, family dynamics, what's working and what isn't. Optional. If answered, the artifact got meaningfully sharper.

How to ask it

These are not easy questions. Describing a parent’s cognitive decline, financial stress, what someone is most afraid of: that’s a lot to hand to an interface you just met. The GPT needed adaptive skip logic so it wouldn’t re-ask what it already knew, judgment about when to acknowledge an answer emotionally and when to simply move forward, and a tightly bounded vocabulary.

The vocabulary work came out of testing with friends and colleagues before we ever showed the prototype to a caregiver. One colleague hit the words “vulnerable” and “decline” and stopped. She wasn’t ready to hear them. They read as clinical and final, and the framing made her feel worse about a parent who was actually doing fine. Another tester didn’t identify as a caregiver yet, even though she was managing her parent’s finances and starting to oversee medical appointments. The word “caregiver” itself was a barrier.

Two boundaries got hard-coded into the system after that. The GPT could not give medical advice or try to diagnose anyone based on described behaviors, full stop. And it could not use clinical language that named what someone was afraid of before they were ready to name it themselves. I built a boundaries-and-guidelines document to govern tone and word choice, and I learned through restructuring that the GPT was more reliable when fed several short reference files rather than a few long ones.

Over-acknowledging after every response also started to feel patronizing fast. We learned that by watching colleagues react to it. One internal tester, several questions in, typed mid-session: “Don’t give me a full summary after every question.” The system was reflecting too much, and the reflection itself was the problem.

Caregiving onboarding · prototype

Coach

Got it, thanks. Before we go further, can you tell me a little about who you're caring for, and how this past week has been?

Caregiver

My dad. We can't get him to stop driving. He hides the keys.

Coach

Driving is one of the most common conflict points when an older parent is showing early cognitive change. Was there a specific moment that brought it up today, or has it been building for a while?

Caregiver

It's been building. I just don't know how to take the keys without making him feel like he's lost everything.

A constraint worth naming about the testing itself.

Client privacy and security policies meant participants couldn’t use the GPT directly. They watched the conversation on a shared screen and spoke their answers aloud while a facilitator typed them in. That changes the interaction. People dictating to another human edit themselves more and take fewer risks with what they share than they would typing into a private chat window alone at their kitchen table.

The signal on flow, language, and engagement was strong enough to act on. But it’s worth being honest about what that data was and wasn’t, especially heading into a beta where real users will interact with the system on their own.

Seven participants in, I started building the next version after hours.

We ran twelve caregivers in March 2026, recruited through an online research platform. All were employed full-time and actively caring for an aging family member. The original plan was to run all twelve through a single version of the prototype. Several sessions in, it was clear that wasn’t going to be a productive use of the remaining slots. The flow was failing in a way facilitation tweaks weren’t going to fix. People were worn down by the time they got to the forecast, and the forecast was landing flat because it followed a long stretch of effort with no payoff in between.

I couldn’t stop the study cleanly. The remaining cohort 1 sessions were already on the calendar and the v2 build wasn’t ready, so seven participants finished on the original flow while I worked on the redesign in the off hours. The remaining five participants ran on v2. The split turned out to be useful: I had a clean comparison between two structurally different versions of the same concept, tested with comparable participants, in the same study.

What the first cohort showed

The v1 system was a sequential planning interview. It moved through six phases: welcome, care receiver questionnaire, caregiver questionnaire, forecast generation, situational guidance, return. The architecture was a linear pipeline. The user answered, the system reflected, the user answered again. After all of that, they were shown a profile for their loved one, then a profile for themselves, then the forecast, then a dashboard. The feature they were most excited about, situational guidance where they could ask the chatbot about their specific caregiving challenges, came last.

The structure was wrong in two ways at once.

People were spent before they got to the value. One of our explicit research goals was to understand how long an onboarding caregivers would tolerate. There’s evidence from Noom and other health and fitness services that long onboarding flows aren’t necessarily a barrier. We learned that ours was. By the time participants reached the forecast, they had given a lot and received nothing. One participant said it out loud at the end of the questionnaire, before the forecast even rendered: “I just gave you all that information.” The disappointment was already in the room.

I just gave you all that information.

v1 participant

The conversational format had a structural problem. During one mode, the GPT presented a numbered list of ten yes/no screening questions. The user worked through them. The conversation continued. A second numbered list appeared further down the thread. By that point the first list had scrolled off screen. When the user said “1,” they were responding to the second list, but the GPT interpreted it against the first and pulled a condition that didn’t apply.

Conversational stack failure · v1, cohort 1

The GPT held the whole thread in memory. The user could only see what was on screen.

What the GPT had in memory First numbered list · earlier in thread

Coach

Quick screening — for each, just answer yes or no:
1. Diabetes (Type 2)
2. Heart failure or coronary artery disease
3. COPD or another chronic lung condition
4. Cancer, current or in the past five years
5. Stroke or significant cognitive decline
… continues to 10

Caregiver

No, no, yes, no, no, yes, no, no, no, no.

… many turns of probes, follow-ups, and reflection in between …

What the user could see on screen Second numbered list · current view

Coach

A few more for the daily-living picture:
1. Has Robert had any falls in the past six months?
2. Any new hospitalizations?
3. Any new medications added recently?

Caregiver

Coach

Got it — confirming diabetes. A few follow-ups about Robert's blood sugar management…

User answered the falls question. The GPT matched against the first list.

Then the last cohort 1 session pointed at something deeper. The participant’s mother was in the final stage of dementia and probably had only months to live. About fifteen minutes in, the system asked her to describe her mother’s personality. She answered: “Because she is end-stage it’s hard to describe. She’s not coherent and is muttering a lot to herself. She isn’t herself.” The system absorbed it gracefully and continued. A question about what kinds of help her mother was getting. Another about who else was involved. Another about what had changed recently. Each one took her further from what she actually needed in that moment. We aborted the session and ran her through a hypothetical version where her mother was earlier-stage instead.

The redesign decision had already been made by then. What this session showed was that the failure went beyond what we’d diagnosed. v1 had no graceful path for someone in the late stages of caregiving. The design assumed a user looking ahead at a journey, not a user already most of the way through one. That assumption needed to change before any version of this service shipped.

The v2 system wasn’t a re-sequenced v1. It was a different interaction model.

v1 asked. v2 helped.

v1 · Sequential interview

The user answers, the system reflects, the user answers again.

Welcome
Care receiver questionnaire
Caregiver questionnaire
Forecast generation
Situational guidance
Return
User

Knowledge files consulted by phase — condition trees, modifier stacking rules, compassionate language rules, forecast templates

v2 · Modal tool

The user picks a task. The system produces an artifact.

The Dispatcher

Triage report

The Checkup

Gap report by priority

The Planner

Filled-in planning document

The Pressure Valve

Sorted priority list

User

The Money Map

Scenario cost summary

The Family Sync

Family alignment document

The Rehearsal Studio

Conversation prep sheet

The Translator

Doctor prep sheet

Forecast Global action — available at any point, not the destination of a sequence

Three of the v2 modes, each producing a different kind of artifact. Lo-fi by design — these are illustrations of an interface, not a working product.

Mode 01 · Situational Guidance

Care Guide

Ask Care Guide

Edit preferences

Caregiving guidance personalized for you

Not sure where to start? You can ask about…

Understanding memory changes Setting up scam protection Prepping for cognitive evaluation Building a support network Talking to your employer + More

Start the conversation…

Artifact: a personalized answer to a caregiver's specific question.

Mode 02 · Dashboard

Care Guide

Dashboard

Action items

Top priority

Install call-blocking and scam protection
Schedule cognitive evaluation
Arrange physical therapy for knee strength

View full list

Caregiver

The caregiver

Adult child, primary caregiver

Profile
complete 100%

Current caregiving tasks: 20–40 hrs/week

View full profile

Forecast summary

Forecast
completeness

80%

Care outlook

Mobility support needs may gradually increase
Memory changes may require structured oversight
Family involvement supports staying at home

View full forecast

Care recipient

The care recipient

Parent, age 82

Profile
complete 50%

Lives with adult child, multigenerational household

View full profile

Artifact: prioritized action items, drawn from the caregiver and care-recipient profiles.

Mode 03 · Care Forecast

Care Guide

Care forecast

Next 0–3 months Next 3–12 months Next 1–3 years Next 4+ years

Based on the most recent profile and forecast questionnaire. Reminder: this is planning guidance, not a prediction.

Watch out for

Increased unsteadiness or near-falls
More frequent repeated questions
Sharing personal information with phone callers
Withdrawing from daily activities

Recommended action items

Install call-blocking and scam protection
Schedule cognitive evaluation
Arrange physical therapy
Review home fall-prevention plan

Financial forecasting

Potential care costs · next 5 years

$122,000

Caregivers in similar situations spend an average of $7,000 out-of-pocket per year.

Year 1$8,500
Year 2$12,500
Year 3$20,000
Year 4$32,500
Year 5$48,500

Part-time companion care, 5–10 hrs/week
Physical therapy and specialist copays
Medication copays and monitoring visits
Home safety upgrades and equipment
Increased supervision as memory changes progress

Artifact: a multi-year care projection, with clinical watch-outs and a cost estimate.

The new version replaced the six-phase pipeline with eight task-shaped modes, each designed to deliver an actionable artifact within about ten minutes. The Dispatcher: triage what needs attention first. The Checkup: scan for gaps. The Planner: build a coordination system. The Money Map: model what care will cost. The Translator: build a doctor prep sheet or a hard message to a sibling. The Rehearsal Studio: practice a difficult conversation. The Family Sync: align everyone on the same picture. The Pressure Valve: process the emotional weight of a hard moment when nothing else can move yet.

The forecast still existed, but it moved from being the destination of a long pipeline to being a global action available at any point. A caregiver could complete two modes, type “forecast,” and get the same comprehensive output the v1 system had built toward, but only after the system had earned the right to ask for the additional inputs it needed.

Five non-negotiable rules governed every mode, each one operationalized from a specific v1 observation.

v2 design rules

Each one operationalized from a specific v1 observation.

No Mirror.

Never reorganize a user's information and present it back as output. If the artifact doesn't contain something the user didn't already know, it has failed.

v1 observation
At the end of the v1 questionnaire, before the forecast even rendered, a participant said out loud: "I just gave you all that information." The disappointment was already in the room.
No Pamphlet Content.

Nothing findable in 30 seconds online. Every piece of content has to demonstrate expertise the user lacks or produce an artifact they couldn't create alone.

v1 observation
v1's situational guidance had a tendency to default to general advice when probes weren't tight enough. Generic advice failed the value test that "I just gave you all that information" had set.
Artifact-First.

The artifact is the deliverable, not the conversation. Every mode ends with a tangible output produced in full.

v1 observation
v1 produced its outputs (profiles, forecast) only at the end of a long pipeline. By the time they appeared, expectations had inflated past what any output could meet.
No Freeform.

Every interaction follows a structured path with probes and outputs. Structure is the value.

v1 observation
When v1 left interaction shape to the GPT, the conversation drifted. The numbered-list scrolling failure was one symptom of that drift — the user lost context the model didn't know was gone.
Invisible Emotional Support.

Never ask "How are you feeling?" If a user shares something heavy, validate with one sentence, then continue the work. The work is the emotional container.

v1 observation
v1's mid-questionnaire check-ins felt patronizing fast. An internal tester typed mid-session: "Don't give me a full summary after every question." The reflection itself was the problem.

The conceptual rules above only mattered if they survived translation into language the GPT could actually run against. Excerpts from the v7 system prompt:

system-prompt-v7.md · 7,974 / 8,000 chars · 2026-03-14

Three excerpts. Each one is the encoded form of a rule the v1 sessions surfaced.

§ 1 · Medical boundary (first instruction)

Never provide differential diagnoses, interpret symptoms, or suggest medical tests. If you find yourself listing possible causes of a symptom, STOP. Acknowledge briefly. Continue the work.

Why first
During pre-cohort testing, the GPT started analyzing symptoms when a caregiver described them during setup. The fix that finally held was a script, not a prohibition: stop, acknowledge, continue. Made the opening line so the model can't get to anything else without crossing it.
§ 2 · Voice and language

Never use first person (no "I", no "me"). Banned vocabulary: decline, worsening, severe, critical, stage, impaired, vulnerable, burnout. Every output must contain something the user did not already know. If it could have been written by someone who Googled the condition, it fails.

Why these words
One internal tester hit "vulnerable" and "decline" and stopped — they read as clinical and final, and made her feel worse about a parent who was actually doing fine. The Googled-it-already test came out of forecast sections that read like blog posts when modifier stacking didn't fire.
§ 3 · Listening, not mirroring

Recognition is selective: acknowledge roughly one in four to five user inputs, not every one. Limit validation to a single sentence before continuing the work. Caregivers engage through problem-solving, not comfort. Never reorganize the user's information back to them as the deliverable.

Why selective
An internal tester typed mid-session: "Don't give me a full summary after every question." Over-acknowledgment read as patronizing fast. The fix wasn't less warmth — it was less reflection. The work itself is the emotional container.

The compassionate language layer carried over. The condition-specific logic lived in the probes inside each mode rather than in a central questionnaire architecture. The system prompt got smaller, the behavioral rules got sharper, and the modular knowledge file structure shifted from “consult by phase” to “consult by mode.”

What the second cohort showed

The difference was immediate. Participants leaned in. In almost every session, we had to tell them we needed to move on because they kept wanting to ask the AI more questions. Several said they learned something they hadn’t known before. Nobody asked what they were getting in exchange for sharing so much personal information. That last point mattered as much as anything else: the interaction felt worth it to them, so the value exchange resolved itself.

The unplanned session that mattered most.

After the study was over and we had presented our findings, the executive stakeholder, the person who would ultimately decide whether the concept justified further investment, asked to try the redesigned prototype. He used his own family’s caregiving experience from several years earlier, working through it as the person he was at the time.

Within minutes he stopped evaluating and started using it. He moved through one of the modes, asked for a forecast, then started asking follow-up questions: what kind of doctor to look for, what to do if his mother-in-law wouldn’t let anyone into her appointments, whether the tool could handle situations beyond memory loss. We had to tell him we needed to wrap up.

We can’t get him to stop.

Note from the session

A recruited participant engaging deeply is a good sign. The person who controls whether a concept gets funded engaging with it as a real user, unable to disengage even after the scheduled session was over: that’s a different kind of evidence. He wasn’t reacting to a concept. He was experiencing a service that was solving a problem he had actually had.

He also exposed the forecast’s biggest content problem. His mother-in-law had been showing early signs of confusion around time and money, and at the time the family hadn’t known whether that was the start of cognitive decline or a temporary response to losing her husband. The forecast didn’t make that distinction. It took her ambiguous early signals and projected a five-year trajectory of progressive deterioration: escalating costs, increasing dependency, eventual facility care.

His reaction was immediate. This would have freaked us out at the time. It felt like the tool was making a big assumption about a diagnosis. He was right. Forecasts have to acknowledge ambiguity in the inputs, not paper over it. I flagged it as a content problem the next phase has to solve at the model level.

The client got the signal they needed. The next phase has user testing built in.

The testing gave them enough confidence in the concept’s desirability and the interaction model to justify moving forward. The next phase is substantially larger: building the actual GPT for production, developing the real forecast model with a data science team, and running a two-month beta this fall that will include the AI, the forecast, human care-guide calls, and task-management features. That beta is a meaningful investment. The prototype is what made it defensible.

Three things came out of this work that weren’t in the brief:

Strategic direction. The interaction model from v2 (modal, artifact-first, value before intake) is now a design principle for the service going forward, not a one-time fix to a flow.
Production specification. The guidance content, mode definitions, and example outputs I created are serving as the experience specification for the production AI system.
Testing methodology. When the client suggested evaluating the next GPT internally rather than with users, I argued for continued user testing. The next phase has user testing built in.

Four things that traveled with me out of this project.

Match the build to the question. If the service is AI-delivered, a static mockup won’t tell you whether people trust it, engage with it, or find it useful. The prototype has to behave like the thing. But fidelity has a ceiling, and chasing it past what the question requires is its own failure mode. The comorbidity rabbit hole I climbed out of cost time I could have spent on the parts that mattered.

Sometimes the structure is wrong, not the steps. v2 wasn’t a re-sequenced v1. The questions were mostly fine. The forecast structure was mostly fine. What was wrong was a layer underneath: the assumption that the right way to deliver an AI caregiving service was to interview the caregiver. Diagnosing the failure at that layer, not at the flow layer, is what made the redesign work. Re-sequencing v1 would have produced a slightly less frustrating version of the same wrong thing.

Design for the edges from the start. The participant whose mother was dying wasn’t an outlier to route around. Her session showed that v1 had no graceful path for the late stages of caregiving, and that the design had assumed a user looking forward rather than a user already deep in. The most important design problems often live at the edges of the user population, not in the middle.

Stop the study when the study is wrong. The original plan was twelve participants on one version. After several sessions it was obvious we were burning slots on a flow that was going to keep failing the same way. Redesigning mid-study wasn’t an easy call, but it was the right one. Five more sessions on a known-broken flow would have produced more confirmation, not more learning.

A concept with eight months of work behind it, and a real decision point ahead.

I joined eight months in. Scope was: build the thing, then learn from it.

Prototyping a digital service is solved. Prototyping an AI service isn’t.

Two intertwined problems: what to ask, and how to ask it.

What to ask, and the rabbit hole I climbed out of

Required · every user

High-value · most users

Enrichment · if willing to go deeper

How to ask it

A constraint worth naming about the testing itself.

Seven participants in, I started building the next version after hours.

What the first cohort showed

The redesign: from sequential interview to modal tool.

Ask Care Guide

Dashboard

Care forecast

No Mirror.

No Pamphlet Content.

Artifact-First.

No Freeform.

Invisible Emotional Support.

What the second cohort showed

The unplanned session that mattered most.

The client got the signal they needed. The next phase has user testing built in.

Four things that traveled with me out of this project.