How do you prototype an AI service?
Building a functional AI prototype to test a caregiving service for working family caregivers, and learning that the medium itself is the design problem.
- Client
- Fortune 500 insurer (innovation team)
- Year
- 2026
- Role
- UX & service design lead
A concept with eight months of work behind it, and a real decision point ahead.
An innovation team inside a conservative Fortune 500 company had spent eight months developing a concept for an AI-powered caregiving service. The idea had legs. But a concept is not a service, and positive reactions in concept testing are not the same as signal from real users.
This case study covers one part of a months-long engagement: how I designed and tested a functional AI prototype of the service’s core experience, what failed in the first version, and how a mid-study redesign rebuilt the underlying interaction model rather than just the user flow.
The team had a concept for a service aimed at employed family caregivers: people working full time while also managing the care of an aging parent or other elderly family member. At its center was an AI that could onboard a caregiver, generate a personalized multi-year forecast of care needs and costs, and serve as an always-available caregiving coach.
Foundational research was done. Archetypes existed. High-level concept testing had shown that caregivers responded positively to the idea. But the team was approaching a real decision point: was there enough signal to justify the next step, investing in the early technology needed to run a beta?
To answer that, we needed something closer to the actual experience. We needed to know whether people would engage with the AI, whether the tone would feel useful rather than cold or overwhelming, and whether a forward-looking care forecast would mean anything to someone managing a parent’s decline.
I joined eight months in. Scope was: build the thing, then learn from it.
A note on where I was coming in. I’d been using AI tools heavily for a couple of years, both for client work and as a thinking partner across most of what I do. I had strong instincts for what these models do well, where they fail, and how to push them past surface-level output. What I hadn’t done was build a custom GPT. Designing a system prompt, writing behavioral rules, structuring knowledge files, governing how a model interacts with users at scale: that was new. I was working at the edge of what I knew, with enough fluency in the underlying tools to know what questions to ask.
I joined well after the foundational work was complete. For this experiment, my scope included designing and building a prototype functional enough to generate real signal, then testing it with caregivers and synthesizing what we learned.
What I owned:
- The interaction model: how the service should ask, listen, surface insight, and produce artifacts
- The full questionnaire architecture: what to ask, in what order, with what branching logic, and why
- A modular knowledge file system: condition trees, modifier stacking rules, multi-condition routing, forecast specifications, and compassionate language rules, all consulted by the system prompt at the right phase
- The system prompt, behavioral rules, crisis protocols, and knowledge file orchestration that governed how the GPT interacted with users
- A compassionate language framework: what the AI could and couldn’t say, how to handle emotionally charged moments, how to stay helpful without drifting clinical or patronizing
- The design and facilitation of a moderated prototype testing study with 12 caregivers across two cohorts
- The mid-study redesign that rebuilt the interaction model based on what the first cohort revealed
- Research logistics and client management
I didn’t know I’d have to create all of these things when I started. I was developing my own deliverables and methods as I went.
Prototyping a digital service is solved. Prototyping an AI service isn’t.
Wireframes, clickable prototypes, Wizard of Oz methods: the field has good tools for simulating a service that hasn’t been built yet.
Prototyping an AI service is different. The medium is the message in a way that’s hard to fake. A caregiver interacting with a static mockup of an AI interface isn’t learning whether they trust it, whether the tone lands, whether the questions feel invasive or genuinely useful, whether they’d come back to it. They’re just looking at boxes on a screen.
For this concept to be testable, the prototype had to actually behave like the service. It had to ask questions, adapt to answers, handle emotional content gracefully, and produce output that felt credible enough to react to. The forecast didn’t need to be numerically accurate. The data science team would build the real model later. But it needed to feel real. Only then can you learn what’s actually wrong with it.
I built a custom GPT using access to the client’s enterprise OpenAI account.
Two intertwined problems: what to ask, and how to ask it.
What to ask, and the rabbit hole I climbed out of
The forecast hadn’t actually been defined when I joined. I led the working sessions that defined what a useful forecast should include: care trajectory, cost over time, signs to watch for in the care recipient, signs to watch for in the caregiver. Once we had a target output, I started reverse-engineering the inputs.
The first instinct was to go deep on medical conditions. People over 65 don’t have one condition, they have combinations: diabetes and heart disease, hypertension and early cognitive change, cancer plus everything else. I started building a knowledge base around the most common combinations and their projected impact on care trajectory and cost.
A few iterations in I stopped. The custom GPT couldn’t actually do that computation. It was a predictive language model sitting on top of ChatGPT, not a statistical engine, and trying to encode that level of medical specificity was both technically wrong for the medium and unnecessary for what we were trying to learn. The forecast didn’t need to be statistically precise. That’s the data science team’s job in the next phase. It needed to be close enough to be reactable: real enough that a caregiver could tell us whether the output meant something to them.
I cut the comorbidity layer and refocused on the questions themselves: what to ask, in what order, with what rules governing how the GPT asked them. That’s where the leverage was for this prototype.
I organized the question set into tiers because a flat question list couldn’t do two things at once: collect enough to make the forecast feel real, and respect the limits of someone already at their limit.
A flat question list couldn't do two things at once. Tiers separated the goals.
- 1
Required · every user
Care recipient relationship, age, U.S. state, primary condition. Geographic state mattered because care costs vary 3x across regions and the forecast was unusable without it.
- 2
High-value · most users
Living arrangement, current help in place, recent changes in status. The questions that take a generic forecast and make it specific to this family.
- 3
Enrichment · if willing to go deeper
Financial picture, family dynamics, what's working and what isn't. Optional. If answered, the artifact got meaningfully sharper.
How to ask it
These are not easy questions. Describing a parent’s cognitive decline, financial stress, what someone is most afraid of: that’s a lot to hand to an interface you just met. The GPT needed adaptive skip logic so it wouldn’t re-ask what it already knew, judgment about when to acknowledge an answer emotionally and when to simply move forward, and a tightly bounded vocabulary.
The vocabulary work came out of testing with friends and colleagues before we ever showed the prototype to a caregiver. One colleague hit the words “vulnerable” and “decline” and stopped. She wasn’t ready to hear them. They read as clinical and final, and the framing made her feel worse about a parent who was actually doing fine. Another tester didn’t identify as a caregiver yet, even though she was managing her parent’s finances and starting to oversee medical appointments. The word “caregiver” itself was a barrier.
Two boundaries got hard-coded into the system after that. The GPT could not give medical advice or try to diagnose anyone based on described behaviors, full stop. And it could not use clinical language that named what someone was afraid of before they were ready to name it themselves. I built a boundaries-and-guidelines document to govern tone and word choice, and I learned through restructuring that the GPT was more reliable when fed several short reference files rather than a few long ones.
Over-acknowledging after every response also started to feel patronizing fast. We learned that by watching colleagues react to it. One internal tester, several questions in, typed mid-session: “Don’t give me a full summary after every question.” The system was reflecting too much, and the reflection itself was the problem.
A constraint worth naming about the testing itself.
Client privacy and security policies meant participants couldn’t use the GPT directly. They watched the conversation on a shared screen and spoke their answers aloud while a facilitator typed them in. That changes the interaction. People dictating to another human edit themselves more and take fewer risks with what they share than they would typing into a private chat window alone at their kitchen table.
The signal on flow, language, and engagement was strong enough to act on. But it’s worth being honest about what that data was and wasn’t, especially heading into a beta where real users will interact with the system on their own.
Seven participants in, I started building the next version after hours.
We ran twelve caregivers in March 2026, recruited through an online research platform. All were employed full-time and actively caring for an aging family member. The original plan was to run all twelve through a single version of the prototype. Several sessions in, it was clear that wasn’t going to be a productive use of the remaining slots. The flow was failing in a way facilitation tweaks weren’t going to fix. People were worn down by the time they got to the forecast, and the forecast was landing flat because it followed a long stretch of effort with no payoff in between.
I couldn’t stop the study cleanly. The remaining cohort 1 sessions were already on the calendar and the v2 build wasn’t ready, so seven participants finished on the original flow while I worked on the redesign in the off hours. The remaining five participants ran on v2. The split turned out to be useful: I had a clean comparison between two structurally different versions of the same concept, tested with comparable participants, in the same study.
What the first cohort showed
The v1 system was a sequential planning interview. It moved through six phases: welcome, care receiver questionnaire, caregiver questionnaire, forecast generation, situational guidance, return. The architecture was a linear pipeline. The user answered, the system reflected, the user answered again. After all of that, they were shown a profile for their loved one, then a profile for themselves, then the forecast, then a dashboard. The feature they were most excited about, situational guidance where they could ask the chatbot about their specific caregiving challenges, came last.
The structure was wrong in two ways at once.
People were spent before they got to the value. One of our explicit research goals was to understand how long an onboarding caregivers would tolerate. There’s evidence from Noom and other health and fitness services that long onboarding flows aren’t necessarily a barrier. We learned that ours was. By the time participants reached the forecast, they had given a lot and received nothing. One participant said it out loud at the end of the questionnaire, before the forecast even rendered: “I just gave you all that information.” The disappointment was already in the room.
I just gave you all that information.
The conversational format had a structural problem. During one mode, the GPT presented a numbered list of ten yes/no screening questions. The user worked through them. The conversation continued. A second numbered list appeared further down the thread. By that point the first list had scrolled off screen. When the user said “1,” they were responding to the second list, but the GPT interpreted it against the first and pulled a condition that didn’t apply.
The GPT held the whole thread in memory. The user could only see what was on screen.
1. Diabetes (Type 2)
2. Heart failure or coronary artery disease
3. COPD or another chronic lung condition
4. Cancer, current or in the past five years
5. Stroke or significant cognitive decline
… continues to 10
1. Has Robert had any falls in the past six months?
2. Any new hospitalizations?
3. Any new medications added recently?
Then the last cohort 1 session pointed at something deeper. The participant’s mother was in the final stage of dementia and probably had only months to live. About fifteen minutes in, the system asked her to describe her mother’s personality. She answered: “Because she is end-stage it’s hard to describe. She’s not coherent and is muttering a lot to herself. She isn’t herself.” The system absorbed it gracefully and continued. A question about what kinds of help her mother was getting. Another about who else was involved. Another about what had changed recently. Each one took her further from what she actually needed in that moment. We aborted the session and ran her through a hypothetical version where her mother was earlier-stage instead.
The redesign decision had already been made by then. What this session showed was that the failure went beyond what we’d diagnosed. v1 had no graceful path for someone in the late stages of caregiving. The design assumed a user looking ahead at a journey, not a user already most of the way through one. That assumption needed to change before any version of this service shipped.
The redesign: from sequential interview to modal tool.
The v2 system wasn’t a re-sequenced v1. It was a different interaction model.
v1 asked. v2 helped.
The user answers, the system reflects, the user answers again.
- Welcome
- Care receiver questionnaire
- Caregiver questionnaire
- Forecast generation
- Situational guidance
- Return
- User
Knowledge files consulted by phase — condition trees, modifier stacking rules, compassionate language rules, forecast templates
The user picks a task. The system produces an artifact.
Ask Care Guide
Caregiving guidance personalized for you
Not sure where to start? You can ask about…
Artifact: a personalized answer to a caregiver's specific question.
Dashboard
Action items
Top priority
- Install call-blocking and scam protection
- Schedule cognitive evaluation
- Arrange physical therapy for knee strength
Caregiver
The caregiver
Adult child, primary caregiver
complete 100%
Forecast summary
Forecast
completeness
80%
Care outlook
- Mobility support needs may gradually increase
- Memory changes may require structured oversight
- Family involvement supports staying at home
Care recipient
The care recipient
Parent, age 82
complete 50%
Artifact: prioritized action items, drawn from the caregiver and care-recipient profiles.
Care forecast
Based on the most recent profile and forecast questionnaire. Reminder: this is planning guidance, not a prediction.
Watch out for
- Increased unsteadiness or near-falls
- More frequent repeated questions
- Sharing personal information with phone callers
- Withdrawing from daily activities
Recommended action items
- Install call-blocking and scam protection
- Schedule cognitive evaluation
- Arrange physical therapy
- Review home fall-prevention plan
Financial forecasting
Potential care costs · next 5 years
$122,000
Caregivers in similar situations spend an average of $7,000 out-of-pocket per year.
- Year 1$8,500
- Year 2$12,500
- Year 3$20,000
- Year 4$32,500
- Year 5$48,500
- Part-time companion care, 5–10 hrs/week
- Physical therapy and specialist copays
- Medication copays and monitoring visits
- Home safety upgrades and equipment
- Increased supervision as memory changes progress
Artifact: a multi-year care projection, with clinical watch-outs and a cost estimate.
The new version replaced the six-phase pipeline with eight task-shaped modes, each designed to deliver an actionable artifact within about ten minutes. The Dispatcher: triage what needs attention first. The Checkup: scan for gaps. The Planner: build a coordination system. The Money Map: model what care will cost. The Translator: build a doctor prep sheet or a hard message to a sibling. The Rehearsal Studio: practice a difficult conversation. The Family Sync: align everyone on the same picture. The Pressure Valve: process the emotional weight of a hard moment when nothing else can move yet.
The forecast still existed, but it moved from being the destination of a long pipeline to being a global action available at any point. A caregiver could complete two modes, type “forecast,” and get the same comprehensive output the v1 system had built toward, but only after the system had earned the right to ask for the additional inputs it needed.
Five non-negotiable rules governed every mode, each one operationalized from a specific v1 observation.
Each one operationalized from a specific v1 observation.
-
No Mirror.
Never reorganize a user's information and present it back as output. If the artifact doesn't contain something the user didn't already know, it has failed.
-
No Pamphlet Content.
Nothing findable in 30 seconds online. Every piece of content has to demonstrate expertise the user lacks or produce an artifact they couldn't create alone.
-
Artifact-First.
The artifact is the deliverable, not the conversation. Every mode ends with a tangible output produced in full.
-
No Freeform.
Every interaction follows a structured path with probes and outputs. Structure is the value.
-
Invisible Emotional Support.
Never ask "How are you feeling?" If a user shares something heavy, validate with one sentence, then continue the work. The work is the emotional container.
The conceptual rules above only mattered if they survived translation into language the GPT could actually run against. Excerpts from the v7 system prompt:
Three excerpts. Each one is the encoded form of a rule the v1 sessions surfaced.
- § 1 · Medical boundary (first instruction)
Never provide differential diagnoses, interpret symptoms, or suggest medical tests. If you find yourself listing possible causes of a symptom, STOP. Acknowledge briefly. Continue the work.
- § 2 · Voice and language
Never use first person (no "I", no "me"). Banned vocabulary: decline, worsening, severe, critical, stage, impaired, vulnerable, burnout. Every output must contain something the user did not already know. If it could have been written by someone who Googled the condition, it fails.
- § 3 · Listening, not mirroring
Recognition is selective: acknowledge roughly one in four to five user inputs, not every one. Limit validation to a single sentence before continuing the work. Caregivers engage through problem-solving, not comfort. Never reorganize the user's information back to them as the deliverable.
The compassionate language layer carried over. The condition-specific logic lived in the probes inside each mode rather than in a central questionnaire architecture. The system prompt got smaller, the behavioral rules got sharper, and the modular knowledge file structure shifted from “consult by phase” to “consult by mode.”
What the second cohort showed
The difference was immediate. Participants leaned in. In almost every session, we had to tell them we needed to move on because they kept wanting to ask the AI more questions. Several said they learned something they hadn’t known before. Nobody asked what they were getting in exchange for sharing so much personal information. That last point mattered as much as anything else: the interaction felt worth it to them, so the value exchange resolved itself.
The unplanned session that mattered most.
After the study was over and we had presented our findings, the executive stakeholder, the person who would ultimately decide whether the concept justified further investment, asked to try the redesigned prototype. He used his own family’s caregiving experience from several years earlier, working through it as the person he was at the time.
Within minutes he stopped evaluating and started using it. He moved through one of the modes, asked for a forecast, then started asking follow-up questions: what kind of doctor to look for, what to do if his mother-in-law wouldn’t let anyone into her appointments, whether the tool could handle situations beyond memory loss. We had to tell him we needed to wrap up.
We can’t get him to stop.
A recruited participant engaging deeply is a good sign. The person who controls whether a concept gets funded engaging with it as a real user, unable to disengage even after the scheduled session was over: that’s a different kind of evidence. He wasn’t reacting to a concept. He was experiencing a service that was solving a problem he had actually had.
He also exposed the forecast’s biggest content problem. His mother-in-law had been showing early signs of confusion around time and money, and at the time the family hadn’t known whether that was the start of cognitive decline or a temporary response to losing her husband. The forecast didn’t make that distinction. It took her ambiguous early signals and projected a five-year trajectory of progressive deterioration: escalating costs, increasing dependency, eventual facility care.
His reaction was immediate. This would have freaked us out at the time. It felt like the tool was making a big assumption about a diagnosis. He was right. Forecasts have to acknowledge ambiguity in the inputs, not paper over it. I flagged it as a content problem the next phase has to solve at the model level.
The client got the signal they needed. The next phase has user testing built in.
The testing gave them enough confidence in the concept’s desirability and the interaction model to justify moving forward. The next phase is substantially larger: building the actual GPT for production, developing the real forecast model with a data science team, and running a two-month beta this fall that will include the AI, the forecast, human care-guide calls, and task-management features. That beta is a meaningful investment. The prototype is what made it defensible.
Three things came out of this work that weren’t in the brief:
- Strategic direction. The interaction model from v2 (modal, artifact-first, value before intake) is now a design principle for the service going forward, not a one-time fix to a flow.
- Production specification. The guidance content, mode definitions, and example outputs I created are serving as the experience specification for the production AI system.
- Testing methodology. When the client suggested evaluating the next GPT internally rather than with users, I argued for continued user testing. The next phase has user testing built in.
Four things that traveled with me out of this project.
Match the build to the question. If the service is AI-delivered, a static mockup won’t tell you whether people trust it, engage with it, or find it useful. The prototype has to behave like the thing. But fidelity has a ceiling, and chasing it past what the question requires is its own failure mode. The comorbidity rabbit hole I climbed out of cost time I could have spent on the parts that mattered.
Sometimes the structure is wrong, not the steps. v2 wasn’t a re-sequenced v1. The questions were mostly fine. The forecast structure was mostly fine. What was wrong was a layer underneath: the assumption that the right way to deliver an AI caregiving service was to interview the caregiver. Diagnosing the failure at that layer, not at the flow layer, is what made the redesign work. Re-sequencing v1 would have produced a slightly less frustrating version of the same wrong thing.
Design for the edges from the start. The participant whose mother was dying wasn’t an outlier to route around. Her session showed that v1 had no graceful path for the late stages of caregiving, and that the design had assumed a user looking forward rather than a user already deep in. The most important design problems often live at the edges of the user population, not in the middle.
Stop the study when the study is wrong. The original plan was twelve participants on one version. After several sessions it was obvious we were burning slots on a flow that was going to keep failing the same way. Redesigning mid-study wasn’t an easy call, but it was the right one. Five more sessions on a known-broken flow would have produced more confirmation, not more learning.