.png)
Inclusion Criteria: a Clinical Research podcast
Thank you for joining Inclusion Criteria: a Clinical Research podcast hosted by me, John Reites. This is an inclusive, non-corporate podcast focused on the people and topics that matter to developing treatments for everyone. It’s my personal project intended to support you in your career, connect with industry experts and contribute to the ideas that advance clinical research.
Inclusion Criteria is the clinical research podcast exploring global clinical trials, drug development, and life‑science innovation. We cover everything clinical research to deepen your industry knowledge, further your career and help you stay current on the market responsible for the future of medicine.
Our episodes discuss current industry headlines, career tips, trending topics, lessons learned, and candid conversations with clinical research experts working to impact our industry everyday.
Watch on YouTube and listen on your favorite podcast app. Thank you for supporting and sharing the show.
Please connect with me (John Reites) at www.linkedin.com/in/johnreites or www.johnreites.com.
The views and opinions expressed by John Reites and guests are provided for informational purposes only. Nothing discussed constitutes medical, legal, regulatory, or financial advice.
Inclusion Criteria: a Clinical Research podcast
AI in Clinical Research: Four Listener Questions w/ Jeremy Franz
John Reites and Jeremy Franz dive into the practical applications of AI in clinical research, focusing on large language models (LLMs), how to select the right model for specific tasks, and the challenges posed by AI hallucinations. They address common questions from listeners, providing insights into the workings of LLMs, the importance of model testing, and strategies to ensure data integrity in clinical trials.
Click here to message/text me your insights and ideas for future episodes
Thank you for joining Inclusion Criteria: a Clinical Research podcast hosted by me, John Reites. This is an inclusive, non-corporate podcast focused on the people and topics that matter to developing treatments for everyone. It’s my personal project intended to support you in your career, connect with industry experts and contribute to the ideas that advance clinical research.
Inclusion Criteria is the clinical research podcast exploring global clinical trials, drug development, and life‑science innovation. We cover everything clinical research to deepen your industry knowledge, further your career and help you stay current on the market responsible for the future of medicine.
Our episodes discuss current industry headlines, career tips, trending topics, lessons learned, and candid conversations with clinical research experts working to impact our industry everyday.
Watch on YouTube and listen on your favorite podcast app. Thank you for supporting and sharing the show.
Please connect with me (John Reites) at www.linkedin.com/in/johnreites or www.johnreites.com.
The views and opinions expressed by John Reites and guests are provided for informational purposes only. Nothing discussed constitutes medical, legal, regulatory, or financial advice.
I wish there was just one model that we could pick and that was the best for everything and we could move on with our lives. But the reality is, especially when you're building products on top of models, is that you have to test the task that you're asking the model to do and evaluate its performance on that task. So when we do this, we...
SPEAKER_01:Hey Jeremy, we're back again. Hey John, good to be back. Yeah, so we did this topic a few weeks ago on practical AI in clinical research and got some great feedback. Appreciate everybody who listened and who sent us a direct message, a comment, or just sent me a note in general. The funny thing is in response, we actually got some unpractical questions around AI. So I think there's definitely a lot of individuals interested in AI, interested in how it works, interested in application around clinical research. There's actually a lot of questions about sort of more foundational things around how does it work on the back end and how do you deal with challenges and gaps they've maybe heard about or they've seen, but they're not really sure how they practically work in these tools like what we're doing today. And so I thought what we could do is we took four questions. I thought we could unpack them just kind of live here together and hopefully they'll help our listeners to really think about and maybe understand just a small segment of AI and how it applies to research a little differently. With that, one of the key questions we got was what are LLMs and how do they work? So Jeremy, start by unpacking that.
SPEAKER_00:Yeah, absolutely. LLM stands for large language model. So think of it as a big computer program that ingests tons of data. Thinking by tons of data, I mean all the text on the internet. It takes in all that data and it is trained to predict data predict the next word in whatever string of text you provide. So given a sentence, it will predict the next word. And so this model has been trained on just massive and massive amounts of data. And it's all been trained to predict the next word. And on top of that, we've been able to build a lot of interesting kind of applications using that pretty relatively simple goal or task for the program. Yeah. And then LLMs,
SPEAKER_01:like you think about the brand of LLMs, can you talk a little bit about what the brands are so people sort of can start to put the names in the right order?
SPEAKER_00:Yeah. So, you know, you kind of start with the model providers. So the big three model providers being OpenAI, Anthropic, and Google. They are these companies that are building these new LLMs from scratch or building them on top of each other. And the actual model itself is what you may know as GPT-4, Claude 3.5, Gemini 2.5. They typically have kind of a family name and then a version number, essentially. OpenAI is famously really bad at naming them. And the version number doesn't always mean what you think it would mean. GPT 4.1 is actually more the... more advanced than GPT 4.5. I don't know why they chose that, you know, that naming scheme, but the model name is the actual kind of record, the thing that you're interacting with the most.
SPEAKER_01:Yeah, it makes sense. And then when you think, so you've got these LLMs, then you have the models. One of the questions we got through the second question that came up quite a bit was, you know, what are the models, but it was more around how do you pick the best model? Like what's the number one model to use and do you use it?
SPEAKER_00:Yeah, absolutely. I wish there was just one model that we could pick and that was the best for everything and we could move on with our lives. But the reality is, especially when you're building products on top of models, is that you have to test the task that you're asking the model to do and evaluate its performance on that task. So when we do this, we specify a task. It's, let's say, a classification problem. So we have a task that is predicting the quality of someone's response. So a one to five scale, it has a rubric with a description of what is a one, what is a five, what is everything in between. And we test that on new models as they come out. So for example, O3 is kind of the largest, most advanced reasoning model that OpenAI has released to date. And we tested it on this quality score task and it actually performed really bad, worse than GPT 4.1, worse than even GPT 4. And when we dug into the data and looked at the results it was the reason it was doing performing so poorly was because it was making up what it thought was the best answer possible to a question and saying they didn't talk about this this and this therefore it's not a good answer and so you have you know it just proves that just because it's the most You know, the most advanced model doesn't mean it's actually the best at performing the tasks that you want it to perform.
SPEAKER_01:So the model was smarter than the task that it had to complete by a long shot. Wow, that's crazy. Obviously, like this is an evolving space. By the time we publish this, it's going to have changed 10 more times, I'm sure at this rate, or some new companies are going to pop up on it. Thanks. I hope that helps people because this is a really dynamic question. And I think the key is the word best model, which we hear a lot is not the case, right? It's models for purpose. It's fit for purpose use. One of the other questions that we got, and maybe it's just because people like using the word hallucination, but this question came up, which is, what are these hallucination And how do I avoid them? I can't have any of them in clinical research. Too risky. So how do you make sure there are zero hallucinations? So when someone asks you that, Jeremy, or the word hallucinations come up, how do you answer that? How do you unpack that for people?
SPEAKER_00:Yeah, I mean, it's a real challenge. I think these models are so good at confidently responding to whatever input you provide it. It's going to give you a confident answer more often than not. And it's really easy to just trust the confidence. You know, as humans, we're a uncertainty or hedging or things like that. Whereas these models are trained to always provide the right answer and where things can go wrong is when they're overconfident in the answer that it's providing to you. And I think for us, that means having to build specific steps within the process in order to validate the output from an LLM. So it's a validator step that takes something that the model said was part of the response from a study in our case and looking at the source data and ensuring, does this string of text actually exist in our source data? If it doesn't, we need to escalate this and get it corrected. And that's one example of just working within the limitations of the model and adding some extra checks to try to reduce the likelihood that a hallucination would be shown to an end user
SPEAKER_01:yeah i like that so let me say it kind of um non-ai english i'm going to try my best here if you have participants like representative patients for a study and you have some principal investigator study coordinators and you you screen them You go out and we recruit them. You ask them to answer a series of questions, whether it's Likert scales or talking to their phone, their voice, and they get feedback. And then that data is in the system and you're using AI to actually process that data because there's a lot of it. The way you're making sure you don't have hallucinations in the mix is because you've basically trained and you have a process and a model that references everything to make sure that that voice came from patient 001 the AI didn't make up what it thought the voice could say, which is where you're going to get to the hallucination. So it, in effect removes that and adds a reference to, so that somebody can trust and know that that's the data that came directly out of their study. Is that, I get that right?
SPEAKER_00:Exactly. No, that's exactly right. We have a ground truth of what exactly was said in the study. And we were able to compare that with what the AI references or cites. And that's how we main integrity with our systems.
SPEAKER_01:Yeah, it's really interesting. I don't know if I told you, but this actually happened to me like a couple of weeks ago. I was doing some research on decentralized clinical trials and I was asking it for some good quotes of the week. And it gave me these three quotes. And one of the quotes was from me. And I read the quote and I was like, that sounds awesome. I didn't say that at all. And I think I messaged you right away. I was like, this looks like a hallucination. And you were like, yeah, you probably said it. And I literally like searched and couldn't find it. And so funny enough, I just left the chat window open and I asked it for the reference. I said, can you reference this for me? And long story short, I'm pretty sure that like ChatGPT just said, hey, if you post it on LinkedIn today, it'll be your quote and then I can reference it for you. And so I thought, no, that's called a hallucination. Like that's incorrect. data and there was no reference point for it. So I think you're right, like these systems and these processes to manage and make sure the data is referenceable and doesn't have hallucinations are so important. And so great question and I appreciate those who offered it. So Jeremy, those were the four questions. Lots more we'll take and maybe we'll add this to a future session. So if someone wants to get a hold of you, learn more about what you're up to in AI or just ask you a really specific question, what's the best way they can get you?
SPEAKER_00:Yeah, just find me on LinkedIn. Message me there or call comment on on this video and be happy to answer any questions you might have
SPEAKER_01:that's great well hey Jeremy thanks for your time and uh if we get more of these questions I'll see you again real soon thanks