The fast advancement of generative artificial intelligence (AI) and large language models (LLMs) in clinical medicine is quickly changing the processes of patient care. This is most impactful on everyday clinical tasks, where technologies like LLMs designed to summarize clinical notes, medications, and patient data, are nearing patient implementation without the oversight of the United States Food and Drug Administration (FDA). The utility of LLMs in creating concise and up-to-date clinical snapshots from a range of data sources within electronic health records (EHRs) signifies that patient care is becoming more efficient. However, this also shows potential concerns regarding the safety and efficacy of these AI tools in clinical settings, given their capacity to bypass FDA medical device oversight. The possibility of LLMs entering clinical practice without rigorous safety assessments is worrying, and is a reminder that careful implementation and evaluation to ensure they serve as beneficial aids rather than sources of inadvertent harm, is still highly necessary
The task of summarizing clinical data is inherently complex and variable within LLM outputs. These models, while offering the promise of improved clinical documentation and decision support, also present challenges in ensuring the accuracy, consistency, and reliability of their summaries. Variations in the length, organization, and tone of LLM-generated summaries can influence clinician interpretations and decision-making in subtle but substantial ways. This variability, along with the probabilistic nature of LLMs, raises concerns about the potential for these tools to introduce biases or errors into clinical decisions. For example, differences in summaries can emphasize certain patient conditions over others or frame clinical histories in ways that might sway diagnostic or treatment pathways. Such nuances portray the importance of developing comprehensive standards and rigorous testing protocols for LLM-generated clinical summaries to ensure they contribute positively to patient care.
The phenomenon of “sycophancy” bias within LLM-generated summaries illustrates the difficulty of AI interactions with clinical decision-making. LLMs may produce summaries that align too closely with clinicians’ preexisting beliefs, potentially exacerbating confirmation biases and diagnostic errors. This issue is particularly pertinent in scenarios where subtle prompt variations can lead to different summaries, emphasizing the need for clinicians to approach LLM-assisted decision-making with a critical eye. Small errors in summaries, though seemingly minor, can have profound implications for clinical judgments and patient outcomes. These considerations show the need for transparency, rigorous testing, and the establishment of robust standards to mitigate risks associated with their use.
The path forward for the integration of LLMs in clinical settings necessitates a well thought out approach, combining regulatory oversight, clinical validation, and the development of comprehensive standards for AI-generated summaries. Despite the absence of clear legal authority for the FDA to regulate most LLMs under current statutes, there is a requirement for industry-wide collaboration to establish guidelines that ensure the safety, accuracy, and utility of AI tools in healthcare. This effort should extend beyond the confines of large technology companies and include stakeholders from the clinical and scientific communities. By prioritizing the development of standards that address accuracy, bias, and the potential for clinical errors, alongside proactive regulatory clarifications, healthcare providers can derive the benefits from LLMs while mitigating the risks.