Clinical Review: ChatGPT-4.0 demonstrated comparable performance to gastroenterologists in IBD treatment recommendations.
Background & Rationale
The increasing complexity of inflammatory bowel disease (IBD) management necessitates continuous evaluation of clinical decision-making processes. Artificial intelligence (AI), specifically large language models (LLMs), offers a potential tool to assist clinicians. However, the performance of these LLMs in the context of nuanced IBD treatment scenarios requires rigorous assessment. Currently, LLMs are being explored for a wide range of medical applications, including summarising patient data and generating treatment plans. This study aimed to compare the recommendations generated by ChatGPT-4.0 with those provided by experienced gastroenterologists when presented with identical IBD case scenarios.
Study Design
Researchers conducted a comparative study evaluating therapeutic recommendations for IBD. Fifteen complex clinical cases – covering Crohn’s disease and ulcerative colitis, across varying disease severities and with multiple comorbidities – were constructed by an expert panel. These cases were presented independently to ChatGPT-4.0 and a panel of 12 experienced gastroenterologists (mean 15 years of practice). Participants were asked to provide a treatment plan for each case including initial drug choice, escalation strategy and duration of therapy. Responses were assessed by the expert panel based on alignment with current IBD guidelines and clinical best practice. Each response was scored from 0 to 5, with 5 representing complete agreement.
Patient Population
The case scenarios represented a diverse cohort of patients with IBD. Seven cases involved Crohn’s disease and eight involved ulcerative colitis. Patient ages ranged from 22 to 78 years. Disease presentation varied, encompassing mild-moderate disease requiring initial induction therapy, to severe disease necessitating hospitalisation and advanced therapies. Cases included patients with both luminal and extraintestinal manifestations, as well as relevant comorbidities such as depression, cardiovascular disease and prior surgery.
Key Findings
Across the 15 cases, ChatGPT-4.0 achieved an average score of 4.1 out of 5. Gastroenterologists achieved a mean score of 4.2. Agreement between ChatGPT-4.0 and the gastroenterologist panel was high, with a median inter-rater reliability of 0.83. In 11 of the 15 cases, the expert panel judged the recommendations from ChatGPT-4.0 to be fully in line with accepted clinical practice. When compared to the gastroenterologists, the LLM demonstrated a comparable ability to select appropriate initial therapies, including aminosalicylates, corticosteroids, thiopurines, and biologic agents. ChatGPT-4.0 consistently recommended a step-up approach to therapy, escalating treatment intensity based on disease response. The LLM correctly suggested the use of anti-TNF agents in 8 out of 10 cases where the gastroenterologists also recommended them.
Discussion
This study demonstrates that ChatGPT-4.0 can generate therapeutic recommendations for IBD that are highly concordant with those of experienced gastroenterologists. The high level of agreement suggests the LLM has effectively internalised current IBD management guidelines and can apply this knowledge to complex clinical scenarios. This has potential implications for clinical practice including augmenting physician decision making, streamlining treatment protocols and potentially assisting clinicians in resource-limited settings. LLMs are not without limitations. Notably, potential risks stemming from algorithmic bias and the lack of contextual understanding must be addressed. This iteration of the study did not assess response time or the cost-effectiveness of utilising LLMs in clinical decision-making.
Authors’ Conclusions
The authors concluded that ChatGPT-4.0 demonstrates a promising capacity to provide IBD treatment recommendations that are comparable to medical experts. They propose that such LLMs could be valuable tools for supporting clinicians in the management of IBD, but emphasise the need for ongoing evaluation and cautious implementation with appropriate oversight.
Reference
Loncour H, Aoun J, Hoyois A, Muls V. Therapeutic Decisions in Inflammatory Bowel Disease: ChatGPT-4.0 Compared to Medical Expertise. JGH open : an open access journal of gastroenterology and hepatology. 2026;10(1):1–11. DOI: 10.1002/jgh3.70386.