ClinIQLink 2025 - LLM Lie Detector Test

Task Overview

The objective of ClinIQLink is to evaluate the ability of generative models to produce factually accurate medical information. Participants must submit their models to CodaBench, [1], where the organizers will test and evaluate them through a semi-automated testing process. The leaderboard will rank submissions based solely on the accuracy of the knowledge retrieved by the submitted models. The task is designed to assess:

Knowledge Retrieval: Using a novel dataset of atomic question-answer pairs, the task will measure how well generative models retrieve factually correct medical information. The evaluation will focus on fundamental medical concepts (aimed at the knowledge level of a General Practicioner (GP)), such as procedures, conditions, drugs, and diagnostics. Scoring will be based on precision, with full points awarded for exact or semantically equivalent answers. Incorrect or factually inaccurate responses will result in negative scores.

Post-hoc Analysis: While participants will be evaluated solely on their ability to provide factually accurate medical knowledge, the coordinators will conduct post-hoc analysis of the submitted models to identify and categorize hallucinations in the responses. This analysis will focus on understanding the origins of hallucinations, including:

Intrinsic: Hallucinations caused by the model's internal representations.
Extrinsic: Hallucinations arising due to missing or incorrect external information.
Other: Hallucinations from unknown or hybrid causes.

The post-hoc findings will provide insights into the limitations of the models but will not impact the participants' scores. Participation Requirements: To be considered for acceptance, participants must submit a short paper outlining their novel model or method. The short paper must present innovative approaches or significant contributions to medical knowledge retrieval. The shared task will utilize a novel medical QA dataset that will not be made publicly available to ensure the integrity of the evaluation process.

Datasets

The dataset provided for this task is a novel collection of factual, atomic question-answer pairs grounded in the medical domain, designed to align with the knowledge level of a General Practitioner (GP) Medical Doctor. The dataset is generated with input from medical experts, and each question is supported by source documentation from medical textbooks to ensure accuracy. The dataset covers core medical concepts, including procedures, conditions, drugs, and diagnostics, and consists of the following five modalities:

True/False +

Question:
"Antibiotics can treat viral infections."

Answer:
"False."

Multiple Choice +

Question:
"Which of the following is not a symptom of diabetes?"

Options:

A. Increased thirst
B. Frequent urination
C. Blurred vision
D. High fever

Correct Answer:
"D. High fever"

List +

Question:
"List the four chambers of the human heart."

Options:

A. right atrium
B. top atrium
C. right ventricle
D. bottom ventricle
E. left atrium
F. left ventricle

Answer:
"A. right atrium, C. right ventricle, E. left atrium, F. left ventricle."

Short Answer +

Format:
Question is posed to generative model, model prediction is compared to ground truth answer

Question:
"What is a Laparoscopy?"

Answer:
"A minimally invasive surgical procedure that uses a camera to view the inside of the abdomen."

Short Inverse +

Format:
Question and incorrect answer are posed to generative model, model prediction for incorrect answer explanation is compared to ground truth

Question:
"What is the primary area covered by the dermatome associated with the second thoracic spinal nerve?"

False Answer:
"The highest thoracic dermatome on the posterior back is primarily T2."

Incorrect Answer Explanation:
"The statement incorrectly places the primary coverage of the T2 dermatome on the posterior back instead of mentioning its extension into the upper limb and presence on the anterior chest wall, indicating confusion between anatomical locations described in the passage."

Multi-Hop Knowledge-Retrieval Answers+

Format:
Question is posed to generative model, model prediction (both answer and reasoning) are compared to ground truth answer

Question:
"A patient presents with fatigue and shortness of breath. Lab results show low hemoglobin levels and high mean corpuscular volume (MCV). Based on these findings, what condition might the patient have, and what deficiency could be contributing to it?"

Answer:
"The patient might have megaloblastic anemia caused by a vitamin B12 deficiency."

Reasoning:
"Step 1: Fatigue and shortness of breath are symptoms of anemia.
Step 2: High MCV indicates macrocytic anemia.
Step 3: Macrocytic anemia is commonly caused by vitamin B12 or folate deficiency.
Step 4: Conclude megaloblastic anemia due to likely vitamin B12 deficiency."

Multi-Hop Inverse Knowledge-Retrieval Answers+

Format:
Question and incorrect answer with incorrect reasoning are posed to generative model, model prediction for determinaiton of the incorrect reasoning step is compared to ground truth

Question:
"How do sympathetic nervous system signals destined for the face originate in terms of their path from the central nervous system?"

Incorrect Answer:
"These signals start from spinal nerve levels below T1, specifically originating from L2, before traveling up through the sympathetic trunk to reach facial structures."

Reasoning:
Step 1: Sympathetic paravertebral trunks extend the entire length of the vertebral column and enable the distribution of visceral motor fibers to peripheral regions.
Step 2: Fibers from different parts of the spine travel upwards or downwards depending on their origin; notably, white rami communicantes, which carry preganglionic fibers away from the spinal cord, are found only with spinal nerves T1 to L2.
Step 3 (Incorrect Step): Given that all sympathetics going into the head involve preganglionic fibers emerging from around the lower thoracic segments due to the general principle of nerve signal propagation in the body, we reason that the primary starting points for such signals intended for the face would indeed begin from areas slightly below the commonly acknowledged range of T1-T5, considering the necessity for widespread coverage.
Step 4: Since these signals need to end up in the head after leaving the spinal cord, they follow the pathway described for reaching higher targets, involving ascent through the sympathetic trunk.
Step 5: Ultimately, considering the anatomy and pathways involved, these signals meant for facial structures likely stem initially from the lowest segment capable of contributing to cranial functions, thus indicating origins potentially from L2 given its role in providing broad sympathetic coverage.

Incorrect Reasoning Step:
- Step 3 contains the incorrect inference.
- Explanation: This step incorrectly infers that because there's a need for widespread coverage and because some signals go upwards, the primary origin for sympathetic signals to the face would necessarily be from areas slightly below the acknowledged range of T1-T5, suggesting L2 as a potential starting point. However, all sympathetics passing into the head actually have preganglionic fibers that emerge from spinal cord level T1, not from lower levels like L2. This misunderstanding leads to the incorrect conclusion about the origin of sympathetic signals destined for the face.

Evaluation Metrics

Our evaluation framework is organized into two primary categories: (1) Closed-ended questions (True/False, lists, and multiple choice) and (2) Open-ended questions (short answer, short inverse answer, Multi-Hop and Multi-Hop Inverse).

Closed-ended questions+

Closed-ended questions include True/False, list, and multiple-choice formats. These are evaluated using the F1 score [2] and a penalty system for factually inaccurate answers.

F1 Score Definition

The F1 score is the harmonic mean of precision and recall.

Precision: The proportion of correctly predicted answers out of all predicted answers. \[ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \]
Recall: The proportion of correctly predicted answers out of all actual correct answers. \[ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} \]
F1 Score: \[ \text{F1 Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

True/False Questions+

True/False questions are evaluated by comparing the model’s response to the expected answer, with scoring based on precision, recall, and F1.

Example:

Question: "Antibiotics can treat viral infections?"
Expected Answer: "False"
Model Answer: "True"
Result: TP = 0, FP = 0, FN = 1

List Questions+

List questions are evaluated by comparing individual items in the model’s response to the expected list. Each correctly identified item is considered a True Positive (TP), while missing items count as False Negatives (FN), and incorrect items as False Positives (FP).

Example:

Question: "List the four chambers of the human heart."
Options:

A. right atrium
B. top atrium
C. right ventricle
D. bottom ventricle
E. left atrium
F. left ventricle

Expected Answer: "A. right atrium, C. right ventricle, E. left atrium, F.left ventricle."
Model Answer: "A. Right atrium, B. top atrium, C. right ventricle, E. left atrium."
Result: TP = 3, FP = 1, FN = 1

Multiple Choice Questions+

Multiple-choice questions are scored by checking if the selected option matches the correct answer. A correct selection is a True Positive (TP), an incorrect selection counts as a False Positive (FP), and a failure to answer or an irrelevant response counts as a False Negative (FN).

Example:

Question: "Which of the following is not a symptom of diabetes?"
Options:

A. Increased thirst
B. Frequent urination
C. Blurred vision
D. High fever

Expected Answer: "D. High fever"
Model Answer: "B. Frequent urination"
Result: TP = 0, FP = 1, FN = 0

Open-ended questions+

Open-ended questions scoring is divided into two key levels of semantic similarity:

Exact Match (Full Points)+

Full points will be awarded if the model provides an exact match to the known ground truth answer.
Example:
- Question: "What is Laparoscopy?"
- Expected Answer: "A minimally invasive surgical procedure using a camera."
- Model Answer: "A minimally invasive surgical procedure using a camera."
- Scoring: If the model returns exactly this answer, full points are awarded.

Full to Partial Semantic Match (Full to Partial Points)+

Full points will also be awarded if the model returns an answer that is semantically equivalent to the ground truth, even if phrased differently (e.g., using synonyms). BLEU [3], ROUGE [4] and METEOR [5] scores will be calculated for all generated-reference QA response pairs and, in cases where automated metrics may yield uncertain or inconsistent scores, human annotators will be employed in conjuction with those scores along with the semantic similarity scores described below to review and normalize the evaluation.
Scoring for Semantic Answer Matching:
The semantic similarity score combines word-level, sentence-level, and paragraph-level similarity. Each similarity level is calculated using embeddings and cosine similarity, and the final score is computed as a weighted sum of these individual scores:
\[ \text{Semantic Match Score} = w_{word} \cdot \text{CosineSim}_{word} + w_{sentence} \cdot \text{CosineSim}_{sentence} + w_{paragraph} \cdot \text{CosineSim}_{paragraph} \] Where:
- \(w_{word}, w_{sentence}, w_{paragraph}\) are weights that sum to 1.
- \(\text{CosineSim}_{word}\): Mean pooling of word embeddings obtained from the model output using: \[ E_{\text{sentence}} = \frac{\sum_{i=1}^{T} E_{\text{token}_i} \cdot \text{Mask}_i}{\sum_{i=1}^{T} \text{Mask}_i} \] Where:
  - \(E_{\text{token}_i}\) is the embedding for the \(i\)-th token.
  - \(\text{Mask}_i\) is the attention mask for the \(i\)-th token to exclude padding tokens.
  - \(T\) is the total number of tokens.
- \(\text{CosineSim}_{sentence}\): Cosine similarity between corresponding sentence embeddings \(E_x\) and \(E_y\), calculated as: \[ \text{CosineSimilarity}(E_x, E_y) = \frac{E_x \cdot E_y}{\|E_x\| \|E_y\|} \] Where:
  - \(E_x \cdot E_y\) is the dot product of the embeddings.
  - \(\|E_x\|\) and \(\|E_y\|\) are the magnitudes of the embeddings.
  Once cosine similarity is computed for each corresponding sentence pair in the generated and reference answers, the aggregated sentence-level similarity is obtained by averaging all pairwise similarities:
  \[ \text{CosineSim}_{\text{sentence}} = \frac{1}{N} \sum_{i=1}^{N} \text{CosineSimilarity}(E_{x_i}, E_{y_i}) \] Where \(N\) is the number of sentences in the generated and reference answers.
- \(\text{CosineSim}_{paragraph}\): Cosine similarity between the full answer embeddings, calculated as: \[ \text{CosineSimilarity}(E_{\text{paragraph, gen}}, E_{\text{paragraph, ref}}) = \frac{E_{\text{paragraph, gen}} \cdot E_{\text{paragraph, ref}}}{\|E_{\text{paragraph, gen}}\| \|E_{\text{paragraph, ref}}\|} \] Where:
  - \(E_{\text{paragraph, gen}}\) and \(E_{\text{paragraph, ref}}\) are the paragraph embeddings obtained by mean pooling over all token embeddings in the generated and reference answers, respectively:
  - \(T\) is the total number of tokens in the paragraph.
The word-level score will be calculated using a method similar to BERTScore [6], which evaluates text generation using contextual embeddings. The sentence-level and paragraph-level scores will be computed in a manner inspired by SemScore [7], which leverages semantic textual similarity for evaluating instruction-tuned large language models.

The final score is determined as follows:
- Full Match: If the semantic match score exceeds the semantic similarity threshold (0.9), full points are awarded.
- Partial Match: If the semantic match score lies between two thresholds of no semantic similarity and full semantic similarity (0.4 to 0.9), partial points are awarded using a linear interpolation: \[ \text{Partial Points} = \frac{\text{Semantic Match Score} - \text{Lower Threshold}}{\text{Full Threshold} - \text{Lower Threshold}} \times \text{Max Points} \]
Full Match Example:
- Question: "What is Laparoscopy?"
- Expected Answer: "A minimally invasive surgical procedure using a camera."
- Model Answer: "A minimally invasive surgery with a camera."
- Scoring: Full Points (1).
Partial Match Example:
- Question: "What is Laparoscopy?"
- Expected Answer: "A minimally invasive surgical procedure using a camera."
- Model Answer: "A procedure using a camera."
- Scoring: Partial Points (less than 1).

Submission, Rules and Participation Requirements

Submission Format+

All participants will be invited to submit a paper describing their solution to be included in the Proceedings of the 24rd Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2025

Participants are required to submit their models to CodaBench, [1] where the organizers will test and evaluate the submissions through a semi-automated process. Participants are not required to make their submissions public; however, doing so is highly encouraged to promote transparency and reproducibility.
Submissions must include:
- The model uploaded to CodaBench [1] for evaluation.
- A detailed description of the system approach, including the methods, training data, configurations, and techniques used to develop the novel model/method in the form of a short paper .
Optional:
- Participants can make their models and approaches public to foster collaboration and transparency within the community.

Model Transparency+

Participants are required to provide clear documentation about:
- The methods and techniques used, including any training data, preprocessing steps, fine-tuning strategies, and specific algorithms.
- Any prompts, prompt engineering, or inference strategies employed.
This information will be used to ensure transparency and promote openness within the competition but will not affect leaderboard rankings.

Training and Data Use+

Participants are free to use any methods, models, or data sources to achieve the best results.
There are no restrictions on the choice of training data, models, or tools. Participants can combine datasets, use proprietary or private data, and leverage any strategies or algorithms to maximize performance.
Responsibility: Participants must ensure their approach complies with all applicable legal and ethical standards regarding data usage and methodology.

Submission Limits+

Each participant or team may submit a maximum of three times for consideration on the leaderboard.
Each submission must reflect a distinct system or approach.

Leaderboard Ranking+

Submissions will be ranked on the official leaderboard based solely on the evaluation metrics generated by the provided script.
The leaderboard will reflect scores based on exact matches, semantic matches, partial matches, and penalize factually incorrect responses.
No distinction will be made between submissions using public or proprietary methods in the rankings. All valid submissions will appear on the leaderboard based on their scores.

Use of ClinIQLink Dataset+

The fully annotated ClinIQLink dataset will NOT be released publicly for participants to use.

Evaluation Requirements+

Participants are required to use the official evaluation script provided by the organizers to generate their metrics.
Submissions must strictly adhere to the output format specified by the script to ensure comparability across all entries.

Timeline

Timeline for ClinIQLink at BioNLP Workshop at ACL 2025:

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

January 21, 2025

First Call for Participation

February 20, 2025

Release of Sample Submission Dataset (Github)

March 10, 2025

Release of Testing Dataset & Framework (CodaBench)

April 15, 2025

System Submission Deadline

April 25, 2025

Results Feedback Provided

May 05, 2025

Prelim Papers Submission Deadline

May 15, 2025

Final Papers Submission Deadline

May 10, 2025

Notification of Acceptance

May 20, 2025

Camera Ready Papers Due

July 7, 2025

Pre-recorded Video Due

July 31, 2025

BioNLP Workshop Date (at ACL 2025)

ACL 2025
July 27th 2025 -
August 01 2025

January 21, 2025: First Call for Participation
February 20, 2025: Release of Sample Submission Dataset (Github)
March 10, 2025: Release of Testing Dataset & Framework (CodaBench)
April 15, 2025: System Submission Deadline
April 25, 2025: Results Feedback Provided
May 05, 2025: Prelim Papers Submission Deadline
May 15, 2025: Final Papers Submission Deadline
May 10, 2025: Notification of Acceptance
May 20, 2025: Camera Ready Papers Due
July 07, 2025: Pre-recorded Video Due
July 31, 2025: BioNLP Workshop Date (at ACL 2025)

ACL 2025: July 27th 2025 - August 01 2025

All deadlines are 11:59 PM ("Anywhere on Earth")

Research Directions

LLM Hallucination Detection+

See the following survey papers for ideas on research directions:

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
https://arxiv.org/abs/2309.01219
Survey of Hallucination in Natural Language Generation
https://dl.acm.org/doi/full/10.1145/3571730
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
https://arxiv.org/abs/2311.05232
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
https://arxiv.org/abs/2401.01313
Hallucination of Multimodal Large Language Models: A Survey
https://arxiv.org/abs/2404.18930
A Survey of Hallucination in Large Foundation Models
https://arxiv.org/abs/2309.05922
A Survey on Hallucination in Large Vision-Language Models
https://arxiv.org/abs/2402.00253
TruthfulQA: Measuring How Models Mimic Human Falsehoods
https://arxiv.org/pdf/2109.07958
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
https://arxiv.org/pdf/2402.17811
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models
https://arxiv.org/pdf/2401.03205

Retrieval Augmented Generation+

Retrieval-Augmented Generation (RAG) combines large language models with external retrieval mechanisms to enhance factual accuracy and mitigate hallucinations. The following papers explore various aspects of RAG:

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
https://arxiv.org/abs/2005.11401
Retrieval-Augmented Generation for Large Language Models: A Survey
https://arxiv.org/abs/2312.10997
Incorporating Retrieval into Generation Models
https://arxiv.org/abs/2406.13249
A Comprehensive Survey of Retrieval-Augmented Generation
https://arxiv.org/abs/2410.12837

Medical QA Datasets+

Key datasets commonly used for Medical Question Answering research involving Large Language Models include:

MultiMedQA
https://arxiv.org/abs/2212.13138
MedQA
https://arxiv.org/abs/2009.13081
MedMCQA
https://arxiv.org/abs/2203.14371
PubMedQA
https://pubmedqa.github.io/
BioASQ
https://www.bioasq.org/

Medical QA Models+

Med-PaLM 2
https://arxiv.org/abs/2305.09617
PMC-LLaMA
https://arxiv.org/abs/2304.14454
Integrating UMLS Knowledge into LLMs
https://arxiv.org/abs/2310.02778
Exploring the Landscape of LLMs in Medical QA
https://arxiv.org/abs/2310.07225
Hugging Face Medical QA Leaderboard
https://huggingface.co/blog/leaderboard-medicalllm

Support and FAQs

Support Contact+

For all questions involving the shared task, please contact Brandon Colelough

Email: brandon.colelough@nih.giv

FAQs+

Coming soon

Result Reporting and Post-task Analysis

Coming Soon

Organizers

Brandon Colelough

Website

Dina Demner-Fushman

Website

N/A

Davis Bartels

No Website

Task Overview

Datasets

Evaluation Metrics

F1 Score Definition

Submission, Rules and Participation Requirements

Timeline

Research Directions

Support and FAQs

Result Reporting and Post-task Analysis

Organizers

Brandon Colelough

Dina Demner-Fushman

Davis Bartels

References