Evidence Quality and Validity Framework and Metrics for Large Language Models
Document Type
Article
Publication Date
5-15-2024
Abstract
Generative AI along with its language models (referred to as Large Language Models - LLMs) are able to perform human-like tasks such as answering questions, perform text-to-text, text-to-speech, text-to-image, and text-to-code generation on topics and images they have been pre-trained using real-world data. Given that LLMs are trained using different kinds of data sources, inherent biases and imperfection present in these data sources might cause the LLMs to produce unintended results. LLMs are also prone to generating incorrect responses or hallucinate which may question the accuracy, validity, and reliability of LLM generated responses.
Ascertaining the validity and quality of LLMs and comparing the results generated from different LLMs are formidable tasks because there is no de-facto gold standard LLM to measure the existing LLMs against. Factors such as the huge corpus of information and data used to pre-train the LLMs, the engineered prompts, the vector embeddings, and the fine-tuning of the models vary across different LLMs. Given the popularity, the acceptance, the productivity-gains that come with using them, as well as reported biases and misuses of LLMs, it is imperative that there should be an objective way to ascertain and validate the quality and truthfulness of LLM outputs.
Meaningfulness and interpretability of LLM outputs are important for trust, transparency, and for informing human decision-making. There is on-going research to develop metrics for determining the accuracy and reliability of LLM outputs. We propose use of the evaluation metrics commonly used in statistics, medicine, and social science domains to assess the quality and reliability of outputs generated by LLMs. The metrics are repeatability score, reproducibility score, generalizability score, robustness score, and replicability score. Using these metrics and scores when designing LLM-based projects can help to ensure their trustworthiness and reliability.
Recommended Citation
Olaleye, David Ph.D, "Evidence Quality and Validity Framework and Metrics for Large Language Models" (2024). N.C. A&T Faculty Enrichment Workshop - Aligning AI Themes of Future Teaching, Research, and Interdisciplinary Collaboration. 3.
https://digital.library.ncat.edu/facenrichwrkshopai/3