Artificial Intelligence and Data Science

Artificial Intelligence and Data Science#

This section of the Glossary contains terms relating to Artificial Intelligence and Data Science.

Advanced Patient Similarity Tool#: An algorithm or piece of software to be developed as part of the study that can be used by healthcare professionals to predict health outcomes and optimal treatments for patients with Multiple Long-Term Conditions [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Algorithm#: A set of rules that a machine (such as a computer) can follow to learn how to do a task [Denny, 2020].
Artificial Intelligence (AI)#: Also known as AI. An Artificially Intelligent computer system makes predictions or takes actions based on patterns in existing data. For example, an AI computer system may be able to notifying a GP when disease risk factors have been identified in a patient’s records.
Autonomous#: A machine is described as autonomous if it can perform its task or tasks without needing human intervention [TELUS International, 2021].
Causal Inference #: A method used in data analysis. Causal Inference involved looking for patterns in large data sets for variables that might be linked together by “cause and effect”. It is very easy to analyse data to look for things that occur together (i.e., variables that are correlated), but it is harder to tell whether things might be causing or affecting the other [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Confound#: Also called Confounding Variable or Confounder. A factor that is associated with both an intervention and the outcome of interest to the researchers. For example, if people in the experimental group of a controlled trial are younger than those in the control group, it will be difficult to decide whether a lower risk of death in one group is due to the intervention or the difference in age. Age is then said to be a confound. Randomisation is typically used by researchers as a technique to minimise the existence of confounding variables between experimental and control groups. Confounding is a major concern in non-randomised trials [NIHR, 2022].
Data#: Data is information that has been collected through research. It can include written information, numbers, sounds, and pictures [NIHR, 2022].
Data Analysis#: Data analysis involves examining and processing research data, in order to answer the questions that the project is trying to address. It involves identifying patterns and drawing out the main themes, and is often done with specialist computer software [NIHR, 2022].
Data Mining #: The process of analysing datasets in order to discover new patterns that might improve the model.
Data Science #: Drawing from statistics, computer science and information science, this interdisciplinary field aims to use a variety of scientific methods, processes and systems to solve problems involving data. Data Science often involves working with complex, ‘noisy’, and unstructured data’ to develop insights and knowledge. Data science is related to data mining, machine learning and big data [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Dataset #: A collection of related data points, usually with a uniform order and tags.
Effect size#: A term for the statistical estimate of treatment effect for a study. Effect size is used to calculate the magnitude of the difference between groups
GitHub#: An online platform to store and collaborate on complex, data-heavy projects. Can also be made publicly available to share with the wider research community and anyone else who is interested [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Human Computer Interaction#: A type of research that considers the ways in which humans and computers interact and what this means. It takes into account computer science, behavioural science and design and can be used to understand things like how someone responds emotionally to technology or how they find the design or are able to interact [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Label #: A part of training data that identifies the desired output for that particular piece of data.
Machine Intelligence #: An umbrella term for various types of learning algorithms, including machine learning and deep learning.
Machine Learning #: The study and development of algorithms (computer codes) that that will help algorithms to learn and change in response to new data, without the help of a human being. Machine Learning is important where it is unfeasible or impossible to develop a static algorithm that would be able to cope with the variability or complexity of the data it is used on [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Manifold Learning #: A method to simplify complex data to make it more usable that retains information about the underlying ‘structure’ of the original data. It is useful where a single data point can legitimately be in several categories at once [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Metadata #: Data that provides information and summarizes basic information about a dataset (for example, author, date created, date modified, file size). Metadata helps researchers decide whether that specific dataset would be useful for their research project [NIHR, 2022].
Model #: A broad term referring to the product of AI training, created by running a machine learning algorithm on training data.
Natural Language Generation (NLG)#: This refers to the process by which a machine turns structured data into text or speech that humans can understand. Essentially, NLG is concerned with what a machine writes or says as the end part of the communication process.
Natural Language Processing (NLP)#: The umbrella term for any machine’s ability to perform conversational tasks, such as recognizing what is said to it, understanding the intended meaning and responding intelligibly.
Natural Language Understanding (NLU)#: As a subset of natural language processing, natural language understanding deals with helping machines to recognize the intended meaning of language — taking into account its subtle nuances and any grammatical errors.
Python #: A popular programming language used for general programming. [Perez et al., 2011]
Statistics #: The study of the collection, organization, analysis, interpretation, and presentation of data. It is used to understand things like, whether data is representative, whether measurements are as expected whether groups of data are different to one another and how reliable comparisons of groups of data are [OPTIMAL, Project Staff and Public Advisory Group, 2022].
Synthetic Data#: Synthetic data is artificially generated data, generated by a computer program. It does not represent real events or people, but it often replicates the structure and patterns of real-world data.
Trusted Research Environments#: Highly secure computing spaces that provide remote access to health data for approved researchers. Trusted Research Environments (TREs) can also be called Data Safe Havens and Secure Data Environments [Swansea University, 2022].