The T1- Performance of AI research indicator addresses evaluation of technical progress in an illustrative set of AI tasks belonging to different subareas (including image classification, face recognition, speech recognition, text summarisation, etc.), using a combination of quantitative measurements, such as in popular AI benchmarks and prize challenges. The indicator presents the performance of AI measured by means of different evaluation metrics (e.g., (mean) accuracy, score, error-rate, BLEU score, etc.) depending on the task at hand. The performance provides a value between 0 and 100. The Natural Language Processing – Speech task, originally measured through an error rate (where 0 is the highest performance and 100 is the lowest), is transformed as 1- error rate to obtain a comparable measure of performance between 0 (lowest) and 100 (highest).
The datasets and metrics used to measure performance for each benchmark are: for Computer vision – Image, dataset: ImageNet, metric: top1 accuracy; for Computer vision – Video, dataset: ActivityNet, metric: mean average precision, for Natural Language Processing – Language Understanding, dataset: SuperGLUE, metric: score; for Natural Language Processing – Language Reasoning, dataset: VQA, metric: accuracy; for Natural Language Processing – Speech, dataset: LibriSpeech, metric: WER.
Recently, research activities carried out by the global AI community on the technical progress of various AI tasks (including computer vision, language modelling and processing, speech, machine translation, videogames, etc.) have used a combination of quantitative measurements such as in popular AI benchmarks and prize challenges.
AI measurement is any activity that estimates the attributes (measures, metrics) of an AI system or some of its components, abstractly or in particular contexts of operation. These attributes, if well estimated, can be used to explain and predict the behaviour of the system. This can stem from an engineering perspective, trying to understand whether a particular AI system meets the specifications or the intention of its designers, known respectively as "verification" and "validation". However, in AI there is an extremely complex adaptive behaviour, and in many cases, a lack of a written specification. What the system is expected to do depends on feedback from the user or the environment (in the form of rewards) or is specified by example (from which the system has to learn a model).
The tradition in AI measurement has set the goal on task-oriented evaluation. For instance, given a scheduling problem, a board game or a classification problem, systems are evaluated according to some metric of task performance. Performance metrics are thus defined as figures and data representative of an AI system’s actions, abilities, and overall quality. There are many different forms of performance metrics depending on the task to address and their values depend on how they are calculated.
Analysis of research reveals different levels of performance with regard to specific AI tasks that are assessed. It is important to note that the performance reached in a specific task cannot be directly compared with that for other tasks, as they tend to have a very different nature. Nevertheless, these metrics provide, independently for each task, a performance value from 0 to 100 and we study the evolution of this performance.
In the graph below it can be seen that the best results are obtained in Natural Language Processing – Speech, which shows very high performance throughout the assessed period. Natural Language Processing – Language Understanding also shows good performance. We can also note that Computer Vision – Image follows a trend that is relatively flat, but with an overall performance above 80 over the entire period. The progress in Natural Language Processing – Language Reasoning shows a clear improvement over time, especially between 2016 and 2018. Finally, Computer Vision – Video, with a level of performance well below that of other AI tasks, has recently experienced the largest improvement, more than doubling its performance in only four years.
Detailed analysis of the previous graph and the approaches used to address each task indicate that deep neural-based approaches have performed at the top of most competition leader boards in AI (mostly with regard to machine learning), even surpassing human-level performance on several specific tasks involving image classification or natural language understanding. However, it should be emphasised that the previous graph only explores the evolution of the top-performing systems and the raw improvement in accuracy (and other performance measures) over time. To obtain a more global evaluation of AI performance, it is useful to consider other factors as well, such as energy consumption trends, or advances in terms of algorithms and infrastructures, which have enabled researchers and practitioners to increase the efficiency of their training and inference phases. Although these considerations are outside the scope of this collection of indicators, they are covered in several works in the literature (Amodei and Hernandez, 2018; Canziani et al., 2017, Desislavov et al., 2021, Gholami et al., 2021). These sources state that, in general terms, the progress of some AI paradigms (such as state-of-the-art deep-learning approaches used in most AI tasks) stems from an exponential growth in algorithm complexity (e.g., the number of parameters in neural networks), which typically results in a higher energy consumption.