Research Program: AKieZ

German version is here.

Overview

For a powerful AI system, the model size, computational effort and data volume must be in a sensible ratio when training the system. Upscaling usually fails due to insufficient or erroneous training data. This project will develop technologies to improve the data quality of in-vivo collected data for AI systems. Domain-specific domain knowledge and contextual information will be used to check time series for plausibility and correctness. We want to demonstrate AI tools that use generative and physically informed AI to increase training data and improve its quality. We will evaluate these concepts using an AI system for handwriting recognition with a sensor equipped ballpoint pen. The consortium can already build on several years of experience in this area.

Our project intends, therefore, to help enhance training data by incorporating specialist and contextual knowledge into the training process. In other words, it is not about replacing data with expert knowledge, but about tapping into low-quality data for training that could not be used meaningfully without this curation. At the same time, we want to automate this approach as much as possible to make it economical and manageable.
In order to achieve this, criteria would have to be set for the data in advance, which are explicitly formulated from physical or statistical boundary conditions and can be incorporated automatically during training. This would allow training data to be filtered on the basis of its plausibility and correctness. The results of such a technology would be widely applicable and would also allow ethical principles to be taken into account during training, as long as they can be expressed algorithmically.

Partners

STABILO International GmbH
Schwanweg 1
90562 Heroldsberg

STABILO is responsible for project management, defining the boundary conditions, part of the data collection and all hardware-related tasks. The design, implementation and operation of the data curation – always equipped with the latest research results from the chair – also take place at STABILO. Demonstrator apps, their interaction with the handwriting recognition system and rigorous software tests are also part of the STABILO’s responsibility

and

Lehrstuhl für Maschinelles Lernen und Datenanalytik
Carl-Thiersch-Straße 2b
91052 Erlangen

The Chair of Machine Learning and Data Analytics conducts research into transfer learning, data augmentation, natural language processing and active learning, cooperates with the industrial partner on issues relating to implementation in production software and carries out scientific evaluations of the research results.

Both partners are responsible for recording training data, with STABILO providing the necessary apps and servers.

Structure of Tasks

Work Package 1: Definition of Requirements

Develop initial set of coarse requirements:

  • e.g. agree on a standardized data format

Use those to define Bachelor and Master Theses

  • Our goal: Four of either over the duration of the project

Refine the requirements as work progresses

Work Package 2: Automatic data evaluation

WP 2.1: Use Clustering to find peculiarities in the data

  • allows clear statements about the connection between the metadata and the annotation quality for different groups.

WP 2.2: Statistical analysis for data quality control

  • Find outliers and low-quality data through statistical analysis.

WP 2.3: Use physical boundary conditions for quality control

  • supports the techniques in WPs 2.1 and 2.2.

WP 2.4: Integrate all three in a demonstrator

Work Package 3: Automatic assessment of ground truth annotations

WP 3.1: Comparison with pen trace reconstruction.

  • Can build on work done in earlier projects and MyScript software.

WP 3.2: Comparison with recognizer ensemble.

  • Comparison of the results of several recognizers, truth is the majority decision.

WP 3.3: Develop Bayesian approaches to identify uncertain annotations.

  • Uncertainty information, calculated through multiple evaluations of a neural network in a Monte Carlo simulation with dropout, is used to verify annotations for time series data.

WP 3.4: Integrate all three in a demonstrator.

Work Package 4: Automatic annotation of data without ground truth

WP 4.1: Assignment of multiple annotations with recognizer confidence
using approaches from WP 3

  • Extend the developed methods to unannotated data

WP 4.2: Comparison with complex language models.

  • Use next-word prediction to statistically identify mislabeled data

WP 4.3: Evaluation of the quality of the new annotations.

  • Compare the effectivity of the different methods and use the best output in a
positive feedback loop

WP 4.4: Integrate all three in a demonstrator

Work Package 5: Active learning with user input

WP 5.1: Development of active learning and human-in-the-loop mechanisms for semi-automated annotation.

  • Comprises both active input and indirect inference based on user reactions.

WP 5.2: Evaluation of the acceptance of the mechanisms.

  • The focus here is on the quantity of user input, while the quality is to be evaluated in the following WP.

WP 5.3: Evaluation of the quality of the annotations determined by active learning.
WP 5.4: Integrate all three in a demonstrator.

Work Packages 6 and 7: Validation

WP 6: Collection of handwriting samples
WP 7: Testing the demonstrators

This work is supported by a grant by the Bavarian government in the context of the Bayerisches Verbundforschungsprogramm (BayVFP) 
des Freistaates Bayern, Förderlinie “Digitalisierung”