Overview
For a powerful AI system, the model size, computational effort and data volume must be in a sensible ratio when training the system. Upscaling usually fails due to insufficient or erroneous training data. This project will develop technologies to improve the data quality of in-vivo collected data for AI systems. Domain-specific domain knowledge and contextual information will be used to check time series for plausibility and correctness. We want to demonstrate AI tools that use generative and physically informed AI to increase training data and improve its quality. We will evaluate these concepts using an AI system for handwriting recognition with a sensor equipped ballpoint pen. The consortium can already build on several years of experience in this area.
Our project intends, therefore, to help enhance training data by incorporating specialist and contextual knowledge into the training process. In other words, it is not about replacing data with expert knowledge, but about tapping into low-quality data for training that could not be used meaningfully without this curation. At the same time, we want to automate this approach as much as possible to make it economical and manageable.
In order to achieve this, criteria would have to be set for the data in advance, which are explicitly formulated from physical or statistical boundary conditions and can be incorporated automatically during training. This would allow training data to be filtered on the basis of its plausibility and correctness. The results of such a technology would be widely applicable and would also allow ethical principles to be taken into account during training, as long as they can be expressed algorithmically.
Partners
STABILO International GmbH
Schwanweg 1
90562 Heroldsberg
STABILO is responsible for project management, defining the boundary conditions, part of the data collection and all hardware-related tasks. The design, implementation and operation of the data curation – always equipped with the latest research results from the chair – also take place at STABILO. Demonstrator apps, their interaction with the handwriting recognition system and rigorous software tests are also part of the STABILO’s responsibility
and
Lehrstuhl für Maschinelles Lernen und Datenanalytik
Carl-Thiersch-Straße 2b
91052 Erlangen
The Chair of Machine Learning and Data Analytics conducts research into transfer learning, data augmentation, natural language processing and active learning, cooperates with the industrial partner on issues relating to implementation in production software and carries out scientific evaluations of the research results.
Both partners are responsible for recording training data, with STABILO providing the necessary apps and servers.
Structure of Tasks
Work Package 1: Definition of Requirements
Develop initial set of coarse requirements:
- e.g. agree on a standardized data format
Use those to define Bachelor and Master Theses
- Our goal: Four of either over the duration of the project
Refine the requirements as work progresses
Work Package 2: Automatic data evaluation
WP 2.1: Use Clustering to find peculiarities in the data
- allows clear statements about the connection between the metadata and the annotation quality for different groups.
WP 2.2: Statistical analysis for data quality control
- Find outliers and low-quality data through statistical analysis.
WP 2.3: Use physical boundary conditions for quality control
- supports the techniques in WPs 2.1 and 2.2.
WP 2.4: Integrate all three in a demonstrator
Work Package 3: Automatic assessment of ground truth annotations
WP 3.1: Comparison with pen trace reconstruction.
- Can build on work done in earlier projects and MyScript software.
WP 3.2: Comparison with recognizer ensemble.
- Comparison of the results of several recognizers, truth is the majority decision.
WP 3.3: Develop Bayesian approaches to identify uncertain annotations.
- Uncertainty information, calculated through multiple evaluations of a neural network in a Monte Carlo simulation with dropout, is used to verify annotations for time series data.
WP 3.4: Integrate all three in a demonstrator.
Work Package 4: Automatic annotation of data without ground truth
WP 4.1: Assignment of multiple annotations with recognizer confidence using approaches from WP 3
- Extend the developed methods to unannotated data
WP 4.2: Comparison with complex language models.
- Use next-word prediction to statistically identify mislabeled data
WP 4.3: Evaluation of the quality of the new annotations.
- Compare the effectivity of the different methods and use the best output in a positive feedback loop
WP 4.4: Integrate all three in a demonstrator
Work Package 5: Active learning with user input
WP 5.1: Development of active learning and human-in-the-loop mechanisms for semi-automated annotation.
- Comprises both active input and indirect inference based on user reactions.
WP 5.2: Evaluation of the acceptance of the mechanisms.
- The focus here is on the quantity of user input, while the quality is to be evaluated in the following WP.
WP 5.3: Evaluation of the quality of the annotations determined by active learning.
WP 5.4: Integrate all three in a demonstrator.
Work Packages 6 and 7: Validation
WP 6: Collection of handwriting samples
WP 7: Testing the demonstrators