“Core Tokensets for Data-efficient Sequential Training of Transformers” has been accepted to the ICCV Workshop on Continual Learning in Computer Vision

Our paper “Core Tokensets for Data-efficient Sequential Training of Transformers” has been accepted for publication in the ICCV-W proceedings in context of the Workshop on Continual Learning for Computer Vision. The paper will be presented at the Workshop in October at ICCV in Hawaii.

In the paper, we introduce the concept of core tokensets. Whereas traditional core sets identify a subset of a dataset that yields comparable results to the full dataset, tokensets go one level deeper and find the subset of relevant tokens. For images, this breaks down to going beyond a single image being more important than another, and instead focusing on precise regions/patches of important content. In the paper we show that this helps reduce memory cost substantially (up to factor 10) in comparison to traditional core sets, when applied in various application contexts such as image classification, multi-modal image captioning and visual question answering.

For more information, read the full paper. The abstract is provided below:

Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points – formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16×16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Our respectively named core tokensets both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.