What is being transferred in transfer learning

3 minute read


This post covers paper “What is being transferred in transfer learning” by Google / Google Brain (NeurIPS 2020).


What is being transferred in transfer learning?

  • Abstract

    • Question - What enables a successful transfer and which part of network is responsible for that?
    • Separate the effect of feature reuse from learning low level statistics of data
      • Show that some benefit if transfer learning comes from low level statistics of data
  • Introduction

    • Traget Domains

      • CheXpert: AAAI 2019, dataset of 224,316 chest radiographs of 65,240 patients
      • Labeler to detect presence of 14 observations in text radiology reports and capture uncertainties in the reports by using uncertainty label
      • CNN provides prob of 14 observations given frontal and lateral radiographs

      • DomainNet: ICCV 2019, 0.6 million images among 345 categories in six distinct domains (sketch, real, quickdraw, painting, info graph, clipart)
    • CXDN RealDN Real shuffle 8x8DN Real shuffle 1x1DN ClipartDN QuickdrawDN Quickdraw shuffle 32x32
  • Investigated feature reuse by shuffling the data

    • Method:
      • Partitioned the image of downstream task into equal sized blocks and shuffle the blocks randomly
      • Shuffling disrupts the visual features especially when block size is small
    • Result:
      • Low-level statistics of the data that is not disturbed by shuffling the pixels also play a role in successful transfer
  • Agreements/Disagreements between models trained from pertained vs scratch

    • Observation
      • Two instances of models that are trained from pre-trained weights make similar mistakes. However, if we compare these two classes of models, they have fewer common mistakes.
    • Result
      • Instances of models trained from pre-trained weights are more similar in feature space compared to ones from random initialisation
    • Additional Observation
      • Feature similarity measured by Centered Kernel Alignment (CKA) of different modules of two model instances also implies this result
      • Two instances of models that are trained from pre-trained weights are much closer in L2 distance compared to the ones trained from random initialization also implies the above result
  • Loss landscape of models trained from pre-trained and random initialisation

    • Observation
      • No performance barrier between two instances of models trained from pre-trained weights
      • Barriers observed between the solutions from two instances trained from randomly initialized weights, even when the same random weights are used for initialization
    • Result
      • pre-trained weights guide the optimisation of flat basin of the loss landscape
  • Where feature reuse is happening

    • Different Modules of the network have different robustness to parameter perturbation - Module criticality measure
    • Observation
      • Criticality of different modules is analysed and observed that higher layers in the network have tighter valleys
    • Result
      • features becoming more specialized as we go through the network and feature-reuse is happening in layers that are closer to the input.
    • Observation
      • Models that are trained from random initialization have a transition point in their valleys, which may be due to changing basins through training
  • Basin of loss landscape

    • Look into different checkpoint in the training of the pre-trained model and
    • show that one can start fine-tuning from the earlier checkpoints without losing accuracy in the target domain

    Main Contributions

    • Both Feature-reuse and low-level statistics of the data are important for successful transfer
    • Models trained from pre-trained weights:
      • make similar mistakes on target domain
      • have similar features and
      • are close in L2 distance in the parameter space.
      • They are in the same basins of the loss landscape;
    • Models trained from random initialization:
      • do not live in the same basin
      • make different mistakes,
      • have different features and
      • are farther away in L2 distance in the parameter space.
    • Modules in the lower layers are in charge of general features and modules in higher layers are more sensitive to perturbation of their parameters.
    • One can start from earlier checkpoints of pre-trained model without losing accuracy of the fine-tuned model. The starting point of such phenomena depends on when the pre-train model enters its final basin.