This post covers paper “What is being transferred in transfer learning” by Google / Google Brain (NeurIPS 2020).
What is being transferred in transfer learning?
- Question - What enables a successful transfer and which part of network is responsible for that?
- Separate the effect of feature reuse from learning low level statistics of data
- Show that some benefit if transfer learning comes from low level statistics of data
- CheXpert: AAAI 2019, dataset of 224,316 chest radiographs of 65,240 patients
- Labeler to detect presence of 14 observations in text radiology reports and capture uncertainties in the reports by using uncertainty label
CNN provides prob of 14 observations given frontal and lateral radiographs
- DomainNet: ICCV 2019, 0.6 million images among 345 categories in six distinct domains (sketch, real, quickdraw, painting, info graph, clipart)
CX DN Real DN Real shuffle 8x8 DN Real shuffle 1x1 DN Clipart DN Quickdraw DN Quickdraw shuffle 32x32
Investigated feature reuse by shuffling the data
- Partitioned the image of downstream task into equal sized blocks and shuffle the blocks randomly
- Shuffling disrupts the visual features especially when block size is small
- Low-level statistics of the data that is not disturbed by shuffling the pixels also play a role in successful transfer
Agreements/Disagreements between models trained from pertained vs scratch
- Two instances of models that are trained from pre-trained weights make similar mistakes. However, if we compare these two classes of models, they have fewer common mistakes.
- Instances of models trained from pre-trained weights are more similar in feature space compared to ones from random initialisation
- Additional Observation
- Feature similarity measured by Centered Kernel Alignment (CKA) of different modules of two model instances also implies this result
- Two instances of models that are trained from pre-trained weights are much closer in L2 distance compared to the ones trained from random initialization also implies the above result
Loss landscape of models trained from pre-trained and random initialisation
- No performance barrier between two instances of models trained from pre-trained weights
- Barriers observed between the solutions from two instances trained from randomly initialized weights, even when the same random weights are used for initialization
- pre-trained weights guide the optimisation of flat basin of the loss landscape
Where feature reuse is happening
- Different Modules of the network have different robustness to parameter perturbation - Module criticality measure
- Criticality of different modules is analysed and observed that higher layers in the network have tighter valleys
- features becoming more specialized as we go through the network and feature-reuse is happening in layers that are closer to the input.
- Models that are trained from random initialization have a transition point in their valleys, which may be due to changing basins through training
Basin of loss landscape
- Look into different checkpoint in the training of the pre-trained model and
- show that one can start fine-tuning from the earlier checkpoints without losing accuracy in the target domain
- Both Feature-reuse and low-level statistics of the data are important for successful transfer
- Models trained from pre-trained weights:
- make similar mistakes on target domain
- have similar features and
- are close in L2 distance in the parameter space.
- They are in the same basins of the loss landscape;
- Models trained from random initialization:
- do not live in the same basin
- make different mistakes,
- have different features and
- are farther away in L2 distance in the parameter space.
- Modules in the lower layers are in charge of general features and modules in higher layers are more sensitive to perturbation of their parameters.
- One can start from earlier checkpoints of pre-trained model without losing accuracy of the fine-tuned model. The starting point of such phenomena depends on when the pre-train model enters its final basin.