Deep Network Optimization

PRUNING / KNOWLEDGE DISTILLATION / LOTTERY TICKET

  1. Teacher-student knowledge distillation focusing on Knowledge & Ranking distillation

  1. Lottery ticket on BERT, magnitude vs structured pruning on a various metrics, i.e., LT works on bert. The classical Lottery Ticket Hypothesis was mostly tested with unstructured pruning, specifically magnitude pruning (m-pruning) where the weights with the lowest magnitude are pruned irrespective of their position in the model. We iteratively prune 10% of the least magnitude weights across the entire fine-tuned model (except the embeddings) and evaluate on dev set, for as long as the performance of the pruned subnetwork is above 90% of the full model.

We also experiment with structured pruning (s-pruning) of entire components of BERT architecture based on their importance scores: specifically, we 'remove' the least important self-attention heads and MLPs by applying a mask. In each iteration, we prune 10% of BERT heads and 1 MLP, for as long as the performance of the pruned subnetwork is above 90% of the full model. To determine which heads/MLPs to prune, we use a loss-based approximation: the importance scores proposed by Michel, Levy and Neubig (2019) for self-attention heads, which we extend to MLPs. Please see our paper and the original formulation for more details.

  1. Troubleshooting Neural Nets

(37 reasons, 10 more) - copy pasted and rewritten here for convenience, it's pretty thorough, but long and extensive, you should have some sort of intuition and not go through all of these. The following list is has much more insight and information in the article itself.

The author of the original article suggests to turn everything off and then start building your network step by step, i.e., “a divide and conquer ‘debug’ method”.

Dataset Issues

1. Check your input data - for stupid mistakes

2. Try random input - if the error behaves the same on random data, there is a problem in the net. Debug layer by layer

3. Check the data loader - input data is possibly broken. Check the input layer.

4. Make sure input is connected to output - do samples have correct labels, even after shuffling?

5. Is the relationship between input and output too random? - the input are not sufficiently related to the output. Its pretty amorphic, just look at the data.

6. Is there too much noise in the dataset? - badly labelled datasets.

7. Shuffle the dataset - useful to counteract order in the DS, always shuffle input and labels together.

8. Reduce class imbalance - imbalance datasets may add a bias to class prediction. Balance your class, your loss, do something.

9. Do you have enough training examples? - training from scratch? ~1000 images per class, ~probably similar numbers for other types of samples.

10. Make sure your batches don’t contain a single label - this is probably something you wont notice and will waste a lot of time figuring out! In certain cases shuffle the DS to prevent batches from having the same label.

11. Reduce batch size - This paper points out that having a very large batch can reduce the generalization ability of the model. However, please note that I found other references that claim a too small batch will impact performance.

12. Test on well known Datasets

Data Normalization/Augmentation

12. Standardize the features - zero mean and unit variance, sounds like normalization.

13. Do you have too much data augmentation?

Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit.

14. Check the preprocessing of your pretrained model - with a pretrained model make sure your input data is similar in range[0, 1], [-1, 1] or [0, 255]?

15. Check the preprocessing for train/validation/test set - CS231n points out a common pitfall:

Any preprocessing should be computed ONLY on the training data, then applied to val/test

Implementation issues

16. Try solving a simpler version of the problem -divide and conquer prediction, i.e., class and box coordinates, just use one.

17. Look for correct loss “at chance” - calculat loss for chance level, i.e 10% baseline is -ln(0.1) = 2.3 Softmax loss is the negative log probability. Afterwards increase regularization strength which should increase the loss.

18. Check your custom loss function.

19. Verify loss input - parameter confusion.

20. Adjust loss weights -If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct. This might involve testing different combinations of loss weights.

21. Monitor other metrics -like accuracy.

22. Test any custom layers, debugging them.

23. Check for “frozen” layers or variables - accidentally frozen?

24. Increase network size - more layers, more neurons.

25. Check for hidden dimension errors - confusion due to vectors ->(64, 64, 64)

26. Explore Gradient checking -does your backprop work for custon gradients? 1 **[2](http://cs231n.github.io/neural-networks-3/#gradcheck) **3.

Training issues

27. Solve for a really small dataset - can you generalize on 2 samples?

28. Check weights initialization - Xavier or He or forget about it for networks such as RNN.

29. Change your hyperparameters - grid search

30. Reduce regularization - too much may underfit, try for dropout, batch norm, weight, bias , L2.

31. Give it more training time as long as the loss is decreasing.

32. Switch from Train to Test mode - not clear.

33. Visualize the training - activations, weights, layer updates, biases. Tensorboard and Crayon. Tips on Deeplearning4j. Expect gaussian distribution for weights, biases start at 0 and end up almost gaussian. Keep an eye out for parameters that are diverging to +/- infinity. Keep an eye out for biases that become very large. This can sometimes occur in the output layer for classification if the distribution of classes is very imbalanced.

34. Try a different optimizer, Check this excellent post about gradient descent optimizers.

35. Exploding / Vanishing gradients - Gradient clipping may help. Tips on: Deeplearning4j: “A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations.”

36. Increase/Decrease Learning Rate, or use adaptive learning

37. Overcoming NaNs, big issue for RNN - decrease LR, how to deal with NaNs. evaluate layer by layer, why does it appear.

Last updated