Datasets

Structured / Unstructured data

BIAS / VARIANCE

Various Bias types by queue.acm

Overfitting your test set, a statistican view point, a great article, bottom line use bonferroni correction.
Understanding what is the next stage in DL (& ML) algorithm development: basic approach - Andrew NG on youtube
Terms: training, validation, test.
Split: training & validation 70%, test 30%
Procedure: cross fold training and validation, or further split 70% to training and validation.
BIAS - Situation 1 - doing much worse than human:
Human expert: 1% error
Training set error: 5% error (test on train)
Validation set error: 6% error (test on validation or CFV)
Conclusion: there is a BIAS between human expert and training set
Solution: 1. Train deeper or bigger\larger networks, 2. train longer, 3. May needs more data to get to the human expert level, Or 4. New model architecture.
VARIANCE - Situation 2 - validation set not close to training set error:
Human expert: 1% error
Training set error: 2% error
Validation set error: 6% error
Conclusion: there is a VARIANCE problem, i.e. OVERFITTING, between training and validation.
Solution: 1. Early stopping, 2. Regularization or 3. get more data, or 4. New model architecture.
Situation 3 - both:Human expert: 1% error
Training set error: 5% error
Validation set error: 10% error
Conclusion: both problems occur, i.e., BIAS as and VARIANCE.
Solution: do it al

Underfitting = Get more data
Overfitting = Early stop, regularization, reason: models detail & noise.
Happens more in non parametric (and non linear) algorithms such as decision trees.
Bottom line, bigger model or more data will solve most issues.
In practice advice with regularized linear regression.

IMPORTANT! For Test Train efficiency when the data is from different distributions:

E.g: TRAIN: 50K hours of voice chatter as the train set for a DLN, TEST: 10H for specific voice-based problem, i.e, taxi chatter.

Best practice: better to divide the validation & test from the same distribution, i.e. the 10H set.

Reason: improving scores on validation which is from a diff distribution will not be the same quality as improving scores on a validation set originated from the actual distribution of the problem’s data, i.e., 10H.

NOTE: Unlike the usual supervised learning, where all the data is from the same distribution, where we split the training to train and validation (cfv).

Situation 4: However, when there are 2 distributions it’s possible to extend the division of the training set to validation_training and training, and the test to validation and test.

Split: Train, Valid_Train = 48K\2K & Valid, Test, 5K & 5K.

So situation 1 stays the same,

Situation 2 is Valid_Train error (train_dev)

Situation 3 is Valid_Test error - need more data, data synthesis - tweak test to be similar to train data, new architecture as a solution

Situation 4 is now Test set error - get more data

SPARSE DATASETS

Sparse matrices in ML - one hot/tfidf, dictionary/list of lists/ coordinate list.

TRAINING METHODOLOGIES

Train test split
Cross validation
Transfer learning - using a pre existing classifier similar to your domain, usually trained on millions of samples. fine-tuned on new data in order to create a new classifier that utilizes that information in the new domain. Examples such as w2v or classic resnet fine-tuning.
Bootstrapping training- using a similar dataset, such as yelp, with 5 stars to create a pos/neg sentiment classifier based on 1 star and 5 stars. Finally using that to label or sample select from an unlabelled dataset, in order to create a new classifier or just to sample for annotation etc.
Student-teacher paradigm (facebook), using a big labelled dataset to train a teacher classifier, predicting on unlabelled data, choosing the best classified examples based on probability, using those to train a new student model, finally fine-tune on the labeled dataset to create a more robust model, which is expected to know the unlabelled dataset and the labelled dataset with higher accuracy. With respect to the fully supervised teacher model / baseline.

Yoav’s method for transfer learning for languages - train a classifier on labelled data from english and spanish, fine tune using left out spanish data, stop before overfitting. This can be generalized to other domains.

TRANSFER LEARNING

In deep learning

TRAIN / TEST / CROSS VALIDATION

Scikit-lego on group-based splitting and transformation

Images from here.

Train Test methodology -

“The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis”

Random Split tests 66\33 - problem: variance each time we rerun.
Multiple times random split tests - problem: samples may not be included in train\test or selected multiple times.
Cross validation - pretty good, diff random seed results in diff mean accuracy, variance due to randomness
Multiple cross validation - accounts for the randomness of the CV
Statistical significance ( t-test) on multi CV - are two samples drawn from the same population? (no difference). If “yes”, not significant, even if the mean and std deviations differ.

Finally, When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests.

Out of fold - leave unseen data, do cross fold on that. Good for ensembles.

VARIOUS DATASETS

26 of them
24
Eu-es, 2
50K - ModelDepot alone has over 50,000 freely accessible pre-trained models with search functionality to

IMBALANCED DATASETS

(the BEST resource and a great api for python) with visual samples - it actually works well on clustering.
Mastery on cost sensitive sampling
Smote for imbalance

Systematic Investigation of imbalance effects in CNN’s, with several observations. This is crucial when training networks, because in real life you don’t always get a balanced DS.

They recommend the following:

(i) the effect of class imbalance on classification performance is detrimental;
(ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling;
(iii) oversampling should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent;
(iv) as opposed to some classical machine learning models, oversampling does not necessarily cause overfitting of CNNs;
(v) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.

General Rules:

Many samples - undersampling
Few samples - over sampling
Consider random and non-random schemes
Different sample rations, instead of 1:1 (proof? papers?)

Balancing data sets (wiki, scikit learn & examples in SKLEARN):

Oversampling the minority class
1. (Random) duplication of samples
2. SMOTE (in weka + needs to be installed & paper) - find k nearest neighbours,

New_Sample = (random num in [0,1] ) * vec(ki,current_sample)

(in weka) The nearestNeighbors parameter says how many nearest neighbor instances (surrounding the currently considered instance) are used to build an in between synthetic instance. The default value is 5. Thus the attributes of 5 nearest neighbors of a real existing instance are used to compute a new synthetic one.
(in weka) The percentage parameter says how many synthetic instances are created based on the number of the class with less instances (by default - you can also use the majority class by setting the -Coption). The default value is 100. This means if you have 25 instances in your minority class, again 25 instances are created synthetically from these (using their nearest neighbours' values). With 200% 50 synthetic instances are created and so on.

ADASYN - shifts the classification boundary to the minority class, synthetic data generated for majority class.
Undersampling the majority class
1. Remove samples
2. Cluster centroids - replaces a cluster of samples (k-means) with a centroid.
3. Tomek links - cleans overlapping samples between classes in the majority class.
4. Penalizing the majority class during training
Combined over and under (hybrid) - i.e., SMOTE and tomek/ENN
Ensemble sampling
1. EasyEnsemble
2. BalanceCascade
Dont balance, try algorithms that perform well with unbalanced DS
1. Decision trees - C4.5\5\CART\Random Forest
2. SVM
Penalize Models -
1. added costs for misclassification on the minority class during training such as penalized-SVM
2. a CostSensitiveClassifier meta classifier in Weka that wraps classifiers and applies a custom penalty matrix for miss classification.
3. complex

SAMPLE SELECTION

How to choose your sample size from a population based on confidence interval

Data advice, should we get more data? How much

Gibbs sampling: - Gibbs Sampling is a MCMC method to draw samples from a potentially really really complicated, high dimensional distribution, where analytically, it’s hard to draw samples from it. The usual suspect would be those nasty integrals when computing the normalizing constant of the distribution, especially in Bayesian inference. Now Gibbs Sampler can draw samples from any distribution, provided you can provide all of the conditional distributions of the joint distribution analytically.

LEARNING CURVES

Git examples
Sklearn examples
Understanding bias variance via learning curves
Unread - learning curve sampling applied to model based clustering - seemed like active learning, i.e., sample using EM/cluster to achieve nearly as accurate on all data
Predicting sample size required for training
Advice on many things, including learning curves

This is a really wonderful study with far-reaching implications that could even impact company strategies in some cases. It starts with a simple question: “how can we improve the state of the art in deep learning?” We have three main lines of attack:

We can search for improved model architectures.
We can scale computation
We can create larger training data sets.

DISTILLING DATA

Medium on this Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. What I found interesting about this paper is that it challenges the common approach of “the more the merrier” when it comes to training data, and shifts the focus from the quantity of the data to the quality of the data.

DATASET SELECTION

Medium

PreviousEvaluation Metrics NextDataset Confidence

Last updated 2 years ago