CROSS-VALIDATION SKLEARN PYTHON (Techniques expliquées en Français)

In this French Python tutorial, I describe cross-validation techniques, which are very useful in machine learning, and show you how to implement them in Sklearn (Python). The main cross-validation techniques are: 1) KFold 2) Leave One Out 3) ShuffleSplit 4) StratifiedKFold 5) GroupKFold To use them in Python with Sklearn, you must import them from the sklearn.model_selection module. For example: from sklearn.model_selection import KFold cv = KFold(n_splits=5) cross_val_score(model, X, y, cv=cv) 1) KFold Cross-Validation: This involves shuffling the dataset, then dividing it into K equal parts (K-Fold). For example, if the dataset contains 100 samples and K=5, then we will have 5 sets of 20 samples. The machine then trains on 4 sets, then evaluates itself on the remaining set, and alternates between the different possible set combinations. Ultimately, it performs K training sessions (5 training sessions in this situation). This technique is widely used, but it has a slight disadvantage: if the dataset is heterogeneous and includes unbalanced classes, then some cross-validation splits may not contain minority classes. For example, if a dataset of 100 samples contains only 10 samples from class 0, and 90 samples from class 1, then it is possible that out of 5 folds, some may not contain any samples from class 0. 2) Leave One Out Cross Validation. This technique is a special case of K-Fold. In fact, this is the case where K = "number of samples in the dataset." For example, if a dataset contains 100 samples, then K = 100. The machine therefore trains on 99 samples and evaluates itself on the last one. It thus performs 100 training sessions (out of the 100 possible combinations), which can take the machine a considerable amount of time. This technique is NOT RECOMMENDED. 3) ShuffleSplit Cross-Validation: This technique consists of shuffling and then splitting the dataset into two parts: a training part and a test part. Once the training and evaluation are complete, we gather our data, reshuffle it, and then re-split the dataset in the same proportions as before. We repeat this action for as many cross-validation iterations as desired. This allows us to find the same data multiple times in the validation set across the iterations. This technique is a GOOD ALTERNATIVE to K-FOLD, but it has the same disadvantage: if the classes are unbalanced, then we risk missing information in the validation set! 4) STRATIFIED K-FOLD This technique is the default choice (but consumes slightly more resources than K-FOLD). It involves shuffling the dataset, then letting the machine sort the data into "Strata" (i.e., into different classes) before forming K groups (K-Folds), each containing a small amount of data from each Strata (each Class). 5) GROUP K-FOLD This cross-validation technique is VERY IMPORTANT TO KNOW! In data science, we often assume that our data are independent and drawn from the same distribution. For example, the apartments in a real estate dataset are all independent (from each other) and identically distributed. But this isn't always the case! For example, data in a medical dataset can be interdependent: if people in the same family are diagnosed with cancer, then the genetic factor creates a dependency between the different data. It's therefore necessary to divide the dataset into influence groups, which is why GROUP K-FOLD exists. GroupKfold(5).split(X, y, groups) ► MY WEBSITE IN ADDITION TO THIS VIDEO: https://machinelearnia.com/ ► JOIN OUR DISCORD COMMUNITY   / discord   ► Get my free book: LEARN MACHINE LEARNING IN ONE WEEK CLICK HERE: https://machinelearnia.com/apprendre-... ► Download my code for free on GitHub: https://github.com/MachineLearnia ► Subscribe:    / @machinelearnia   ► To learn more: Visit Machine Learnia: https://machinelearnia.com/