“Garbage in, garbage out”: as a simple parameter can decrease the quality of our analysis.

Hi, all.

In many studies, we as data scientists will not be able to enjoy all the data of the domain of our study in our analyzes – also called population. Among the many factors responsible for this, the ethical, economic or geographical factors stand out. To minimize this gap, we performed a defined sampling selection process, the result of an activity known as sampling.

Sampling can be defined as a technique or even a set of procedures/activities needed to describe and select samples, either randomly or not. When the process is done correctly, it is a factor responsible for determining the representativeness of the sample.

When we speak of representativeness, we are saying that the sample used in our experiment should have the same characteristics of the observed population. For this, there are two characteristics that we consider: quantity and quality. The larger the sample number, according to the law of large numbers, the better. However, if you have a population with heterogeneous characteristics, your sample should be heterogeneous as well.

Of course! If you have a population of 1000 people, of which 500 are female and 500 are males, if your sample is composed of 500 males, the female universe (quite representative) will not be observed. Therefore, their conclusions can’t generalize the actions of the population – a process called inference. So, this exemple is as “Garbage in, garbage out”: nonsense input data produces nonsense output.

Understand: although we use samples in our experiments, we want to provide solutions for the population by complete and not just for that sampled group. I use the example of blood analysis. When we perform a blood glucose test, only one sample is taken from our blood and not our whole blood. However, the result of the examination is on us (all blood) and not just for that specific sample.

There are basically two types of sampling: probabilistic and non-probabilistic. In probabilistic sampling, each element of the population has a known and non-zero chance of being selected to compose the sample. In non-probabilistic sampling, the selection of the population elements to compose the sample often depends on the researcher’s judgment. We will discuss the probabilistic type here because in most of the courses and examples we find on the web about data science, it is the predominant approach.

Among the most common plans applied to the sampling process, stand out: Simple, Systematic and Stratified.

In simple random sampling, sample size n is randomly selected from among the N elements of the sample population, considering that individual i has the same probability of being selected as an individual from population N.

The systematic sample is a probabilistic non-random sampling process, obeying a probability criterion that is established through a random process of choosing the first sample unit. The sample units are selected from a rigid and pre-established scheme of systematization, with the purpose of covering the entire population, to obtain a simple and uniform systematic model.

Finally, in the stratified sample, existing information about the population is used to make the sampling process more efficient. Based on the previous example, if the population of which we wish to study consists of 800 women and 200 men, it is desired that our sample represents the proportion between classes.

Even with the use of several techniques to obtain a sample that is representative of the desired population (besides those presented here), no sample represents the population perfectly. The use of a sample implies accepting a margin of error called the sampling error, which means the difference between a sample result and the true population result.

Most of the examples we find on the Internet teach us how to make a simple sample selection. Even using very interesting libraries for this process, such as the train_test_split module of sklearn.model_selection. For the execution of our experiment, we will use the Iris Dataset.

Iris Dataset is one of the most well-known in the world of machine learning. It is behaved by three classes and four other descriptor attributes and has 150 elements, 50 of which are for each class. Note that the Iris dataset is a naturally balanced dataset. However, if we make a simple sample selection, the samples used for the training of the model and for the test will not be balanced. Let’s go to practice.

Figure 1 shows the libraries I used in the example. In addition to skilearn, I also imported the numpy and the module frequency from scipy.stats. The freqitem will be used to observe the distribution of sample elements.

Figure 1: Loading libraries.

In Figure 2, I used a variable named date and assigns the Iris dataset. In addition, I separated the descriptors (data that will be used for the machine learning) and the labels (classes).

Figure 2: Loading dataset.

The data selection process for training the model and then testing it is shown in Figure 3. If you have already taken a web course on data science or machine learning again, this method is not strange to you. The train_test_split receives the descriptors and labels. It will be used for generation (separation) of data for training and for testing according to the parameter test_size (defined as 30% for test) and the parameter random_state, which is used as a seed for the process of random data separation.

Figure 3: Spliting dataset.

The next step will be to see how the train_test_split performed the split of the samples for training and testing. Figure 4 shows the command executed for this operation.

Figure 4: Showing frequency.

The result is shown in Figure 5. In the first column are the classes, represented by the numbers 1, 2 and 3. In the second column, the frequency of the items of each one.

Figure 5: Showing result.

What I would like to present to you is just that. Do you remember that Iris dataset data has a balanced distribution (50 elements for each of the classes)? With the parameters set in the train_test_split function, the training and test samples are not fully balanced – although approximate.

To be able to separate the samples for training and to test in a balanced way, we only need to insert the attribute stratify with the variable used for the labels values in the train_test_split function (Figure 6).

Figure 6: Adding new parameter.

Now execute the script again and notice what happened to your sample (Figure 7). The data were separated in a balanced way (35 elements for each class). You can also view the data that will be used in the test step. You’ll see they’re balanced, too – of course, is not it?

Figure 7: Showing new result.

But what happens if the dataset you will be using in your experiment is not balanced like the Iris dataset? Do not worry. The stratification method will perform the balancing of the samples so that it represents well to the analyzed population.

Important: It is very important for your experiment that the selected sample has the same characteristics of the population. This way, you will be able to reduce the sample error and you can perform inference on the population.

Best regards.

Reading suggestions

Choosing a sampling method: http://changingminds.org/explanations/research/sampling/choosing_sampling.htm

Law of large numbers: https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter8.pdf

Population Sampling Techniques: https://explorable.com/population-sampling

6 Sampling Techniques: How to Choose a Representative Subset of the Population: https://blog.socialcops.com/academy/resources/6-sampling-techniques-choose-representative-subset/ https://goo.gl/Rrac5G #DataScience #Cloud