In this tutorial, we’ll talk about missing and sparse data: what they are, how to represent and deal with them, and in what ways they differ.
2. Missing Data
When we talk of missing data, we’re referring to the data not appearing in a dataset even though we expect them to be present. For example, let’s suppose we have demographic data on 10 individuals from a company’s database, but there’s no information on the ages of some employees. We’d say that the age data for those employees are missing:
There are three main types of missing data.
The missing-at-random (MAR) data are those whose missingness we can predict using other observed variables. For example, if we miss the arrival times of some employees on a certain day but see that the variable indicating they’re on a business trip is set to True, we’ll say that the time is missing at random because we can infer it from the fact that those employees were on a business trip.
The term Missing Completely at Random (MCAR) refers to the data that is missing independently at random. The fact that data is missing is not related to the values we observe. For example, let’s say we have an automated system that records the temperature three times each day. On a certain day, the thermometer malfunctioned so we got no readings that day. The cause of missing data isn’t related to the values that weren’t recorded. Therefore, they represent MCAR data.
Lastly, Missing Not at Random (MNAR) refers to the data that are missing for reasons related to the variable under observation. For instance, some participants may not want to disclose how many times they smoke a day in an addiction study because that number is high.
2.2. How to Handle Missing Data
Missing data can sometimes make our datasets and analysis thereof biased. For instance, if we were to analyze the employment dataset from above, the analysis would consider only the records with age present. The danger lies in drawing erroneous conclusions because we didn’t consider all the employees. So, it’s important to know how to handle the missing data.
A common technique is to remove or delete the features or records with missing data. For example, we can remove the employees for which we don’t know the age:
If the age is missing for many employees, we can even remove the entire feature:
Alternatively, we can impute the missing data. Imputation means replacing missing values with substitute values. There are a variety of ways to do that. A frequently used method is imputation by mean. It replaces the missing values with the mean of the variable in question. For instance, since the mean age of the employees in the above example is 38.375, we can replace the NA values with 38:
Similarly, imputation by median or mode refers to substituting the absent values with the median or mode of the variable from which the data are missing.
2.3. Multiple Imputation and Interpolation
In multiple imputation, we estimate the values of missing data from the distribution of the observed data. This could be a uniform distribution or another type of data distribution. The first step is to create copies of the dataset. Then, we impute the missing values using random values drawn from a distribution of the observed values for each of these sets. Finally, we perform an analysis of the sets and aggregate the results into a single estimated value for the missing data.
We can also use interpolation. It’s a form of imputation that attempts to deduce the missing values. Interpolation functions compute the relationship between the known values and output a new value within the range of the known values. For instance, in the employee database, we can interpolate the age of employee 2022001 using the observed values for employees 2022002 and 2022003.
3. Sparse Data
Sparse data are those with many gaps. The term typically refers to a dataset in which the variables with actually useful data are rare. For instance, suppose we have a dataset of readings from six rain gauges over a period of time. There was no rainfall in some months, so the recorded values are zero. This creates a sparse dataset:
It is important to note that the values in sparse data are usually known but occur infrequently. So, a zero in a set with sparse data isn’t a missing value.
There are two main types of data sparsity: controlled and random. Controlled sparsity refers to the cases where a range of values of one or more variables has no data. For example, in an HR dataset containing new employees, there might be no values for February and March 2022 because the company hired no one during those two months.
On the other hand, random sparsity is when the occurrence of no actual values in a dataset is randomly spread out through our dataset.
3.2. How to Deal With Sparse Data
The most common solution to dealing with sparse data is to remove the features without actual values while retaining the ones containing data. However, we must do this with caution to ensure that important features are not removed. For instance, we can remove Variable 6 from the above sparse dataset because it contains all zeroes:
Alternatively, we can apply dimensionality-reduction techniques such as Principal Component Analysis (PCA). That way, we store only the features with actual data, and, usually, we reduce their number to a manageable one.
4. Differences Between Missing and Sparse Data
The main differences between missing data and sparse data are in the representation of the data, and the techniques used to address it. These are summed up in the table below.
In this tutorial, we reviewed missing and sparse data. Missing data are unknown and absent from a dataset, whereas sparse data are usually known but are rarely present.