Missing data in columns

The purpose of this post is to show an easy way to find data frame columns which have all missing data.

We live in a world of big data. Often our data frames have several hundred if not thousands of variables. Not all variables are useful in that they may not contain any actual values. This is often the case with data exported from various business software. I like to get a list of these columns as part of the data processing step in a research project.

# Setting up an example data frome.
customer_id <- c("aid123","aid123","aid123", "aid123",
                 "aid432","aid432",
                 "aid887","aid887","aid887","aid887","aid887",
                 "aid98","aid98", "aid98")
dat_miss1 <- rep(NA, 14) # Create a vector with 14 missing values 
visit_num <- c("1", "1", "1", "2",
               "1", "1",
               "1", "2", "2", "3", "4",
               "1", "1", "1")
item <- c("socks", "socks", "shoes", "shoes",
          "tires", "brakes", 
          "shoes","shoes", "shoes", "shoes", "shoes",
          "trimmer", "trimmer", "trimmer")
dat_miss2 <- rep(NA, 14)
df <- data.frame(customer_id, dat_miss1, visit_num, item, dat_miss2)

library(dplyr) #Load dplyr here for the tbl_df call.
print(tbl_df(df), n = 14)
## # A tibble: 14 x 5
##    customer_id dat_miss1 visit_num    item dat_miss2
##         <fctr>     <lgl>    <fctr>  <fctr>     <lgl>
##  1      aid123        NA         1   socks        NA
##  2      aid123        NA         1   socks        NA
##  3      aid123        NA         1   shoes        NA
##  4      aid123        NA         2   shoes        NA
##  5      aid432        NA         1   tires        NA
##  6      aid432        NA         1  brakes        NA
##  7      aid887        NA         1   shoes        NA
##  8      aid887        NA         2   shoes        NA
##  9      aid887        NA         2   shoes        NA
## 10      aid887        NA         3   shoes        NA
## 11      aid887        NA         4   shoes        NA
## 12       aid98        NA         1 trimmer        NA
## 13       aid98        NA         1 trimmer        NA
## 14       aid98        NA         1 trimmer        NA

You can clearly see which columns are filled with missing data but this would not be the case if you were dealing with millions of rows across hundreds of variables. The code below will return a list of variables with all missing values.

# This returns a list of column names which are totally blank 
# and their location in the data frame.

miss_cols <- which(sapply(df, function(x)all(is.na(x))) == TRUE)
miss_cols
## dat_miss1 dat_miss2 
##         2         5

Changing the value to FALSE gives you the columns which are not blank and have information useful for your analyses.

# This returns a list of column names which are not blank 
# and their location in the data frame.

nonmiss_cols <- which(sapply(df, function(x)all(is.na(x))) == FALSE)
nonmiss_cols
## customer_id   visit_num        item 
##           1           3           4