Preparing Data for Modelling

1 Introduction

There are three parts to this talk:

  • Three parts to the talk:
    • Getting data, deciding what to do
    • Cleaning data
    • Handling issues/getting more data
    • Repeat until performance acceptable

The goal (in R code) is to be able to do (lm(repsonse~predictiors, data=mydata)) and have it all work.

This never, ever works without lots of work, and that's part of preparing data for modelling.

2 Asking Questions

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of - R.A. Fisher

a week in the lab will save you a day in the library, - unknown

  • Figure out what you want to do
  • Figure out how you will measure the result
  • Figure out what data you have (or can get)
  • Do EDA, find weirdness
  • Remove weirdness
  • Attempt to fit a model
  • Fail, look closer at data
  • Repeat until better performance is obtained/you give up in frustration

3 What data

  • Granularity: hourly, daily, weekly monthly
  • Effective granularity: how similar are two observations "next" to each other


  • ggplot2 is now your god and saviour
  • We'll show some examples below
  • eye is optimised for finding patterns
  • make use of this

5 What to look for

  • Inconsistencies!
  • Very domain-dependent
  • Not a lot of actionable advice 1

6 Example Data

  • Property Price Register (available here)
  • Because I'm irish, and therefore obsessed with property ;)
  • Available in the repository for this course

7 Read in data

ppr <- read_excel("~/Dropbox/PPR/PPR-ALL.xlsx", sheet="PPR-ALL")
Date of Sale (dd/mm/yyyy)
Postal Code
Price ()
Not Full Market Price
VAT Exclusive
Description of Property
Property Size Description
  • Note that this appears to be optimised for reading, not coding

8 Data Cleaning

normalise_names <- function(df) {
    nms <- names(df)
    normed <- iconv(#covert from one encoding to another
            gsub("([[:space:]]|[[:punct:]])+", "_", #regular expressions
        "latin1",#from encoding
        "ASCII",#to encoding
    drop_usc <- gsub("([a-z_])_*$", "\\1", x=normed) #drop extra underscores
    names(df) <- drop_usc #update names
    df #return dataframe
ppr2 <- normalise_names(ppr)


  • You always want to check what is actually in your data, as sometimes this may suprise you
  • You need to check:
    • class of data (numeric, character, etc)
    • range of data (summary)
    • distribution of data (plots, mostly histograms or densities)

10 Example: Price

  • This is our outcome variable
  • We assume this is numeric
with(ppr2, mean(price))

[1] NA
Warning message:
In mean.default(price) : argument is not numeric or logical: returning NA

  • Huh?

11 Example Price, Continued

date_of_sale_dd_mm_yyyy price postal_code
01/01/2010 343,000.00 nil
03/01/2010 185,000.00 nil
04/01/2010 438,500.00 nil
04/01/2010 400,000.00 nil
04/01/2010 160,000.00 nil
04/01/2010 425,000.00 nil

12 WTF?

  • This is pretty common, Excel inserts currency symbols willy-nilly, and treats the columns as numeric
  • R, on the other hand, thinks that this is not acceptable, and treats such columns as character, and hence no calling mean
  • We can resolve this with a little code

13 Fixing Price

fix_price <- function(x) {
    nopunct <- gsub(",|\\.", "", x=x)
    nums <- as.numeric(
        iconv(nopunct, "latin1",
ppr2$price <- with(ppr2, fix_price(price))
with(ppr2, mean(price))

14 More WTF

  • That doesn't look like a house price
  • Do we really believe that the average house in the dataset sold for €23,754,443?
  • Clearly not (right?)
  • This is actually a good example of where domain knowledge (or common sense) comes into play

15 Well Actually,


  • You'd look at a plot and go what the hell?
  • Then look at the summary values
with(ppr2, summary(price))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.079e+05 1.010e+07 1.780e+07 2.375e+07 2.800e+07 1.392e+10

16 Dividing by 100

with(ppr2, summary(price/100))
Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5079    101000    178000    237544    280000 139165000

17 More Plots


  • Still pretty crazy
  • Let's ignore everything greater than 2mn
  • So all the big ones are pretty close to 2mn.
  • Still a pretty crazy distribution of prices

18 Modelling Price

  • Always greater than 0
  • Massively, massively skewed (like most interesting response variables)
  • What else do we need to think about?
  • Does a linear model make sense?
  • Should we just say screw it, and use a neural net?
  • Are there any transformations that spring to mind?

19 You're not modelling in a vacuum

  • You have (presumably) some idea of why you're analysing the data
  • Other people have probably done similar things, maybe look at how they did it?
  • The solution to this problem is Google, judiciously.
  • Honestly, i find that I get a lot more out of a few good books but Google and short articles are super useful when you need more specifics (or working code)
  • For positive variables, we can do log or square root
  • Square root tends to be preferred for counts, and log is preferred for things that grow in a multiplicative fashion (rich get richer)
  • Mostly for reasons of computability, I went with a log transform.

20 Transforming Price


  • Much better, from a modelling point of view
  • Why is this?
  • But the log is kinda annoying
  • Dunno about you, but I can't do powers of 2.7 in my head.
  • On the other hand, powers of 10 are pretty easy


21 Response Variables

  • Given that we know what we're doing with the predictor, we need to examine the response variables.
  • These can range from the very useful to the very useless.

22 More EDA

  • This is all part of the process
  • Try to love looking around
  • First step, what have we got? 2

23 What kind of variables are they?

sapply(ppr4, class)
date_of_sale_dd_mm_yyyy character
address character
postal_code character
county character
price numeric
not_full_market_price character
vat_exclusive character
description_of_property character
property_size_description character
is_big character
log_price numeric

24 Date of Sale


  • Some days have way, way more observations than others
  • Some days are just odd (that point in the upper right is really weird ) 3

25 What to do with date columns?

  • They can be handled in many different ways:
    • as a number of days since some threshold
    • as a day, week and month within a year
    • as the primary unit of analysis (time series)
    • You can also just ignore it, which may suit your purposes also

26 Tradeoffs in Feature Generation

  • So assuming that we want day, week, month and year in the model
  • Given that our dataset covers approximately 8 years, that would 4 require 8*365=2920 parameters
  • We have 331k observations
  • Is estimating all of these parameters worth it?
  • How do you make that decision?
  • Should you, or just generate all the features and let God 5 sort it out?
  • Different situations require different approaches
  • Is it better to have a simple, robust model or a more flexible, fragile model?

27 Process

  • Review each of your variables individually first
  • Look for weird crap
  • Some weird crap can include:
    • numeric missing values (like 99, 999, or 9998, or -1)
    • characters as factors
    • unique identifiers
    • wrong type of data (like price above)

28 Address

  • Something that can be really important in modelling categorical features is the dimensionality, i.e. the number of distinct values
  • High dimensionality can cause significant problems for statistical/ML models (why?)
date_of_sale_dd_mm_yyyy 3023
address 316658
postal_code 32
county 26
price 21834
not_full_market_price 2
vat_exclusive 2
description_of_property 5
property_size_description 7
is_big 2
log_price 21834

29 Postal Code

post_codes_table <-
    with(ppr4, table(postal_code, useNA="always")) %>% %>%
head(post_codes_table, n=12)

postal_code Freq
nil 270264
Dublin 15 7435
Dublin 18 4387
Dublin 24 3863
Dublin 8 3707
Dublin 4 3674
Dublin 7 3411
Dublin 9 3314
Dublin 11 3196
Dublin 13 3093
Dublin 12 2954
Dublin 14 2912
  • Hmmm, pretty strange that all of the postcodes are in Dublin, right?
  • What's the deal with all those missings?
ppr5  <- mutate(ppr4, no_postcode=ifelse(, "Yes", "No"))
no_postcode_county <-
    filter(ppr5, no_postcode=="Yes") %>%
    group_by(county) %>%
    summarise(count=n()) %>%

postcode_county <-
    filter(ppr5, no_postcode=="No") %>%
    group_by(county) %>%
    summarise(count=n())  %>%
Dublin 60690
Cork 8
Wexford 8
Wicklow 8
Meath 6
Kildare 5
Louth 5
Limerick 4
Mayo 4
Carlow 3
Galway 3
Westmeath 3
Roscommon 2
Tipperary 2
Waterford 2
Clare 1
Kerry 1
Kilkenny 1
Leitrim 1
Longford 1
Offaly 1
head(postcode_county, n=5)
county count
Dublin 60690
Cork 8
Wexford 8
Wicklow 8
Meath 6
  • So almost every property with a postcode is in Dublin
  • Why?

30 not full market price

with(ppr4, table(not_full_market_price, useNA="always"))
not_full_market_price Freq
No 316286
Yes 14737
nil 0
  • This is a super confusing variable
  • If No, then it is full market price, else it's not
  • We should reframe this to make it easier to interpret
ppr5 <- mutate(ppr4,
                   "No" )) %>%
    select(-not_full_market_price) #remove old var - what happens if we don't?

31 vat exclusive

with(ppr5, table(vat_exclusive, useNA="always"))
vat_exclusive Freq
No 281909
Yes 49114
nil 0
  • In general, secondhand properties are not liable for VAT, while new properties are
  • However, the price reported on the register is not inclusive of VAT, which means that the overall prices are not comparable for new houses (relative to second-hand houses)
  • This variable doesn't really need much treatment, it can be used as is

32 description of property

with(ppr5, table(description_of_property))
description_of_property Freq
New Dwelling house /Apartment 50343
Second-Hand Dwelling house /Apartment 280651
Teach/?ras?n C?naithe Nua 1
Teach/?ras?n C?naithe Ath?imhe 26
Teach/?ras?n C?naithe Nua 2
  • Here we have a different issue
  • Irish being the first language of the State, everyone has the right to put in their records as Gaeilge, and some people have
  • In general, you want to normalise all text to have one language

33 Normalising Text

ppr6 <- mutate(ppr5, description_of_property=
                             grepl("Nua", x=description_of_property),
                             "New Dwelling house /Apartment",
                             grepl("Ath", x=description_of_property),
                             "Second-Hand Dwelling house /Apartment",
                             ##nested ifelse are pretty useful

with(ppr6, table(description_of_property, useNA="always")) 

34 Property Size Description

with(ppr6, table(property_size_description, useNA="always"))
property_size_description Freq
greater than 125 sq metres 6886
greater than or equal to 125 sq metres 4197
greater than or equal to 38 sq metres and less than 125 sq metres 36097
less than 38 sq metres 3163
n?os l? n? 38 m?adar cearnach 1
n?os m? n? n? cothrom le 38 m?adar cearnach agus n?os l? n? 125 m?adar cearnach 2
nil 280677
  • So we have both a missing value problem, and a language problem here
  • we can resolve the language problem with regular expressions
ppr7 <- mutate(ppr6, prop_description=ifelse(
                         grepl("cothrom", x=property_size_description),
                         "greater than or equal to 38 sq metres and less than 125 sq metres",
                                          grepl("cearnach", x=property_size_description),
                                          "less than 38 sq metres",
  • Interestingly enough, after fixing this I noticed an inconsistency in the data
  • For some properties we have >125m^2, while for others we have >=125m^2
  • This isn't a massive difference, but we'll get better results if we don't have two redundant levels

35 Shortening Variable Names

  • If we don't do anything, it will be awkward to print any models using property size description
  • This is worth fixing, as it makes life easier
ppr8 <- mutate(ppr7, prop_description=ifelse(
                         prop_description=="greater than 125 sq metres",
                                        "greater than or equal to 125 sq metres",

ppr9 <- mutate(ppr8, property_size_description=ifelse(
                                           prop_description=="less than 38 sq metres",
                                           "greater than or equal to 38 sq metres and less than 125 sq metres",
                                           prop_description))) %>%
    select(-prop_description) #pointless now
ppr10 <- mutate(ppr9, 
  • This is pretty gnarly code
  • Unfortunately, much data cleaning code is not automatable as it tends to be really specific, and so much of the code DS's write is like this

36 Missing Data

  • Missing data is the bane of every analyst's life
  • We have two variables with large amounts of missing data
    • property size description
    • postcode
  • There are a number of ways to treat missing data
    • casewise deletion (delete any observations with missing data on the particular variable)
    • listwise deletion (delete any observations with missing data on any variable)
  • These are the most common strategies, and they are pretty dumb

37 Missing Data Taxonomy

  • Research goes back to Rubin, 1979, Missing Data Analysis (link)
  • Distinguishes between three types of missingness
    • Missing Completely at Random (MCAR, all missingness is completely independent of everything else)
    • Missing At Random (MAR, missingness dependent on observed variables)
    • Missing not at Random (MNAR, missingness dependent on unobserved variables)
  • Casewise and Listwise Deletion are only appropriate for MCAR data 6

38 Missing Data Methods

  • Very brief treatment, consult Harrell or Kuhn for lots more
  • Simple imputation (mean/median): reduces variance, increases false precision
  • Predictive mean matching (use regression to estimate likely values): good, but time consuming
  • multiple imputation: run PMM for each variable in turn, return multiple datasets
  • Multiple imputation is the best idea, but is computationally expensive 7
  • Also MI will return multiple datasets, so you need to aggregate the results on each dataset, which requires extra coding (in many cases)
  • Nonetheless its the best approach, especially if you have a lot of missing data

39 Automated Variable treatment

  • In general, you want to be able to script all of your variable treatment
  • This is important for a bunch of reasons:
    1. Reproducibility
    2. Cross-Validation
    3. Avoiding ad-hoc mistakes (very, very common)
  • There are a number of packages available for this in R, amongst them caret and vtreat

40 Reproducibility Problems

  • In general, reproducibility is the most important thing we can get from a model
  • if we can't get similar results on unseen data then the modelling process was a waste of time
  • The best way (the only way, really) to assess reproducibility is with independent samples
  • independent samples can be hard to come by 8
  • There are two main methods for simulating independent samples:
    1. Cross-validation
      1. K-fold CV
      2. Leave-one-out
    2. Bootstrapping:
      1. parametric
      2. Non-parametric

41 Cross Validation

  • Whenever you plan to fit a model, do this first!
  • Cross-validation is probably the biggest contribution of ML practicioners to statistical practice
  • What is cross-validation?
  • Two main types:
    • K-fold: split the data into K pieces, fit model on K-1 parts, then validate on part K
    • Leave-one-out: for each datapoint, fit one model on N-1 parts, validate on observation N.
  • Asymptotically, these behave the same :P 9
  • I prefer K-fold, as it's easier to implement practically

42 Bootstrapping

  • You have one dataset, of size N
  • Sample 10,100,1000 datasets from the original, each of size N (with replacement, obviously)
  • Calculate your statistic/fit a model on each subsample
  • Average the results
  • Bootstrapping provides a method for getting confidence/prediction/uncertainty intervals for anything

43 Code (K-Fold CV)

#set.seed ensures that future samples are consistent so this is a bad
## generated by taking the integer part of set.seed(rnorm(1, mean=50, sd=100)) 
#create sample of 70% of rows, weighted by quantiles of y (i.e.
ppr_train_indices <- with(ppr10,

#apply this sample to the dataset
ppr_train <- ppr10[ppr_train_indices,]
#get not in this sample (i.e. test)
ppr_not_train <- ppr10[-ppr_train_indices,]
#create another sample of 50% size
ppr_test_indices <- with(ppr_not_train,
                                             list=FALSE)) #get a 50% sample

#generate test dataset
ppr_test <- ppr_not_train[ppr_test_indices,]
#generate validation dataset (hide this under a rock until the very,
#very end)
ppr_validation <- ppr_not_train[ppr_test_indices,]
write_csv(x=ppr_validation, path="ppr_validation_set.csv")
  • Do this before you do anything else!!

Exercise: Create a script that will perform cross-validation and all feature transforms described in this document thus far, on a new dataset.

44 Questions

  • Is the cross-validation method shown above appropriate for the problem of predicting house prices given a date? Why or why not?
  • What is missing from the above treatment of the data for modelling?

45 Code: Bootstrapping

ppr_train2 <- rename(ppr_train, date_of_sale=date_of_sale_dd_mm_yyyy ) %>%
ppr_bootstrap_indices <- with(ppr_train, createResample(log_price, times=100))

generate_bootstrap_results <- function(df, ind) {
    #results list (always generate something to hold your results first)
    trainresults <- vector(mode="list",length=10)
    for (i in 1:length(ind)) {
        nm <- names(ind[i])
        train <- df[ind[[i]],]
        test <- df[-ind[[i]],]
    model <- lm(log_price~property_size_description+year,
    preds <- predict(model,
                     newdata=test, type="response",
    trainresults[[i]] <- preds
    names(trainresults) <- names(ind)
g <- generate_bootstrap_results(ppr_train2, ppr_bootstrap_indices)
hist(sapply(g, mean)^10)


  • We need to transform back to the original scale to get useful information
  • This model is pretty crap, and normally you'd want the 5th and 95th quantiles to produce confidence intervals
  • Nonetheless, this is how you can bootstrap
  • In general, you probably want to write some functions to automate parts of this

46 Cross Validation vs the Bootstrap

In general, I tend to use cross-validation pretty much everywhere. My reasons are the following:

  • It's much easier to get started with
  • It's much easier to avoid polluting your validation data
  • This makes your models more likely to replicate on new(er) samples

However, my approach has the following problems:

  • Much higher variance: I could get really terrible splits or really good splits
  • Not as rigorous
  • Not as statistically efficient

In general, I strongly recommend splitting out a final sample for analysis at the end of the project. In some cases, you may not need to do this, especially if it's log data and being generated daily. You should still do this as a best practice though.

As a complement to this, bootstrapping is an absolutely fabulous technique that should be used with abandon. Many people tend to make lots of unrealistic assumptions (normality, indpendence, homogenity of variances ) to get confidence intervals. That's mostly a waste of time in practice, as you can just boostrap those intervals. In essence, we trade flexibility for computation. This is (often) a good trade. When you wait hours to fit a model, bootstrapping is not going to be fun or effective, but this is rarely the case (and in this case, you can just sample).

On the other hand, if you have very little data (less than 1000 observations), then the variance of small test sets can cause real problems. In situations like this the bootstrap excels.

47 Different Models

  • Need to decide what model to fit
  • At base, all models take a matrix
  • R does this for you in many cases, but with vtreat, you need to pass a model matrix

48 Design matrix

  • A matrix which can be passed directly to a modelling function
  • Can be created from a factor and a data-frame in R, like so()
ppr_mm <- model.matrix(price~., data=ppr)

  • This one-hot encodes factors
  • This allows them to be used in a regression model
  • Ultimately, we are doing matrix multiplication here, so everything must be represented as a number
  • This is mostly hidden from the casual user

49 Recommended Reading

  • Harrell, RMS (see here)
  • Gelman and Hill, Applied Regression (here)
  • Intro Stat Learning (free copy here)
  • vtreat paper (here)



unless you work in digital marketing :(


in reality you'd have done this way before now


what's up with that?


assuming independence, which is of course ridiculous


that is, your statistical model


which is untestable, unfortunately


which is often worse than merely being hard


this varies widely by field and industry, but it's best to plan for the worst case


nobody has infinite data

Author: Richie Morrisroe

Created: 2020-04-29 Wed 11:45