Preparing Data for Modelling

1 Introduction

There are three parts to this talk:

Three parts to the talk:
- Getting data, deciding what to do
- Cleaning data
- Handling issues/getting more data
- Repeat until performance acceptable

The goal (in R code) is to be able to do (lm(repsonse~predictiors, data=mydata)) and have it all work.

This never, ever works without lots of work, and that's part of preparing data for modelling.

2 Asking Questions

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of - R.A. Fisher

a week in the lab will save you a day in the library, - unknown

Figure out what you want to do
Figure out how you will measure the result
Figure out what data you have (or can get)
Do EDA, find weirdness
Remove weirdness
Attempt to fit a model
Fail, look closer at data
Repeat until better performance is obtained/you give up in frustration

3 What data

Granularity: hourly, daily, weekly monthly
Effective granularity: how similar are two observations "next" to each other

4 EDA

ggplot2 is now your god and saviour
We'll show some examples below
eye is optimised for finding patterns
make use of this

5 What to look for

Inconsistencies!
Very domain-dependent
Not a lot of actionable advice ¹

6 Example Data

Property Price Register (available here)
Because I'm irish, and therefore obsessed with property ;)
Available in the repository for this course

7 Read in data

library(tidyverse)
library(readxl)
ppr <- read_excel("~/Dropbox/PPR/PPR-ALL.xlsx", sheet="PPR-ALL")
names(ppr)

Date of Sale (dd/mm/yyyy)

Address

Postal Code

County

Price ()

Not Full Market Price

VAT Exclusive

Description of Property

Property Size Description

Note that this appears to be optimised for reading, not coding

8 Data Cleaning

normalise_names <- function(df) {
    nms <- names(df)
    normed <- iconv(#covert from one encoding to another
        tolower(
            gsub("([[:space:]]|[[:punct:]])+", "_", #regular expressions
                 x=nms)),
        "latin1",#from encoding
        "ASCII",#to encoding
        sub="")
    drop_usc <- gsub("([a-z_])_*$", "\\1", x=normed) #drop extra underscores
    names(df) <- drop_usc #update names
    df #return dataframe
}
ppr2 <- normalise_names(ppr)

9 EDA

You always want to check what is actually in your data, as sometimes this may suprise you
You need to check:
- class of data (numeric, character, etc)
- range of data (summary)
- distribution of data (plots, mostly histograms or densities)

10 Example: Price

This is our outcome variable
We assume this is numeric

with(ppr2, mean(price))


[1] NA
Warning message:
In mean.default(price) : argument is not numeric or logical: returning NA

Huh?

11 Example Price, Continued

date_of_sale_dd_mm_yyyy	price	postal_code
01/01/2010	343,000.00	nil
03/01/2010	185,000.00	nil
04/01/2010	438,500.00	nil
04/01/2010	400,000.00	nil
04/01/2010	160,000.00	nil
04/01/2010	425,000.00	nil

12 WTF?

This is pretty common, Excel inserts currency symbols willy-nilly, and treats the columns as numeric
R, on the other hand, thinks that this is not acceptable, and treats such columns as character, and hence no calling mean
We can resolve this with a little code

13 Fixing Price

fix_price <- function(x) {
    nopunct <- gsub(",|\\.", "", x=x)
    nums <- as.numeric(
        iconv(nopunct, "latin1",
              "ASCII",
              sub=""))
}
ppr2$price <- with(ppr2, fix_price(price))
with(ppr2, mean(price))

14 More WTF

That doesn't look like a house price
Do we really believe that the average house in the dataset sold for €23,754,443?
Clearly not (right?)
This is actually a good example of where domain knowledge (or common sense) comes into play

15 Well Actually,

You'd look at a plot and go what the hell?
Then look at the summary values

with(ppr2, summary(price))


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.079e+05 1.010e+07 1.780e+07 2.375e+07 2.800e+07 1.392e+10

16 Dividing by 100

with(ppr2, summary(price/100))

Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5079    101000    178000    237544    280000 139165000

17 More Plots

Still pretty crazy
Let's ignore everything greater than 2mn
So all the big ones are pretty close to 2mn.
Still a pretty crazy distribution of prices

18 Modelling Price

Always greater than 0
Massively, massively skewed (like most interesting response variables)
What else do we need to think about?
Does a linear model make sense?
Should we just say screw it, and use a neural net?
Are there any transformations that spring to mind?

19 You're not modelling in a vacuum

You have (presumably) some idea of why you're analysing the data
Other people have probably done similar things, maybe look at how they did it?
The solution to this problem is Google, judiciously.
Honestly, i find that I get a lot more out of a few good books but Google and short articles are super useful when you need more specifics (or working code)
For positive variables, we can do log or square root
Square root tends to be preferred for counts, and log is preferred for things that grow in a multiplicative fashion (rich get richer)
Mostly for reasons of computability, I went with a log transform.

20 Transforming Price

Much better, from a modelling point of view
Why is this?
But the log is kinda annoying
Dunno about you, but I can't do powers of 2.7 in my head.
On the other hand, powers of 10 are pretty easy

21 Response Variables

Given that we know what we're doing with the predictor, we need to examine the response variables.
These can range from the very useful to the very useless.

22 More EDA

This is all part of the process
Try to love looking around
First step, what have we got? ²

names(ppr4)

date_of_sale_dd_mm_yyyy

address

postal_code

county

price

not_full_market_price

vat_exclusive

description_of_property

property_size_description

is_big

log_price

23 What kind of variables are they?

sapply(ppr4, class)

date_of_sale_dd_mm_yyyy	character
address	character
postal_code	character
county	character
price	numeric
not_full_market_price	character
vat_exclusive	character
description_of_property	character
property_size_description	character
is_big	character
log_price	numeric

24 Date of Sale

Some days have way, way more observations than others
Some days are just odd (that point in the upper right is really weird ) ³

25 What to do with date columns?

They can be handled in many different ways:
- as a number of days since some threshold
- as a day, week and month within a year
- as the primary unit of analysis (time series)
- You can also just ignore it, which may suit your purposes also

26 Tradeoffs in Feature Generation

So assuming that we want day, week, month and year in the model
Given that our dataset covers approximately 8 years, that would ⁴ require 8*365=2920 parameters
We have 331k observations
Is estimating all of these parameters worth it?
How do you make that decision?
Should you, or just generate all the features and let God ⁵ sort it out?
Different situations require different approaches
Is it better to have a simple, robust model or a more flexible, fragile model?

27 Process

Review each of your variables individually first
Look for weird crap
Some weird crap can include:
- numeric missing values (like 99, 999, or 9998, or -1)
- characters as factors
- unique identifiers
- wrong type of data (like price above)

28 Address

Something that can be really important in modelling categorical features is the dimensionality, i.e. the number of distinct values
High dimensionality can cause significant problems for statistical/ML models (why?)

date_of_sale_dd_mm_yyyy	3023
address	316658
postal_code	32
county	26
price	21834
not_full_market_price	2
vat_exclusive	2
description_of_property	5
property_size_description	7
is_big	2
log_price	21834

29 Postal Code

post_codes_table <-
    with(ppr4, table(postal_code, useNA="always")) %>%
    as.data.frame() %>%
    arrange(desc(Freq))
head(post_codes_table, n=12)

postal_code	Freq
nil	270264
Dublin 15	7435
Dublin 18	4387
Dublin 24	3863
Dublin 8	3707
Dublin 4	3674
Dublin 7	3411
Dublin 9	3314
Dublin 11	3196
Dublin 13	3093
Dublin 12	2954
Dublin 14	2912

Hmmm, pretty strange that all of the postcodes are in Dublin, right?
What's the deal with all those missings?

ppr5  <- mutate(ppr4, no_postcode=ifelse(is.na(postal_code), "Yes", "No"))
no_postcode_county <-
    filter(ppr5, no_postcode=="Yes") %>%
    group_by(county) %>%
    summarise(count=n()) %>%
    arrange(desc(count))

postcode_county <-
    filter(ppr5, no_postcode=="No") %>%
    group_by(county) %>%
    summarise(count=n())  %>%
    arrange(desc(count))

Dublin	60690
Cork	8
Wexford	8
Wicklow	8
Meath	6
Kildare	5
Louth	5
Limerick	4
Mayo	4
Carlow	3
Galway	3
Westmeath	3
Roscommon	2
Tipperary	2
Waterford	2
Clare	1
Kerry	1
Kilkenny	1
Leitrim	1
Longford	1
Offaly	1

head(postcode_county, n=5)

county	count
Dublin	60690
Cork	8
Wexford	8
Wicklow	8
Meath	6

So almost every property with a postcode is in Dublin
Why?

30 not full market price

with(ppr4, table(not_full_market_price, useNA="always"))

not_full_market_price	Freq
No	316286
Yes	14737
nil	0

This is a super confusing variable
If No, then it is full market price, else it's not
We should reframe this to make it easier to interpret

ppr5 <- mutate(ppr4,
               is_full_market_price=ifelse(
                   not_full_market_price=="No",
                   "Yes",
                   "No" )) %>%
    select(-not_full_market_price) #remove old var - what happens if we don't?

31 vat exclusive

with(ppr5, table(vat_exclusive, useNA="always"))

vat_exclusive	Freq
No	281909
Yes	49114
nil	0

In general, secondhand properties are not liable for VAT, while new properties are
However, the price reported on the register is not inclusive of VAT, which means that the overall prices are not comparable for new houses (relative to second-hand houses)
This variable doesn't really need much treatment, it can be used as is

32 description of property

with(ppr5, table(description_of_property))

description_of_property	Freq
New Dwelling house /Apartment	50343
Second-Hand Dwelling house /Apartment	280651
Teach/?ras?n C?naithe Nua	1
Teach/?ras?n C?naithe Ath?imhe	26
Teach/?ras?n C?naithe Nua	2

Here we have a different issue
Irish being the first language of the State, everyone has the right to put in their records as Gaeilge, and some people have
In general, you want to normalise all text to have one language

33 Normalising Text

ppr6 <- mutate(ppr5, description_of_property=
                         ifelse(
                             grepl("Nua", x=description_of_property),
                             "New Dwelling house /Apartment",
                         ifelse(
                             grepl("Ath", x=description_of_property),
                             "Second-Hand Dwelling house /Apartment",
                             ##nested ifelse are pretty useful
                             description_of_property))) 

with(ppr6, table(description_of_property, useNA="always"))

34 Property Size Description

with(ppr6, table(property_size_description, useNA="always"))

property_size_description	Freq
greater than 125 sq metres	6886
greater than or equal to 125 sq metres	4197
greater than or equal to 38 sq metres and less than 125 sq metres	36097
less than 38 sq metres	3163
n?os l? n? 38 m?adar cearnach	1
n?os m? n? n? cothrom le 38 m?adar cearnach agus n?os l? n? 125 m?adar cearnach	2
nil	280677

So we have both a missing value problem, and a language problem here
we can resolve the language problem with regular expressions

ppr7 <- mutate(ppr6, prop_description=ifelse(
                         grepl("cothrom", x=property_size_description),
                         "greater than or equal to 38 sq metres and less than 125 sq metres",
                                      ifelse(
                                          grepl("cearnach", x=property_size_description),
                                          "less than 38 sq metres",
                                          property_size_description)))

Interestingly enough, after fixing this I noticed an inconsistency in the data
For some properties we have >125m^2, while for others we have >=125m^2
This isn't a massive difference, but we'll get better results if we don't have two redundant levels

35 Shortening Variable Names

If we don't do anything, it will be awkward to print any models using property size description
This is worth fixing, as it makes life easier

ppr8 <- mutate(ppr7, prop_description=ifelse(
                         prop_description=="greater than 125 sq metres",
                         "ge_125_square_meters",
                                    ifelse(
                                        prop_description==
                                        "greater than or equal to 125 sq metres",
                                        "ge_125_square_meters",
                                        prop_description)))

ppr9 <- mutate(ppr8, property_size_description=ifelse(
                                           prop_description=="less than 38 sq metres",
                                           "lt_38_square_meters",
                                       ifelse(
                                           prop_description==
                                           "greater than or equal to 38 sq metres and less than 125 sq metres",
                                           "ge_38_lt_125_square_meters",
                                           prop_description))) %>%
    select(-prop_description) #pointless now
ppr10 <- mutate(ppr9, 
                property_size_description=fct_explicit_na(property_size_description))

This is pretty gnarly code
Unfortunately, much data cleaning code is not automatable as it tends to be really specific, and so much of the code DS's write is like this

36 Missing Data

Missing data is the bane of every analyst's life
We have two variables with large amounts of missing data
- property size description
- postcode
There are a number of ways to treat missing data
- casewise deletion (delete any observations with missing data on the particular variable)
- listwise deletion (delete any observations with missing data on any variable)
These are the most common strategies, and they are pretty dumb

37 Missing Data Taxonomy

Research goes back to Rubin, 1979, Missing Data Analysis (link)
Distinguishes between three types of missingness
- Missing Completely at Random (MCAR, all missingness is completely independent of everything else)
- Missing At Random (MAR, missingness dependent on observed variables)
- Missing not at Random (MNAR, missingness dependent on unobserved variables)
Casewise and Listwise Deletion are only appropriate for MCAR data ⁶

38 Missing Data Methods

Very brief treatment, consult Harrell or Kuhn for lots more
Simple imputation (mean/median): reduces variance, increases false precision
Predictive mean matching (use regression to estimate likely values): good, but time consuming
multiple imputation: run PMM for each variable in turn, return multiple datasets
Multiple imputation is the best idea, but is computationally expensive ⁷
Also MI will return multiple datasets, so you need to aggregate the results on each dataset, which requires extra coding (in many cases)
Nonetheless its the best approach, especially if you have a lot of missing data

39 Automated Variable treatment

In general, you want to be able to script all of your variable treatment
This is important for a bunch of reasons:
1. Reproducibility
2. Cross-Validation
3. Avoiding ad-hoc mistakes (very, very common)
There are a number of packages available for this in R, amongst them caret and vtreat

40 Reproducibility Problems

In general, reproducibility is the most important thing we can get from a model
if we can't get similar results on unseen data then the modelling process was a waste of time
The best way (the only way, really) to assess reproducibility is with independent samples
independent samples can be hard to come by ⁸
There are two main methods for simulating independent samples:
1. Cross-validation
  1. K-fold CV
  2. Leave-one-out
2. Bootstrapping:
  1. parametric
  2. Non-parametric

41 Cross Validation

Whenever you plan to fit a model, do this first!
Cross-validation is probably the biggest contribution of ML practicioners to statistical practice
What is cross-validation?
Two main types:
- K-fold: split the data into K pieces, fit model on K-1 parts, then validate on part K
- Leave-one-out: for each datapoint, fit one model on N-1 parts, validate on observation N.
Asymptotically, these behave the same :P ⁹
I prefer K-fold, as it's easier to implement practically

42 Bootstrapping

You have one dataset, of size N
Sample 10,100,1000 datasets from the original, each of size N (with replacement, obviously)
Calculate your statistic/fit a model on each subsample
Average the results
Bootstrapping provides a method for getting confidence/prediction/uncertainty intervals for anything

43 Code (K-Fold CV)

library(caret)
#set.seed ensures that future samples are consistent so this is a bad
#idea
## generated by taking the integer part of set.seed(rnorm(1, mean=50, sd=100)) 
set.seed(34)
#create sample of 70% of rows, weighted by quantiles of y (i.e.
#log_price)
ppr_train_indices <- with(ppr10,
                          createDataPartition(log_price,
                                              times=1,
                                              p=0.7,
                                              list=FALSE)) 

#apply this sample to the dataset
ppr_train <- ppr10[ppr_train_indices,]
#get not in this sample (i.e. test)
ppr_not_train <- ppr10[-ppr_train_indices,]
#create another sample of 50% size
ppr_test_indices <- with(ppr_not_train,
                         createDataPartition(log_price,
                                             times=1,
                                             p=0.5,
                                             list=FALSE)) #get a 50% sample

#generate test dataset
ppr_test <- ppr_not_train[ppr_test_indices,]
#generate validation dataset (hide this under a rock until the very,
#very end)
ppr_validation <- ppr_not_train[ppr_test_indices,]
write_csv(x=ppr_validation, path="ppr_validation_set.csv")
rm(ppr_validation)

Do this before you do anything else!!

Exercise: Create a script that will perform cross-validation and all feature transforms described in this document thus far, on a new dataset.

44 Questions

Is the cross-validation method shown above appropriate for the problem of predicting house prices given a date? Why or why not?
What is missing from the above treatment of the data for modelling?

45 Code: Bootstrapping

ppr_train2 <- rename(ppr_train, date_of_sale=date_of_sale_dd_mm_yyyy ) %>%
    mutate(year=lubridate::year(date_of_sale),
           month=lubridate::month(date_of_sale),
           quarter=lubridate::quarter(date_of_sale)
           )
ppr_bootstrap_indices <- with(ppr_train, createResample(log_price, times=100))

generate_bootstrap_results <- function(df, ind) {
    #results list (always generate something to hold your results first)
    trainresults <- vector(mode="list",length=10)
    for (i in 1:length(ind)) {
        nm <- names(ind[i])
        train <- df[ind[[i]],]
        test <- df[-ind[[i]],]
    model <- lm(log_price~property_size_description+year,
                data=train)
    preds <- predict(model,
                     newdata=test, type="response",
                     na.action=na.exclude)
    trainresults[[i]] <- preds
}
    names(trainresults) <- names(ind)
    trainresults
}
g <- generate_bootstrap_results(ppr_train2, ppr_bootstrap_indices)
hist(sapply(g, mean)^10)

We need to transform back to the original scale to get useful information
This model is pretty crap, and normally you'd want the 5th and 95th quantiles to produce confidence intervals
Nonetheless, this is how you can bootstrap
In general, you probably want to write some functions to automate parts of this

46 Cross Validation vs the Bootstrap

In general, I tend to use cross-validation pretty much everywhere. My reasons are the following:

It's much easier to get started with
It's much easier to avoid polluting your validation data
This makes your models more likely to replicate on new(er) samples

However, my approach has the following problems:

Much higher variance: I could get really terrible splits or really good splits
Not as rigorous
Not as statistically efficient

In general, I strongly recommend splitting out a final sample for analysis at the end of the project. In some cases, you may not need to do this, especially if it's log data and being generated daily. You should still do this as a best practice though.

As a complement to this, bootstrapping is an absolutely fabulous technique that should be used with abandon. Many people tend to make lots of unrealistic assumptions (normality, indpendence, homogenity of variances ) to get confidence intervals. That's mostly a waste of time in practice, as you can just boostrap those intervals. In essence, we trade flexibility for computation. This is (often) a good trade. When you wait hours to fit a model, bootstrapping is not going to be fun or effective, but this is rarely the case (and in this case, you can just sample).

On the other hand, if you have very little data (less than 1000 observations), then the variance of small test sets can cause real problems. In situations like this the bootstrap excels.

47 Different Models

Need to decide what model to fit
At base, all models take a matrix
R does this for you in many cases, but with vtreat, you need to pass a model matrix

48 Design matrix

A matrix which can be passed directly to a modelling function
Can be created from a factor and a data-frame in R, like so()

ppr_mm <- model.matrix(price~., data=ppr)

org_babel_R_eoe

This one-hot encodes factors
This allows them to be used in a regression model
Ultimately, we are doing matrix multiplication here, so everything must be represented as a number
This is mostly hidden from the casual user

49 Recommended Reading

Harrell, RMS (see here)
Gelman and Hill, Applied Regression (here)
Intro Stat Learning (free copy here)
vtreat paper (here)

Footnotes:

unless you work in digital marketing :(

in reality you'd have done this way before now

what's up with that?

⁴

assuming independence, which is of course ridiculous

⁵

that is, your statistical model

⁶

which is untestable, unfortunately

⁷

which is often worse than merely being hard

⁸

this varies widely by field and industry, but it's best to plan for the worst case

⁹

nobody has infinite data