Preparing Data for Modelling
1 Introduction
There are three parts to this talk:
- Three parts to the talk:
- Getting data, deciding what to do
- Cleaning data
- Handling issues/getting more data
- Repeat until performance acceptable
The goal (in R code) is to be able to do (lm(repsonse~predictiors, data=mydata)) and have it all work.
This never, ever works without lots of work, and that's part of preparing data for modelling.
2 Asking Questions
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of - R.A. Fisher
a week in the lab will save you a day in the library, - unknown
- Figure out what you want to do
- Figure out how you will measure the result
- Figure out what data you have (or can get)
- Do EDA, find weirdness
- Remove weirdness
- Attempt to fit a model
- Fail, look closer at data
- Repeat until better performance is obtained/you give up in frustration
3 What data
- Granularity: hourly, daily, weekly monthly
- Effective granularity: how similar are two observations "next" to each other
4 EDA
- ggplot2 is now your god and saviour
- We'll show some examples below
- eye is optimised for finding patterns
- make use of this
5 What to look for
- Inconsistencies!
- Very domain-dependent
- Not a lot of actionable advice 1
6 Example Data
- Property Price Register (available here)
- Because I'm irish, and therefore obsessed with property ;)
- Available in the repository for this course
7 Read in data
library(tidyverse) library(readxl) ppr <- read_excel("~/Dropbox/PPR/PPR-ALL.xlsx", sheet="PPR-ALL") names(ppr)
Date of Sale (dd/mm/yyyy) |
Address |
Postal Code |
County |
Price () |
Not Full Market Price |
VAT Exclusive |
Description of Property |
Property Size Description |
- Note that this appears to be optimised for reading, not coding
8 Data Cleaning
normalise_names <- function(df) { nms <- names(df) normed <- iconv(#covert from one encoding to another tolower( gsub("([[:space:]]|[[:punct:]])+", "_", #regular expressions x=nms)), "latin1",#from encoding "ASCII",#to encoding sub="") drop_usc <- gsub("([a-z_])_*$", "\\1", x=normed) #drop extra underscores names(df) <- drop_usc #update names df #return dataframe } ppr2 <- normalise_names(ppr)
9 EDA
- You always want to check what is actually in your data, as sometimes this may suprise you
- You need to check:
- class of data (numeric, character, etc)
- range of data (summary)
- distribution of data (plots, mostly histograms or densities)
10 Example: Price
- This is our outcome variable
- We assume this is numeric
with(ppr2, mean(price))
[1] NA Warning message: In mean.default(price) : argument is not numeric or logical: returning NA
- Huh?
11 Example Price, Continued
date_of_sale_dd_mm_yyyy | price | postal_code |
---|---|---|
01/01/2010 | 343,000.00 | nil |
03/01/2010 | 185,000.00 | nil |
04/01/2010 | 438,500.00 | nil |
04/01/2010 | 400,000.00 | nil |
04/01/2010 | 160,000.00 | nil |
04/01/2010 | 425,000.00 | nil |
12 WTF?
- This is pretty common, Excel inserts currency symbols willy-nilly, and treats the columns as numeric
- R, on the other hand, thinks that this is not acceptable, and treats such columns as character, and hence no calling mean
- We can resolve this with a little code
13 Fixing Price
fix_price <- function(x) { nopunct <- gsub(",|\\.", "", x=x) nums <- as.numeric( iconv(nopunct, "latin1", "ASCII", sub="")) } ppr2$price <- with(ppr2, fix_price(price)) with(ppr2, mean(price))
14 More WTF
- That doesn't look like a house price
- Do we really believe that the average house in the dataset sold for €23,754,443?
- Clearly not (right?)
- This is actually a good example of where domain knowledge (or common sense) comes into play
15 Well Actually,
- You'd look at a plot and go what the hell?
- Then look at the summary values
with(ppr2, summary(price))
Min. 1st Qu. Median Mean 3rd Qu. Max. 5.079e+05 1.010e+07 1.780e+07 2.375e+07 2.800e+07 1.392e+10
16 Dividing by 100
with(ppr2, summary(price/100))
Min. 1st Qu. Median Mean 3rd Qu. Max. 5079 101000 178000 237544 280000 139165000
17 More Plots
- Still pretty crazy
- Let's ignore everything greater than 2mn
- So all the big ones are pretty close to 2mn.
- Still a pretty crazy distribution of prices
18 Modelling Price
- Always greater than 0
- Massively, massively skewed (like most interesting response variables)
- What else do we need to think about?
- Does a linear model make sense?
- Should we just say screw it, and use a neural net?
- Are there any transformations that spring to mind?
19 You're not modelling in a vacuum
- You have (presumably) some idea of why you're analysing the data
- Other people have probably done similar things, maybe look at how they did it?
- The solution to this problem is Google, judiciously.
- Honestly, i find that I get a lot more out of a few good books but Google and short articles are super useful when you need more specifics (or working code)
- For positive variables, we can do log or square root
- Square root tends to be preferred for counts, and log is preferred for things that grow in a multiplicative fashion (rich get richer)
- Mostly for reasons of computability, I went with a log transform.
20 Transforming Price
- Much better, from a modelling point of view
- Why is this?
- But the log is kinda annoying
- Dunno about you, but I can't do powers of 2.7 in my head.
- On the other hand, powers of 10 are pretty easy
21 Response Variables
- Given that we know what we're doing with the predictor, we need to examine the response variables.
- These can range from the very useful to the very useless.
22 More EDA
- This is all part of the process
- Try to love looking around
- First step, what have we got? 2
names(ppr4)
date_of_sale_dd_mm_yyyy |
address |
postal_code |
county |
price |
not_full_market_price |
vat_exclusive |
description_of_property |
property_size_description |
is_big |
log_price |
23 What kind of variables are they?
sapply(ppr4, class)
date_of_sale_dd_mm_yyyy | character |
address | character |
postal_code | character |
county | character |
price | numeric |
not_full_market_price | character |
vat_exclusive | character |
description_of_property | character |
property_size_description | character |
is_big | character |
log_price | numeric |
24 Date of Sale
- Some days have way, way more observations than others
- Some days are just odd (that point in the upper right is really weird ) 3
25 What to do with date columns?
- They can be handled in many different ways:
- as a number of days since some threshold
- as a day, week and month within a year
- as the primary unit of analysis (time series)
- You can also just ignore it, which may suit your purposes also
26 Tradeoffs in Feature Generation
- So assuming that we want day, week, month and year in the model
- Given that our dataset covers approximately 8 years, that would 4 require 8*365=2920 parameters
- We have 331k observations
- Is estimating all of these parameters worth it?
- How do you make that decision?
- Should you, or just generate all the features and let God 5 sort it out?
- Different situations require different approaches
- Is it better to have a simple, robust model or a more flexible, fragile model?
27 Process
- Review each of your variables individually first
- Look for weird crap
- Some weird crap can include:
- numeric missing values (like 99, 999, or 9998, or -1)
- characters as factors
- unique identifiers
- wrong type of data (like price above)
28 Address
- Something that can be really important in modelling categorical features is the dimensionality, i.e. the number of distinct values
- High dimensionality can cause significant problems for statistical/ML models (why?)
date_of_sale_dd_mm_yyyy | 3023 |
address | 316658 |
postal_code | 32 |
county | 26 |
price | 21834 |
not_full_market_price | 2 |
vat_exclusive | 2 |
description_of_property | 5 |
property_size_description | 7 |
is_big | 2 |
log_price | 21834 |
29 Postal Code
post_codes_table <- with(ppr4, table(postal_code, useNA="always")) %>% as.data.frame() %>% arrange(desc(Freq)) head(post_codes_table, n=12)
postal_code | Freq |
---|---|
nil | 270264 |
Dublin 15 | 7435 |
Dublin 18 | 4387 |
Dublin 24 | 3863 |
Dublin 8 | 3707 |
Dublin 4 | 3674 |
Dublin 7 | 3411 |
Dublin 9 | 3314 |
Dublin 11 | 3196 |
Dublin 13 | 3093 |
Dublin 12 | 2954 |
Dublin 14 | 2912 |
- Hmmm, pretty strange that all of the postcodes are in Dublin, right?
- What's the deal with all those missings?
ppr5 <- mutate(ppr4, no_postcode=ifelse(is.na(postal_code), "Yes", "No")) no_postcode_county <- filter(ppr5, no_postcode=="Yes") %>% group_by(county) %>% summarise(count=n()) %>% arrange(desc(count)) postcode_county <- filter(ppr5, no_postcode=="No") %>% group_by(county) %>% summarise(count=n()) %>% arrange(desc(count))
Dublin | 60690 |
Cork | 8 |
Wexford | 8 |
Wicklow | 8 |
Meath | 6 |
Kildare | 5 |
Louth | 5 |
Limerick | 4 |
Mayo | 4 |
Carlow | 3 |
Galway | 3 |
Westmeath | 3 |
Roscommon | 2 |
Tipperary | 2 |
Waterford | 2 |
Clare | 1 |
Kerry | 1 |
Kilkenny | 1 |
Leitrim | 1 |
Longford | 1 |
Offaly | 1 |
head(postcode_county, n=5)
county | count |
---|---|
Dublin | 60690 |
Cork | 8 |
Wexford | 8 |
Wicklow | 8 |
Meath | 6 |
- So almost every property with a postcode is in Dublin
- Why?
30 not full market price
with(ppr4, table(not_full_market_price, useNA="always"))
not_full_market_price | Freq |
---|---|
No | 316286 |
Yes | 14737 |
nil | 0 |
- This is a super confusing variable
- If No, then it is full market price, else it's not
- We should reframe this to make it easier to interpret
ppr5 <- mutate(ppr4, is_full_market_price=ifelse( not_full_market_price=="No", "Yes", "No" )) %>% select(-not_full_market_price) #remove old var - what happens if we don't?
31 vat exclusive
with(ppr5, table(vat_exclusive, useNA="always"))
vat_exclusive | Freq |
---|---|
No | 281909 |
Yes | 49114 |
nil | 0 |
- In general, secondhand properties are not liable for VAT, while new properties are
- However, the price reported on the register is not inclusive of VAT, which means that the overall prices are not comparable for new houses (relative to second-hand houses)
- This variable doesn't really need much treatment, it can be used as is
32 description of property
with(ppr5, table(description_of_property))
description_of_property | Freq |
---|---|
New Dwelling house /Apartment | 50343 |
Second-Hand Dwelling house /Apartment | 280651 |
Teach/?ras?n C?naithe Nua | 1 |
Teach/?ras?n C?naithe Ath?imhe | 26 |
Teach/?ras?n C?naithe Nua | 2 |
- Here we have a different issue
- Irish being the first language of the State, everyone has the right to put in their records as Gaeilge, and some people have
- In general, you want to normalise all text to have one language
33 Normalising Text
ppr6 <- mutate(ppr5, description_of_property= ifelse( grepl("Nua", x=description_of_property), "New Dwelling house /Apartment", ifelse( grepl("Ath", x=description_of_property), "Second-Hand Dwelling house /Apartment", ##nested ifelse are pretty useful description_of_property))) with(ppr6, table(description_of_property, useNA="always"))
34 Property Size Description
with(ppr6, table(property_size_description, useNA="always"))
property_size_description | Freq |
---|---|
greater than 125 sq metres | 6886 |
greater than or equal to 125 sq metres | 4197 |
greater than or equal to 38 sq metres and less than 125 sq metres | 36097 |
less than 38 sq metres | 3163 |
n?os l? n? 38 m?adar cearnach | 1 |
n?os m? n? n? cothrom le 38 m?adar cearnach agus n?os l? n? 125 m?adar cearnach | 2 |
nil | 280677 |
- So we have both a missing value problem, and a language problem here
- we can resolve the language problem with regular expressions
ppr7 <- mutate(ppr6, prop_description=ifelse( grepl("cothrom", x=property_size_description), "greater than or equal to 38 sq metres and less than 125 sq metres", ifelse( grepl("cearnach", x=property_size_description), "less than 38 sq metres", property_size_description)))
- Interestingly enough, after fixing this I noticed an inconsistency in the data
- For some properties we have >125m^2, while for others we have >=125m^2
- This isn't a massive difference, but we'll get better results if we don't have two redundant levels
35 Shortening Variable Names
- If we don't do anything, it will be awkward to print any models using property size description
- This is worth fixing, as it makes life easier
ppr8 <- mutate(ppr7, prop_description=ifelse( prop_description=="greater than 125 sq metres", "ge_125_square_meters", ifelse( prop_description== "greater than or equal to 125 sq metres", "ge_125_square_meters", prop_description))) ppr9 <- mutate(ppr8, property_size_description=ifelse( prop_description=="less than 38 sq metres", "lt_38_square_meters", ifelse( prop_description== "greater than or equal to 38 sq metres and less than 125 sq metres", "ge_38_lt_125_square_meters", prop_description))) %>% select(-prop_description) #pointless now ppr10 <- mutate(ppr9, property_size_description=fct_explicit_na(property_size_description))
- This is pretty gnarly code
- Unfortunately, much data cleaning code is not automatable as it tends to be really specific, and so much of the code DS's write is like this
36 Missing Data
- Missing data is the bane of every analyst's life
- We have two variables with large amounts of missing data
- property size description
- postcode
- There are a number of ways to treat missing data
- casewise deletion (delete any observations with missing data on the particular variable)
- listwise deletion (delete any observations with missing data on any variable)
- These are the most common strategies, and they are pretty dumb
37 Missing Data Taxonomy
- Research goes back to Rubin, 1979, Missing Data Analysis (link)
- Distinguishes between three types of missingness
- Missing Completely at Random (MCAR, all missingness is completely independent of everything else)
- Missing At Random (MAR, missingness dependent on observed variables)
- Missing not at Random (MNAR, missingness dependent on unobserved variables)
- Casewise and Listwise Deletion are only appropriate for MCAR data 6
38 Missing Data Methods
- Very brief treatment, consult Harrell or Kuhn for lots more
- Simple imputation (mean/median): reduces variance, increases false precision
- Predictive mean matching (use regression to estimate likely values): good, but time consuming
- multiple imputation: run PMM for each variable in turn, return multiple datasets
- Multiple imputation is the best idea, but is computationally expensive 7
- Also MI will return multiple datasets, so you need to aggregate the results on each dataset, which requires extra coding (in many cases)
- Nonetheless its the best approach, especially if you have a lot of missing data
39 Automated Variable treatment
- In general, you want to be able to script all of your variable treatment
- This is important for a bunch of reasons:
- Reproducibility
- Cross-Validation
- Avoiding ad-hoc mistakes (very, very common)
- There are a number of packages available for this in R, amongst them caret and vtreat
40 Reproducibility Problems
- In general, reproducibility is the most important thing we can get from a model
- if we can't get similar results on unseen data then the modelling process was a waste of time
- The best way (the only way, really) to assess reproducibility is with independent samples
- independent samples can be hard to come by 8
- There are two main methods for simulating independent samples:
- Cross-validation
- K-fold CV
- Leave-one-out
- Bootstrapping:
- parametric
- Non-parametric
- Cross-validation
41 Cross Validation
- Whenever you plan to fit a model, do this first!
- Cross-validation is probably the biggest contribution of ML practicioners to statistical practice
- What is cross-validation?
- Two main types:
- K-fold: split the data into K pieces, fit model on K-1 parts, then validate on part K
- Leave-one-out: for each datapoint, fit one model on N-1 parts, validate on observation N.
- Asymptotically, these behave the same :P 9
- I prefer K-fold, as it's easier to implement practically
42 Bootstrapping
- You have one dataset, of size N
- Sample 10,100,1000 datasets from the original, each of size N (with replacement, obviously)
- Calculate your statistic/fit a model on each subsample
- Average the results
- Bootstrapping provides a method for getting confidence/prediction/uncertainty intervals for anything
43 Code (K-Fold CV)
library(caret) #set.seed ensures that future samples are consistent so this is a bad #idea ## generated by taking the integer part of set.seed(rnorm(1, mean=50, sd=100)) set.seed(34) #create sample of 70% of rows, weighted by quantiles of y (i.e. #log_price) ppr_train_indices <- with(ppr10, createDataPartition(log_price, times=1, p=0.7, list=FALSE)) #apply this sample to the dataset ppr_train <- ppr10[ppr_train_indices,] #get not in this sample (i.e. test) ppr_not_train <- ppr10[-ppr_train_indices,] #create another sample of 50% size ppr_test_indices <- with(ppr_not_train, createDataPartition(log_price, times=1, p=0.5, list=FALSE)) #get a 50% sample #generate test dataset ppr_test <- ppr_not_train[ppr_test_indices,] #generate validation dataset (hide this under a rock until the very, #very end) ppr_validation <- ppr_not_train[ppr_test_indices,] write_csv(x=ppr_validation, path="ppr_validation_set.csv") rm(ppr_validation)
- Do this before you do anything else!!
Exercise: Create a script that will perform cross-validation and all feature transforms described in this document thus far, on a new dataset.
44 Questions
- Is the cross-validation method shown above appropriate for the problem of predicting house prices given a date? Why or why not?
- What is missing from the above treatment of the data for modelling?
45 Code: Bootstrapping
ppr_train2 <- rename(ppr_train, date_of_sale=date_of_sale_dd_mm_yyyy ) %>% mutate(year=lubridate::year(date_of_sale), month=lubridate::month(date_of_sale), quarter=lubridate::quarter(date_of_sale) ) ppr_bootstrap_indices <- with(ppr_train, createResample(log_price, times=100)) generate_bootstrap_results <- function(df, ind) { #results list (always generate something to hold your results first) trainresults <- vector(mode="list",length=10) for (i in 1:length(ind)) { nm <- names(ind[i]) train <- df[ind[[i]],] test <- df[-ind[[i]],] model <- lm(log_price~property_size_description+year, data=train) preds <- predict(model, newdata=test, type="response", na.action=na.exclude) trainresults[[i]] <- preds } names(trainresults) <- names(ind) trainresults } g <- generate_bootstrap_results(ppr_train2, ppr_bootstrap_indices) hist(sapply(g, mean)^10)
- We need to transform back to the original scale to get useful information
- This model is pretty crap, and normally you'd want the 5th and 95th quantiles to produce confidence intervals
- Nonetheless, this is how you can bootstrap
- In general, you probably want to write some functions to automate parts of this
46 Cross Validation vs the Bootstrap
In general, I tend to use cross-validation pretty much everywhere. My reasons are the following:
- It's much easier to get started with
- It's much easier to avoid polluting your validation data
- This makes your models more likely to replicate on new(er) samples
However, my approach has the following problems:
- Much higher variance: I could get really terrible splits or really good splits
- Not as rigorous
- Not as statistically efficient
In general, I strongly recommend splitting out a final sample for analysis at the end of the project. In some cases, you may not need to do this, especially if it's log data and being generated daily. You should still do this as a best practice though.
As a complement to this, bootstrapping is an absolutely fabulous technique that should be used with abandon. Many people tend to make lots of unrealistic assumptions (normality, indpendence, homogenity of variances ) to get confidence intervals. That's mostly a waste of time in practice, as you can just boostrap those intervals. In essence, we trade flexibility for computation. This is (often) a good trade. When you wait hours to fit a model, bootstrapping is not going to be fun or effective, but this is rarely the case (and in this case, you can just sample).
On the other hand, if you have very little data (less than 1000 observations), then the variance of small test sets can cause real problems. In situations like this the bootstrap excels.
47 Different Models
- Need to decide what model to fit
- At base, all models take a matrix
- R does this for you in many cases, but with vtreat, you need to pass a model matrix
48 Design matrix
- A matrix which can be passed directly to a modelling function
- Can be created from a factor and a data-frame in R, like so()
ppr_mm <- model.matrix(price~., data=ppr)
org_babel_R_eoe
- This one-hot encodes factors
- This allows them to be used in a regression model
- Ultimately, we are doing matrix multiplication here, so everything must be represented as a number
- This is mostly hidden from the casual user
49 Recommended Reading
Footnotes:
unless you work in digital marketing :(
in reality you'd have done this way before now
what's up with that?
assuming independence, which is of course ridiculous
that is, your statistical model
which is untestable, unfortunately
which is often worse than merely being hard
this varies widely by field and industry, but it's best to plan for the worst case
nobody has infinite data