Feature Engineering

1 Introduction

Feature engineering:
- finding/creating new datasets and attributes
- improving current features
- representing features appropriately

2 Transformations

Handle skew
Center (subtract each number from it's mean)
Scale (divide this by the standard deviation)
Results in numeric variables taking the 0-1 range
Ultimately you're solving a bunch of simultaneous equations, the scale of the variables can matter
Other transformations can be chosen using the BoxCox formula

3 Box-Cox

Essentially produces a value, lambda which is used as a power for the original variable
If the value is 0, take the log

with(ppr_train, BoxCoxTrans(price))

So, nicely enough, it supports my decision to choose a log-transform
If we look at the counts from daily_data earlier, it suggests 0.4, which indicates a square root transform

4 Categorical data

Often many choices in how to represent
How many levels should one allow?
Do these categories have any hierarhical structure?
Should all levels of a categorical variable be included? (yes for trees, no for linear models)
Some categories may define the unit of analysis (i.e. customer_id, order_id, time)

5 Quick model

ppr_train3 <- select(ppr_train2, -postal_code, 
                     -address, 
                     -price, -date_of_sale)
ppr_train_s <- sample_frac(ppr_train3, size=0.1)
ppr_test_samp <- sample_frac(ppr_train3, 
                             size=0.05)
train_ctrl <- trainControl(method="repeatedcv", 
                           repeats=5)
ppr_glmnet <- train(log_price~.,
                    data=ppr_train_s,
                    method="glmnet",
                    trControl=train_ctrl)
preds <- predict(ppr_glmnet, newdata=ppr_test_samp)
ppr_test_samp$glmnet_preds <- preds

6 Sampling

The first rule of practical model-building is to sample judiciously
This is entirely a device to speed up iteration, and can often be a necessity forced upon the project by a lack of resources

ggplot(ppr_test_samp, aes(x=log_price^10, y=glmnet_preds^10))+geom_point()

The disparity between the two sets of data (individual vs commercial) is very, very apparent
Do we care about the extremely large examples?

7 Modelling Objectives

We need to be very clear with what we are trying to accomplish
With this data, I was trying to assess where I should buy a house, and how I should value it
For that purpose, the huge (1mn plus) sales add absolutely no value to me, because I can't possibly afford them

8 Recontextualising the Problem

ppr_train_s2 <- filter(ppr_train_s, 
                       log_price<=log(1e6, base=10)) %>% 
    select(-is_big)
ppr_glmnet2 <- train(log_price~.,
                     data=ppr_train_s2,
                     method="glmnet",
                     trControl=train_ctrl)
preds2 <- predict(ppr_glmnet2,
                  newdata=ppr_test_samp)
ppr_test_samp$glmnet_preds2 <- preds2

Interestingly enough, the model containing the large prices does better
I don't really have a good explanation for this

9 Feature Engineering

The first step is to examine what data we have
Are there any sources of data which could be used better?
The obvious idea here is to geocode the addresses
This will give us exact location, and allow us to examine more CSO data

10 Python Geocoding Script

This is the sad truth about data science:
- you'll be using multiple tools in order to get your work done
In this case, I actually didn't write the script myself
I got the script from Shane Lynn, who did this up to 2016
I used all my free credits from Google Cloud Platform to get this updated to 2018

11 Extra Variables from geocoding

ppr_gc1 <- filter(ppr_gc, year<=2016)
ppr_gc2 <- filter(ppr_gc, year>2016)
#github doesn't like large files
write_csv(ppr_gc1, path="ppr_gecoded1.csv")
write_csv(ppr_gc2, path="ppr_gecoded2.csv")
ppr_gc <- read_csv(
  "~/Dropbox/PPR/ppr_geocoded_till_oct2018.csv")
names(ppr_gc)[12:24]

x
date
formatted_address
accuracy
latitude
longitude
postcode
type
geo_county
electoral_district
electoral_district_id
region
small_area
month

To be fair, this isn't everything, but it's a lot more than I had earlier
This is the key point of feature engineering:
- You are much more likely to be successful by adding new datasets rather than new features
- This is worth remembering in future research

12 Geocoded Data

Note that we have the electoral districts, and the small areas encoded into the file
This provides us with a link into the CSO census data
This data is all stored in shapefiles, which also encode the area and size of each unit (electoral district/small area/constituency)
What do we get from this?

13 Shape Files!

library(sp)
library(rgdal)
shp <- readOGR("~/Dropbox/PPR/electoral_divisions_gps.shp")
#df stored in this slot
names(shp@data[,13:length(shp@data)])

x
MALE2011
FEMALE2011
TOTAL2011
PPOCC2011
UNOCC2011
VACANT2011
HS2011
PCVAC2011
LAND_AREA
TOTAL_AREA
CREATEDATE

So this is really, really useful
Note that HS2011 is the number of households
This provides a denominator for the sales of houses
We also have vacancies and land area, which we can use to calculate density and other attributes
Note that despite the names, these are 2016 figures

14 Pobal Data

Pobal are an agency that investigates deprivation across Ireland
They produce a dataset which has lots of useful variables on a small area/electoral district area
Full details can be found here
The dataset itself can be found here

15 Pobal Data Details

Note that there is a data dictionary
These are always worth looking for, and if you can't find one, it's well worth actually documenting what you find in a dataset, especially if you'll ever need it again
You'll need to join your datasets together, obviously
When that's done, you'll gain the following variables:

pobal <- read_csv("~/Dropbox/PPR/pobal_deprivation_index_ed.csv")
names(pobal)[25:45]

x
share
POPCHG06
POPCHG11
POPCHG16
AGEDEP06
AGEDEP11
AGEDEP16
LONEPA06
LONEPA11
LONEPA16
EDLOW_06
EDLOW_11
EDLOW_16
EDHIGH06
EDHIGH11
EDHIGH16
HLPROF06
HLPROF11
HLPROF16
LSKILL06
LSKILL11

Note that the HP variables are indexes of absolute and relative deprivation, and I believe that they are the result of PCA

16 Reassassing the Situation

So, sometimes the best feature engineering you can do is find other data :)
However, we now have a dataset where we can run interesting variable treatments (i.e. we have lots of continuous variables )

17 Transformations

So the best thing to do is to examine the data again, as we did before

sapply(ppr_pobal, n_distinct) %>%
    as.data.frame() %>%
    rownames_to_column() %>%
    arrange(desc(`.`))

rowname	0
geometry	298501
rowid	298501
input_string	263158
longitude	180897
latitude	177022
postcode	113999
price	18119
small_area	18019
electoral_district_id	3352
ID06	3322
electoral_district	3066
ED_Name	3035
sale_date	2337
TOTPOP16	1751
popdep_T	1751
TOTPOP11	1739
TOTPOP06	1704
TOTPOP02	1673
popdep_L	1304
popdep_M	902
OHOUSE16	596
OHOUSE11	585
EDHIGH16	547
HLPROF16	529
popdep_H	523
HLPROF11	518
EDHIGH11	505
OHOUSE06	500
HLPROF06	497
EDHIGH06	493
PRRENT16	460
LONEPA11	458
PRRENT11	456
LONEPA06	453
LONEPA16	437
EDLOW_06	430
UNEMPM11	419
EDLOW_11	391
PRRENT06	383
LSKILL06	379
HP2016abs	360
HP2011rel	360
LSKILL11	355
HP2016rel	352
HP2006abs	351
HP2006rel	351
LSKILL16	351
HP2011abs	350
EDLOW_16	350
UNEMPM16	347
UNEMPF11	321
type	320
LARENT16	307
UNEMPF16	300
AGEDEP06	295
AGEDEP11	294
AGEDEP16	294
LARENT11	286
LARENT06	284
UNEMPM06	271
UNEMPF06	232
share_H	60
POPCHG11	60
share_T	46
share	45
share_M	44
share_L	42
POPCHG06	39
NUTS4	35
geo_county	34
ppr_county	26
PEROOM11	19
PEROOM16	17
POPCHG16	15
month	12
NUTS3	9
region	8
year	7
PEROOM06	7
description_of_property	6
NUTS2	3
NUTS1	2

We can see that the top few variables have far too many levels, and that the bottom few variables will be problematic as they have 1-5 levels ¹

18 Missing Data

sapply(ppr_pobal, function(x) 
  sum(is.na(x))/length(x)) %>%
    as.data.frame() %>%
    rownames_to_column() %>%
    dplyr::arrange(desc(`.`)) %>%
  mutate_if(is.numeric, round, 4) %>%
  head(n=5)

	rowname	0
1	postcode	0.5491
2	ID06	0.0012
3	ED_Name	0.0012
4	NUTS4	0.0012
5	NUTS3	0.0012

As we can see, it's mostly postcode that's missing
Here, it's probably a better idea to just drop it

19 BoxCox parameters

num_vars <-  as.data.frame(ppr_pobal) %>%
    na.omit() %>%
select_if(is.numeric)

bc2 <- lapply(num_vars, 
              BoxCoxTrans, 
              na.rm=TRUE)
lambdas <- sapply(bc2, function (x) x[["lambda"]])
head(lambdas)

year	2
price	0.1
latitude	2
longitude	nil
month	0.8
ID06	0.1

So we have a bunch of potential transformations to apply

20 PCA

Principal Components Analysis (PCA) is a data reduction method
It essentially examines the covariance between each variable, and returns the combination of each variable that is the best linear predictor
The first component explains most of the variance
And each extra component is uncorrelated with the previous
This makes it much easier to model the variables
However, it can make them much harder to interpret

21 Practical PCA

Keeping all components just gives us uncorrelation (which makes each model easier to interpret)
However, we normally want to describe as much variance as possible with as few variables as possible
This necessitates choosing K components to keep
How do we choose K?

22 This is a very very hard problem

Because it's unsupervised
If we had an outcome metric, we could cross-validate
This is much harder without a response variable
If we do have an outcome variable, then we would need to repeat the selection of K for each fold (to avoid bias)
There are lots of methods, many of which don't agree

23 One method to rule them all

pobal_vars <- dplyr::select(num_vars, 26:64) %>% 
    lapply(function (x) scale(x, 
                              center=TRUE, 
                              scale=TRUE))
pca <- princomp(~., data=pobal_vars)

24 Screeplot

screeplot(pca, type="lines")

25 How to use the scree plot

Keep the components until the line levels off
That's about five here
You should also look at the variable loadings, described below
If you want approximately 100 pages of this, consult my thesis

26 Loadings

loadings <- pca$loadings
loadings[,1:6] %>%
  round(3) %>%
  head()

	Comp.1	Comp.2	Comp.3	Comp.4	Comp.5	Comp.6
POPCHG06	0.004	0.012	0.338	0.37	0.245	0.092
POPCHG11	-0.006	-0.02	-0.336	-0.442	-0.23	-0.252
POPCHG16	-0.007	-0.114	-0.229	-0.138	-0.068	-0.222
AGEDEP06	0.024	0.266	0.14	-0.181	-0.016	0.293
AGEDEP11	0.004	0.289	0.053	-0.215	-0.076	0.287
AGEDEP16	-0.012	0.288	0.018	-0.188	-0.094	0.255

This will allow you to name the variables which makes them more comfortable to model with

27 Y-aware Variable treatment

Recently, a wonderful package for R has become available
vtreat is a package for y-aware variable selection
It does a whole host of variable selection and preparation techniques, and can significantly improve outcomes of modelling processes
However, it is y-aware which means that it must be placed into the cross-validation process
This can add significant compute requirements to your project
even if you don't use it, the paper is a great source of references and ideas for variable treatment

28 Impact Coding

Terminology developed by the authors of the vtreat package (explanation available here)
Essentially the variables are smoothed towards the grand average
In many ways, this is what a Bayesian hierarchical model will do, but this approach is much faster
You must use a distinct set of data to prepare this coding, otherwise you will overfit

29 Informative Missingness

Sometimes, the fact that a variable is missing is itself informative
A good example is income:
- higher income people are less likely to report their income
- this introduces a bias based on whether or not this variable is missing
We can create an indicator variable_is_missing, which allows us to include this in a regression model
Note that trees and other methods do not have as much trouble with these variables, so vtreat may be less useful there

30 Example: Electoral Districts

There are 3k electoral districts in the data

ed_count  <- ppr_pobal %>% 
  as.data.frame() %>% 
  group_by(electoral_district_id) %>% 
    summarise(count=n(), median=median(price))
ggplot(ed_count, aes(x=count))+geom_density()

31 Modelling without Vtreat

Most of these have very few observations

#otherwise the below errors out
ppr_pobal_s <- sample_frac(ppr_pobal, size=0.1) 
##takes a loooooonnnnng time
lm_ed1 <- lm(log(price)~electoral_district_id, 
             data=ppr_pobal_s)

We get an R^2 of .49 (adjusted for the number of parameters)
This is pretty good, but we need to estimate a lot of parameters
Can we do better with vtreat

32 Vtreat Approach

require(vtreat)
#not a perfectly independent sample
##don't do it like this
ppr_train_s2 <- sample_frac(ppr_pobal, 
                            size=0.05) 
treat <- designTreatmentsN(ppr_train_s2, 
                           varlist=c("electoral_district_id"), 
                           outcomename = "price")
train_prep <- prepare(treat, 
                      dframe=ppr_pobal_s)

33 Vtreat In practice

ppr_pobal_test <- sample_frac(ppr_pobal, size=0.1)
test_vtreat <- prepare(treat, ppr_pobal_test)
glmnet_preds <- predict(glmnet_ed1, newx=as.matrix(test_vtreat), 
                        type="response")
predsvobs <- data.frame(preds=exp(glmnet_preds), 
                        obs=ppr_pobal_test$price)
ggplot(predsvobs, aes(x=obs, y=preds.s0))+geom_point()
ggplot(predsvobs, aes(x=obs, y=preds.s60))+geom_point()
predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test$price)
with(predsvobs, cor(preds.s60, obs))

We get a correlation of around the same
But the joy of vtreat is that we can easily apply the same approach to multiple variables, and it will handle missings and encodings for us automatically

34 Vtreat on multiple variables

tf <- designTreatmentsN(ppr_train_s2, varlist=
                                        setdiff(
                                          names(ppr_train_s2), 
                                          "price"),
                                outcomename = "price")
ppr_p_t <- prepare(tf,
                   dframe=ppr_pobal_s)
ppr_glmnet_full <- cv.glmnet(x=as.matrix(ppr_pobal_treat),
                          y=log(ppr_pobal_s$price))
test_vtreat_full <- prepare(treat_full, ppr_pobal_test)

35 Vtreat Results

When we use all of the variables, we get pretty good results

The predictions are pretty accurate, for most of the data

36 Predictions vs Observed

37 Is this model accurate?

The most likely residual is approx 1000, which is very very good
In fact, it's probably too good
Examine the variables and residuals found from this model
Discuss potential issues which may be flattering this model

38 Conclusions

Sometimes, the best way to engineer features is to get more data
Scaling and centering are good default transformations for numeric variables
If you have lots of correlated variables, PCA can be a good choice
If you want to transform effectively, then use Box Cox transformations
vtreat can automate lots of common transformations, and if you can spare the data, you should definitely use it

Footnotes:

particularly an issue when using cross-validation