# Feature Engineering

## 1 Introduction

- Feature engineering:
- finding/creating new datasets and attributes
- improving current features
- representing features appropriately

## 2 Transformations

- Handle skew
- Center (subtract each number from it's mean)
- Scale (divide this by the standard deviation)
- Results in numeric variables taking the 0-1 range
- Ultimately you're solving a bunch of simultaneous equations, the scale of the variables can matter
- Other transformations can be chosen using the BoxCox formula

## 3 Box-Cox

- Essentially produces a value, lambda which is used as a power for the original variable
- If the value is 0, take the log

with(ppr_train, BoxCoxTrans(price))

- So, nicely enough, it supports my decision to choose a log-transform
- If we look at the counts from daily_data earlier, it suggests 0.4, which indicates a square root transform

## 4 Categorical data

- Often many choices in how to represent
- How many levels should one allow?
- Do these categories have any hierarhical structure?
- Should all levels of a categorical variable be included? (yes for trees, no for linear models)
- Some categories may define the unit of analysis (i.e. customer_id, order_id, time)

## 5 Quick model

ppr_train3 <- select(ppr_train2, -postal_code, -address, -price, -date_of_sale) ppr_train_s <- sample_frac(ppr_train3, size=0.1) ppr_test_samp <- sample_frac(ppr_train3, size=0.05) train_ctrl <- trainControl(method="repeatedcv", repeats=5) ppr_glmnet <- train(log_price~., data=ppr_train_s, method="glmnet", trControl=train_ctrl) preds <- predict(ppr_glmnet, newdata=ppr_test_samp) ppr_test_samp$glmnet_preds <- preds

## 6 Sampling

- The first rule of practical model-building is to sample judiciously
- This is entirely a device to speed up iteration, and can often be a necessity forced upon the project by a lack of resources

ggplot(ppr_test_samp, aes(x=log_price^10, y=glmnet_preds^10))+geom_point()

- The disparity between the two sets of data (individual vs commercial) is very, very apparent
- Do we care about the extremely large examples?

## 7 Modelling Objectives

- We need to be very clear with what we are trying to accomplish
- With this data, I was trying to assess where I should buy a house, and how I should value it
- For that purpose, the huge (1mn plus) sales add absolutely no value to me, because I can't possibly afford them

## 8 Recontextualising the Problem

ppr_train_s2 <- filter(ppr_train_s, log_price<=log(1e6, base=10)) %>% select(-is_big) ppr_glmnet2 <- train(log_price~., data=ppr_train_s2, method="glmnet", trControl=train_ctrl) preds2 <- predict(ppr_glmnet2, newdata=ppr_test_samp) ppr_test_samp$glmnet_preds2 <- preds2

- Interestingly enough, the model containing the large prices does better
- I don't really have a good explanation for this

## 9 Feature Engineering

- The first step is to examine what data we have
- Are there any sources of data which could be used better?
- The obvious idea here is to geocode the addresses
- This will give us exact location, and allow us to examine more CSO data

## 10 Python Geocoding Script

- This is the sad truth about data science:
- you'll be using multiple tools in order to get your work done

- In this case, I actually didn't write the script myself
- I got the script from Shane Lynn, who did this up to 2016
- I used all my free credits from Google Cloud Platform to get this updated to 2018

## 11 Extra Variables from geocoding

ppr_gc1 <- filter(ppr_gc, year<=2016) ppr_gc2 <- filter(ppr_gc, year>2016) #github doesn't like large files write_csv(ppr_gc1, path="ppr_gecoded1.csv") write_csv(ppr_gc2, path="ppr_gecoded2.csv") ppr_gc <- read_csv( "~/Dropbox/PPR/ppr_geocoded_till_oct2018.csv") names(ppr_gc)[12:24]

x |
---|

date |

formatted_address |

accuracy |

latitude |

longitude |

postcode |

type |

geo_county |

electoral_district |

electoral_district_id |

region |

small_area |

month |

- To be fair, this isn't everything, but it's a lot more than I had earlier
- This is the key point of feature engineering:
- You are much more likely to be successful by adding new datasets rather than new features
- This is worth remembering in future research

## 12 Geocoded Data

- Note that we have the electoral districts, and the small areas encoded into the file
- This provides us with a link into the CSO census data
- This data is all stored in shapefiles, which also encode the area and size of each unit (electoral district/small area/constituency)
- What do we get from this?

## 13 Shape Files!

library(sp) library(rgdal) shp <- readOGR("~/Dropbox/PPR/electoral_divisions_gps.shp") #df stored in this slot names(shp@data[,13:length(shp@data)])

x |
---|

MALE2011 |

FEMALE2011 |

TOTAL2011 |

PPOCC2011 |

UNOCC2011 |

VACANT2011 |

HS2011 |

PCVAC2011 |

LAND_AREA |

TOTAL_AREA |

CREATEDATE |

- So this is really, really useful
- Note that HS2011 is the number of households
- This provides a denominator for the sales of houses
- We also have vacancies and land area, which we can use to calculate density and other attributes
- Note that despite the names, these are 2016 figures

## 14 Pobal Data

## 15 Pobal Data Details

- Note that there is a data dictionary
- These are always worth looking for, and if you can't find one, it's well worth actually documenting what you find in a dataset, especially if you'll ever need it again
- You'll need to join your datasets together, obviously
- When that's done, you'll gain the following variables:

pobal <- read_csv("~/Dropbox/PPR/pobal_deprivation_index_ed.csv") names(pobal)[25:45]

x |
---|

share |

POPCHG06 |

POPCHG11 |

POPCHG16 |

AGEDEP06 |

AGEDEP11 |

AGEDEP16 |

LONEPA06 |

LONEPA11 |

LONEPA16 |

EDLOW_06 |

EDLOW_11 |

EDLOW_16 |

EDHIGH06 |

EDHIGH11 |

EDHIGH16 |

HLPROF06 |

HLPROF11 |

HLPROF16 |

LSKILL06 |

LSKILL11 |

- Note that the HP variables are indexes of absolute and relative deprivation, and I believe that they are the result of PCA

## 16 Reassassing the Situation

- So, sometimes the best feature engineering you can do is find other data :)
- However, we now have a dataset where we can run interesting variable treatments (i.e. we have lots of continuous variables )

## 17 Transformations

- So the best thing to do is to examine the data again, as we did before

sapply(ppr_pobal, n_distinct) %>% as.data.frame() %>% rownames_to_column() %>% arrange(desc(`.`))

rowname | 0 |
---|---|

geometry | 298501 |

rowid | 298501 |

input_string | 263158 |

longitude | 180897 |

latitude | 177022 |

postcode | 113999 |

price | 18119 |

small_area | 18019 |

electoral_district_id | 3352 |

ID06 | 3322 |

electoral_district | 3066 |

ED_Name | 3035 |

sale_date | 2337 |

TOTPOP16 | 1751 |

popdep_T | 1751 |

TOTPOP11 | 1739 |

TOTPOP06 | 1704 |

TOTPOP02 | 1673 |

popdep_L | 1304 |

popdep_M | 902 |

OHOUSE16 | 596 |

OHOUSE11 | 585 |

EDHIGH16 | 547 |

HLPROF16 | 529 |

popdep_H | 523 |

HLPROF11 | 518 |

EDHIGH11 | 505 |

OHOUSE06 | 500 |

HLPROF06 | 497 |

EDHIGH06 | 493 |

PRRENT16 | 460 |

LONEPA11 | 458 |

PRRENT11 | 456 |

LONEPA06 | 453 |

LONEPA16 | 437 |

EDLOW_06 | 430 |

UNEMPM11 | 419 |

EDLOW_11 | 391 |

PRRENT06 | 383 |

LSKILL06 | 379 |

HP2016abs | 360 |

HP2011rel | 360 |

LSKILL11 | 355 |

HP2016rel | 352 |

HP2006abs | 351 |

HP2006rel | 351 |

LSKILL16 | 351 |

HP2011abs | 350 |

EDLOW_16 | 350 |

UNEMPM16 | 347 |

UNEMPF11 | 321 |

type | 320 |

LARENT16 | 307 |

UNEMPF16 | 300 |

AGEDEP06 | 295 |

AGEDEP11 | 294 |

AGEDEP16 | 294 |

LARENT11 | 286 |

LARENT06 | 284 |

UNEMPM06 | 271 |

UNEMPF06 | 232 |

share_H | 60 |

POPCHG11 | 60 |

share_T | 46 |

share | 45 |

share_M | 44 |

share_L | 42 |

POPCHG06 | 39 |

NUTS4 | 35 |

geo_county | 34 |

ppr_county | 26 |

PEROOM11 | 19 |

PEROOM16 | 17 |

POPCHG16 | 15 |

month | 12 |

NUTS3 | 9 |

region | 8 |

year | 7 |

PEROOM06 | 7 |

description_of_property | 6 |

NUTS2 | 3 |

NUTS1 | 2 |

- We can see that the top few variables have far too many levels, and
that the bottom few variables will be problematic as they have 1-5 levels
^{1}

## 18 Missing Data

sapply(ppr_pobal, function(x) sum(is.na(x))/length(x)) %>% as.data.frame() %>% rownames_to_column() %>% dplyr::arrange(desc(`.`)) %>% mutate_if(is.numeric, round, 4) %>% head(n=5)

rowname | 0 | |
---|---|---|

1 | postcode | 0.5491 |

2 | ID06 | 0.0012 |

3 | ED_Name | 0.0012 |

4 | NUTS4 | 0.0012 |

5 | NUTS3 | 0.0012 |

- As we can see, it's mostly postcode that's missing
- Here, it's probably a better idea to just drop it

## 19 BoxCox parameters

num_vars <- as.data.frame(ppr_pobal) %>% na.omit() %>% select_if(is.numeric) bc2 <- lapply(num_vars, BoxCoxTrans, na.rm=TRUE) lambdas <- sapply(bc2, function (x) x[["lambda"]]) head(lambdas)

year | 2 |

price | 0.1 |

latitude | 2 |

longitude | nil |

month | 0.8 |

ID06 | 0.1 |

- So we have a bunch of potential transformations to apply

## 20 PCA

- Principal Components Analysis (PCA) is a data reduction method
- It essentially examines the covariance between each variable, and returns the combination of each variable that is the best linear predictor
- The first component explains most of the variance
- And each extra component is uncorrelated with the previous
- This makes it much easier to model the variables
- However, it can make them much harder to interpret

## 21 Practical PCA

- Keeping all components just gives us uncorrelation (which makes each model easier to interpret)
- However, we normally want to describe as much variance as possible with as few variables as possible
- This necessitates choosing K components to keep
- How do we choose K?

## 22 This is a very very hard problem

- Because it's unsupervised
- If we had an outcome metric, we could cross-validate
- This is much harder without a response variable
- If we do have an outcome variable, then we would need to repeat the selection of K for each fold (to avoid bias)
- There are lots of methods, many of which don't agree

## 23 One method to rule them all

pobal_vars <- dplyr::select(num_vars, 26:64) %>% lapply(function (x) scale(x, center=TRUE, scale=TRUE)) pca <- princomp(~., data=pobal_vars)

## 24 Screeplot

```
screeplot(pca, type="lines")
```

## 25 How to use the scree plot

- Keep the components until the line levels off
- That's about five here
- You should also look at the variable loadings, described below
- If you want approximately 100 pages of this, consult my thesis

## 26 Loadings

loadings <- pca$loadings loadings[,1:6] %>% round(3) %>% head()

Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | |
---|---|---|---|---|---|---|

POPCHG06 | 0.004 | 0.012 | 0.338 | 0.37 | 0.245 | 0.092 |

POPCHG11 | -0.006 | -0.02 | -0.336 | -0.442 | -0.23 | -0.252 |

POPCHG16 | -0.007 | -0.114 | -0.229 | -0.138 | -0.068 | -0.222 |

AGEDEP06 | 0.024 | 0.266 | 0.14 | -0.181 | -0.016 | 0.293 |

AGEDEP11 | 0.004 | 0.289 | 0.053 | -0.215 | -0.076 | 0.287 |

AGEDEP16 | -0.012 | 0.288 | 0.018 | -0.188 | -0.094 | 0.255 |

- This will allow you to name the variables which makes them more comfortable to model with

## 27 Y-aware Variable treatment

- Recently, a wonderful package for R has become available
**vtreat**is a package for y-aware variable selection- It does a whole host of variable selection and preparation techniques, and can significantly improve outcomes of modelling processes
- However, it is y-aware which means that it
**must**be placed into the cross-validation process - This can add significant compute requirements to your project
- even if you don't use it, the paper is a great source of references and ideas for variable treatment

## 28 Impact Coding

- Terminology developed by the authors of the vtreat package (explanation available here)
- Essentially the variables are smoothed towards the grand average
- In many ways, this is what a Bayesian hierarchical model will do, but this approach is much faster
- You
**must**use a distinct set of data to prepare this coding, otherwise you**will**overfit

## 29 Informative Missingness

- Sometimes, the fact that a variable is missing is itself informative
- A good example is income:
- higher income people are less likely to report their income
- this introduces a bias based on whether or not this variable is missing

- We can create an indicator variable_is_missing, which allows us to include this in a regression model
- Note that trees and other methods do not have as much trouble with
these variables, so
**vtreat**may be less useful there

## 30 Example: Electoral Districts

- There are 3k electoral districts in the data

ed_count <- ppr_pobal %>% as.data.frame() %>% group_by(electoral_district_id) %>% summarise(count=n(), median=median(price)) ggplot(ed_count, aes(x=count))+geom_density()

## 31 Modelling without Vtreat

- Most of these have very few observations

#otherwise the below errors out ppr_pobal_s <- sample_frac(ppr_pobal, size=0.1) ##takes a loooooonnnnng time lm_ed1 <- lm(log(price)~electoral_district_id, data=ppr_pobal_s)

- We get an R^2 of .49 (adjusted for the number of parameters)
- This is pretty good, but we need to estimate a lot of parameters
- Can we do better with
**vtreat**

## 32 Vtreat Approach

require(vtreat) #not a perfectly independent sample ##don't do it like this ppr_train_s2 <- sample_frac(ppr_pobal, size=0.05) treat <- designTreatmentsN(ppr_train_s2, varlist=c("electoral_district_id"), outcomename = "price") train_prep <- prepare(treat, dframe=ppr_pobal_s)

## 33 Vtreat In practice

ppr_pobal_test <- sample_frac(ppr_pobal, size=0.1) test_vtreat <- prepare(treat, ppr_pobal_test) glmnet_preds <- predict(glmnet_ed1, newx=as.matrix(test_vtreat), type="response") predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test$price) ggplot(predsvobs, aes(x=obs, y=preds.s0))+geom_point() ggplot(predsvobs, aes(x=obs, y=preds.s60))+geom_point() predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test$price) with(predsvobs, cor(preds.s60, obs))

- We get a correlation of around the same
- But the joy of vtreat is that we can easily apply the same approach to multiple variables, and it will handle missings and encodings for us automatically

## 34 Vtreat on multiple variables

tf <- designTreatmentsN(ppr_train_s2, varlist= setdiff( names(ppr_train_s2), "price"), outcomename = "price") ppr_p_t <- prepare(tf, dframe=ppr_pobal_s) ppr_glmnet_full <- cv.glmnet(x=as.matrix(ppr_pobal_treat), y=log(ppr_pobal_s$price)) test_vtreat_full <- prepare(treat_full, ppr_pobal_test)

## 35 Vtreat Results

- When we use all of the variables, we get pretty good results

- The predictions are pretty accurate, for most of the data

## 36 Predictions vs Observed

## 37 Is this model accurate?

- The most likely residual is approx 1000, which is very very good
- In fact, it's probably too good
- Examine the variables and residuals found from this model
- Discuss potential issues which may be flattering this model

## 38 Conclusions

- Sometimes, the best way to engineer features is to get more data
- Scaling and centering are good default transformations for numeric variables
- If you have lots of correlated variables, PCA can be a good choice
- If you want to transform effectively, then use Box Cox transformations
**vtreat**can automate lots of common transformations, and if you can spare the data, you should definitely use it

## Footnotes:

^{1}

particularly an issue when using cross-validation