Feature Engineering
1 Introduction
- Feature engineering:
- finding/creating new datasets and attributes
- improving current features
- representing features appropriately
2 Transformations
- Handle skew
- Center (subtract each number from it's mean)
- Scale (divide this by the standard deviation)
- Results in numeric variables taking the 0-1 range
- Ultimately you're solving a bunch of simultaneous equations, the scale of the variables can matter
- Other transformations can be chosen using the BoxCox formula
3 Box-Cox
- Essentially produces a value, lambda which is used as a power for the original variable
- If the value is 0, take the log
with(ppr_train, BoxCoxTrans(price))
- So, nicely enough, it supports my decision to choose a log-transform
- If we look at the counts from daily_data earlier, it suggests 0.4, which indicates a square root transform
4 Categorical data
- Often many choices in how to represent
- How many levels should one allow?
- Do these categories have any hierarhical structure?
- Should all levels of a categorical variable be included? (yes for trees, no for linear models)
- Some categories may define the unit of analysis (i.e. customer_id, order_id, time)
5 Quick model
ppr_train3 <- select(ppr_train2, -postal_code, -address, -price, -date_of_sale) ppr_train_s <- sample_frac(ppr_train3, size=0.1) ppr_test_samp <- sample_frac(ppr_train3, size=0.05) train_ctrl <- trainControl(method="repeatedcv", repeats=5) ppr_glmnet <- train(log_price~., data=ppr_train_s, method="glmnet", trControl=train_ctrl) preds <- predict(ppr_glmnet, newdata=ppr_test_samp) ppr_test_samp$glmnet_preds <- preds
6 Sampling
- The first rule of practical model-building is to sample judiciously
- This is entirely a device to speed up iteration, and can often be a necessity forced upon the project by a lack of resources
ggplot(ppr_test_samp, aes(x=log_price^10, y=glmnet_preds^10))+geom_point()
- The disparity between the two sets of data (individual vs commercial) is very, very apparent
- Do we care about the extremely large examples?
7 Modelling Objectives
- We need to be very clear with what we are trying to accomplish
- With this data, I was trying to assess where I should buy a house, and how I should value it
- For that purpose, the huge (1mn plus) sales add absolutely no value to me, because I can't possibly afford them
8 Recontextualising the Problem
ppr_train_s2 <- filter(ppr_train_s, log_price<=log(1e6, base=10)) %>% select(-is_big) ppr_glmnet2 <- train(log_price~., data=ppr_train_s2, method="glmnet", trControl=train_ctrl) preds2 <- predict(ppr_glmnet2, newdata=ppr_test_samp) ppr_test_samp$glmnet_preds2 <- preds2
- Interestingly enough, the model containing the large prices does better
- I don't really have a good explanation for this
9 Feature Engineering
- The first step is to examine what data we have
- Are there any sources of data which could be used better?
- The obvious idea here is to geocode the addresses
- This will give us exact location, and allow us to examine more CSO data
10 Python Geocoding Script
- This is the sad truth about data science:
- you'll be using multiple tools in order to get your work done
- In this case, I actually didn't write the script myself
- I got the script from Shane Lynn, who did this up to 2016
- I used all my free credits from Google Cloud Platform to get this updated to 2018
11 Extra Variables from geocoding
ppr_gc1 <- filter(ppr_gc, year<=2016) ppr_gc2 <- filter(ppr_gc, year>2016) #github doesn't like large files write_csv(ppr_gc1, path="ppr_gecoded1.csv") write_csv(ppr_gc2, path="ppr_gecoded2.csv") ppr_gc <- read_csv( "~/Dropbox/PPR/ppr_geocoded_till_oct2018.csv") names(ppr_gc)[12:24]
x |
---|
date |
formatted_address |
accuracy |
latitude |
longitude |
postcode |
type |
geo_county |
electoral_district |
electoral_district_id |
region |
small_area |
month |
- To be fair, this isn't everything, but it's a lot more than I had earlier
- This is the key point of feature engineering:
- You are much more likely to be successful by adding new datasets rather than new features
- This is worth remembering in future research
12 Geocoded Data
- Note that we have the electoral districts, and the small areas encoded into the file
- This provides us with a link into the CSO census data
- This data is all stored in shapefiles, which also encode the area and size of each unit (electoral district/small area/constituency)
- What do we get from this?
13 Shape Files!
library(sp) library(rgdal) shp <- readOGR("~/Dropbox/PPR/electoral_divisions_gps.shp") #df stored in this slot names(shp@data[,13:length(shp@data)])
x |
---|
MALE2011 |
FEMALE2011 |
TOTAL2011 |
PPOCC2011 |
UNOCC2011 |
VACANT2011 |
HS2011 |
PCVAC2011 |
LAND_AREA |
TOTAL_AREA |
CREATEDATE |
- So this is really, really useful
- Note that HS2011 is the number of households
- This provides a denominator for the sales of houses
- We also have vacancies and land area, which we can use to calculate density and other attributes
- Note that despite the names, these are 2016 figures
14 Pobal Data
15 Pobal Data Details
- Note that there is a data dictionary
- These are always worth looking for, and if you can't find one, it's well worth actually documenting what you find in a dataset, especially if you'll ever need it again
- You'll need to join your datasets together, obviously
- When that's done, you'll gain the following variables:
pobal <- read_csv("~/Dropbox/PPR/pobal_deprivation_index_ed.csv") names(pobal)[25:45]
x |
---|
share |
POPCHG06 |
POPCHG11 |
POPCHG16 |
AGEDEP06 |
AGEDEP11 |
AGEDEP16 |
LONEPA06 |
LONEPA11 |
LONEPA16 |
EDLOW_06 |
EDLOW_11 |
EDLOW_16 |
EDHIGH06 |
EDHIGH11 |
EDHIGH16 |
HLPROF06 |
HLPROF11 |
HLPROF16 |
LSKILL06 |
LSKILL11 |
- Note that the HP variables are indexes of absolute and relative deprivation, and I believe that they are the result of PCA
16 Reassassing the Situation
- So, sometimes the best feature engineering you can do is find other data :)
- However, we now have a dataset where we can run interesting variable treatments (i.e. we have lots of continuous variables )
17 Transformations
- So the best thing to do is to examine the data again, as we did before
sapply(ppr_pobal, n_distinct) %>% as.data.frame() %>% rownames_to_column() %>% arrange(desc(`.`))
rowname | 0 |
---|---|
geometry | 298501 |
rowid | 298501 |
input_string | 263158 |
longitude | 180897 |
latitude | 177022 |
postcode | 113999 |
price | 18119 |
small_area | 18019 |
electoral_district_id | 3352 |
ID06 | 3322 |
electoral_district | 3066 |
ED_Name | 3035 |
sale_date | 2337 |
TOTPOP16 | 1751 |
popdep_T | 1751 |
TOTPOP11 | 1739 |
TOTPOP06 | 1704 |
TOTPOP02 | 1673 |
popdep_L | 1304 |
popdep_M | 902 |
OHOUSE16 | 596 |
OHOUSE11 | 585 |
EDHIGH16 | 547 |
HLPROF16 | 529 |
popdep_H | 523 |
HLPROF11 | 518 |
EDHIGH11 | 505 |
OHOUSE06 | 500 |
HLPROF06 | 497 |
EDHIGH06 | 493 |
PRRENT16 | 460 |
LONEPA11 | 458 |
PRRENT11 | 456 |
LONEPA06 | 453 |
LONEPA16 | 437 |
EDLOW_06 | 430 |
UNEMPM11 | 419 |
EDLOW_11 | 391 |
PRRENT06 | 383 |
LSKILL06 | 379 |
HP2016abs | 360 |
HP2011rel | 360 |
LSKILL11 | 355 |
HP2016rel | 352 |
HP2006abs | 351 |
HP2006rel | 351 |
LSKILL16 | 351 |
HP2011abs | 350 |
EDLOW_16 | 350 |
UNEMPM16 | 347 |
UNEMPF11 | 321 |
type | 320 |
LARENT16 | 307 |
UNEMPF16 | 300 |
AGEDEP06 | 295 |
AGEDEP11 | 294 |
AGEDEP16 | 294 |
LARENT11 | 286 |
LARENT06 | 284 |
UNEMPM06 | 271 |
UNEMPF06 | 232 |
share_H | 60 |
POPCHG11 | 60 |
share_T | 46 |
share | 45 |
share_M | 44 |
share_L | 42 |
POPCHG06 | 39 |
NUTS4 | 35 |
geo_county | 34 |
ppr_county | 26 |
PEROOM11 | 19 |
PEROOM16 | 17 |
POPCHG16 | 15 |
month | 12 |
NUTS3 | 9 |
region | 8 |
year | 7 |
PEROOM06 | 7 |
description_of_property | 6 |
NUTS2 | 3 |
NUTS1 | 2 |
- We can see that the top few variables have far too many levels, and that the bottom few variables will be problematic as they have 1-5 levels 1
18 Missing Data
sapply(ppr_pobal, function(x) sum(is.na(x))/length(x)) %>% as.data.frame() %>% rownames_to_column() %>% dplyr::arrange(desc(`.`)) %>% mutate_if(is.numeric, round, 4) %>% head(n=5)
rowname | 0 | |
---|---|---|
1 | postcode | 0.5491 |
2 | ID06 | 0.0012 |
3 | ED_Name | 0.0012 |
4 | NUTS4 | 0.0012 |
5 | NUTS3 | 0.0012 |
- As we can see, it's mostly postcode that's missing
- Here, it's probably a better idea to just drop it
19 BoxCox parameters
num_vars <- as.data.frame(ppr_pobal) %>% na.omit() %>% select_if(is.numeric) bc2 <- lapply(num_vars, BoxCoxTrans, na.rm=TRUE) lambdas <- sapply(bc2, function (x) x[["lambda"]]) head(lambdas)
year | 2 |
price | 0.1 |
latitude | 2 |
longitude | nil |
month | 0.8 |
ID06 | 0.1 |
- So we have a bunch of potential transformations to apply
20 PCA
- Principal Components Analysis (PCA) is a data reduction method
- It essentially examines the covariance between each variable, and returns the combination of each variable that is the best linear predictor
- The first component explains most of the variance
- And each extra component is uncorrelated with the previous
- This makes it much easier to model the variables
- However, it can make them much harder to interpret
21 Practical PCA
- Keeping all components just gives us uncorrelation (which makes each model easier to interpret)
- However, we normally want to describe as much variance as possible with as few variables as possible
- This necessitates choosing K components to keep
- How do we choose K?
22 This is a very very hard problem
- Because it's unsupervised
- If we had an outcome metric, we could cross-validate
- This is much harder without a response variable
- If we do have an outcome variable, then we would need to repeat the selection of K for each fold (to avoid bias)
- There are lots of methods, many of which don't agree
23 One method to rule them all
pobal_vars <- dplyr::select(num_vars, 26:64) %>% lapply(function (x) scale(x, center=TRUE, scale=TRUE)) pca <- princomp(~., data=pobal_vars)
24 Screeplot
screeplot(pca, type="lines")
25 How to use the scree plot
- Keep the components until the line levels off
- That's about five here
- You should also look at the variable loadings, described below
- If you want approximately 100 pages of this, consult my thesis
26 Loadings
loadings <- pca$loadings loadings[,1:6] %>% round(3) %>% head()
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | |
---|---|---|---|---|---|---|
POPCHG06 | 0.004 | 0.012 | 0.338 | 0.37 | 0.245 | 0.092 |
POPCHG11 | -0.006 | -0.02 | -0.336 | -0.442 | -0.23 | -0.252 |
POPCHG16 | -0.007 | -0.114 | -0.229 | -0.138 | -0.068 | -0.222 |
AGEDEP06 | 0.024 | 0.266 | 0.14 | -0.181 | -0.016 | 0.293 |
AGEDEP11 | 0.004 | 0.289 | 0.053 | -0.215 | -0.076 | 0.287 |
AGEDEP16 | -0.012 | 0.288 | 0.018 | -0.188 | -0.094 | 0.255 |
- This will allow you to name the variables which makes them more comfortable to model with
27 Y-aware Variable treatment
- Recently, a wonderful package for R has become available
- vtreat is a package for y-aware variable selection
- It does a whole host of variable selection and preparation techniques, and can significantly improve outcomes of modelling processes
- However, it is y-aware which means that it must be placed into the cross-validation process
- This can add significant compute requirements to your project
- even if you don't use it, the paper is a great source of references and ideas for variable treatment
28 Impact Coding
- Terminology developed by the authors of the vtreat package (explanation available here)
- Essentially the variables are smoothed towards the grand average
- In many ways, this is what a Bayesian hierarchical model will do, but this approach is much faster
- You must use a distinct set of data to prepare this coding, otherwise you will overfit
29 Informative Missingness
- Sometimes, the fact that a variable is missing is itself informative
- A good example is income:
- higher income people are less likely to report their income
- this introduces a bias based on whether or not this variable is missing
- We can create an indicator variable_is_missing, which allows us to include this in a regression model
- Note that trees and other methods do not have as much trouble with these variables, so vtreat may be less useful there
30 Example: Electoral Districts
- There are 3k electoral districts in the data
ed_count <- ppr_pobal %>% as.data.frame() %>% group_by(electoral_district_id) %>% summarise(count=n(), median=median(price)) ggplot(ed_count, aes(x=count))+geom_density()
31 Modelling without Vtreat
- Most of these have very few observations
#otherwise the below errors out ppr_pobal_s <- sample_frac(ppr_pobal, size=0.1) ##takes a loooooonnnnng time lm_ed1 <- lm(log(price)~electoral_district_id, data=ppr_pobal_s)
- We get an R^2 of .49 (adjusted for the number of parameters)
- This is pretty good, but we need to estimate a lot of parameters
- Can we do better with vtreat
32 Vtreat Approach
require(vtreat) #not a perfectly independent sample ##don't do it like this ppr_train_s2 <- sample_frac(ppr_pobal, size=0.05) treat <- designTreatmentsN(ppr_train_s2, varlist=c("electoral_district_id"), outcomename = "price") train_prep <- prepare(treat, dframe=ppr_pobal_s)
33 Vtreat In practice
ppr_pobal_test <- sample_frac(ppr_pobal, size=0.1) test_vtreat <- prepare(treat, ppr_pobal_test) glmnet_preds <- predict(glmnet_ed1, newx=as.matrix(test_vtreat), type="response") predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test$price) ggplot(predsvobs, aes(x=obs, y=preds.s0))+geom_point() ggplot(predsvobs, aes(x=obs, y=preds.s60))+geom_point() predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test$price) with(predsvobs, cor(preds.s60, obs))
- We get a correlation of around the same
- But the joy of vtreat is that we can easily apply the same approach to multiple variables, and it will handle missings and encodings for us automatically
34 Vtreat on multiple variables
tf <- designTreatmentsN(ppr_train_s2, varlist= setdiff( names(ppr_train_s2), "price"), outcomename = "price") ppr_p_t <- prepare(tf, dframe=ppr_pobal_s) ppr_glmnet_full <- cv.glmnet(x=as.matrix(ppr_pobal_treat), y=log(ppr_pobal_s$price)) test_vtreat_full <- prepare(treat_full, ppr_pobal_test)
35 Vtreat Results
- When we use all of the variables, we get pretty good results
- The predictions are pretty accurate, for most of the data
36 Predictions vs Observed
37 Is this model accurate?
- The most likely residual is approx 1000, which is very very good
- In fact, it's probably too good
- Examine the variables and residuals found from this model
- Discuss potential issues which may be flattering this model
38 Conclusions
- Sometimes, the best way to engineer features is to get more data
- Scaling and centering are good default transformations for numeric variables
- If you have lots of correlated variables, PCA can be a good choice
- If you want to transform effectively, then use Box Cox transformations
- vtreat can automate lots of common transformations, and if you can spare the data, you should definitely use it
Footnotes:
1
particularly an issue when using cross-validation