Visualisation: Modelling the World
1 Structure
- This talk is an approach to visualisation
- Not many absolutes
- assumptions of vision
- Assumptions of Statistical Graphics
- Understanding data with Visualisation
- Communicating to others with Visualisation
2 What is Visualisation?
- a tool for understanding the world
- a way to communicate a particular perspective on data
- an adjunct to thought
3 Why Visualisation?
- The eye is really really good at finding patterns in pictures
- in fact, it's so good that it can find patterns that aren't even
there
m
4 The importance of perspective
- You can see one of two things in the previous image
- Which of them can depend on what you expect to see
- It can also depend on what your environment contains
5 Muller-Lyer
Figure 1: Which line is longer?
6 This illusion doesn't affect everyone similarly
- Europeans and Americans are more susceptible
- Africans are less susceptible
- Possibility that it is due to presence of right angles in urban environments
- appears to be a small difference between urban and rural dwellers
- very very relevant to boxplots (how to lie with boxplots, I guess)
7 Who cares?
- Shows that how we interpret stimuli is not tabula rasa
- When you gaze into the image, the image also gazes into you…
- We bring our own perception and previous associations into any image 1
8 When to use Visualisation?
\begin{center}
{\Huge Always}
\end{center}
9 Running Example
- Property Price Register
- Kinda a crappy dataset
- No cleaning or checking done by the authority
- lots of craziness (1 apartment for 18.6mn)
10 Property Price Register
- We used Google's geocoding service to get more details on each observation
- I updated Shane Lynn's script and ran it on the data up till October 2018
- I also typically break out properties sold for greater than 1e6, as they are often multiple-unit sales (and there's little to no automated way of figuring this out) 2
- Lots of manual fixing required
- the irish text definitely doesn't help
11 Assumptions of Statistical Graphics
- there are many
- in this section, I'd like to subvert them, in order to make you think
12 Line Graphs
- Normally represent time
- scatterplots don't (always) have the same assumptions
- what is the deepest assumption?
13 Median Property Price by Day, Ireland 2011-18
14 Flipped Line Chart
15 F-ing Line Chart
- Here, the violence is that we swap the axes in a fashion only a monster would
16 Abusing Standard Assumptions
17 Scatter plot
- Also encodes a set of base assumptions
- points nearer to each other in space are more related
- more orientation issues
18 Standard Scatter
19 Flipped Scatter
20 Other side
21 What does this tell us?
- We have a base level of assumptions that we bring to graphics (especially statistical graphics)
- Most of these appear to have been formed by Descartes
- When these assumptions are subverted, expect problems
22 Simple Statistical Graphics
- Graphs excel at showing relations between things
- Consider the difference between quantiles of a variable, and a density plot
- For example, the price of houses:
0% | 5079 |
10% | 55000 |
20% | 85000 |
30% | 115000 |
40% | 145000 |
50% | 175000 |
60% | 214000 |
70% | 255505 |
80% | 315000 |
90% | 430000 |
100% | 139165000 |
23 Density Plot
24 Better Density Plot
25 Transformations
- Useful to get a better sense of the data
- Have a bunch of assumptions (what's the log of -1)
- Can be used to deceive very, very easily
- Really really useful in everyday practice
26 Getting the sense of things
- Picking the right visualisation for the data is important
- is this a good plot?
- does this depend on the number of points?
27 Cleaning the Data
- Let's say we remove all properties with prices greater than 2mn
28 More Data Cleaning
- Better or worse?
29 Transformations Help
- Note the log 10 base
- Some of you may be able to convert from base 2.718, but I missed that class in school
- Still crap though
30 No data is an island
- The first obvious thing is to split by county, right?
- Oh look, it's lot of little boxes of crap :(
31 Summarisation
- The obvious answer is summarisation
32 Reducing Alpha kinda works…
- But really just washes the whole thing out
33 A redundant faceting variable
- We just group by a higher level variable
- Much clearer :)
34 WTF?
- This is one of the major advantages of visualisation:
- it helps to (dis)confirm your assumptions
- given that we have too many lines in the various groupings,we know that somethng has gone horribly wrong
- in this case, it's a mismatch between two different types of data
35 Distributions (i.e. boxplots)
36 Faceting, redux
- This actually works (for me, at least)
- can you explain this to a sales-person?
37 Distributions over Time, Redux
- This is much, much better
- I definitely don't think I'd try to explain it to a business/sales person
38 Spatial vs Temporal
- line plots vs maps
- time versus space
- both provide insight into
- pick one, difficult to do both
39 Line plots ignore space, maps ignore time
- There's a real problem of scale here, in that Dublin City is both responsible for much of the population, but is invisible
40 Dirty Oul Town
41 Counts tell a different story
- Outliers make the map useless
42 Dublin City (again)
43 Density Plots to help maps
- A tiny proportion of electoral districts drive the uselessness of the maps
44 Maps over Time
- Just doesn't work
- Even when I account for the outliers, it still doesn't work.
45 Lines for Time
- This shows the trend plus outliers
- Much more useful
- lose the spatial dimension
46 Interactivity and Dashboards
- Can show both time and space
- for reporting, these are essential
- Much more effort from a software-engineering perspective 3
47 Performative vs Presentation
- Two types of graphs:
- for yourself
- for other people (and different audiences need different things)
48 Performative Graphics
- These are used to help you understand a problem
- typically created in an iterative fashion
- often move from data transformation to visualisation and back again (like this talk)
49 Presentation Graphs
- To some extent, your job with presentation visualisations is to tell a story
- hopefully, it will be nuanced, but that isn't a requirement 4
- Often good to show smooths as opposed to raw data
- raw data is often ugly
- need for care here, as this should only be done where there is a clear effect
50 Advice
- As few as possible
- One clear message
- Repeat yourself
- Remove nuance
51 As few as possible
- There should be no extraneous graphs
- Each graph should have a clear purpose
- Smooths are really effective
52 One Clear Message
- You should only be telling one story at a time
- People are easily confused
- Especially in an oral presentation
- Backup docs should contain nuance
53 Repeat Yourself
- This is the key to helping people retain information
- This is easier once you know the story
- Say what you want to say, say it, then say what you said
54 Remove Nuance
- This varies by audience
- Salespeople may just want the results
- colleagues may want to see the code
- most people just want a high level explanation
- Nuance should be present, just not in a presentation
55 Conclusions
- Everyone bring assumptions to visualisations
- Make sure that you take advantage of this
- Visualisation is primarily a tool for communicating with yourself
- Iterative process, even bad graphs can teach you something
- Secondarily, it's a tool for communicating with others
- When using visualisations with others, keep it simple
56 More Info
- My property article here
- My repository for this talk
- My crazy long notes file with most of my analyses
- the data itself
57 sessionInfo
Footnotes:
1
anything really, but we're talking about images here.
2
please someone in the audience suggest a better idea
3
for me, at least
4
and in fact, it may be better to remove all nuance from the presentation and provide a longer document with all the failed approaches and hacking needed to actually reproduce your results