Data Classification Methods
Why classify at all
Data Classification is used to group a large number of observations into data ranges or classes. It is a useful tool to structure the data for choropleth maps automatically and it helps to reduce the information content. It allows to make the presentation of values much clearer. Individual observations are lost, small differences can be reduced and large ones can be emphasized and evaluated better. Distributional characteristics and the psychology of perception are taken into account.
What types of classification are distinguished?
- Supervised (number of classes known)
- Unsupervised (number of classes unknown)
How many classes are to be formed?
- Number of classes deepnds on the legibility of the map
- Class intervals must neither differ nor form gaps
- Class boundaries are round (integer values)
"As much as necessary as little as possible"
In general
- Multi-color representation: Maximum of 11 classes recommended
- Single-color representation: Maximum of 7 classes recommended
What classification methods are there?
- Equal Interval
- Quantiles
- Maximum breaks
- Standard deviations
- [...]
The Methods
Equal Interval
- Each class has a constant interval on the scale of common data ranges (e.g percentage-values or temperatures)
- Size of intervals is calculated as follows:
\[ Interval~Size = \frac{maxX_i-minX_i}{Number~of~Classes} \] - Therefore number of elements per category can differ
Example:
\(~\)Advantages | \(~\)Disadvantages |
Emphasizes amount of attributes relative to another | Data not regularly distributed |
Legend easy to interpret | Many features in one and none in another class |
\[ Highest~Elevation~per~State \]
Quantiles
- For data distributed across its range. Preferably linear distributed data or ordinal data
- The more classes the better
- The number of elements contained in a class is calculated as follows:
\[ Number~of~Elements~per~Class = \frac{Total~Number~of~Elements}{Number~of~Quantiles} \]
\(~\)Advantages | \(~\)Disadvantages |
Emphasizes relative position | Misleading, as interval sizes may vary strongly |
No empty classes | Values of same value might be in different classes |
\[ Highest~Elevation~per~State \]
Maximum breaks
- Class boundaries are determined by the maximum gradients of the values
- Unevenly distributed but not skewed toward either end
\[ Gradient = value(k-1)~-~value(k) \]
Example:
\(~\)Advantages | \(~\)Disadvantages |
No empty classes | If gradients are rather small in amount, the class boundaries may not be representative |
Highly uneven distribution of class boundaries possible |
\[ Highest~Elevation~per~State \]
Standard deviation
- Class boundaries are determined by the maximum gradients of the values- Unevenly distributed but not skewed toward either end. Especially appropriate or normal distributed data
- Deviation of the attribute values from the mean value is considered
- Intervals as a fraction of the standard deviation (e.g. \( n=\frac{1}{2} \) , \( n=\frac{1}{4} \))
\[ Interval = n~\cdot~\sigma \]
\(~\)Advantages | \(~\)Disadvantages |
Splitting values into above and below mean value | Uneven number of elements within the classes |
Depending on the distribution properties of the dataset |
\[ Highest~Elevation~per~State \]
Hands-On
The following map shows the density of medical practitioners in Munich. Feel free to try out different ways to classify this thematic information.