Data Classification Methods


Why classify at all
Data Classification is used to group a large number of observations into data ranges or classes. It is a useful tool to structure the data for choropleth maps automatically and it helps to reduce the information content. It allows to make the presentation of values much clearer. Individual observations are lost, small differences can be reduced and large ones can be emphasized and evaluated better. Distributional characteristics and the psychology of perception are taken into account.



What types of classification are distinguished?

  • Supervised (number of classes known)
  • Unsupervised (number of classes unknown)



How many classes are to be formed?

  • Number of classes deepnds on the legibility of the map
  • Class intervals must neither differ nor form gaps
  • Class boundaries are round (integer values)


"As much as necessary as little as possible"

Dr.-Ing. Christian Murphy in Datenklassifizierung

In general

  • Multi-color representation: Maximum of 11 classes recommended
  • Single-color representation: Maximum of 7 classes recommended



What classification methods are there?

  • Equal Interval
  • Quantiles
  • Maximum breaks
  • Standard deviations
  • [...]

The Methods




Equal Interval

  • Each class has a constant interval on the scale of common data ranges (e.g percentage-values or temperatures)
  • Size of intervals is calculated as follows:

    \[ Interval~Size = \frac{maxX_i-minX_i}{Number~of~Classes} \]
  • \( maxX_i/minX_i \): maximum/minimum value

  • Therefore number of elements per category can differ

Example:





\(~\)Advantages \(~\)Disadvantages
Emphasizes amount of attributes relative to another Data not regularly distributed
Legend easy to interpret Many features in one and none in another class

\[ Highest~Elevation~per~State \] 0 500 1000 1500 2000 2500 3000 3500



Quantiles

  • For data distributed across its range. Preferably linear distributed data or ordinal data
  • The more classes the better
  • The number of elements contained in a class is calculated as follows:

    \[ Number~of~Elements~per~Class = \frac{Total~Number~of~Elements}{Number~of~Quantiles} \]
Example:





\(~\)Advantages \(~\)Disadvantages
Emphasizes relative position Misleading, as interval sizes may vary strongly
No empty classes Values of same value might be in different classes

\[ Highest~Elevation~per~State \] 0 500 1000 1500 2000 2500 3000 3500



Maximum breaks

  • Class boundaries are determined by the maximum gradients of the values
  • Unevenly distributed but not skewed toward either end

\[ Gradient = value(k-1)~-~value(k) \]
Example:





\(~\)Advantages \(~\)Disadvantages
No empty classes If gradients are rather small in amount, the class boundaries may not be representative
Highly uneven distribution of class boundaries possible

\[ Highest~Elevation~per~State \] 0 500 1000 1500 2000 2500 3000 3500



Standard deviation

  • Class boundaries are determined by the maximum gradients of the values- Unevenly distributed but not skewed toward either end. Especially appropriate or normal distributed data
  • Deviation of the attribute values from the mean value is considered
  • Intervals as a fraction of the standard deviation (e.g. \( n=\frac{1}{2} \) , \( n=\frac{1}{4} \))

    \[ Interval = n~\cdot~\sigma \]
  • \( n \): multiple of standard deviation
    \( \sigma \): standard deviation
Example:





\(~\)Advantages \(~\)Disadvantages
Splitting values into above and below mean value Uneven number of elements within the classes
Depending on the distribution properties of the dataset

\[ Highest~Elevation~per~State \] 0 500 1000 1500 2000 2500 3000 3500

Hands-On


The following map shows the density of medical practitioners in Munich. Feel free to try out different ways to classify this thematic information.

Classification Methods

References