Data Classification Methods

Why classify at all
Data Classification is used to group a large number of observations into data ranges or classes. It is a useful tool to structure the data for choropleth maps automatically and it helps to reduce the information content. It allows to make the presentation of values much clearer. Individual observations are lost, small differences can be reduced and large ones can be emphasized and evaluated better. Distributional characteristics and the psychology of perception are taken into account.

What types of classification are distinguished?

Supervised (number of classes known)
Unsupervised (number of classes unknown)

How many classes are to be formed?

Number of classes deepnds on the legibility of the map
Class intervals must neither differ nor form gaps
Class boundaries are round (integer values)

"As much as necessary as little as possible"

Dr.-Ing. Christian Murphy in Datenklassifizierung

In general

Multi-color representation: Maximum of 11 classes recommended
Single-color representation: Maximum of 7 classes recommended

What classification methods are there?

Equal Interval
Quantiles
Maximum breaks
Standard deviations
[...]

The Methods

Equal Interval

Each class has a constant interval on the scale of common data ranges (e.g percentage-values or temperatures)
Size of intervals is calculated as follows:

\[ Interval~Size = \frac{maxX_i-minX_i}{Number~of~Classes} \]

\( maxX_i/minX_i \): maximum/minimum value

Therefore number of elements per category can differ

Example:

\(~\)Advantages	\(~\)Disadvantages
Emphasizes amount of attributes relative to another	Data not regularly distributed
Legend easy to interpret	Many features in one and none in another class

\[ Highest~Elevation~per~State \]

Quantiles

For data distributed across its range. Preferably linear distributed data or ordinal data
The more classes the better
The number of elements contained in a class is calculated as follows:

\[ Number~of~Elements~per~Class = \frac{Total~Number~of~Elements}{Number~of~Quantiles} \]

Example:

\(~\)Advantages	\(~\)Disadvantages
Emphasizes relative position	Misleading, as interval sizes may vary strongly
No empty classes	Values of same value might be in different classes

\[ Highest~Elevation~per~State \]

Maximum breaks

Class boundaries are determined by the maximum gradients of the values
Unevenly distributed but not skewed toward either end

\[ Gradient = value(k-1)~-~value(k) \]
Example:

\(~\)Advantages	\(~\)Disadvantages
No empty classes	If gradients are rather small in amount, the class boundaries may not be representative
	Highly uneven distribution of class boundaries possible

\[ Highest~Elevation~per~State \]

Standard deviation

Class boundaries are determined by the maximum gradients of the values- Unevenly distributed but not skewed toward either end. Especially appropriate or normal distributed data
Deviation of the attribute values from the mean value is considered
Intervals as a fraction of the standard deviation (e.g. \( n=\frac{1}{2} \) , \( n=\frac{1}{4} \))

\[ Interval = n~\cdot~\sigma \]

\( n \): multiple of standard deviation

\( \sigma \): standard deviation

Example:

\(~\)Advantages	\(~\)Disadvantages
Splitting values into above and below mean value	Uneven number of elements within the classes
	Depending on the distribution properties of the dataset

\[ Highest~Elevation~per~State \]

Hands-On

The following map shows the density of medical practitioners in Munich. Feel free to try out different ways to classify this thematic information.

Classification Methods

Data Classification Methods

The Methods

Hands-On

References