Kennard-Stone algorithm and SPXY algorithm with python implementation

Kennard-Stone (KS) and SPXY algorithms are common sample split algorithms in the chemometrics field. Unlike random split, these two methods are based on the so called "maximum minimum distance", which means the split result of KS and SPXY are definite instead of random. This article will briefly introduce these 2 algorithms and provide the python implementation.

Background

Multivariate data analysis is a pillar of chemometrics, and multi-dimensional data are quite easily acquired in the presence of new instruments, such as spectrometers. Multi-dimensional data should be carefully split into train, validation, and test set to build chemometrics models. Random split is a common choice, but the models can vary due to different split results. What's worse, random split can't assure a proper split result, i.e., the subsets after split can't not represent the original dataset. So many researches have investigated how to select a representative subset from a large dataset.

KS and SPXY algorithms are commonly used in spectra analysis. The 2 methods can select samples uniformly from a pool of $n$ samples. KS algorithm are conducted based on the similarity of independent variables ( $x$ ) between the subset and the original set, while SPXY combines independent and dependent variables ( $y$ )^[1]. They are very similar, both selecting subset samples by maximum minimum distance.

KS split

Assume we are going to select $k$ samples to create a subset from $n$ samples. The $n$ samples are presented by the matrix $X$ .

X= \left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1m} \\ x_{21} & x_{22} & \cdots & x_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nm} \\ \end{matrix} \right]

where $m$ is the number of variables of each sample.

The steps of KS split is shown below:

Calculate the distance matrix of the $n$ samples as the similarity matrix. Euclidean distance is often adopted here. The distance matrix $D$ is denoted as below.

D= \left[ \begin{matrix} d_{11} & d_{12} & \cdots & d_{1n} \\ d_{21} & d_{22} & \cdots & d_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ d_{n1} & d_{n2} & \cdots & d_{nn} \\ \end{matrix} \right]

where $d_{ij}$ is the distance between the $i$ th and the $j$ th sample. So $D$ is a symmetric matrix if the distance is Euclidean distance.

Add the 2 samples between which the distance is the longest into the subset $S$ . Now we face the maximum minimum distance problem. There are 2 samples in $S$ (named $a$ and $b$ , respectively) and $n$ -2 samples remaining now.
Select a sample c from remaining samples, calculate the distance of c to $a$ and $b$ , respectively (You don't have to calculate again indeed, because all distances have been calculated in step1).
If the distance between $c$ and $b$ is shorter than the distance between $c$ and $a$ , then we call the distance "minimum distance of sample $c$ ".
Repeat the step 3 and step 4 above, then we can get all the minimum distances of the remaining $n$ -2 samples.
From the $n$ -2 minimum distances, the maximum distance (maximum minimum distance) is selected as the new sample of $S$ , shown in the figure below.
Repeat the steps above until $k$ samples are added into $S$ .

The samples in $S$ are distributed uniformly and by this method, the train set will represent the original set to some extent.

SPXY split

As illustrated at the start, KS split only concerns the similarity (distance) between independent variables, while SPXY adds dependent variables into distance calculation. The only difference between KS and SPXY is the way to calculate distance. Assume we have sample matrix is shown as

X= \left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1m} \\ x_{21} & x_{22} & \cdots & x_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nm} \\ \end{matrix} \right], \qquad Y= \left[ \begin{matrix} y_{11} & y_{12} & \cdots & y_{1s} \\ y_{21} & y_{22} & \cdots & y_{2s} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{ns} \\ \end{matrix} \right]

The distance calculation formula for KS and SPXY are shown below:

d_{KS}(i,j)=\sqrt{\Sigma^m_{t=1}(x_{it}-x_{jt})^2}

d_{SPXY}(i,j)=\frac{\sqrt{\Sigma^m_{t=1}(x_{it}-x_{jt})^2}}{\max\limits_{i,j\in [1.n]}\sqrt{\Sigma^m_{t=1}(x_{it}-x_{jt})^2}} +\frac{\sqrt{\Sigma^s_{t=1}(y_{it}-y_{jt})^2}}{\max\limits_{i,j\in [1.n]}\sqrt{\Sigma^s_{t=1}(y_{it}-y_{jt})^2}}

In fact, the core of KS and SPXY algorithms are maximum minimum distance split, and we can define another distance metric according to the real situation. A research proposed a new distance metric based on cosine similarity^[2].

Python implementation

# -*- coding=utf-8 -*-

from __future__ import division, print_function
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import cdist


def random_split(spectra, test_size=0.25, random_state=None, shuffle=True, stratify=None):
    """implement random_split by using sklearn.model_selection.train_test_split function. See
    http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    for more infomation.
    """
    return train_test_split(
        spectra,
        test_size=test_size,
        random_state=random_state,
        shuffle=shuffle,
        stratify=stratify)


def kennardstone(spectra, test_size=0.25, metric='euclidean', *args, **kwargs):
    """Kennard Stone Sample Split method
    Parameters
    ----------
    spectra: ndarray, shape of i x j
        i spectrums and j variables (wavelength/wavenumber/ramam shift and so on)
    test_size : float, int
        if float, then round(i x (1-test_size)) spectrums are selected as test data, by default 0.25
        if int, then test_size is directly used as test data size
    metric : str, optional
        The distance metric to use, by default 'euclidean'
        See scipy.spatial.distance.cdist for more infomation
    Returns
    -------
    select_pts: list
        index of selected spetrums as train data, index is zero based
    remaining_pts: list
        index of remaining spectrums as test data, index is zero based
    References
    --------
    Kennard, R. W., & Stone, L. A. (1969). Computer aided design of experiments.
    Technometrics, 11(1), 137-148. (https://www.jstor.org/stable/1266770)
    """

    if test_size < 1:
        train_size = round(spectra.shape[0] * (1 - test_size))
    else:
        train_size = spectra.shape[0] - round(test_size)

    if train_size > 2:
        distance = cdist(spectra, spectra, metric=metric, *args, **kwargs)
        select_pts, remaining_pts = max_min_distance_split(distance, train_size)
    else:
        raise ValueError("train sample size should be at least 2")

    return select_pts, remaining_pts


def spxy(spectra, yvalues, test_size=0.25, metric='euclidean', *args, **kwargs):
    """SPXY Sample Split method
    Parameters
    ----------
    spectra: ndarray, shape of i x j
        i spectrums and j variables (wavelength/wavenumber/ramam shift and so on)
    test_size : float, int
        if float, then round(i x (1-test_size)) spectrums are selected as test data, by default 0.25
        if int, then test_size is directly used as test data size
    metric : str, optional
        The distance metric to use, by default 'euclidean'
        See scipy.spatial.distance.cdist for more infomation
    Returns
    -------
    select_pts: list
        index of selected spetrums as train data, index is zero based
    remaining_pts: list
        index of remaining spectrums as test data, index is zero based
    References
    ---------
    Galvao et al. (2005). A method for calibration and validation subset partitioning.
    Talanta, 67(4), 736-740. (https://www.sciencedirect.com/science/article/pii/S003991400500192X)
    """

    if test_size < 1:
        train_size = round(spectra.shape[0] * (1 - test_size))
    else:
        train_size = spectra.shape[0] - round(test_size)

    if train_size > 2:
        yvalues = yvalues.reshape(yvalues.shape[0], -1)
        distance_spectra = cdist(spectra, spectra, metric=metric, *args, **kwargs)
        distance_y = cdist(yvalues, yvalues, metric=metric, *args, **kwargs)
        distance_spectra = distance_spectra / distance_spectra.max()
        distance_y = distance_y / distance_y.max()

        distance = distance_spectra + distance_y
        select_pts, remaining_pts = max_min_distance_split(distance, train_size)
    else:
        raise ValueError("train sample size should be at least 2")

    return select_pts, remaining_pts


def max_min_distance_split(distance, train_size):
    """sample set split method based on maximun minimun distance, which is the core of Kennard Stone
    method
    Parameters
    ----------
    distance : distance matrix
        semi-positive real symmetric matrix of a certain distance metric
    train_size : train data sample size
        should be greater than 2
    Returns
    -------
    select_pts: list
        index of selected spetrums as train data, index is zero-based
    remaining_pts: list
        index of remaining spectrums as test data, index is zero-based
    """

    select_pts = []
    remaining_pts = [x for x in range(distance.shape[0])]

    # first select 2 farthest points
    first_2pts = np.unravel_index(np.argmax(distance), distance.shape)
    select_pts.append(first_2pts[0])
    select_pts.append(first_2pts[1])

    # remove the first 2 points from the remaining list
    remaining_pts.remove(first_2pts[0])
    remaining_pts.remove(first_2pts[1])

    for i in range(train_size - 2):
        # find the maximum minimum distance
        select_distance = distance[select_pts, :]
        min_distance = select_distance[:, remaining_pts]
        min_distance = np.min(min_distance, axis=0)
        max_min_distance = np.max(min_distance)

        # select the first point (in case that several distances are the same, choose the first one)
        points = np.argwhere(select_distance == max_min_distance)[:, 1].tolist()
        for point in points:
            if point in select_pts:
                pass
            else:
                select_pts.append(point)
                remaining_pts.remove(point)
                break
    return select_pts, remaining_pts

Galvao, Roberto Kawakami Harrop, et al. "A method for calibration and validation subset partitioning." Talanta 67.4 (2005): 736-740. ↩︎
Li, Wenze, et al. "HSPXY: A hybrid‐correlation and diversity‐distances based data partition method." Journal of Chemometrics 33.4 (2019): e3109. ↩︎

dario-passoscommentedover 4 years ago

Hi! Congratulations for the functions. They look very useful.
I was wandering, how can we use your kennardstone() function to split both the spectra X and a target variable Y like we do in sklearn train_test_split?

Cheers
Dário

hxhccommentedover 4 years ago

@dario-passos
Hi! Congratulations for the functions. They look very useful.
I was wandering, how can we use your kennardstone() function to split both the spectra X and a target variable Y like we do in sklearn train_test_split?

Cheers
Dário

Hi Dário

Thank you for comment. The function is a little different from the train_test_split in scikit-learn. In scikit-learn, we can split X and y like below (X and y are numpy arrrays):

 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3)

And X_train, y_train are the datasets after split.

But my kennardstone function does not support such a convenient operation. There are two aspects:

Kennardstone is designed to split spectra data only. If you think the y data is also important for sample split, maybe you can consider using the SPXY method.
All the functions I implemented above return the indexes of the train/test samples, instead of the dataset itself.

So, the way to use kennardstone to achieve the scikit-learn effect is below:

X_train_index, X_test_index = kennardstone(X, test_size=0.3)
X_train, X_test = X[X_train_index, :], X[X_test_index, :]
y_train, y_test = y[X_train_index, :], y[X_test_index, :]

If you have any problems, or find any mistakes, please don't hesitate to contact me.

@hxhc
Hi Dário

Thank you for comment. The function is a little different from the train_test_split in scikit-learn. In scikit-learn, we can split X and y like below (X and y are numpy arrrays):
 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3)
And X_train, y_train are the datasets after split.

But my kennardstone function does not support such a convenient operation. There are two aspects:

Kennardstone is designed to split spectra data only. If you think the y data is also important for sample split, maybe you can consider using the SPXY method.

All the functions I implemented above return the indexes of the train/test samples, instead of the dataset itself.

So, the way to use kennardstone to achieve the scikit-learn effect is below:
X_train_index, X_test_index = kennardstone(X, test_size=0.3)
X_train, X_test = X[X_train_index, :], X[X_test_index, :]
y_train, y_test = y[X_train_index, :], y[X_test_index, :]
If you have any problems, or find any mistakes, please don't hesitate to contact me.

Hi @hxhc
thanks for the reply. When I asked the question I was thinking in terms of the train_test_split function that returns the samples themselves, not the indices. Probably I didn't pay the correct attention to your code. Now that you told me that your functions return the indices of the splits, it makes perfect sense!
Great work and also thanks for sharing the other links on CNNs...

环形缓冲

Kennard-Stone algorithm and SPXY algorithm with python implementation

Background

KS split

SPXY split

Python implementation

Using matplotlib to create publishable graphs

Windows下创建共享文件夹并映射网络驱动器