Stability¶

Investigate the stability of your clustering algorithm.

Stability Testers¶

class clusterking.stability.stabilitytester.StabilityTesterResult[source]¶

Bases: clusterking.result.AbstractResult

Result of a AbstractStabilityTester

class clusterking.stability.stabilitytester.SimpleStabilityTesterResult(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶

Bases: clusterking.result.AbstractResult

__init__(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶

write(path: Union[str, pathlib.PurePath]) → None[source]¶: Save to file.

classmethod load(path: Union[str, pathlib.PurePath]) → clusterking.stability.stabilitytester.SimpleStabilityTesterResult[source]¶

Load SimpleStabilityTesterResult from file.

Parameters:	path – Path to result file
Returns:	`SimpleStabilityTesterResult` object

Example

sstr = SimpleStabilityTesterResult.load(“path/to/file”)

class clusterking.stability.stabilitytester.AbstractStabilityTester(exceptions='raise')[source]¶

Bases: clusterking.worker.AbstractWorker

Abstract baseclass to perform stability tests. This baseclass is a subclass of clusterking.worker.AbstractWorker and thereby adheres to the Command design pattern: After initialization, several methods can be called to modify internal settings. Finally, the run() method is called to perform the actual test.

All current stability tests perform the task at hand (clustering, benchmarking, etc.) for multiple, slightly varied datasets or worker parameters (these runs are called ‘experiments’). For each of these (for each experiment), figures of merit (FOMs) are calculated that compare the outcome with the original outcome (e.g. how many points still lie in the same cluster, or how far the benchmark points are diverging). These FOMs are then written out to a StabilityTesterResult object, which provides methods for visualization and further analyses (e.g. histograms, etc.).

__init__(exceptions='raise')[source]¶

Initialize AbstractStabilityTester

Parameters:	exceptions – When calculating the FOM, what should we do if an exception arises. ‘raise’: Raise exception, ‘print’: Return None and print exception information.

add_fom(fom: clusterking.stability.fom.FOM) → None[source]¶

Add a figure of merit (FOM).

Parameters:	fom – `FOM` object
Returns:	None

run(*args, **kwargs) → clusterking.stability.stabilitytester.StabilityTesterResult[source]¶

Run the stability test.

Parameters:	args – Positional arguments *kwargs – Key word arguments
Returns:	`StabilityTesterResult` object

class clusterking.stability.noisysamplestability.NoisySampleStabilityTesterResult(df, samples=None, **kwargs)[source]¶

Bases: clusterking.stability.stabilitytester.SimpleStabilityTesterResult

Result of NoisySampleStabilityTester

__init__(df, samples=None, **kwargs)[source]¶

samples = None¶: Collected samples

class clusterking.stability.noisysamplestability.NoisySampleResult(samples: Optional[List[clusterking.data.data.Data]] = None)[source]¶

Bases: clusterking.result.AbstractResult

__init__(samples: Optional[List[clusterking.data.data.Data]] = None)[source]¶

write(directory: Union[str, pathlib.PurePath], non_empty='add') → None[source]¶

Write to output directory

Parameters:	directory – Path to directory non_empty – What to do if directory is not empty: `raise` (raise `FileExistsError`), `ignore` (do nothing and potentially overwrite files), `add` (add files with new name).
Returns:	None

classmethod load(directory: Union[str, pathlib.PurePath], loader: Optional[Callable] = None) → clusterking.stability.noisysamplestability.NoisySampleResult[source]¶

Load from output directory

Parameters:	directory – Path to directory to load from loader – Function used to load data (optional).

Example:

def loader(path):
    d = clusterking.DataWithError(path)
    d.add_rel_err_uncorr(0.01)
    return d

nsr = NoisySampleResult.load("/path/to/dir/", loader=loader)

class clusterking.stability.noisysamplestability.NoisySample[source]¶

Bases: clusterking.worker.AbstractWorker

This stability test generates data samples with slightly varied sample points (by adding clusterking.scan.Scanner.add_spoints_noise() to a pre-configured clusterking.scan.Scanner object)

Example:

import clusterking as ck
from clusterking.stability.noisysamplestability import NoisySample

# Set up data object
d = ck.Data()

# Set up scanner
s = Scanner()
s.set_dfunction(...)
s.set_spoints_equidist(...)

# Set up noisysample object
ns = NoisySample()
ns.set_repeat(1)
ns.set_noise("gauss", mean=0., sigma=1/30/4)

# Run and write
nsr = ns.run(scanner=s, data=d)
nsr.write("output/folder")

__init__()[source]¶

set_repeat(repeat=10) → None[source]¶

Set number of experiments.

Parameters:	repeat – Number of experiments
Returns:	None

set_noise(*args, **kwargs) → None[source]¶

Configure noise, applied to the spoints in each experiment. See clusterking.scan.Scanner.add_spoints_noise().

Parameters:	args – Positional arguments to `clusterking.scan.Scanner.add_spoints_noise()`. *kwargs – Keyword argumnets to `clusterking.scan.Scanner.add_spoints_noise()`.
Returns:	None

run(scanner: clusterking.scan.scanner.Scanner, data: Optional[clusterking.data.data.Data] = None) → clusterking.stability.noisysamplestability.NoisySampleResult[source]¶

Note

This method will handle keyboard interrupts and still return the so far collected data.

Parameters:	scanner – `Scanner` object data – data: `Data` object. This does not have to contain any actual sample points, but is used so that you can use data with errors by passing a `DataWithErrors` object.
Returns:	`NoisySampleResult`.

class clusterking.stability.noisysamplestability.NoisySampleStabilityTester(*args, keep_samples=False, **kwargs)[source]¶

Bases: clusterking.stability.stabilitytester.AbstractStabilityTester

This stability test generates data samples with slightly varied sample points (by adding clusterking.scan.Scanner.add_spoints_noise() to a pre-configured clusterking.scan.Scanner object) and compares the resulting clusters and benchmark points.

Example:

nsr = NoisySampleResult()
nsr.load("/path/to/samples/")

c = ck.cluster.HierarchyCluster()
c.set_metric()
c.set_max_d(0.2)

nsst = NoisySampleStabilityTester()
nsst.add_fom(DeltaNClusters(name="DeltaNClusters"))
r = nsst.run(sample=nsr, cluster=c)

__init__(*args, keep_samples=False, **kwargs)[source]¶

Initialize NoisySampleStabilityTester

Parameters:	args – Arguments passed on to `AbstractStabilityTester` keep_samples* – Save clustered/benchmarked samples to `NoisySampleStabilityTester.samples` **kwargs – Keyword arguments passed on to `AbstractStabilityTester`

run(sample: clusterking.stability.noisysamplestability.NoisySampleResult, cluster: Optional[clusterking.cluster.cluster.Cluster] = None, benchmark: Optional[clusterking.benchmark.abstract_benchmark.AbstractBenchmark] = None) → clusterking.stability.noisysamplestability.NoisySampleStabilityTesterResult[source]¶

Run stability test.

Parameters:	sample – `NoisySampleResult` cluster – `Cluster` object benchmark – Optional: `Cluster` object
Returns:	`NoisySampleStabilityTesterResult` object

class clusterking.stability.subsamplestability.SubSampleStabilityTesterResult(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶: Bases: clusterking.stability.stabilitytester.SimpleStabilityTesterResult

class clusterking.stability.subsamplestability.SubSampleStabilityTester[source]¶

Bases: clusterking.stability.stabilitytester.AbstractStabilityTester

Test the stability of clustering algorithms by repeatedly clustering subsamples of data.

Example:

ssst = SubSampleStabilityTester()
ssst.set_sampling(frac=0.99)
ssst.set_repeat(50)

d = ck.Data(path)

c = ck.cluster.HierarchyCluster()
c.set_metric("euclidean")
c.set_max_d(0.2)
c.run(data=d).write()

b = Benchmark()
b.set_metric("euclidean")
b.run(data=d).write()

ssstr = ssst.run(data=d, cluster=c, benchmark=b)

__init__()[source]¶

set_sampling(**kwargs) → None[source]¶

Configure the subsampling of the data. If performing benchmarking, it is ensured that none of the benchmark points of the original dataframe are removed during subsampling (to allow to compare the benchmarking results).

Parameters:	**kwargs – Keyword arguments to `clusterking.data.Data.sample_param_random()`, in particular keyword arguments to `pandas.DataFrame.sample()`.
Returns:	None

Example:

ssst.set_sampling(n=100)     # Sample 100 points
ssst.set_sampling(frac=0.9)  # Sample 90% of the points

set_repeat(repeat=100) → None[source]¶

Parameters:	repeat – Number of subsamples to test
Returns:	None

set_progress_bar(state=True) → None[source]¶

Set or unset progress bar.

Parameters:	state – Bool: Display progress bar?
Returns:	None

run(data: clusterking.data.data.Data, cluster: clusterking.cluster.cluster.Cluster, benchmark: Optional[clusterking.benchmark.abstract_benchmark.AbstractBenchmark] = None) → clusterking.stability.subsamplestability.SubSampleStabilityTesterResult[source]¶

Run test.

Parameters:	data – `Data` object cluster – Pre-configured `Cluster` object benchmark – Optional: `Cluster` object
Returns:	`SubSampleStabilityTesterResult` object

class clusterking.stability.subsamplestability.SubSampleStabilityVsFractionResult(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶: Bases: clusterking.stability.stabilitytester.SimpleStabilityTesterResult

class clusterking.stability.subsamplestability.SubSampleStabilityVsFraction[source]¶

Bases: object

Repeatedly run SubSampleStabilityTester for different fractions.

__init__()[source]¶

run(data: clusterking.data.data.Data, cluster: clusterking.cluster.cluster.Cluster, ssst: clusterking.stability.subsamplestability.SubSampleStabilityTester, fractions: Iterable[float])[source]¶

Figures of Merit¶

class clusterking.stability.fom.FOMResult(fom, name)[source]¶

Bases: clusterking.result.AbstractResult

Object containing the result of a Figure of Merit (FOM), represented by a FOM object.

__init__(fom, name)[source]¶

class clusterking.stability.fom.FOM(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Bases: clusterking.worker.AbstractWorker

Figure of Merit, comparing the outcome of two experiments (e.g. the clusters of two very similar datasets).

__init__(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Initialize the FOM worker.

Parameters:

name – Name of the FOM

preprocessor – Preprocessor object

name¶

Name of the FOM

set_name(value: str)[source]¶

preprocessor¶

set_preprocessor(preprocessor: clusterking.stability.preprocessor.Preprocessor)[source]¶

run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) → clusterking.stability.fom.FOMResult[source]¶

Calculate figure of merit.

Parameters:

data1 – “original” Data object

data2 – “other” Data object

Returns:
FOMResult object

class clusterking.stability.fom.CCFOM(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Bases: clusterking.stability.fom.FOM

Cluster Comparison figure of merit (CCFOM), comparing whether the clusters of two experiments match.

class clusterking.stability.fom.MatchingClusters(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Bases: clusterking.stability.fom.CCFOM

Fraction of sample points (spoints) that lie in the same cluster, when comparing two clustered datasets with the same number of sample points.

class clusterking.stability.fom.DeltaNClusters(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Bases: clusterking.stability.fom.CCFOM

Difference of number of clusters between two experiments (number of clusters in experiment 1 - number of lcusters in experiment 2).

class clusterking.stability.fom.NClusters(which, **kwargs)[source]¶

Bases: clusterking.stability.fom.CCFOM

Number of clusters in dataset 1 or 2

__init__(which, **kwargs)[source]¶

Parameters:

which – 1 or 2 for dataset 1 or dataset 2

**kwargs – Keyword argumnets for CCFOM`

class clusterking.stability.fom.BpointList(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Bases: clusterking.stability.fom.FOM

Adds array of bpoint coordinates of data2

class clusterking.stability.fom.BMFOM(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶

Bases: clusterking.stability.fom.FOM

Abstract class: Benchmark Figure of Merit (BMFOM), comparing whether the benchmark points of two experiments match.

class clusterking.stability.fom.AverageBMProximityFOM(*args, **kwargs)[source]¶

Bases: clusterking.stability.fom.BMFOM

Returns the average distance of benchmark points in parameter space between two experiments.

named_averaging_fcts = dict_keys(['arithmetic', 'max'])¶

named_metric_fcts = dict_keys(['euclidean'])¶

__init__(*args, **kwargs)[source]¶

Initialize the FOM worker.

:param See __init__():

set_averaging(fct: Union[str, Callable]) → None[source]¶

Set averaging mode

Parameters: fct – Function of the distances between benchmark points of the same cluster or name of pre-implemented functions (check named_averaging_fcts for a list)

Returns: None

set_metric(fct: Union[str, Callable]) → None[source]¶

Set metric in parameter space

Parameters: fct – Function of a tuple of two points in parameter space or name of pre-implemented functions (check named_metric_fcts for a list)

Returns: None

Preprocessors¶

class clusterking.stability.preprocessor.PreprocessorResult(data1, data2)[source]¶

Bases: clusterking.result.AbstractResult

__init__(data1, data2)[source]¶

class clusterking.stability.preprocessor.Preprocessor(name=None)[source]¶

Bases: clusterking.worker.AbstractWorker

__init__(name=None)[source]¶

name¶

run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) → clusterking.stability.preprocessor.PreprocessorResult[source]¶

Run.

Parameters:

data1 – “original” Data object

data2 – “other” Data object

Returns:
PreprocessorResult

class clusterking.stability.preprocessor.ClusterMatcherResult(data1, data2, rename_dct)[source]¶

Bases: clusterking.stability.preprocessor.PreprocessorResult

__init__(data1, data2, rename_dct)[source]¶

class clusterking.stability.preprocessor.ClusterMatcher(*args, cluster_column='cluster', **kwargs)[source]¶

Bases: clusterking.stability.preprocessor.Preprocessor

Cluster names are arbitrary in general, i.e. when trying to compare two clustered datasets and trying to calculate a figure of merit, we have to match the names together. This is donen by this worker class.

__init__(*args, cluster_column='cluster', **kwargs)[source]¶

run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) → clusterking.stability.preprocessor.ClusterMatcherResult[source]¶

Parameters:

data1 – “original” Data object

data2 – “other” Data object

Returns:
ClusterMatcherResult

class clusterking.stability.preprocessor.TrivialClusterMatcher(*args, cluster_column='cluster', **kwargs)[source]¶

Bases: clusterking.stability.preprocessor.ClusterMatcher

Thus subclass of CCMatcher maps cluster names from the first clustering to the cluster name of the second that maximizes the number of sample points that lie in the same cluster. It also only returns the intersection of the indizes of both Series.

run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) → clusterking.stability.preprocessor.ClusterMatcherResult[source]¶

class clusterking.stability.preprocessor.FirstComeFirstServe1DClusterMatcher(*args, cluster_column='cluster', **kwargs)[source]¶

Bases: clusterking.stability.preprocessor.ClusterMatcher

This subclass of CCMatcher works only for 1D parameter spaces. It simply sorts the first points of each cluster and enumerates them in order to get a unique name for each cluster.

run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) → clusterking.stability.preprocessor.ClusterMatcherResult[source]¶