Stability¶
Investigate the stability of your clustering algorithm.
Stability Testers¶
-
class
clusterking.stability.stabilitytester.
StabilityTesterResult
[source]¶ Bases:
clusterking.result.AbstractResult
Result of a
AbstractStabilityTester
-
class
clusterking.stability.stabilitytester.
SimpleStabilityTesterResult
(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶ Bases:
clusterking.result.AbstractResult
-
classmethod
load
(path: Union[str, pathlib.PurePath]) → clusterking.stability.stabilitytester.SimpleStabilityTesterResult[source]¶ Load
SimpleStabilityTesterResult
from file.Parameters: path – Path to result file Returns: SimpleStabilityTesterResult
objectExample
sstr = SimpleStabilityTesterResult.load(“path/to/file”)
-
classmethod
-
class
clusterking.stability.stabilitytester.
AbstractStabilityTester
(exceptions='raise')[source]¶ Bases:
clusterking.worker.AbstractWorker
Abstract baseclass to perform stability tests. This baseclass is a subclass of
clusterking.worker.AbstractWorker
and thereby adheres to the Command design pattern: After initialization, several methods can be called to modify internal settings. Finally, therun()
method is called to perform the actual test.All current stability tests perform the task at hand (clustering, benchmarking, etc.) for multiple, slightly varied datasets or worker parameters (these runs are called ‘experiments’). For each of these (for each experiment), figures of merit (FOMs) are calculated that compare the outcome with the original outcome (e.g. how many points still lie in the same cluster, or how far the benchmark points are diverging). These FOMs are then written out to a
StabilityTesterResult
object, which provides methods for visualization and further analyses (e.g. histograms, etc.).-
__init__
(exceptions='raise')[source]¶ Initialize
AbstractStabilityTester
Parameters: exceptions – When calculating the FOM, what should we do if an exception arises. ‘raise’: Raise exception, ‘print’: Return None and print exception information.
-
add_fom
(fom: clusterking.stability.fom.FOM) → None[source]¶ Add a figure of merit (FOM).
Parameters: fom – FOM
objectReturns: None
-
run
(*args, **kwargs) → clusterking.stability.stabilitytester.StabilityTesterResult[source]¶ Run the stability test.
Parameters: - *args – Positional arguments
- **kwargs – Key word arguments
Returns: StabilityTesterResult
object
-
-
class
clusterking.stability.noisysamplestability.
NoisySampleStabilityTesterResult
(df, samples=None, **kwargs)[source]¶ Bases:
clusterking.stability.stabilitytester.SimpleStabilityTesterResult
Result of
NoisySampleStabilityTester
-
samples
= None¶ Collected samples
-
-
class
clusterking.stability.noisysamplestability.
NoisySampleResult
(samples: Optional[List[clusterking.data.data.Data]] = None)[source]¶ Bases:
clusterking.result.AbstractResult
-
write
(directory: Union[str, pathlib.PurePath], non_empty='add') → None[source]¶ Write to output directory
Parameters: - directory – Path to directory
- non_empty – What to do if directory is not empty:
raise
(raiseFileExistsError
),ignore
(do nothing and potentially overwrite files),add
(add files with new name).
Returns: None
-
classmethod
load
(directory: Union[str, pathlib.PurePath], loader: Optional[Callable] = None) → clusterking.stability.noisysamplestability.NoisySampleResult[source]¶ Load from output directory
Parameters: - directory – Path to directory to load from
- loader – Function used to load data (optional).
Example:
def loader(path): d = clusterking.DataWithError(path) d.add_rel_err_uncorr(0.01) return d nsr = NoisySampleResult.load("/path/to/dir/", loader=loader)
-
-
class
clusterking.stability.noisysamplestability.
NoisySample
[source]¶ Bases:
clusterking.worker.AbstractWorker
This stability test generates data samples with slightly varied sample points (by adding
clusterking.scan.Scanner.add_spoints_noise()
to a pre-configuredclusterking.scan.Scanner
object)Example:
import clusterking as ck from clusterking.stability.noisysamplestability import NoisySample # Set up data object d = ck.Data() # Set up scanner s = Scanner() s.set_dfunction(...) s.set_spoints_equidist(...) # Set up noisysample object ns = NoisySample() ns.set_repeat(1) ns.set_noise("gauss", mean=0., sigma=1/30/4) # Run and write nsr = ns.run(scanner=s, data=d) nsr.write("output/folder")
-
set_repeat
(repeat=10) → None[source]¶ Set number of experiments.
Parameters: repeat – Number of experiments Returns: None
-
set_noise
(*args, **kwargs) → None[source]¶ Configure noise, applied to the spoints in each experiment. See
clusterking.scan.Scanner.add_spoints_noise()
.Parameters: - *args – Positional arguments to
clusterking.scan.Scanner.add_spoints_noise()
. - **kwargs – Keyword argumnets to
clusterking.scan.Scanner.add_spoints_noise()
.
Returns: None
- *args – Positional arguments to
-
run
(scanner: clusterking.scan.scanner.Scanner, data: Optional[clusterking.data.data.Data] = None) → clusterking.stability.noisysamplestability.NoisySampleResult[source]¶ Note
This method will handle keyboard interrupts and still return the so far collected data.
Parameters: - scanner –
Scanner
object - data – data:
Data
object. This does not have to contain any actual sample points, but is used so that you can use data with errors by passing aDataWithErrors
object.
Returns: - scanner –
-
-
class
clusterking.stability.noisysamplestability.
NoisySampleStabilityTester
(*args, keep_samples=False, **kwargs)[source]¶ Bases:
clusterking.stability.stabilitytester.AbstractStabilityTester
This stability test generates data samples with slightly varied sample points (by adding
clusterking.scan.Scanner.add_spoints_noise()
to a pre-configuredclusterking.scan.Scanner
object) and compares the resulting clusters and benchmark points.Example:
nsr = NoisySampleResult() nsr.load("/path/to/samples/") c = ck.cluster.HierarchyCluster() c.set_metric() c.set_max_d(0.2) nsst = NoisySampleStabilityTester() nsst.add_fom(DeltaNClusters(name="DeltaNClusters")) r = nsst.run(sample=nsr, cluster=c)
-
__init__
(*args, keep_samples=False, **kwargs)[source]¶ Initialize
NoisySampleStabilityTester
Parameters: - *args – Arguments passed on to
AbstractStabilityTester
- keep_samples – Save clustered/benchmarked samples to
NoisySampleStabilityTester.samples
- **kwargs – Keyword arguments passed on to
AbstractStabilityTester
- *args – Arguments passed on to
-
run
(sample: clusterking.stability.noisysamplestability.NoisySampleResult, cluster: Optional[clusterking.cluster.cluster.Cluster] = None, benchmark: Optional[clusterking.benchmark.abstract_benchmark.AbstractBenchmark] = None) → clusterking.stability.noisysamplestability.NoisySampleStabilityTesterResult[source]¶ Run stability test.
Parameters: - sample –
NoisySampleResult
- cluster –
Cluster
object - benchmark – Optional:
Cluster
object
Returns: - sample –
-
-
class
clusterking.stability.subsamplestability.
SubSampleStabilityTesterResult
(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶ Bases:
clusterking.stability.stabilitytester.SimpleStabilityTesterResult
-
class
clusterking.stability.subsamplestability.
SubSampleStabilityTester
[source]¶ Bases:
clusterking.stability.stabilitytester.AbstractStabilityTester
Test the stability of clustering algorithms by repeatedly clustering subsamples of data.
Example:
ssst = SubSampleStabilityTester() ssst.set_sampling(frac=0.99) ssst.set_repeat(50) d = ck.Data(path) c = ck.cluster.HierarchyCluster() c.set_metric("euclidean") c.set_max_d(0.2) c.run(data=d).write() b = Benchmark() b.set_metric("euclidean") b.run(data=d).write() ssstr = ssst.run(data=d, cluster=c, benchmark=b)
-
set_sampling
(**kwargs) → None[source]¶ Configure the subsampling of the data. If performing benchmarking, it is ensured that none of the benchmark points of the original dataframe are removed during subsampling (to allow to compare the benchmarking results).
Parameters: **kwargs – Keyword arguments to clusterking.data.Data.sample_param_random()
, in particular keyword arguments topandas.DataFrame.sample()
.Returns: None Example:
ssst.set_sampling(n=100) # Sample 100 points ssst.set_sampling(frac=0.9) # Sample 90% of the points
-
set_repeat
(repeat=100) → None[source]¶ Parameters: repeat – Number of subsamples to test Returns: None
-
set_progress_bar
(state=True) → None[source]¶ Set or unset progress bar.
Parameters: state – Bool: Display progress bar? Returns: None
-
-
class
clusterking.stability.subsamplestability.
SubSampleStabilityVsFractionResult
(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f106a14a0f0>)[source]¶ Bases:
clusterking.stability.stabilitytester.SimpleStabilityTesterResult
-
class
clusterking.stability.subsamplestability.
SubSampleStabilityVsFraction
[source]¶ Bases:
object
Repeatedly run
SubSampleStabilityTester
for different fractions.
Figures of Merit¶
- class
clusterking.stability.fom.
FOMResult
(fom, name)[source]¶Bases:
clusterking.result.AbstractResult
Object containing the result of a Figure of Merit (FOM), represented by a
FOM
object.
- class
clusterking.stability.fom.
FOM
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Bases:
clusterking.worker.AbstractWorker
Figure of Merit, comparing the outcome of two experiments (e.g. the clusters of two very similar datasets).
__init__
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Initialize the FOM worker.
Parameters:
- name – Name of the FOM
- preprocessor –
Preprocessor
object
name
¶Name of the FOM
preprocessor
¶
- class
clusterking.stability.fom.
CCFOM
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Bases:
clusterking.stability.fom.FOM
Cluster Comparison figure of merit (CCFOM), comparing whether the clusters of two experiments match.
- class
clusterking.stability.fom.
MatchingClusters
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Bases:
clusterking.stability.fom.CCFOM
Fraction of sample points (spoints) that lie in the same cluster, when comparing two clustered datasets with the same number of sample points.
- class
clusterking.stability.fom.
DeltaNClusters
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Bases:
clusterking.stability.fom.CCFOM
Difference of number of clusters between two experiments (number of clusters in experiment 1 - number of lcusters in experiment 2).
- class
clusterking.stability.fom.
NClusters
(which, **kwargs)[source]¶Bases:
clusterking.stability.fom.CCFOM
Number of clusters in dataset 1 or 2
- class
clusterking.stability.fom.
BpointList
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Bases:
clusterking.stability.fom.FOM
Adds array of bpoint coordinates of data2
- class
clusterking.stability.fom.
BMFOM
(name: Optional[str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]¶Bases:
clusterking.stability.fom.FOM
Abstract class: Benchmark Figure of Merit (BMFOM), comparing whether the benchmark points of two experiments match.
- class
clusterking.stability.fom.
AverageBMProximityFOM
(*args, **kwargs)[source]¶Bases:
clusterking.stability.fom.BMFOM
Returns the average distance of benchmark points in parameter space between two experiments.
named_averaging_fcts
= dict_keys(['arithmetic', 'max'])¶
named_metric_fcts
= dict_keys(['euclidean'])¶
__init__
(*args, **kwargs)[source]¶Initialize the FOM worker.
:param See
__init__()
:
set_averaging
(fct: Union[str, Callable]) → None[source]¶Set averaging mode
Parameters: fct – Function of the distances between benchmark points of the same cluster or name of pre-implemented functions (check named_averaging_fcts
for a list)Returns: None
set_metric
(fct: Union[str, Callable]) → None[source]¶Set metric in parameter space
Parameters: fct – Function of a tuple of two points in parameter space or name of pre-implemented functions (check named_metric_fcts
for a list)Returns: None
Preprocessors¶
- class
clusterking.stability.preprocessor.
Preprocessor
(name=None)[source]¶Bases:
clusterking.worker.AbstractWorker
name
¶
- class
clusterking.stability.preprocessor.
ClusterMatcherResult
(data1, data2, rename_dct)[source]¶Bases:
clusterking.stability.preprocessor.PreprocessorResult
- class
clusterking.stability.preprocessor.
ClusterMatcher
(*args, cluster_column='cluster', **kwargs)[source]¶Bases:
clusterking.stability.preprocessor.Preprocessor
Cluster names are arbitrary in general, i.e. when trying to compare two clustered datasets and trying to calculate a figure of merit, we have to match the names together. This is donen by this worker class.
- class
clusterking.stability.preprocessor.
TrivialClusterMatcher
(*args, cluster_column='cluster', **kwargs)[source]¶Bases:
clusterking.stability.preprocessor.ClusterMatcher
Thus subclass of
CCMatcher
maps cluster names from the first clustering to the cluster name of the second that maximizes the number of sample points that lie in the same cluster. It also only returns the intersection of the indizes of both Series.
- class
clusterking.stability.preprocessor.
FirstComeFirstServe1DClusterMatcher
(*args, cluster_column='cluster', **kwargs)[source]¶Bases:
clusterking.stability.preprocessor.ClusterMatcher
This subclass of
CCMatcher
works only for 1D parameter spaces. It simply sorts the first points of each cluster and enumerates them in order to get a unique name for each cluster.