Stability
Contents
Stability#
Investigate the stability of your clustering algorithm.
Stability Testers#
- class clusterking.stability.stabilitytester.StabilityTesterResult[source]#
Bases:
clusterking.result.AbstractResultResult of a
AbstractStabilityTester
- class clusterking.stability.stabilitytester.SimpleStabilityTesterResult(df: pandas.core.frame.DataFrame)[source]#
Bases:
clusterking.result.AbstractResult- classmethod load(path: Union[str, pathlib.PurePath]) clusterking.stability.stabilitytester.SimpleStabilityTesterResult[source]#
Load
SimpleStabilityTesterResultfrom file.- Parameters
path – Path to result file
- Returns
SimpleStabilityTesterResultobject
Example
sstr = SimpleStabilityTesterResult.load(“path/to/file”)
- class clusterking.stability.stabilitytester.AbstractStabilityTester(exceptions='raise')[source]#
Bases:
clusterking.worker.AbstractWorkerAbstract baseclass to perform stability tests. This baseclass is a subclass of
clusterking.worker.AbstractWorkerand thereby adheres to the Command design pattern: After initialization, several methods can be called to modify internal settings. Finally, therun()method is called to perform the actual test.All current stability tests perform the task at hand (clustering, benchmarking, etc.) for multiple, slightly varied datasets or worker parameters (these runs are called ‘experiments’). For each of these (for each experiment), figures of merit (FOMs) are calculated that compare the outcome with the original outcome (e.g. how many points still lie in the same cluster, or how far the benchmark points are diverging). These FOMs are then written out to a
StabilityTesterResultobject, which provides methods for visualization and further analyses (e.g. histograms, etc.).- __init__(exceptions='raise')[source]#
Initialize
AbstractStabilityTester- Parameters
exceptions – When calculating the FOM, what should we do if an exception arises. ‘raise’: Raise exception, ‘print’: Return None and print exception information.
- add_fom(fom: clusterking.stability.fom.FOM) None[source]#
Add a figure of merit (FOM).
- Parameters
fom –
FOMobject- Returns
None
- abstract run(*args, **kwargs) clusterking.stability.stabilitytester.StabilityTesterResult[source]#
Run the stability test.
- Parameters
*args – Positional arguments
**kwargs – Key word arguments
- Returns
StabilityTesterResultobject
- class clusterking.stability.noisysamplestability.NoisySampleStabilityTesterResult(df, samples=None, **kwargs)[source]#
Bases:
clusterking.stability.stabilitytester.SimpleStabilityTesterResultResult of
NoisySampleStabilityTester- samples#
Collected samples
- class clusterking.stability.noisysamplestability.NoisySampleResult(samples: Optional[List[clusterking.data.data.Data]] = None)[source]#
Bases:
clusterking.result.AbstractResult- __init__(samples: Optional[List[clusterking.data.data.Data]] = None)[source]#
- write(directory: Union[str, pathlib.PurePath], non_empty='add') None[source]#
Write to output directory
- Parameters
directory – Path to directory
non_empty – What to do if directory is not empty:
raise(raiseFileExistsError),ignore(do nothing and potentially overwrite files),add(add files with new name).
- Returns
None
- classmethod load(directory: Union[str, pathlib.PurePath], loader: Optional[Callable] = None) clusterking.stability.noisysamplestability.NoisySampleResult[source]#
Load from output directory
- Parameters
directory – Path to directory to load from
loader – Function used to load data (optional).
Example:
def loader(path): d = clusterking.DataWithError(path) d.add_rel_err_uncorr(0.01) return d nsr = NoisySampleResult.load("/path/to/dir/", loader=loader)
- class clusterking.stability.noisysamplestability.NoisySample[source]#
Bases:
clusterking.worker.AbstractWorkerThis stability test generates data samples with slightly varied sample points (by adding
clusterking.scan.Scanner.add_spoints_noise()to a pre-configuredclusterking.scan.Scannerobject)Example:
import clusterking as ck from clusterking.stability.noisysamplestability import NoisySample # Set up data object d = ck.Data() # Set up scanner s = Scanner() s.set_dfunction(...) s.set_spoints_equidist(...) # Set up noisysample object ns = NoisySample() ns.set_repeat(1) ns.set_noise("gauss", mean=0., sigma=1/30/4) # Run and write nsr = ns.run(scanner=s, data=d) nsr.write("output/folder")
- set_repeat(repeat=10) None[source]#
Set number of experiments.
- Parameters
repeat – Number of experiments
- Returns
None
- set_noise(*args, **kwargs) None[source]#
Configure noise, applied to the spoints in each experiment. See
clusterking.scan.Scanner.add_spoints_noise().- Parameters
*args – Positional arguments to
clusterking.scan.Scanner.add_spoints_noise().**kwargs – Keyword arguments to
clusterking.scan.Scanner.add_spoints_noise().
- Returns
None
- run(scanner: clusterking.scan.scanner.Scanner, data: Optional[clusterking.data.data.Data] = None) clusterking.stability.noisysamplestability.NoisySampleResult[source]#
Note
This method will handle keyboard interrupts and still return the so far collected data.
- Parameters
scanner –
Scannerobjectdata – data:
Dataobject. This does not have to contain any actual sample points, but is used so that you can use data with errors by passing aDataWithErrorsobject.
- Returns
- class clusterking.stability.noisysamplestability.NoisySampleStabilityTester(*args, keep_samples=False, **kwargs)[source]#
Bases:
clusterking.stability.stabilitytester.AbstractStabilityTesterThis stability test generates data samples with slightly varied sample points (by adding
clusterking.scan.Scanner.add_spoints_noise()to a pre-configuredclusterking.scan.Scannerobject) and compares the resulting clusters and benchmark points.Example:
nsr = NoisySampleResult() nsr.load("/path/to/samples/") c = ck.cluster.HierarchyCluster() c.set_metric() c.set_max_d(0.2) nsst = NoisySampleStabilityTester() nsst.add_fom(DeltaNClusters(name="DeltaNClusters")) r = nsst.run(sample=nsr, cluster=c)
- __init__(*args, keep_samples=False, **kwargs)[source]#
Initialize
NoisySampleStabilityTester- Parameters
*args – Arguments passed on to
AbstractStabilityTesterkeep_samples – Save clustered/benchmarked samples to
NoisySampleStabilityTester.samples**kwargs – Keyword arguments passed on to
AbstractStabilityTester
- run(sample: clusterking.stability.noisysamplestability.NoisySampleResult, cluster: Optional[clusterking.cluster.cluster.Cluster] = None, benchmark: Optional[clusterking.benchmark.abstract_benchmark.AbstractBenchmark] = None) clusterking.stability.noisysamplestability.NoisySampleStabilityTesterResult[source]#
Run stability test.
- Parameters
sample –
NoisySampleResultcluster –
Clusterobjectbenchmark – Optional:
Clusterobject
- Returns
- class clusterking.stability.subsamplestability.SubSampleStabilityTesterResult(df: pandas.core.frame.DataFrame)[source]#
Bases:
clusterking.stability.stabilitytester.SimpleStabilityTesterResult
- class clusterking.stability.subsamplestability.SubSampleStabilityTester[source]#
Bases:
clusterking.stability.stabilitytester.AbstractStabilityTesterTest the stability of clustering algorithms by repeatedly clustering subsamples of data.
Example:
ssst = SubSampleStabilityTester() ssst.set_sampling(frac=0.99) ssst.set_repeat(50) d = ck.Data(path) c = ck.cluster.HierarchyCluster() c.set_metric("euclidean") c.set_max_d(0.2) c.run(data=d).write() b = Benchmark() b.set_metric("euclidean") b.run(data=d).write() ssstr = ssst.run(data=d, cluster=c, benchmark=b)
- set_sampling(**kwargs) None[source]#
Configure the subsampling of the data. If performing benchmarking, it is ensured that none of the benchmark points of the original dataframe are removed during subsampling (to allow to compare the benchmarking results).
- Parameters
**kwargs – Keyword arguments to
clusterking.data.Data.sample_param_random(), in particular keyword arguments topandas.DataFrame.sample().- Returns
None
Example:
ssst.set_sampling(n=100) # Sample 100 points ssst.set_sampling(frac=0.9) # Sample 90% of the points
- set_progress_bar(state=True) None[source]#
Set or unset progress bar.
- Parameters
state – Bool: Display progress bar?
- Returns
None
- run(data: clusterking.data.data.Data, cluster: clusterking.cluster.cluster.Cluster, benchmark: Optional[clusterking.benchmark.abstract_benchmark.AbstractBenchmark] = None) clusterking.stability.subsamplestability.SubSampleStabilityTesterResult[source]#
Run test.
- Parameters
- Returns
- class clusterking.stability.subsamplestability.SubSampleStabilityVsFractionResult(df: pandas.core.frame.DataFrame)[source]#
Bases:
clusterking.stability.stabilitytester.SimpleStabilityTesterResult
- class clusterking.stability.subsamplestability.SubSampleStabilityVsFraction[source]#
Bases:
objectRepeatedly run
SubSampleStabilityTesterfor different fractions.- run(data: clusterking.data.data.Data, cluster: clusterking.cluster.cluster.Cluster, ssst: clusterking.stability.subsamplestability.SubSampleStabilityTester, fractions: Iterable[float])[source]#
Figures of Merit#
- class clusterking.stability.fom.FOMResult(fom, name)[source]#
Bases:
clusterking.result.AbstractResultObject containing the result of a Figure of Merit (FOM), represented by a
FOMobject.
- class clusterking.stability.fom.FOM(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Bases:
clusterking.worker.AbstractWorkerFigure of Merit, comparing the outcome of two experiments (e.g. the clusters of two very similar datasets).
- __init__(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Initialize the FOM worker.
- Parameters
name – Name of the FOM
preprocessor –
Preprocessorobject
- property name#
Name of the FOM
- property preprocessor#
- set_preprocessor(preprocessor: clusterking.stability.preprocessor.Preprocessor)[source]#
- run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) clusterking.stability.fom.FOMResult[source]#
Calculate figure of merit.
- class clusterking.stability.fom.CCFOM(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Bases:
clusterking.stability.fom.FOMCluster Comparison figure of merit (CCFOM), comparing whether the clusters of two experiments match.
- class clusterking.stability.fom.MatchingClusters(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Bases:
clusterking.stability.fom.CCFOMFraction of sample points (spoints) that lie in the same cluster, when comparing two clustered datasets with the same number of sample points.
- class clusterking.stability.fom.DeltaNClusters(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Bases:
clusterking.stability.fom.CCFOMDifference of number of clusters between two experiments (number of clusters in experiment 1 - number of lcusters in experiment 2).
- class clusterking.stability.fom.NClusters(which, **kwargs)[source]#
Bases:
clusterking.stability.fom.CCFOMNumber of clusters in dataset 1 or 2
- class clusterking.stability.fom.BpointList(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Bases:
clusterking.stability.fom.FOMAdds array of bpoint coordinates of data2
- class clusterking.stability.fom.BMFOM(name: Union[None, str] = None, preprocessor: Optional[clusterking.stability.preprocessor.Preprocessor] = None)[source]#
Bases:
clusterking.stability.fom.FOMAbstract class: Benchmark Figure of Merit (BMFOM), comparing whether the benchmark points of two experiments match.
- class clusterking.stability.fom.AverageBMProximityFOM(*args, **kwargs)[source]#
Bases:
clusterking.stability.fom.BMFOMReturns the average distance of benchmark points in parameter space between two experiments.
- named_averaging_fcts = dict_keys(['max', 'arithmetic'])#
- named_metric_fcts = dict_keys(['euclidean'])#
- __init__(*args, **kwargs)[source]#
Initialize the FOM worker.
:param See
__init__():
Preprocessors#
- class clusterking.stability.preprocessor.Preprocessor(name=None)[source]#
Bases:
clusterking.worker.AbstractWorker- property name#
- run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) clusterking.stability.preprocessor.PreprocessorResult[source]#
Run.
- Parameters
- Returns
- class clusterking.stability.preprocessor.ClusterMatcherResult(data1, data2, rename_dct)[source]#
Bases:
clusterking.stability.preprocessor.PreprocessorResult
- class clusterking.stability.preprocessor.ClusterMatcher(*args, cluster_column='cluster', **kwargs)[source]#
Bases:
clusterking.stability.preprocessor.PreprocessorCluster names are arbitrary in general, i.e. when trying to compare two clustered datasets and trying to calculate a figure of merit, we have to match the names together. This is donen by this worker class.
- abstract run(data1: clusterking.data.data.Data, data2: clusterking.data.data.Data) clusterking.stability.preprocessor.ClusterMatcherResult[source]#
- Parameters
- Returns
- class clusterking.stability.preprocessor.TrivialClusterMatcher(*args, cluster_column='cluster', **kwargs)[source]#
Bases:
clusterking.stability.preprocessor.ClusterMatcherThus subclass of
CCMatchermaps cluster names from the first clustering to the cluster name of the second that maximizes the number of sample points that lie in the same cluster. It also only returns the intersection of the indizes of both Series.
- class clusterking.stability.preprocessor.FirstComeFirstServe1DClusterMatcher(*args, cluster_column='cluster', **kwargs)[source]#
Bases:
clusterking.stability.preprocessor.ClusterMatcherThis subclass of
CCMatcherworks only for 1D parameter spaces. It simply sorts the first points of each cluster and enumerates them in order to get a unique name for each cluster.