Benchmark

This module contains worker classes that select representative sample points for each cluster (“benchmark points”).

AbstractBenchmark

class clusterking.benchmark.AbstractBenchmark[source]

Bases: clusterking.worker.DataWorker

Subclass this class to implement algorithms to choose benchmark points from all the points (in parameter space) that correspond to one cluster.

__init__()[source]
property cluster_column: str
set_cluster_column(column='cluster')[source]

St the column of the dataframe of the Data object that contains the cluster information.

abstract run(data)[source]
class clusterking.benchmark.AbstractBenchmarkResult(data, bpoints, md)[source]

Bases: clusterking.result.DataResult

__init__(data, bpoints, md)[source]
write(bpoint_column='bpoint') None[source]

Write benchmark points to a column in the dataframe of the data object.

Parameters

bpoint_column – Column to write to

Returns

None

Benchmark

class clusterking.benchmark.Benchmark[source]

Bases: clusterking.benchmark.abstract_benchmark.AbstractBenchmark

Selecting benchmarks based on a figure of merit that is calculated with the metric. You have to use set_metric() to specify the metric (as for the HierarchyCluster class). The default case for the figure of merit (“sum”) chooses the point as benchmark point that minimizes the sum of all distances to all other points in the same cluster (where “distance” of course is with respect to the metric).

__init__()[source]
Parameters
  • dataData object

  • cluster_column – Column name of the clusters

set_metric(*args, **kwargs) None[source]

Select a metric in one of the following ways:

  1. If no positional arguments are given, we choose the euclidean metric.

  2. If the first positional argument is string, we pick one of the metrics that are defined in scipy.spatical.distance.pdist by that name (all additional arguments will be past to this function).

  3. If the first positional argument is a function, we take this function (and add all additional arguments to it).

Examples:

  • ...(): Euclidean metric

  • ...("euclidean"): Also Euclidean metric

  • ...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean'): Also Euclidean metric

  • ...("minkowski", p=2): Minkowsky distance with p=2.

See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.

Parameters
  • *args – see description above

  • **kwargs – see description above

Returns

Function that takes Data object as only parameter and returns a reduced distance matrix.

set_fom(fct: Callable, *args, **kwargs) None[source]

Set a figure of merit. The default case for the figure of merit ( “sum”) chooses the point as benchmark point that minimizes the sum of all distances to all other points in the same cluster (where “distance” of course is with respect to the metric). In general we choose the point that minimizes self.fom(<metric>), i.e. the default case corresponds to self.fom = lambda x: np.sum(x, axis=1), which you could have also set by calling self.set_com(np.sum, axis=1).

Parameters
  • fct – Function that takes the metric as first argument

  • *args – Positional arguments that are added to the positional arguments of fct after the metric

  • **kwargs – Keyword arguments for the function

Returns

None

run(data)[source]
class clusterking.benchmark.BenchmarkResult(data, bpoints, md)[source]

Bases: clusterking.benchmark.abstract_benchmark.AbstractBenchmarkResult