Benchmark

AbstractBenchmark

class clusterking.benchmark.abstract_benchmark.AbstractBenchmark(data: clusterking.data.data.Data, cluster_column='cluster')[source]

Bases: object

Subclass this class to implement algorithms to choose benchmark points from all the points (in parameter space) that correspond to one cluster.

__init__(data: clusterking.data.data.Data, cluster_column='cluster')[source]
Parameters:
  • dataData object
  • cluster_column – Column name of the clusters
cluster_column

The column from which we read the cluster information. Defaults to ‘cluster’.

select_bpoints() → None[source]

Select one benchmark point for each cluster.

write(bpoint_column='bpoint') → None[source]

Write benchmark points to a column in the dataframe of the data object.

Parameters:bpoint_column – Column to write to
Returns:None

Benchmark

class clusterking.benchmark.benchmark.Benchmark(data, cluster_column='cluster')[source]

Bases: clusterking.benchmark.abstract_benchmark.AbstractBenchmark

Selecting benchmarks based on a figure of merit that is calculated with the metric. You have to use set_metric() to specify the metric (as for the HierarchyCluster class). The default case for the figure of merit (“sum”) chooses the point as benchmark point that minimizes the sum of all distances to all other points in the same cluster (where “distance” of course is with respect to the metric).

__init__(data, cluster_column='cluster')[source]
Parameters:
  • dataData object
  • cluster_column – Column name of the clusters
set_metric(*args, **kwargs) → None[source]

Select a metric in one of the following ways:

  1. If no positional arguments are given, we choose the euclidean metric.
  2. If the first positional argument is string, we pick one of the metrics
that are defined in scipy.spatical.distance.pdist by that name (all additional arguments will be past to this function).

3. If the first positional argument is a function, we take this function (and add all additional arguments to it).

Examples:

  • ...(): Euclidean metric
  • ...("euclidean"): Also Euclidean metric
  • ...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean'): Also Euclidean metric
  • ...("minkowski", p=2): Minkowsky distance with p=2.

See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.

Parameters:
  • *args
  • **kwargs
Returns:

Function that takes Data object as only parameter and returns a reduced distance matrix.

set_fom(fct: Callable, *args, **kwargs) → None[source]

Set a figure of merit. The default case for the figure of merit ( “sum”) chooses the point as benchmark point that minimizes the sum of all distances to all other points in the same cluster (where “distance” of course is with respect to the metric). In general we choose the point that minimizes self.fom(<metric>), i.e. the default case corresponds to self.fom = lambda x: np.sum(x, axis=1), which you could have also set by calling self.set_com(np.sum, axis=1).

Parameters:
  • fct – Function that takes the metric as first argument
  • *args – Positional arguments that are added to the positional arguments of fct after the metric
  • **kwargs – Keyword arguments for the function
Returns:

None