Cluster

This subpackage provides classes to perform the actual clustering.

Different clustering algorithms correspond to different subclasses of the base class clusterking.cluster.Cluster (and inherit all of its methods).

Currently implemented:

Cluster

class clusterking.cluster.Cluster[source]

Bases: clusterking.worker.DataWorker

Abstract baseclass of the Cluster classes. This class is subclassed to implement specific clustering algorithms and defines common functions.

__init__()[source]
Parameters:dataData object
md = None

Metadata

run(data, **kwargs)[source]

Implementation of the clustering. Should return an array-like object with the cluster number.

class clusterking.cluster.ClusterResult(data, md, clusters)[source]

Bases: clusterking.result.DataResult

__init__(data, md, clusters)[source]
get_clusters(indexed=False)[source]
write(cluster_column='cluster')[source]

Write results back in the Data object.

HierarchyCluster

class clusterking.cluster.HierarchyCluster[source]

Bases: clusterking.cluster.cluster.Cluster

__init__()[source]
max_d

Cutoff value set in set_max_d().

metric

Metric that was set in set_metric() (Function that takes Data object as only parameter and returns a reduced distance matrix.)

set_metric(*args, **kwargs) → None[source]

Select a metric in one of the following ways:

  1. If no positional arguments are given, we choose the euclidean metric.
  2. If the first positional argument is string, we pick one of the metrics that are defined in scipy.spatical.distance.pdist by that name (all additional arguments will be past to this function).
  3. If the first positional argument is a function, we take this function (and add all additional arguments to it).

Examples:

  • ...(): Euclidean metric
  • ...("euclidean"): Also Euclidean metric
  • ...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean'): Also Euclidean metric
  • ...("minkowski", p=2): Minkowsky distance with p=2.

See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.

Parameters:
  • *args – see description above
  • **kwargs – see description above
Returns:

Function that takes Data object as only parameter and returns a reduced distance matrix.

set_hierarchy_options(method='complete', optimal_ordering=False)[source]

Configure hierarchy building

Parameters:
  • method – See reference on scipy.cluster.hierarchy.linkage
  • optimal_ordering – See reference on scipy.cluster.hierarchy.linkage
set_max_d(max_d) → None[source]

Set the cutoff value of the hierarchy that then gives the clusters. This corresponds to the t argument of scipy.cluster.hierarchy.fcluster.

Parameters:max_d – float
Returns:None
set_fcluster_options(**kwargs) → None[source]

Set additional keyword options for our call to scipy.cluster.hierarchy.fcluster.

Parameters:kwargs – Keyword arguments
Returns:None
run(data, reuse_hierarchy_from: Optional[clusterking.cluster.hierarchy_cluster.HierarchyClusterResult] = None)[source]
Parameters:

Returns:

class clusterking.cluster.HierarchyClusterResult(data, md, clusters, hierarchy, worker_id)[source]

Bases: clusterking.cluster.cluster.ClusterResult

__init__(data, md, clusters, hierarchy, worker_id)[source]
hierarchy
worker_id

ID of the HierarchyCluster worker that generated this object.

data_id

ID of the data object that the HierarchyCluster worker was run on.

dendrogram(output: Union[None, str, pathlib.Path] = None, ax=None, show=False, **kwargs)[source]

Creates dendrogram

Parameters:
  • output – If supplied, we save the dendrogram there
  • ax – An axes object if you want to add the dendrogram to an existing axes rather than creating a new one
  • show – If true, the dendrogram is shown in a viewer.
  • **kwargs – Additional keyword options to scipy.cluster.hierarchy.dendrogram
Returns:

The matplotlib.pyplot.Axes object

KmeansCluster

class clusterking.cluster.KmeansCluster[source]

Bases: clusterking.cluster.cluster.Cluster

Kmeans clustering (wikipedia) as implemented in sklearn.cluster.

Example:

import clusterking as ck
d = ck.Data("/path/to/data.sql")    # Load some data
c = ck.cluster.KmeansCluster()      # Init worker class
c.set_kmeans_options(n_clusters=5)  # Set options for clustering
r = c.run(d)                        # Perform clustering on data
r.write()                           # Write results back to data
__init__()[source]
set_kmeans_options(**kwargs) → None[source]

Configure clustering algorithms.

Parameters:**kwargs – Keyword arguments to sklearn.cluster.KMeans().
run(data) → clusterking.cluster.kmeans_cluster.KmeansClusterResult[source]
class clusterking.cluster.KmeansClusterResult(data, md, clusters)[source]

Bases: clusterking.cluster.cluster.ClusterResult