Cluster

This subpackage provides classes to perform the actual clustering.

Different clustering algorithms correspond to different subclasses of the base class clusterking.cluster.Cluster (and inherit all of its methods).

Currently implemented:

Cluster

class clusterking.cluster.Cluster[source]

Bases: clusterking.worker.DataWorker

Abstract baseclass of the Cluster classes. This class is subclassed to implement specific clustering algorithms and defines common functions.

__init__()[source]
Parameters

dataData object

md

Metadata

abstract run(data, **kwargs)[source]

Implementation of the clustering. Should return an array-like object with the cluster number.

class clusterking.cluster.ClusterResult(data, md, clusters)[source]

Bases: clusterking.result.DataResult

__init__(data, md, clusters)[source]
get_clusters(indexed=False)[source]
write(cluster_column='cluster')[source]

Write results back in the Data object.

HierarchyCluster

class clusterking.cluster.HierarchyCluster[source]

Bases: clusterking.cluster.cluster.Cluster

__init__()[source]
property max_d: Optional[float]

Cutoff value set in set_max_d().

property metric: Callable

Metric that was set in set_metric() (Function that takes Data object as only parameter and returns a reduced distance matrix.)

set_metric(*args, **kwargs) None[source]

Select a metric in one of the following ways:

  1. If no positional arguments are given, we choose the euclidean metric.

  2. If the first positional argument is string, we pick one of the metrics that are defined in scipy.spatical.distance.pdist by that name (all additional arguments will be past to this function).

  3. If the first positional argument is a function, we take this function (and add all additional arguments to it).

Examples:

  • ...(): Euclidean metric

  • ...("euclidean"): Also Euclidean metric

  • ...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean'): Also Euclidean metric

  • ...("minkowski", p=2): Minkowsky distance with p=2.

See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.

Parameters
  • *args – see description above

  • **kwargs – see description above

Returns

Function that takes Data object as only parameter and returns a reduced distance matrix.

set_hierarchy_options(method='complete', optimal_ordering=False)[source]

Configure hierarchy building

Parameters
  • method – See reference on scipy.cluster.hierarchy.linkage

  • optimal_ordering – See reference on scipy.cluster.hierarchy.linkage

set_max_d(max_d) None[source]

Set the cutoff value of the hierarchy that then gives the clusters. This corresponds to the t argument of scipy.cluster.hierarchy.fcluster.

Parameters

max_d – float

Returns

None

set_fcluster_options(**kwargs) None[source]

Set additional keyword options for our call to scipy.cluster.hierarchy.fcluster.

Parameters

kwargs – Keyword arguments

Returns

None

run(data, reuse_hierarchy_from: Optional[clusterking.cluster.hierarchy_cluster.HierarchyClusterResult] = None)[source]
Parameters

Returns:

class clusterking.cluster.HierarchyClusterResult(data, md, clusters, hierarchy, worker_id)[source]

Bases: clusterking.cluster.cluster.ClusterResult

__init__(data, md, clusters, hierarchy, worker_id)[source]
property hierarchy
property worker_id

ID of the HierarchyCluster worker that generated this object.

property data_id: int

ID of the data object that the HierarchyCluster worker was run on.

dendrogram(output: Union[None, str, pathlib.Path] = None, ax=None, show=False, **kwargs)[source]

Creates dendrogram

Parameters
  • output – If supplied, we save the dendrogram there

  • ax – An axes object if you want to add the dendrogram to an existing axes rather than creating a new one

  • show – If true, the dendrogram is shown in a viewer.

  • **kwargs – Additional keyword options to scipy.cluster.hierarchy.dendrogram

Returns

The matplotlib.pyplot.Axes object

KmeansCluster

class clusterking.cluster.KmeansCluster[source]

Bases: clusterking.cluster.cluster.Cluster

Kmeans clustering (wikipedia) as implemented in sklearn.cluster.

Example:

import clusterking as ck
d = ck.Data("/path/to/data.sql")    # Load some data
c = ck.cluster.KmeansCluster()      # Init worker class
c.set_kmeans_options(n_clusters=5)  # Set options for clustering
r = c.run(d)                        # Perform clustering on data
r.write()                           # Write results back to data
__init__()[source]
set_kmeans_options(**kwargs) None[source]

Configure clustering algorithms.

Parameters

**kwargs – Keyword arguments to sklearn.cluster.KMeans().

run(data) clusterking.cluster.kmeans_cluster.KmeansClusterResult[source]
class clusterking.cluster.KmeansClusterResult(data, md, clusters)[source]

Bases: clusterking.cluster.cluster.ClusterResult