Cluster
Contents
Cluster#
This subpackage provides classes to perform the actual clustering.
Different clustering algorithms correspond to different subclasses of the
base class clusterking.cluster.Cluster
(and inherit all of its
methods).
Currently implemented:
HierarchyCluster
: Hierarchical clustering (https://en.wikipedia.org/wiki/Hierarchical_clustering/)KmeansCluster
: Kmeans clustering (https://en.wikipedia.org/wiki/K-means_clustering/)
Cluster
#
- class clusterking.cluster.Cluster[source]#
Bases:
clusterking.worker.DataWorker
Abstract baseclass of the Cluster classes. This class is subclassed to implement specific clustering algorithms and defines common functions.
- md#
Metadata
HierarchyCluster
#
- class clusterking.cluster.HierarchyCluster[source]#
Bases:
clusterking.cluster.cluster.Cluster
- property max_d: Optional[float]#
Cutoff value set in
set_max_d()
.
- property metric: Callable#
Metric that was set in
set_metric()
(Function that takes Data object as only parameter and returns a reduced distance matrix.)
- set_metric(*args, **kwargs) None [source]#
Select a metric in one of the following ways:
If no positional arguments are given, we choose the euclidean metric.
If the first positional argument is string, we pick one of the metrics that are defined in
scipy.spatical.distance.pdist
by that name (all additional arguments will be past to this function).If the first positional argument is a function, we take this function (and add all additional arguments to it).
Examples:
...()
: Euclidean metric...("euclidean")
: Also Euclidean metric...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean')
: Also Euclidean metric...("minkowski", p=2)
: Minkowsky distance withp=2
.
See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.
- Parameters
*args – see description above
**kwargs – see description above
- Returns
Function that takes Data object as only parameter and returns a reduced distance matrix.
- set_hierarchy_options(method='complete', optimal_ordering=False)[source]#
Configure hierarchy building
- Parameters
method – See reference on
scipy.cluster.hierarchy.linkage
optimal_ordering – See reference on
scipy.cluster.hierarchy.linkage
- set_max_d(max_d) None [source]#
Set the cutoff value of the hierarchy that then gives the clusters. This corresponds to the
t
argument ofscipy.cluster.hierarchy.fcluster
.- Parameters
max_d – float
- Returns
None
- set_fcluster_options(**kwargs) None [source]#
Set additional keyword options for our call to
scipy.cluster.hierarchy.fcluster
.- Parameters
kwargs – Keyword arguments
- Returns
None
- run(data, reuse_hierarchy_from: Optional[clusterking.cluster.hierarchy_cluster.HierarchyClusterResult] = None)[source]#
- Parameters
data –
reuse_hierarchy_from – Reuse the hierarchy from a
HierarchyClusterResult
object.
Returns:
- class clusterking.cluster.HierarchyClusterResult(data, md, clusters, hierarchy, worker_id)[source]#
Bases:
clusterking.cluster.cluster.ClusterResult
- property hierarchy#
- property worker_id#
ID of the HierarchyCluster worker that generated this object.
- dendrogram(output: Union[None, str, pathlib.Path] = None, ax=None, show=False, **kwargs)[source]#
Creates dendrogram
- Parameters
output – If supplied, we save the dendrogram there
ax – An axes object if you want to add the dendrogram to an existing axes rather than creating a new one
show – If true, the dendrogram is shown in a viewer.
**kwargs – Additional keyword options to scipy.cluster.hierarchy.dendrogram
- Returns
The matplotlib.pyplot.Axes object
KmeansCluster
#
- class clusterking.cluster.KmeansCluster[source]#
Bases:
clusterking.cluster.cluster.Cluster
Kmeans clustering (wikipedia) as implemented in
sklearn.cluster
.Example:
import clusterking as ck d = ck.Data("/path/to/data.sql") # Load some data c = ck.cluster.KmeansCluster() # Init worker class c.set_kmeans_options(n_clusters=5) # Set options for clustering r = c.run(d) # Perform clustering on data r.write() # Write results back to data