Cluster¶
This subpackage provides classes to perform the actual clustering.
Different clustering algorithms correspond to different subclasses of the
base class clusterking.cluster.Cluster
(and inherit all of its
methods).
Currently implemented:
HierarchyCluster
: Hierarchical clustering (https://en.wikipedia.org/wiki/Hierarchical_clustering/)KmeansCluster
: Kmeans clustering (https://en.wikipedia.org/wiki/K-means_clustering/)
Cluster
¶
- class
clusterking.cluster.
Cluster
[source]¶Bases:
clusterking.worker.DataWorker
Abstract baseclass of the Cluster classes. This class is subclassed to implement specific clustering algorithms and defines common functions.
md
= None¶Metadata
HierarchyCluster
¶
- class
clusterking.cluster.
HierarchyCluster
[source]¶Bases:
clusterking.cluster.cluster.Cluster
max_d
¶Cutoff value set in
set_max_d()
.
metric
¶Metric that was set in
set_metric()
(Function that takes Data object as only parameter and returns a reduced distance matrix.)
set_metric
(*args, **kwargs) → None[source]¶Select a metric in one of the following ways:
- If no positional arguments are given, we choose the euclidean metric.
- If the first positional argument is string, we pick one of the metrics that are defined in
scipy.spatical.distance.pdist
by that name (all additional arguments will be past to this function).- If the first positional argument is a function, we take this function (and add all additional arguments to it).
Examples:
...()
: Euclidean metric...("euclidean")
: Also Euclidean metric...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean')
: Also Euclidean metric...("minkowski", p=2)
: Minkowsky distance withp=2
.See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.
Parameters:
- *args – see description above
- **kwargs – see description above
Returns: Function that takes Data object as only parameter and returns a reduced distance matrix.
set_hierarchy_options
(method='complete', optimal_ordering=False)[source]¶Configure hierarchy building
Parameters:
- method – See reference on
scipy.cluster.hierarchy.linkage
- optimal_ordering – See reference on
scipy.cluster.hierarchy.linkage
set_max_d
(max_d) → None[source]¶Set the cutoff value of the hierarchy that then gives the clusters. This corresponds to the
t
argument ofscipy.cluster.hierarchy.fcluster
.
Parameters: max_d – float Returns: None
set_fcluster_options
(**kwargs) → None[source]¶Set additional keyword options for our call to
scipy.cluster.hierarchy.fcluster
.
Parameters: kwargs – Keyword arguments Returns: None
run
(data, reuse_hierarchy_from: Optional[clusterking.cluster.hierarchy_cluster.HierarchyClusterResult] = None)[source]¶
Parameters:
- data –
- reuse_hierarchy_from – Reuse the hierarchy from a
HierarchyClusterResult
object.Returns:
- class
clusterking.cluster.
HierarchyClusterResult
(data, md, clusters, hierarchy, worker_id)[source]¶Bases:
clusterking.cluster.cluster.ClusterResult
hierarchy
¶
worker_id
¶ID of the HierarchyCluster worker that generated this object.
data_id
¶ID of the data object that the HierarchyCluster worker was run on.
dendrogram
(output: Union[None, str, pathlib.Path] = None, ax=None, show=False, **kwargs)[source]¶Creates dendrogram
Parameters:
- output – If supplied, we save the dendrogram there
- ax – An axes object if you want to add the dendrogram to an existing axes rather than creating a new one
- show – If true, the dendrogram is shown in a viewer.
- **kwargs – Additional keyword options to scipy.cluster.hierarchy.dendrogram
Returns: The matplotlib.pyplot.Axes object
KmeansCluster
¶
- class
clusterking.cluster.
KmeansCluster
[source]¶Bases:
clusterking.cluster.cluster.Cluster
Kmeans clustering (wikipedia) as implemented in
sklearn.cluster
.Example:
import clusterking as ck d = ck.Data("/path/to/data.sql") # Load some data c = ck.cluster.KmeansCluster() # Init worker class c.set_kmeans_options(n_clusters=5) # Set options for clustering r = c.run(d) # Perform clustering on data r.write() # Write results back to data