Cluster¶
This subpackage provides classes to perform the actual clustering.
Different clustering algorithms correspond to different subclasses of the
base class clusterking.cluster.Cluster (and inherit all of its
methods).
Currently implemented:
HierarchyCluster: Hierarchical clustering (https://en.wikipedia.org/wiki/Hierarchical_clustering/)KmeansCluster: Kmeans clustering (https://en.wikipedia.org/wiki/K-means_clustering/)
Cluster¶
- class
clusterking.cluster.Cluster[source]¶Bases:
clusterking.worker.DataWorkerAbstract baseclass of the Cluster classes. This class is subclassed to implement specific clustering algorithms and defines common functions.
md= None¶Metadata
HierarchyCluster¶
- class
clusterking.cluster.HierarchyCluster[source]¶Bases:
clusterking.cluster.cluster.Cluster
max_d¶Cutoff value set in
set_max_d().
metric¶Metric that was set in
set_metric()(Function that takes Data object as only parameter and returns a reduced distance matrix.)
set_metric(*args, **kwargs) → None[source]¶Select a metric in one of the following ways:
- If no positional arguments are given, we choose the euclidean metric.
- If the first positional argument is string, we pick one of the metrics that are defined in
scipy.spatical.distance.pdistby that name (all additional arguments will be past to this function).- If the first positional argument is a function, we take this function (and add all additional arguments to it).
Examples:
...(): Euclidean metric...("euclidean"): Also Euclidean metric...(lambda data: scipy.spatial.distance.pdist(data.data(), 'euclidean'): Also Euclidean metric...("minkowski", p=2): Minkowsky distance withp=2.See https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html for more information.
Parameters:
- *args – see description above
- **kwargs – see description above
Returns: Function that takes Data object as only parameter and returns a reduced distance matrix.
set_hierarchy_options(method='complete', optimal_ordering=False)[source]¶Configure hierarchy building
Parameters:
- method – See reference on
scipy.cluster.hierarchy.linkage- optimal_ordering – See reference on
scipy.cluster.hierarchy.linkage
set_max_d(max_d) → None[source]¶Set the cutoff value of the hierarchy that then gives the clusters. This corresponds to the
targument ofscipy.cluster.hierarchy.fcluster.
Parameters: max_d – float Returns: None
set_fcluster_options(**kwargs) → None[source]¶Set additional keyword options for our call to
scipy.cluster.hierarchy.fcluster.
Parameters: kwargs – Keyword arguments Returns: None
run(data, reuse_hierarchy_from: Optional[clusterking.cluster.hierarchy_cluster.HierarchyClusterResult] = None)[source]¶
Parameters:
- data –
- reuse_hierarchy_from – Reuse the hierarchy from a
HierarchyClusterResultobject.Returns:
- class
clusterking.cluster.HierarchyClusterResult(data, md, clusters, hierarchy, worker_id)[source]¶Bases:
clusterking.cluster.cluster.ClusterResult
hierarchy¶
worker_id¶ID of the HierarchyCluster worker that generated this object.
data_id¶ID of the data object that the HierarchyCluster worker was run on.
dendrogram(output: Union[None, str, pathlib.Path] = None, ax=None, show=False, **kwargs)[source]¶Creates dendrogram
Parameters:
- output – If supplied, we save the dendrogram there
- ax – An axes object if you want to add the dendrogram to an existing axes rather than creating a new one
- show – If true, the dendrogram is shown in a viewer.
- **kwargs – Additional keyword options to scipy.cluster.hierarchy.dendrogram
Returns: The matplotlib.pyplot.Axes object
KmeansCluster¶
- class
clusterking.cluster.KmeansCluster[source]¶Bases:
clusterking.cluster.cluster.ClusterKmeans clustering (wikipedia) as implemented in
sklearn.cluster.Example:
import clusterking as ck d = ck.Data("/path/to/data.sql") # Load some data c = ck.cluster.KmeansCluster() # Init worker class c.set_kmeans_options(n_clusters=5) # Set options for clustering r = c.run(d) # Perform clustering on data r.write() # Write results back to data