sklearn.cluster.AgglomerativeClustering
-
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, *, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False)
[source] -
Agglomerative Clustering
Recursively merges the pair of clusters that minimally increases a given linkage distance.
Read more in the User Guide.
- Parameters
-
-
n_clustersint or None, default=2
-
The number of clusters to find. It must be
None
ifdistance_threshold
is notNone
. -
affinitystr or callable, default=’euclidean’
-
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.
-
memorystr or object with the joblib.Memory interface, default=None
-
Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
-
connectivityarray-like or callable, default=None
-
Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is
None
, i.e, the hierarchical clustering algorithm is unstructured. -
compute_full_tree‘auto’ or bool, default=’auto’
-
Stop early the construction of the tree at
n_clusters
. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must beTrue
ifdistance_threshold
is notNone
. By defaultcompute_full_tree
is “auto”, which is equivalent toTrue
whendistance_threshold
is notNone
or thatn_clusters
is inferior to the maximum between 100 or0.02 * n_samples
. Otherwise, “auto” is equivalent toFalse
. -
linkage{‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’
-
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
- ‘ward’ minimizes the variance of the clusters being merged.
- ‘average’ uses the average of the distances of each observation of the two sets.
- ‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.
- ‘single’ uses the minimum of the distances between all observations of the two sets.
New in version 0.20: Added the ‘single’ option
-
distance_thresholdfloat, default=None
-
The linkage distance threshold above which, clusters will not be merged. If not
None
,n_clusters
must beNone
andcompute_full_tree
must beTrue
.New in version 0.21.
-
compute_distancesbool, default=False
-
Computes distances between clusters even if
distance_threshold
is not used. This can be used to make dendrogram visualization, but introduces a computational and memory overhead.New in version 0.24.
-
- Attributes
-
-
n_clusters_int
-
The number of clusters found by the algorithm. If
distance_threshold=None
, it will be equal to the givenn_clusters
. -
labels_ndarray of shape (n_samples)
-
cluster labels for each point
-
n_leaves_int
-
Number of leaves in the hierarchical tree.
-
n_connected_components_int
-
The estimated number of connected components in the graph.
New in version 0.21:
n_connected_components_
was added to replacen_components_
. -
children_array-like of shape (n_samples-1, 2)
-
The children of each non-leaf node. Values less than
n_samples
correspond to leaves of the tree which are the original samples. A nodei
greater than or equal ton_samples
is a non-leaf node and has childrenchildren_[i - n_samples]
. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form noden_samples + i
-
distances_array-like of shape (n_nodes-1,)
-
Distances between nodes in the corresponding place in
children_
. Only computed ifdistance_threshold
is used orcompute_distances
is set toTrue
.
-
Examples
>>> from sklearn.cluster import AgglomerativeClustering >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> clustering = AgglomerativeClustering().fit(X) >>> clustering AgglomerativeClustering() >>> clustering.labels_ array([1, 1, 1, 0, 0, 0])
Methods
fit
(X[, y])Fit the hierarchical clustering from features, or distance matrix.
fit_predict
(X[, y])Fit the hierarchical clustering from features or distance matrix, and return cluster labels.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
-
fit(X, y=None)
[source] -
Fit the hierarchical clustering from features, or distance matrix.
- Parameters
-
-
Xarray-like, shape (n_samples, n_features) or (n_samples, n_samples)
-
Training instances to cluster, or distances between instances if
affinity='precomputed'
. -
yIgnored
-
Not used, present here for API consistency by convention.
-
- Returns
-
- self
-
fit_predict(X, y=None)
[source] -
Fit the hierarchical clustering from features or distance matrix, and return cluster labels.
- Parameters
-
-
Xarray-like of shape (n_samples, n_features) or (n_samples, n_samples)
-
Training instances to cluster, or distances between instances if
affinity='precomputed'
. -
yIgnored
-
Not used, present here for API consistency by convention.
-
- Returns
-
-
labelsndarray of shape (n_samples,)
-
Cluster labels.
-
-
get_params(deep=True)
[source] -
Get parameters for this estimator.
- Parameters
-
-
deepbool, default=True
-
If True, will return the parameters for this estimator and contained subobjects that are estimators.
-
- Returns
-
-
paramsdict
-
Parameter names mapped to their values.
-
-
set_params(**params)
[source] -
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
-
-
**paramsdict
-
Estimator parameters.
-
- Returns
-
-
selfestimator instance
-
Estimator instance.
-
Examples using sklearn.cluster.AgglomerativeClustering
© 2007–2020 The scikit-learn developers
Licensed under the 3-clause BSD License.
https://scikit-learn.org/0.24/modules/generated/sklearn.cluster.AgglomerativeClustering.html