Clustering overview

Clustering is an unsupervised machine learning technique you can use to group similar records together. It is a useful approach for when you want to understand what groups or clusters you have in your data, but don't have labeled data to train a model on. For example, if you had unlabeled data about subway ticket purchases, you could cluster that data by ticket purchase time to better understand what time periods have the heaviest subway usage. For more information, see What is clustering?

K-means models are widely used to perform clustering. You can use k-means models with the ML.PREDICT function to cluster data, or with the ML.DETECT_ANOMALIES function to perform anomaly detection.

K-means models use centroid-based clustering to organize data into clusters. To get information about a k-mean model's centroids, you can use the ML.CENTROIDS function.

Recommended knowledge

By using the default settings in the CREATE MODEL statements and the inference functions, you can create and use a clustering model even without much ML knowledge. However, having basic knowledge about ML development, and clustering models in particular, helps you optimize both your data and your model to deliver better results. We recommend using the following resources to develop familiarity with ML techniques and processes: