Reconciliation (or clustering) confidence score is a metric for the confidence level of the assignment of an entity to a cluster. You can then filter out predictions that the clustering model is uncertain about and make decisions based on the remaining, confident outcomes.
How a confidence score is produced
Clustering produces hard assignments: each entity is assigned to exactly one cluster. The confidence score describes the confidence level that a node belongs to its assigned cluster, valued between [0, 1].
1.0 = very certain the entity belongs to its assigned cluster
0.0 = very uncertain the entity belongs to its assigned cluster
There is a notion of similarity/distance between any pair of entities. Entity pairs within a cluster are more likely to have lower distances than pairs that span different clusters. The further away an entity is from other members of its cluster, the lower the confidence value.
Other clusters also influence the confidence score. If there are other clusters close to an entity, its confidence is diminished according to the distances from those clusters.
The cluster density is related to the distances between all entity pairs of the cluster, and also has an effect on the confidence value: for any entity at a fixed distance from the cluster, the confidence value is high if the cluster density is low; and the confidence is low if the cluster density is high.
For the reconciliation pipeline to scale to millions or billions of entities, the confidence score calculation exploits randomized sampling methods to limit the computational complexity. As such, confidence scores are bucketed into 0.1-sized intervals. As a result, we recommend you do not depend on the exact confidence values to make review or human-in-the-loop decisions.
Diagram Key
Use the following descriptions to understand the diagrams.
Description | Diagram |
---|---|
Entity | |
A cluster of entities. Entity cluster depicted by a circle. Cluster spread is represented by the size of the circle. |
|
Multiple entity clusters. Color coded: an entity and its assigned cluster share the same color. | |
In some cases we focus on a single entity and its relation to other clusters. All other entities are hidden from view. d_a: Distance from the entity to cluster A's centroid d_b: Distance from the entity to cluster B's centroid c: cluster confidence score of the entity |
Illustrated examples
The follow diagrams serve as examples to help you visualize the high-level concept in determining confidence scores.
Situation | Diagram |
---|---|
The entity is assigned to cluster A. If A is the only cluster in the entire embedding space, then the confidence score will always be 1 regardless of the distance between them. | |
A and B are clusters that have the same spread, and their centroids are equally distant from the entity. Both clusters have the same influence on the entity, so the confidence score is 0.5. |
|
The presence of other clusters nearby will exert their influence on the entity and dilute the confidence score. If there are three clusters of identical spread, and the entity is equally distant from all three, then the confidence score is 0.33. |
|
A and B are clusters that have the same spread, but the entity is closer to A than it is to B. A has a higher influence on the entity. Because the entity is also assigned to A, the confidence score will be larger than 0.5. |
|
A and B are clusters that have the same spread, but the entity is closer to B than it is to A. A's influence on the entity is thus lowered. The confidence score will be lower than 0.5. |
|
A has a larger spread than B, but their centroids are equally distant from the entity. A has a higher influence on the entity. Because the entity is also assigned to A, the confidence score will be larger than 0.5. |