I have a reasonably large list of centroids that I want to cluster into groups of two by proximity (minimizing proximity).
I've explored k-means, which does cluster them by proximity, but the count of members in each group varies. With k-means you set a number of clusters, not a number of members in each cluster.
The nearest-neighbor problem solves this issue for two items from the set, but not against the entire data set.
K-nearest neighbors seems to break them into groups of N, but it appears to allow for points to be reused. In my scenario there can be no overlap.
Is there a particular algorithm, or suite of algorithms designed to address this? I'm pretty handy when I know what I'm working against, but I don't have a good sense of how to approach the problem.
To add more about the context and what we're trying to solve:
The points represent a number of sites throughout the USA. Each of these sites is a competitor (supply). Independently, we've aggregated demand (from census data, etc). We want to average the nearest pairs so that we can use the aggregated supply when calculating our supply/demand indexes for a given spatial extent (defined by the demand polygons).
We need to use at least two points so that individual data from a given site is obscured. This is a licensing/privacy requirement. We would otherwise analyze every point individually. We don't want to use more than two, because that further obscures the data. By using two, we adhere to licensing requirements, while minimizing the effect of averaging across a cluster.
أكثر...
I've explored k-means, which does cluster them by proximity, but the count of members in each group varies. With k-means you set a number of clusters, not a number of members in each cluster.
The nearest-neighbor problem solves this issue for two items from the set, but not against the entire data set.
K-nearest neighbors seems to break them into groups of N, but it appears to allow for points to be reused. In my scenario there can be no overlap.
Is there a particular algorithm, or suite of algorithms designed to address this? I'm pretty handy when I know what I'm working against, but I don't have a good sense of how to approach the problem.
To add more about the context and what we're trying to solve:
The points represent a number of sites throughout the USA. Each of these sites is a competitor (supply). Independently, we've aggregated demand (from census data, etc). We want to average the nearest pairs so that we can use the aggregated supply when calculating our supply/demand indexes for a given spatial extent (defined by the demand polygons).
We need to use at least two points so that individual data from a given site is obscured. This is a licensing/privacy requirement. We would otherwise analyze every point individually. We don't want to use more than two, because that further obscures the data. By using two, we adhere to licensing requirements, while minimizing the effect of averaging across a cluster.
أكثر...