matlab - Find median value of the largest clump of similar values in an array in the most computationally efficient manner -
sorry long title, sums up.
i looking find median value of largest clump of similar values in array in computationally efficient manner.
for example:
h = [99,100,101,102,103,180,181,182,5,250,17]
i looking 101.
the array not sorted, typed in above order easier understanding. array of constant length , can assume there @ least 1 clump of similar values.
what have been doing far computing standard deviation 1 of values removed , finding value corresponds largest reduction in std , repeating number of elements in array, terribly inefficient.
for j = 1:7 g = double(h); = 1:7 g(i) = nan; t(i) = nanstd(g); end best = find(t==min(t)); h(best) = nan; end x = find(h==max(h));
any thoughts?
this possibility bins data , looks bin elements. if distribution consists of separated clusters should work reasonably well.
h = [99,100,101,102,103,180,181,182,5,250,17]; nbins = length(h); % <-- set # of bins here [v bins]=hist(h,nbins); [vm im]=max(v); % find max in histogram bl = bins(2)-bins(1); % bin size bm = bins(im); % position of bin max # ifb =find(abs(h-bm)<bl/2) % elements within bin median(h(ifb)) % average on elements in bin
output:
ifb = 1 2 3 4 5 h(ifb) = 99 100 101 102 103 median = 101
the more challenging parameters set number of bins , size of region around populated bin. in example provided neither of these critical, set number of bins 3
(instead of length(h)
) , still work. using length(h)
number of bins in fact little extreme , not general choice. better choice somewhere between number , expected number of clusters.
it may distributions change bl
within find
expression value judge better in advance.
i should note there clustering methods (kmeans
) may work better, perhaps less efficiently. instance output of [h' kmeans(h',4) ]
:
99 2 100 2 101 2 102 2 103 2 180 3 181 3 182 3 5 4 250 3 17 1
in case decided in advance attempt grouping 4 clusters. using kmeans
can answer follows:
nbin = 4; km = kmeans(h',nbin); [mv iv]=max(histc(km,[1:nbin])); h(km==km(iv)) median(h(km==km(iv)))
notice kmeans
not return same value every time run, might need average on few iterations.
i timed 2 methods , found kmeans
takes ~10 x longer. however, more robust since bin sizes adapt problem , not need set beforehand (only number of bins does).
Comments
Post a Comment