matlab - Find median value of the largest clump of similar values in an array in the most computationally efficient manner -


sorry long title, sums up.

i looking find median value of largest clump of similar values in array in computationally efficient manner.

for example:

h = [99,100,101,102,103,180,181,182,5,250,17]

i looking 101.

the array not sorted, typed in above order easier understanding. array of constant length , can assume there @ least 1 clump of similar values.

what have been doing far computing standard deviation 1 of values removed , finding value corresponds largest reduction in std , repeating number of elements in array, terribly inefficient.

for j = 1:7     g = double(h);      = 1:7            g(i) = nan;         t(i) = nanstd(g);     end      best = find(t==min(t));     h(best) = nan;  end  x = find(h==max(h)); 

any thoughts?

this possibility bins data , looks bin elements. if distribution consists of separated clusters should work reasonably well.

h = [99,100,101,102,103,180,181,182,5,250,17];  nbins = length(h);        % <-- set # of bins here [v bins]=hist(h,nbins); [vm im]=max(v);           % find max in histogram bl = bins(2)-bins(1);     % bin size bm = bins(im);            % position of bin max # ifb =find(abs(h-bm)<bl/2)   % elements within bin  median(h(ifb))              % average on elements in bin 

output:

ifb =     1     2     3     4     5 h(ifb) =    99   100   101   102   103 median =   101 

the more challenging parameters set number of bins , size of region around populated bin. in example provided neither of these critical, set number of bins 3 (instead of length(h)) , still work. using length(h) number of bins in fact little extreme , not general choice. better choice somewhere between number , expected number of clusters.

it may distributions change bl within find expression value judge better in advance.

i should note there clustering methods (kmeans) may work better, perhaps less efficiently. instance output of [h' kmeans(h',4) ]:

    99     2    100     2    101     2    102     2    103     2    180     3    181     3    182     3      5     4    250     3     17     1 

in case decided in advance attempt grouping 4 clusters. using kmeans can answer follows:

nbin = 4; km = kmeans(h',nbin); [mv iv]=max(histc(km,[1:nbin])); h(km==km(iv)) median(h(km==km(iv))) 

notice kmeans not return same value every time run, might need average on few iterations.

i timed 2 methods , found kmeans takes ~10 x longer. however, more robust since bin sizes adapt problem , not need set beforehand (only number of bins does).


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -