python - Longest equally-spaced subsequence -


i have million integers in sorted order , find longest subsequence difference between consecutive pairs equal. example

1, 4, 5, 7, 8, 12 

has subsequence

   4,       8, 12 

my naive method greedy , checks how far can extend subsequence each point. takes o(n²) time per point seems.

is there faster way solve problem?

update. test code given in answers possible (thank you). clear using n^2 memory not work. far there no code terminates input [random.randint(0,100000) r in xrange(200000)] .

timings. tested following input data on 32 bit system.

a= [random.randint(0,10000) r in xrange(20000)]  a.sort() 
  • the dynamic programming method of zellux uses 1.6g of ram , takes 2 minutes , 14 seconds. pypy takes 9 seconds! crashes memory error on large inputs.
  • the o(nd) time method of armin took 9 seconds pypy 20mb of ram. of course worse if range larger. low memory usage meant test a= [random.randint(0,100000) r in xrange(200000)] didn't finish in few minutes gave pypy.

in order able test method of kluev's reran

a= [random.randint(0,40000) r in xrange(28000)]  = list(set(a)) a.sort() 

to make list of length 20000. timings pypy

  • zellux, 9 seconds
  • kluev, 20 seconds
  • armin, 52 seconds

it seems if zellux method made linear space clear winner.

update: first algorithm described here obsoleted armin rigo's second answer, simpler , more efficient. both these methods have 1 disadvantage. need many hours find result 1 million integers. tried 2 more variants (see second half of answer) range of input integers assumed limited. such limitation allows faster algorithms. tried optimize armin rigo's code. see benchmarking results @ end.


here idea of algorithm using o(n) memory. time complexity o(n2 log n), may decreased o(n2).

algorithm uses following data structures:

  1. prev: array of indexes pointing previous element of (possibly incomplete) subsequence.
  2. hash: hashmap key = difference between consecutive pairs in subsequence , value = 2 other hashmaps. these other hashmaps: key = starting/ending index of subsequence, value = pair of (subsequence length, ending/starting index of subsequence).
  3. pq: priority queue possible "difference" values subsequences stored in prev , hash.

algorithm:

  1. initialize prev indexes i-1. update hash , pq register (incomplete) subsequences found on step , "differences".
  2. get (and remove) smallest "difference" pq. corresponding record hash , scan 1 of second-level hash maps. @ time subsequences given "difference" complete. if second-level hash map contains subsequence length better found far, update best result.
  3. in array prev: each element of sequence found on step #2, decrement index , update hash , possibly pq. while updating hash, perform 1 of following operations: add new subsequence of length 1, or grow existing subsequence 1, or merge 2 existing subsequences.
  4. remove hash map record found on step #2.
  5. continue step #2 while pq not empty.

this algorithm updates o(n) elements of prev o(n) times each. , each of these updates may require add new "difference" pq. means time complexity of o(n2 log n) if use simple heap implementation pq. decrease o(n2) might use more advanced priority queue implementations. of possibilities listed on page: priority queues.

see corresponding python code on ideone. code not allow duplicate elements in list. possible fix this, optimization anyway remove duplicates (and find longest subsequence beyond duplicates separately).

and the same code after little optimization. here search terminated subsequence length multiplied possible subsequence "difference" exceeds source list range.


armin rigo's code simple , pretty efficient. in cases computations may avoided. search may terminated subsequence length multiplied possible subsequence "difference" exceeds source list range:

def findless(a):   aset = set(a)   lmax = 2   d = 1   minstep = 0    while (lmax - 1) * minstep <= a[-1] - a[0]:     minstep = a[-1] - a[0] + 1     j, b in enumerate(a):       if j+d < len(a):         = a[j+d]         step = - b         minstep = min(minstep, step)         if + step in aset , b - step not in aset:           c = + step           count = 3           while c + step in aset:             c += step             count += 1           if count > lmax:             lmax = count     d += 1    return lmax  print(findless([1, 4, 5, 7, 8, 12])) 

if range of integers in source data (m) small, simple algorithm possible o(m2) time , o(m) space:

def findless(src):   r = [false in range(src[-1]+1)]   x in src:     r[x] = true    d = 1   best = 1    while best * d < len(r):     s in range(d):       l = 0        in range(s, len(r), d):         if r[i]:           l += 1           best = max(best, l)         else:           l = 0      d += 1    return best   print(findless([1, 4, 5, 7, 8, 12])) 

it similar first method armin rigo, doesn't use dynamic data structures. suppose source data has no duplicates. , (to keep code simple) suppose minimum input value non-negative , close zero.


previous algorithm may improved if instead of array of booleans use bitset data structure , bitwise operations process data in parallel. code shown below implements bitset built-in python integer. has same assumptions: no duplicates, minimum input value non-negative , close zero. time complexity o(m2 * log l) l length of optimal subsequence, space complexity o(m):

def findless(src):   r = 0   x in src:     r |= 1 << x    d = 1   best = 1    while best * d < src[-1] + 1:     c = best     rr = r      while c & (c-1):       cc = c & -c       rr &= rr >> (cc * d)       c &= c-1      while c != 1:       c = c >> 1       rr &= rr >> (c * d)      rr &= rr >> d      while rr:       rr &= rr >> d       best += 1      d += 1    return best 

benchmarks:

input data (about 100000 integers) generated way:

random.seed(42) s = sorted(list(set([random.randint(0,200000) r in xrange(140000)]))) 

and fastest algorithms used following data (about 1000000 integers):

s = sorted(list(set([random.randint(0,2000000) r in xrange(1400000)]))) 

all results show time in seconds:

size:                         100000   1000000 second answer armin rigo:     634         ? armin rigo, optimized:         64     >5000 o(m^2) algorithm:                 53      2940 o(m^2*l) algorithm:                7       711 

Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

Function that returns a formatted array in VBA -