BCH performance #125

mhostetter · 2021-06-12T12:20:44Z

In PR #114, I added an initial implementation of BCH codes. This issue will review their performance as compared with other software packages and any desired API changes.

mhostetter · 2021-06-12T12:30:54Z

@varun19299 In response to your posts #112 (comment) and #112 (comment)...

I think in Matlab you can write bchenc(msg, n, k, 'end') to systematically encode the message with parity bits at the end. I'm using a similar call signature with Octave to generate the systematic test vectors.

I'm glad to hear about the 20x speed-up over Matlab! Are you testing encoding or decoding or both?

Technically the code isn't vectorized (at least by my definition). None of the operations are running in parallel. Numba has a parallel compile option, but I've found it's slower than the default. The code is implemented with a big for loop, but it's inside the compiled code so it should be as fast as a C for loop (not a slow python for loop).

If you're saying encoding is O(1) for N < 2048 and linear for larger N, I think I may know why. But I'll wait for your response and we can discuss.

varun19299 · 2021-06-12T13:28:03Z

I think in Matlab you can write bchenc(msg, n, k, 'end') to systematically encode the message with parity bits at the end. I'm using a similar call signature with Octave to generate the systematic test vectors.

Alright, I'll check this out.

I'm glad to hear about the 20x speed-up over Matlab! Are you testing encoding or decoding or both?

W.r.t decoding. Encoding is comparatively much faster (~ 100 microseconds upto batch size of 2048, for BCH(15,11)), then linearly increases.

Technically the code isn't vectorized (at least by my definition). None of the operations are running in parallel. Numba has a parallel compile option, but I've found it's slower than the default. The code is implemented with a big for loop, but it's inside the compiled code so it should be as fast as a C for loop (not a slow python for loop).

Is parallel slower across a wide range of array sizes (N, n)?
I think MATLAB uses a JIT compiler in for loops. Not sure what explains the performance difference then.

If you're saying encoding is O(1) for N < 2048 and linear for larger N, I think I may know why. But I'll wait for your response and we can discuss.

Yep, this was my observation, but I may have to compare across more BCH [n,k,d] combinations if this isn't expected.

mhostetter · 2021-06-12T14:46:16Z

@varun19299 One thing to keep in mind... At least in my implementation, the decoding time is error-dependent. For example, if the syndromes are all zero, then the decoder will return early. I don't know if Matlab has the same non-deterministic algorithm. So I would test against the same error pattern, when possible.

If you're saying encoding is O(1) for N < 2048 and linear for larger N, I think I may know why. But I'll wait for your response and we can discuss.

Yep, this was my observation, but I may have to compare across more BCH [n,k,d] combinations if this isn't expected.

I believe the reason the time is constant for small N is because, for systematic codes, I'm only matrix multiplying with the parity portion of G and then calling np.hstackto append the parity to the message. I believe the call time to np.hstack dominates the matrix multiplication time for small N (hence the roughly constant time), but for large N the matrix multiplication time dominates (hence the linear time).

(ignore the comment...)

galois/galois/code/bch.py

Lines 194 to 197 in 49f8f1d

    
           elif self.systematic: 
        
               # Seems faster to just matrix multiply than use hstack 
        
               parity = message.view(GF2) @ self.G[:, self.k:] 
        
               return np.hstack((message, parity)).view(type(message))

Regarding parallel mode in numba, unfortunately I can't get parallel=True to perform better than the default implementation. Here's a comparison of the current decoder and a "parallel" version. The parallel version replaces the for loop in decode_calculate() with numba.prange and JIT compiles it with parallel=True.

Current decoder:

In [1]: import galois                                                                                                                   

In [2]: bch = galois.BCH(31, 21)                                                                                                        

In [3]: def speed_test(N): 
   ...:     m = galois.GF2.Random((N, bch.k)) 
   ...:     c = bch.encode(m) 
   ...:     c[:,0] ^= 1 
   ...:     c[:,5] ^= 1 
   ...:     %timeit bch.decode(c) 
   ...:                                                                                                                                 

In [4]: speed_test(100)                                                                                                                 
785 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: speed_test(1_000)                                                                                                               
4.26 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: speed_test(10_000)                                                                                                              
37 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: speed_test(100_000)                                                                                                             
380 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: speed_test(1_000_000)                                                                                                           
4.04 s ± 68.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Parallel decoder:

In [1]: import galois                                                                                                                   

In [2]: bch = galois.BCH(31, 21)                                                                                                        

In [3]: def speed_test(N): 
   ...:     m = galois.GF2.Random((N, bch.k)) 
   ...:     c = bch.encode(m) 
   ...:     c[:,0] ^= 1 
   ...:     c[:,5] ^= 1 
   ...:     %timeit bch.decode(c) 
   ...:                                                                                                                                 

In [4]: speed_test(100)                                                                                                                 
1.22 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: speed_test(1_000)                                                                                                               
6.23 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: speed_test(10_000)                                                                                                              
42.3 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: speed_test(100_000)                                                                                                             
395 ms ± 35.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: speed_test(1_000_000)                                                                                                           
4.15 s ± 289 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

varun19299 · 2021-06-18T06:49:41Z

@mhostetter

galois.BCH(31,11).t gives 4. Shouldn't it be 5?

Reference: BCH table

varun19299 · 2021-06-18T06:51:51Z

Even (63,10), (127,15) don't seem to be correct.

mhostetter · 2021-06-18T12:52:20Z

Thanks for the report! And for your testing -- it's been helpful.

You're absolutely right, that's a bug. For some codes, (n, k, t) and (n, k, < t) return the same code size and generator polynomial. I avoided the < t cases in bch_valid_codes(), but missed them in bch_generator_poly() which is used in galois.BCH. I just merged a bug fix and increased the unit testing to catch cases like these. Master should be working now.

In [1]: import galois                                                                                                                

In [2]: galois.BCH(31, 11).t                                                                                                         
Out[2]: 5

In [3]: galois.BCH(63, 10).t                                                                                                         
Out[3]: 13

In [4]: galois.BCH(127, 15).t                                                                                                        
Out[4]: 27

varun19299 · 2021-06-18T12:55:58Z

I’ll try sharing timed runs this weekend (vs MATLAB): sorry for the delay.

…

________________________________ From: Matt Hostetter ***@***.***> Sent: Friday, June 18, 2021 6:22:34 PM To: mhostetter/galois ***@***.***> Cc: Varun Sundar ***@***.***>; Mention ***@***.***> Subject: Re: [mhostetter/galois] BCH performance (#125) Thanks for the report! And for your testing -- it's been helpful. You're absolutely right, that's a bug. For some codes, (n, k, t) and (n, k, < t) return the same code size and generator polynomial. I avoided the < t cases in bch_valid_codes(), but missed them in bch_generator_poly() which is used in galois.BCH. I just merged a bug fix and increased the unit testing to catch cases like these. Master should be working now. In [1]: import galois In [2]: galois.BCH(31, 11).t Out[2]: 5 In [3]: galois.BCH(63, 10).t Out[3]: 13 In [4]: galois.BCH(127, 15).t Out[4]: 27 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#125 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFBQJZZVJSKLHRWXVLMTZHDTTM6RFANCNFSM46SWFAKA>.

mhostetter · 2021-06-18T12:58:15Z

@varun19299 No worries. I appreciate the help / feedback.

I also discovered a bizarre thing when resolving this bug. Textbooks (both yours and mine) use the lexicographically smallest primitive
polynomial as the default for the BCH field GF(2^m). However, for GF(2^7) they use the second-smallest x^7 + x^3 + 1. And I can't figure out why.

Here are the first three primitive polynomials listed in lexicographically-increasing order.

In [5]: galois.primitive_polys(2, 6)[0:3]
Out[5]:
[Poly(x^6 + x + 1, GF(2)),
 Poly(x^6 + x^4 + x^3 + x + 1, GF(2)),
 Poly(x^6 + x^5 + 1, GF(2))]

In [6]: galois.primitive_polys(2, 7)[0:3]
Out[6]:
[Poly(x^7 + x + 1, GF(2)),
 Poly(x^7 + x^3 + 1, GF(2)),
 Poly(x^7 + x^3 + x^2 + x + 1, GF(2))]

In [7]: galois.primitive_polys(2, 8)[0:3]
Out[7]:
[Poly(x^8 + x^4 + x^3 + x^2 + 1, GF(2)),
 Poly(x^8 + x^5 + x^3 + x + 1, GF(2)),
 Poly(x^8 + x^5 + x^3 + x^2 + 1, GF(2))]

And here is the default primitive polynomial (run in Octave), which matches the textbooks. I imagine Matlab is the same. Note, a (n, k, 1) code's generator polynomial equals the GF(2^m) field's primitive polynomial.

octave:69> bchpoly(63, 57)(end:-1:1)
ans =

   1   0   0   0   0   1   1

octave:70> bchpoly(127, 120)(end:-1:1)
ans =

   1   0   0   0   1   0   0   1

octave:71> bchpoly(255, 247)(end:-1:1)
ans =

   1   0   0   0   1   1   1   0   1

Probably easier to visualize like this:

octave:100> gf(1, 6)
ans =
GF(2^6) array. Primitive Polynomial = D^6+D+1 (decimal 67)

Array elements = 

   1

octave:101> gf(1, 7)
ans =
GF(2^7) array. Primitive Polynomial = D^7+D^3+1 (decimal 137)

Array elements = 

   1

octave:102> gf(1, 8)
ans =
GF(2^8) array. Primitive Polynomial = D^8+D^4+D^3+D^2+1 (decimal 285)

Array elements = 

   1

I can't seem to make sense of the discrepancy with GF(2^7), but either way I added the correct default for GF(2^7) in the pull request as well.

mhostetter · 2021-06-22T22:15:58Z

@varun19299 can I ask for a favor? After discovering the issue in the previous post, I added matlab_primitive_poly() (docs here) so that the default BCH and Reed-Solomon codes match those from Matlab (and most textbooks). I'm trying to add some unit tests, but don't currently have access to Matlab. Could you run the following .m file and copy-and-paste the output? I'd appreciate it.

gfprimdf(1, 2)
gfprimdf(2, 2)
gfprimdf(3, 2)
gfprimdf(4, 2)
gfprimdf(5, 2)
gfprimdf(6, 2)
gfprimdf(7, 2)
gfprimdf(8, 2)
gfprimdf(9, 2)
gfprimdf(10, 2)
gfprimdf(11, 2)
gfprimdf(12, 2)
gfprimdf(13, 2)
gfprimdf(14, 2)
gfprimdf(15, 2)
gfprimdf(16, 2)

gfprimdf(1, 3)
gfprimdf(2, 3)
gfprimdf(3, 3)
gfprimdf(4, 3)
gfprimdf(5, 3)
gfprimdf(6, 3)
gfprimdf(7, 3)
gfprimdf(8, 3)

gfprimdf(1, 5)
gfprimdf(2, 5)
gfprimdf(3, 5)
gfprimdf(4, 5)
gfprimdf(5, 5)
gfprimdf(6, 5)
gfprimdf(7, 5)
gfprimdf(8, 5)

gfprimdf(1, 7)
gfprimdf(2, 7)
gfprimdf(3, 7)
gfprimdf(4, 7)
gfprimdf(5, 7)
gfprimdf(6, 7)
gfprimdf(7, 7)
gfprimdf(8, 7)

varun19299 · 2021-06-23T03:33:07Z

Here you go: github gist.

mhostetter · 2021-06-23T03:39:03Z

Perfect! Thanks a bunch.

varun19299 · 2021-06-23T03:47:27Z

I've cleaned up the output, easier to read now. GF(8**7) takes a bit of time to generate primitive polynomials. I'll add that in a little while.

Update: added.

varun19299 · 2021-06-23T03:48:54Z

And yes, like you observed x^7 + x^3 + 1 is used instead of x^7 + x + 1 (I'm not sure why).

x^7 + x + 1 is a Conway polynomial (C_2,7) and seems like a more natural choice.

Update: Even MATLAB's primpoly outputs x^7 + x^3 + 1 as its first output.

mhostetter · 2021-12-12T15:16:00Z

I'm closing this due to inactivity. I believe we covered what we wanted to here.

If you have additional thoughts, desired features, or issues with BCH encoding / decoding, please open another GitHub issue. 👍

mhostetter added the performance Affects speed/performance label Jun 12, 2021

mhostetter mentioned this issue Jun 12, 2021

Array of polynomials #112

Closed

mhostetter added the discussion Something to be discussed label Jun 12, 2021

mhostetter mentioned this issue Jun 18, 2021

Fix computation of BCH t values #132

Merged

mhostetter added the fec Forward error correction label Dec 12, 2021

mhostetter closed this as completed Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BCH performance #125

BCH performance #125

mhostetter commented Jun 12, 2021

mhostetter commented Jun 12, 2021

varun19299 commented Jun 12, 2021 •

edited

Loading

mhostetter commented Jun 12, 2021

varun19299 commented Jun 18, 2021

varun19299 commented Jun 18, 2021 •

edited

Loading

mhostetter commented Jun 18, 2021

varun19299 commented Jun 18, 2021 via email

mhostetter commented Jun 18, 2021 •

edited

Loading

mhostetter commented Jun 22, 2021

varun19299 commented Jun 23, 2021

mhostetter commented Jun 23, 2021

varun19299 commented Jun 23, 2021 •

edited

Loading

varun19299 commented Jun 23, 2021 •

edited

Loading

mhostetter commented Dec 12, 2021

BCH performance #125

BCH performance #125

Comments

mhostetter commented Jun 12, 2021

mhostetter commented Jun 12, 2021

varun19299 commented Jun 12, 2021 • edited Loading

mhostetter commented Jun 12, 2021

varun19299 commented Jun 18, 2021

varun19299 commented Jun 18, 2021 • edited Loading

mhostetter commented Jun 18, 2021

varun19299 commented Jun 18, 2021 via email

mhostetter commented Jun 18, 2021 • edited Loading

mhostetter commented Jun 22, 2021

varun19299 commented Jun 23, 2021

mhostetter commented Jun 23, 2021

varun19299 commented Jun 23, 2021 • edited Loading

varun19299 commented Jun 23, 2021 • edited Loading

mhostetter commented Dec 12, 2021

varun19299 commented Jun 12, 2021 •

edited

Loading

varun19299 commented Jun 18, 2021 •

edited

Loading

mhostetter commented Jun 18, 2021 •

edited

Loading

varun19299 commented Jun 23, 2021 •

edited

Loading

varun19299 commented Jun 23, 2021 •

edited

Loading