Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCH performance #125

Closed
mhostetter opened this issue Jun 12, 2021 · 14 comments
Closed

BCH performance #125

mhostetter opened this issue Jun 12, 2021 · 14 comments
Labels
discussion Something to be discussed fec Forward error correction performance Affects speed/performance

Comments

@mhostetter
Copy link
Owner

In PR #114, I added an initial implementation of BCH codes. This issue will review their performance as compared with other software packages and any desired API changes.

@mhostetter mhostetter added the performance Affects speed/performance label Jun 12, 2021
@mhostetter
Copy link
Owner Author

@varun19299 In response to your posts #112 (comment) and #112 (comment)...

I think in Matlab you can write bchenc(msg, n, k, 'end') to systematically encode the message with parity bits at the end. I'm using a similar call signature with Octave to generate the systematic test vectors.

I'm glad to hear about the 20x speed-up over Matlab! Are you testing encoding or decoding or both?

Technically the code isn't vectorized (at least by my definition). None of the operations are running in parallel. Numba has a parallel compile option, but I've found it's slower than the default. The code is implemented with a big for loop, but it's inside the compiled code so it should be as fast as a C for loop (not a slow python for loop).

If you're saying encoding is O(1) for N < 2048 and linear for larger N, I think I may know why. But I'll wait for your response and we can discuss.

@mhostetter mhostetter added the discussion Something to be discussed label Jun 12, 2021
@varun19299
Copy link

varun19299 commented Jun 12, 2021

I think in Matlab you can write bchenc(msg, n, k, 'end') to systematically encode the message with parity bits at the end. I'm using a similar call signature with Octave to generate the systematic test vectors.

Alright, I'll check this out.

I'm glad to hear about the 20x speed-up over Matlab! Are you testing encoding or decoding or both?

W.r.t decoding. Encoding is comparatively much faster (~ 100 microseconds upto batch size of 2048, for BCH(15,11)), then linearly increases.

Technically the code isn't vectorized (at least by my definition). None of the operations are running in parallel. Numba has a parallel compile option, but I've found it's slower than the default. The code is implemented with a big for loop, but it's inside the compiled code so it should be as fast as a C for loop (not a slow python for loop).

Is parallel slower across a wide range of array sizes (N, n)?
I think MATLAB uses a JIT compiler in for loops. Not sure what explains the performance difference then.

If you're saying encoding is O(1) for N < 2048 and linear for larger N, I think I may know why. But I'll wait for your response and we can discuss.

Yep, this was my observation, but I may have to compare across more BCH [n,k,d] combinations if this isn't expected.

@mhostetter
Copy link
Owner Author

@varun19299 One thing to keep in mind... At least in my implementation, the decoding time is error-dependent. For example, if the syndromes are all zero, then the decoder will return early. I don't know if Matlab has the same non-deterministic algorithm. So I would test against the same error pattern, when possible.

If you're saying encoding is O(1) for N < 2048 and linear for larger N, I think I may know why. But I'll wait for your response and we can discuss.

Yep, this was my observation, but I may have to compare across more BCH [n,k,d] combinations if this isn't expected.

I believe the reason the time is constant for small N is because, for systematic codes, I'm only matrix multiplying with the parity portion of G and then calling np.hstackto append the parity to the message. I believe the call time to np.hstack dominates the matrix multiplication time for small N (hence the roughly constant time), but for large N the matrix multiplication time dominates (hence the linear time).

(ignore the comment...)

galois/galois/code/bch.py

Lines 194 to 197 in 49f8f1d

elif self.systematic:
# Seems faster to just matrix multiply than use hstack
parity = message.view(GF2) @ self.G[:, self.k:]
return np.hstack((message, parity)).view(type(message))

Regarding parallel mode in numba, unfortunately I can't get parallel=True to perform better than the default implementation. Here's a comparison of the current decoder and a "parallel" version. The parallel version replaces the for loop in decode_calculate() with numba.prange and JIT compiles it with parallel=True.

Current decoder:

In [1]: import galois                                                                                                                   

In [2]: bch = galois.BCH(31, 21)                                                                                                        

In [3]: def speed_test(N): 
   ...:     m = galois.GF2.Random((N, bch.k)) 
   ...:     c = bch.encode(m) 
   ...:     c[:,0] ^= 1 
   ...:     c[:,5] ^= 1 
   ...:     %timeit bch.decode(c) 
   ...:                                                                                                                                 

In [4]: speed_test(100)                                                                                                                 
785 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: speed_test(1_000)                                                                                                               
4.26 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: speed_test(10_000)                                                                                                              
37 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: speed_test(100_000)                                                                                                             
380 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: speed_test(1_000_000)                                                                                                           
4.04 s ± 68.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Parallel decoder:

In [1]: import galois                                                                                                                   

In [2]: bch = galois.BCH(31, 21)                                                                                                        

In [3]: def speed_test(N): 
   ...:     m = galois.GF2.Random((N, bch.k)) 
   ...:     c = bch.encode(m) 
   ...:     c[:,0] ^= 1 
   ...:     c[:,5] ^= 1 
   ...:     %timeit bch.decode(c) 
   ...:                                                                                                                                 

In [4]: speed_test(100)                                                                                                                 
1.22 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: speed_test(1_000)                                                                                                               
6.23 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: speed_test(10_000)                                                                                                              
42.3 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: speed_test(100_000)                                                                                                             
395 ms ± 35.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: speed_test(1_000_000)                                                                                                           
4.15 s ± 289 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@varun19299
Copy link

@mhostetter

galois.BCH(31,11).t gives 4. Shouldn't it be 5?

Reference: BCH table

@varun19299
Copy link

varun19299 commented Jun 18, 2021

Even (63,10), (127,15) don't seem to be correct.

@mhostetter
Copy link
Owner Author

Thanks for the report! And for your testing -- it's been helpful.

You're absolutely right, that's a bug. For some codes, (n, k, t) and (n, k, < t) return the same code size and generator polynomial. I avoided the < t cases in bch_valid_codes(), but missed them in bch_generator_poly() which is used in galois.BCH. I just merged a bug fix and increased the unit testing to catch cases like these. Master should be working now.

In [1]: import galois                                                                                                                

In [2]: galois.BCH(31, 11).t                                                                                                         
Out[2]: 5

In [3]: galois.BCH(63, 10).t                                                                                                         
Out[3]: 13

In [4]: galois.BCH(127, 15).t                                                                                                        
Out[4]: 27

@varun19299
Copy link

varun19299 commented Jun 18, 2021 via email

@mhostetter
Copy link
Owner Author

mhostetter commented Jun 18, 2021

@varun19299 No worries. I appreciate the help / feedback.

I also discovered a bizarre thing when resolving this bug. Textbooks (both yours and mine) use the lexicographically smallest primitive
polynomial as the default for the BCH field GF(2^m). However, for GF(2^7) they use the second-smallest x^7 + x^3 + 1. And I can't figure out why.

Here are the first three primitive polynomials listed in lexicographically-increasing order.

In [5]: galois.primitive_polys(2, 6)[0:3]
Out[5]:
[Poly(x^6 + x + 1, GF(2)),
 Poly(x^6 + x^4 + x^3 + x + 1, GF(2)),
 Poly(x^6 + x^5 + 1, GF(2))]

In [6]: galois.primitive_polys(2, 7)[0:3]
Out[6]:
[Poly(x^7 + x + 1, GF(2)),
 Poly(x^7 + x^3 + 1, GF(2)),
 Poly(x^7 + x^3 + x^2 + x + 1, GF(2))]

In [7]: galois.primitive_polys(2, 8)[0:3]
Out[7]:
[Poly(x^8 + x^4 + x^3 + x^2 + 1, GF(2)),
 Poly(x^8 + x^5 + x^3 + x + 1, GF(2)),
 Poly(x^8 + x^5 + x^3 + x^2 + 1, GF(2))]

And here is the default primitive polynomial (run in Octave), which matches the textbooks. I imagine Matlab is the same. Note, a (n, k, 1) code's generator polynomial equals the GF(2^m) field's primitive polynomial.

octave:69> bchpoly(63, 57)(end:-1:1)
ans =

   1   0   0   0   0   1   1

octave:70> bchpoly(127, 120)(end:-1:1)
ans =

   1   0   0   0   1   0   0   1

octave:71> bchpoly(255, 247)(end:-1:1)
ans =

   1   0   0   0   1   1   1   0   1

Probably easier to visualize like this:

octave:100> gf(1, 6)
ans =
GF(2^6) array. Primitive Polynomial = D^6+D+1 (decimal 67)

Array elements = 

   1

octave:101> gf(1, 7)
ans =
GF(2^7) array. Primitive Polynomial = D^7+D^3+1 (decimal 137)

Array elements = 

   1

octave:102> gf(1, 8)
ans =
GF(2^8) array. Primitive Polynomial = D^8+D^4+D^3+D^2+1 (decimal 285)

Array elements = 

   1

I can't seem to make sense of the discrepancy with GF(2^7), but either way I added the correct default for GF(2^7) in the pull request as well.

@mhostetter
Copy link
Owner Author

@varun19299 can I ask for a favor? After discovering the issue in the previous post, I added matlab_primitive_poly() (docs here) so that the default BCH and Reed-Solomon codes match those from Matlab (and most textbooks). I'm trying to add some unit tests, but don't currently have access to Matlab. Could you run the following .m file and copy-and-paste the output? I'd appreciate it.

gfprimdf(1, 2)
gfprimdf(2, 2)
gfprimdf(3, 2)
gfprimdf(4, 2)
gfprimdf(5, 2)
gfprimdf(6, 2)
gfprimdf(7, 2)
gfprimdf(8, 2)
gfprimdf(9, 2)
gfprimdf(10, 2)
gfprimdf(11, 2)
gfprimdf(12, 2)
gfprimdf(13, 2)
gfprimdf(14, 2)
gfprimdf(15, 2)
gfprimdf(16, 2)

gfprimdf(1, 3)
gfprimdf(2, 3)
gfprimdf(3, 3)
gfprimdf(4, 3)
gfprimdf(5, 3)
gfprimdf(6, 3)
gfprimdf(7, 3)
gfprimdf(8, 3)

gfprimdf(1, 5)
gfprimdf(2, 5)
gfprimdf(3, 5)
gfprimdf(4, 5)
gfprimdf(5, 5)
gfprimdf(6, 5)
gfprimdf(7, 5)
gfprimdf(8, 5)

gfprimdf(1, 7)
gfprimdf(2, 7)
gfprimdf(3, 7)
gfprimdf(4, 7)
gfprimdf(5, 7)
gfprimdf(6, 7)
gfprimdf(7, 7)
gfprimdf(8, 7)

@varun19299
Copy link

Here you go: github gist.

@mhostetter
Copy link
Owner Author

Perfect! Thanks a bunch.

@varun19299
Copy link

varun19299 commented Jun 23, 2021

I've cleaned up the output, easier to read now. GF(8**7) takes a bit of time to generate primitive polynomials. I'll add that in a little while.

Update: added.

@varun19299
Copy link

varun19299 commented Jun 23, 2021

And yes, like you observed x^7 + x^3 + 1 is used instead of x^7 + x + 1 (I'm not sure why).

x^7 + x + 1 is a Conway polynomial (C_2,7) and seems like a more natural choice.

Update: Even MATLAB's primpoly outputs x^7 + x^3 + 1 as its first output.

@mhostetter mhostetter added the fec Forward error correction label Dec 12, 2021
@mhostetter
Copy link
Owner Author

I'm closing this due to inactivity. I believe we covered what we wanted to here.

If you have additional thoughts, desired features, or issues with BCH encoding / decoding, please open another GitHub issue. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Something to be discussed fec Forward error correction performance Affects speed/performance
Projects
None yet
Development

No branches or pull requests

2 participants