Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring suggestions #36

Closed
8 of 13 tasks
rouson opened this issue Mar 30, 2022 · 18 comments
Closed
8 of 13 tasks

Refactoring suggestions #36

rouson opened this issue Mar 30, 2022 · 18 comments

Comments

@rouson
Copy link
Collaborator

rouson commented Mar 30, 2022

@milancurcic would you be open to a pull request with some refactoring that is minor but global? If so, I would submit one or more pull requests with the changes described below. In all honesty, one of the main reasons I do steps like these is because it walks me through the project in a way that keeps my brain actively involved in consuming and understanding the code, but there are potential inherent benefits to the project in the parenthetical descriptions below. If you like some ideas but not others, you could check the ones you like. Otherwise, I can check them off as the pull requests get reviewed and merged.

Increase modularity

  • Separate procedure definitions into submodules, keeping interface bodies in modules (which more succinctly exposes the API and reduces compilation cascades)

Adopt more functional style

  • Use associate wherever possible to ensure immutability (which reduces chances of mistaken data modifications)
  • Use constructor functions that return fully-defined, whole objects (which reduces chances of mistakenly using undefined components and eliminates the need for setters) (done as of v0.3.0)
  • Look for opportunities to employ the functional-fortran library (as of Read Conv2D, MaxPooling2D, and Flatten layers from Keras #85 it's now a dependency; using reverse in one or two places)

Leverage Vegetables in the Tests

  • Make passing tests more quiet (less output)
  • Make test output read like a specification

Note: Not using vegetables but essentially resolved as of v0.3.0.

Generate & Deploy Documentation

  • Add a FORD project file.
  • Add PlantUML diagrams
  • Add a CI script that deploys the documentation to GitHub Pages

Continuous Integration Testing

Nomenclature

This is a style that Brad (@everythingfunctional) and I adopted in recent code:

  • Module name suffix _m
  • Type name suffix _t
  • Submodule name suffix _submodule

Note: Using _submodule for submodules as of v0.2.0.

@milancurcic
Copy link
Member

Thanks a lot for suggesting these. Overall, yes, all of these are very welcome, with a few caveats that I comment on below:

Regarding modularity and functional style, OK but only if not at the expense of readability. For such a small API where most procedures are < 20 LOC, I personally don't see meaningful gains here, but let's look at specific suggestions for changes. Submodules are fine, but again, personally, for such small modules, I prefer to see the whole thing in one file. For the time being, can you open a separate issue for associate suggestions where we can discuss it?

A thing to keep in mind regarding the functional style is that I'll be soon introducing a module of activation subroutines (there already are activation functions) which will modify the activation in place (and be called aptly, activate, the verb). This comes from a suggestion from @peterukk who obtained non-negligible performance improvements when using in-place activations with subroutines (some discussion on this in #18).

Another (simple) refactor that is coming up (and has been done in FKB) and that must come before CNNs is to implement forward and backward propagation on the layer type rather than the network type. I expect that a base layer type will become an abstract type, with concrete types implementing forward and backward propagation subroutines.

For tests, I prefer test-drive--simpler to me and a Fortran-lang project. CI, UML, and FORD files will all be very welcome.

If opening PR, please keep one PR per feature (e.g. CI, UML, FORD, associate, submodules should each be separate PRs).

@rouson
Copy link
Collaborator Author

rouson commented Mar 31, 2022

@milancurcic thanks for getting back to me so quickly with the feedback. I'm not familiar with test-drive. I look forward to checking it out. FYI, I just added a "Nomenclature" section at the end of the checklist in the original comment in this issue.

@milancurcic
Copy link
Member

I just added a "Nomenclature" section at the end of the checklist in the original comment in this issue.

While I appreciate that many people like this convention, I don't, sorry. It's possible that we will need a suffix in source filenames if we start using submodules, so let's discuss that then.

Another heads-up on nomenclature that I forgot to mention earlier: Currently module names and source files are prefixed with mod_, which I like, however, it's another common convention that could create a name clash in the future since neural-fortran is and will be mostly used in existing Fortran applications. So mod_ prefix will be replaced with something project-specific, perhaps neural_ or nf_ (my current favorite). Ideas welcome.

@rouson
Copy link
Collaborator Author

rouson commented Apr 1, 2022

@milancurcic regarding the separation of procedure interfaces from procedure definitions, I think it's important to keep two categories of reader in mind: developers and users. Many users (especially a naive user) just want to know how to invoke a procedure (as described in the interface body) and what it does (as might be described in a comment). From a user perspective, there's likely little benefit from bundling the procedure definition with the procedure interface body and there are the downsides of information overload, increased scrolling, and possible confusion (e.g., if the developer wants me to see the innards, was that because the procedure does something unexpected like modifying non-local variables?).

Also, "user" in this context is anyone who needs to invoke the procedure, including a contributing developer. Former Fortran committee member and Fortran book author Richard Maine makes the point that separating interfaces from definitions has subtle conceptual benefits beyond the practical benefits often discussed (compilation cascades). I've found this separation very helpful in working with clients on modernizing code and trying to disentangle some of their mental mapping of how different parts of a project relate to each other when they've gotten used to doing things a certain way like using global data. In such cases, people make a big conceptual leap when they realize they don't need to read through every line of a procedure to competently use the procedure.

Lastly, it's always possible to repeat the interface information in the procedure definition if one wants to do so. I prefer to avoid the duplication and the potential inconsistencies, but at least it's still nice to clearly exposes the minimal information that a user needs without burdening them with unnecessary implementation details.

@milancurcic
Copy link
Member

OK, that's a fair point and I trust you, even if I have some initial resistance. I opened #40 to track it.

@rouson
Copy link
Collaborator Author

rouson commented Apr 2, 2022

@milancurcic FWIW, I just refactored a module on another project and remembered another benefit of submodules: tighter scoping. Non-public entities can be moved to the submodule. When there are constants, module variables, and procedures that are only used internally in executable statements, moving such entities to submodules means less that a user has to wade through. I haven't looked yet at whether this is the case for any entities in neural-fortran. We'll see.

@rouson
Copy link
Collaborator Author

rouson commented Apr 2, 2022

@milancurcic I keep remembering additional subtleties to the module/submodule pairing (which is why I need to write a new book to gather all this stuff in one place). The separation affords a really nice way around Fortran's prohibition against circular module dependencies. In situations where it would be useful for module A to use module B and module B to use module A, a prohibited circular relationship, submodules often come to the rescue! When the executable statements of a procedure in module A reference a procedure in module B and vice versa, what the procedures are really referencing are each other's interfaces so the dependency graph forms an X instead of a circle:

        module a_m        module b_m
                     \ /
                      \
                     / \
submodule(a_m) a_s        submodule(b) b_s

where the above depiction looks best when rendered in a browser.

This doesn't necessarily come up often, but it shows one of the more subtle benefits of exposing the distinction between two very different concepts that otherwise get conflated. When the procedure interface and the procedure definition are separate, many otherwise circular dependencies vanish. I think that's a really nice example of the conceptual benefits.

@rouson
Copy link
Collaborator Author

rouson commented Apr 2, 2022

My book uses what it calls the Surrogate pattern to solve the above problem. I recently heard MIT computer science Prof. Saman Amarasinghe say "Design patterns are all the things you forgot to put in the language." In our book, we had to invent a pattern in Fortran 2003 as a workaround for a problem that's more naturally solved by a built-in language capability in Fortran 2008.

@rouson
Copy link
Collaborator Author

rouson commented Apr 4, 2022

Of course, just 2 days after I wrote "This doesn't necessarily come up often,..." regarding circular module dependencies, I now see a recent commit message indicating that Brad (@everythingfunctional) had to do some work to break a circular dependency in software that we're developing for a clients at a private company. I haven't dug through the commit to see if submodules could have helped in this case, but seeing this so soon after my statement makes me wonder just how often the issue arises. It's possible that the only reason I was thinking it doesn't come up often is because we've all gotten used to having to design around a limitation in the language that has been alleviated by submodules, which most people don't use yet.

@rouson
Copy link
Collaborator Author

rouson commented Apr 17, 2022

@milancurcic I'd like to work on the test items from the above checklist next. The main thing I'd like to do is make passing tests more quiet. This is a philosophy that I picked up from Tom Clune and it's also reflected in how ctest behaves. Most of the output that the test suite currently produces is not useful to a user who is unfamiliar with the values being printed. I think users mostly just want to know whether everything works, and if not, then which tests failed. Any output beyond that is likely to be confusing and ignored unless a user wants to dive into the details of the test and the implementation.

I'll take a look at test-drive. On first look, it appears pretty similar to Vegetables so I'm hoping I can come up to speed on it quickly. If not, I could contribute Vegetables tests and either you could help me convert them to test-drive tests or I could figure out how to keep the Vegetables on my fork so they don't get pushed to the modern-fortran repository.

Once the testing has been updated a bit, I'll work on setting up CI testing via GitHub Actions.

@rouson
Copy link
Collaborator Author

rouson commented Apr 17, 2022

I just realized that I wrote "... keep the Vegetables on my fork..." without thinking about the pun. I'll have use that one again. In the mean time, I'll go for a test-drive! 😆

@rouson
Copy link
Collaborator Author

rouson commented Apr 17, 2022

On first pass through the test-drive repository, I recommend Garden, the parallel successor to Vegetables. The differences I see so far are

  1. Garden uses a functional style, whereas test-drive uses a procedural style,
  2. Garden explicitly supports parallelism, whereas I see no mention of parallelism in test-drive,
  3. Garden has a companion tool that generates the main driver automatically, whereas it appears I have to write, and more importantly, update and maintain a test-drive driver manually,
  4. Garden provides ~250 assertion procedures, whereas test-drive has 9 check procedures that appear to be similar to assertions, and
  5. Garden has extensive FORD documentation, whereas unless I missed something, I don't see any documentation in test-drive beyond the README.md and one long comment that appears to be formatted for automatic parsing, but I don't know what tool to use to generate the documentation and that style of one-long-comment documentation often doesn't get updated when the adjacent code gets updated.

I get the sense that the procedural style of test-drive is considered a feature and I can understand how that style is more immediately intuitive to most Fortran programmers, but it imposes an unnecessary burden on the newcomer. With a functional style, it's immediately obvious what the inputs are (the arguments) and what the output is (the result). Fortunately, test-drive is simple enough that this information can be at least partially inferred pretty quickly from context in its current form, but that still leaves some uncertainty about argument intent and it has the side effects relying on unsafe practices such as the use of allocated status to convey important information. With a functional style, one rarely works with unallocated entities, which protects against accidental use of unallocated entities.

To understand things like argument intent, I have to dive into the test-drive source code, but then I run into the common issue with most Fortran that the interface information is not separated out so I have to wade through all of the implementation details. When tried skipping straight to a given procedure, I bumped into C preprocessor macros that I had to search back up to the top of the file to understand.

But item 2 is what seems to me like a deal breaker. I imagine that it's possible that item 3 mitigates the issue if the intention is that the end user handles the parallelism in the main driver, but that leaves a lot of logic to the user to decide, e.g., will every image run every test, and if so, how does one test the parallel execution of the collective subroutines in neural-fortran?

@milancurcic
Copy link
Member

Thanks for the detailed summary. I mostly see the same picture, but my takeaways are different.

  1. Yes, and IMO test-drive tests are simpler to me. Can you point me to an example of a simple Vegetable/Garden test? I may have not searched long enough for some test examples, but what I've seen so far looks quite complicated.
  2. Test-drive runs in parallel using OpenMP, see https://github.com/fortran-lang/test-drive/blob/a1ab7854175b9d90f0cac7a504db404f2ed98160/src/testdrive.F90#L337-L345. I'm happy this was the only deal-breaker and I hope your concern is addressed. I agree this is an important point.
  3. That's fine, but IMO test-drive is simple enough to not warrant a separate tool to generate a driver.
  4. Another thing I like better about test-drive. :)
  5. From what I see, test-drive also uses FORD docstrings. There's no FORD file though. I'm sure it could be contributed via a PR easily.

No doubt garden packs a lot of bells and whistles that are very useful for very complex projects. What appeals to me about test-drive is that it's a small and simple tool with a few procedures. And not the least, again, it's a Fortran-lang project, which I'm personally invested in.

@rouson
Copy link
Collaborator Author

rouson commented Apr 18, 2022

@milancurcic my most significant points are in the text below the itemized list. The simplicity of test-drive is partly because it has a lot less capability and it burdens the user with reading procedure implementations just to understand calling semantics that would be completely obvious in a functional programming style. By contrast, I've used Garden's predecessor, Vegetables, on O(10) projects and I can't recall a single time that I felt the need to read a Vegetables procedure implementation because every procedure I use is a function that takes only intent(in) arguments and produces one result. I rarely even have to look at a procedure's interface because the functional programming style makes the interface obvious. That's a form of simplicity that scales really well with software size.

About two years ago, I experimented with going all-in on attempting to follow the functional programming style to its logical conclusion by designing very nearly every part of a simulation package around pure functions. The experience blew me away. It was the first time that I had ever written about 1,200 lines of code that just worked very nearly the first time I integrated all the pieces together. Seriously. I spent almost no time debugging. I had never experienced that before and now it has become the norm for me. It was the basis of the keynote address that I gave at the International Fortran Conference last year.

Because you use co_sum, I recommend starting with the co_sum unit test in Caffeine. Notice that the first function in that module reads like an English-language specification and that's even without comments. Vegetables is designed so that what would go in a comment naturally gets embedded in the semantics. You can't invoke the it procedures without writing a description of what you're doing because the description is the first argument. That's thoughtful design. You don't even need to read the rest of the module to understand what the module is doing. The entire specification is given in the first function and that's true of every test in every project below. I hope the power of this approach demonstrates a different form simplicity. :)

Vegetable Examples

@rouson
Copy link
Collaborator Author

rouson commented Apr 18, 2022

@milancurcic I just clicked on the link you provided to the parallel code in test-drive. When I mentioned parallelism, I wasn't referring to executing multiple unit tests concurrently. I was referring to testing SPMD parallel code. Wouldn't forking multiple threads to run independent tests cause problems if the tests are exploiting SPMD parallelism, e.g., multi-image tests?

@milancurcic
Copy link
Member

When I mentioned parallelism, I wasn't referring to executing multiple unit tests concurrently. I was referring to testing SPMD parallel code. Wouldn't forking multiple threads to run independent tests cause problems if the tests are exploiting SPMD parallelism, e.g., multi-image tests?

I see, I didn't understand that. You're right, I don't think test-drive was designed with that in mind. At the very least, all tests in a suite (assuming one suite per driver) would need to run correctly on the same number of images. And, reporting of the suite results would be duplicated. So, I think it's possible to test parallel code with test-drive, but yes, the writer of the tests would need to handle the parallel logic inside the tests. Test-drive is then not the best candidate for testing parallel code.

Thank you for the co_sum example test (and others). I find it relatively simple and makes me confident that writing simple and small tests with Garden is possible. This was my biggest concern which I'd say is now mostly addressed.

You don't have to sell me on functional style--I was sold years ago. Functional style is an advantage for me, but it alone is not sufficient to warrant choosing one tool over another. The experience that you describe trying to understand test-drive is something that I didn't experience. I never felt the need to look at the test-drive implementation. In fact, the first time I looked was when I went to see how parallelism was implemented in an earlier message in this thread. I adapted many stdlib tests to test-drive (see fortran-lang/stdlib#494) and I didn't look at the implementation a single time. I merely looked at some existing tests that Sebastian already wrote. So, my explanation for the apparently different movies playing in your and my heads is: You used Garden to write many tests--it makes sense to you; I used test-drive to write many tests--it makes sense to me. That said, I don't expect Garden tests to be any more difficult to write than test-drive tests. It's a merely a matter of learning the API.

Notice that the first function in that module reads like an English-language specification and that's even without comments. Vegetables is designed so that what would go in a comment naturally gets embedded in the semantics. You can't invoke the it procedures without writing a description of what you're doing because the description is the first argument.

Damian, this is key. This was the barrier for me when first looking at Garden/Veggies. I looked at about a dozen different Garden tests before. Seeing functions called it, describe, and then, I thought "what the heck does this do" on initial reads. The second thought was, OK, odd function names, but it's just an API and I can learn it. It wasn't until reading your comment that it clicked for me: Garden basically invents what looks like a DSL in which composing function names and string literals creates a set of English-language statements. That's pretty smart and cool, even if a bit gimmicky (not a bad thing!) and unconventional (I find it odd using verbs for function names; it doesn't convey any meaning, etc.).

Once I switch mindset and get used to what Garden is trying to do, the API is quite attractive. Once I remember I'm in an almost DSL-land, it's even more attractive than that of test-drive.

If anything, Garden is too smart. Your explanation that made it click for me should be the first thing in the tutorial. Otherwise, if prospective users come in expecting that the function names alone should convey some meaning, they may get confused.

For me, the advantages to Garden over test-drive are:

  1. You want to write tests with Garden;
  2. We can more easily test parallel code.

Let's go with Garden then. If you want to work on this soon-ish, I recommend staying within the limits of MNIST data and io modules, and the activation functions. Most of the other modules will be entirely re-written with a re-factor that I'm working on.

@rouson
Copy link
Collaborator Author

rouson commented Apr 18, 2022

Garden basically invents what looks like a DSL in which composing function names and string literals creates a set of English-language statements. That's pretty smart and cool, even if a bit gimmicky (not a bad thing!) and unconventional (I find it odd using verbs for function names; it doesn't convey any meaning, etc.).

Exactly. I experienced the same, but I couldn't think of any better suggestions to give for the names. As you probably picked up, "It" refers back to the quote just after describe(... so in the Caffeine co_sum test, the test describes the behavior of "The caf_co_sum subroutine" and each subsequent clause tells us what it (caf_co_sum) does.

Once I switch mindset and get used to what Garden is trying to do, the API is quite attractive. Once I remember I'm in an almost DSL-land, it's even more attractive than that of test-drive.

In reading the Garden documentation yesterday, I discovered that it's inspired by behavior-driven development (BDD), which in turn grew out of test-driven development (TDD). I had heard of BDD before, but didn't really get it until just now. This quote from the Wikipedia page precisely describes what Garden/Veggies tests read like:

"BDD is largely facilitated through the use of a simple DSL using natural-language constructs (e.g., English-like sentences) that can express the behaviour and the expected outcomes."

For me, the advantages to Garden over test-drive are:

  1. You want to write tests with Garden;
  2. We can more easily test parallel code.

Let's go with Garden then. If you want to work on this soon-ish, I recommend staying within the limits of MNIST data and io modules, and the activation functions. Most of the other modules will be entirely re-written with a re-factor that I'm working on.

Thanks for your patience with me. Despite my long-windedness, I often spend time editing my comments down to try to make them shorter, but there's just so much to say! It reminds me of when jazz band leader Miles Davis was complaining about saxophonist John Coltrane taking too long on his solos. Coltrane responded that once he gets going, he just can't figure out how to stop. Miles replied, "Try taking your lips off the horn." Sometimes I probably just need to try taking my fingers off the keyboard. LOL

This will be quick and painless. :)

@milancurcic
Copy link
Member

Most of the stuff done here, will close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants