Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tensorboard to driver.py for logging #30

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

infophysics
Copy link

Added a few config parameters to allow the use of tensorboard:

base:
  run_number: '00'       # defaults to 00, but can be anything
  world_size: 1
  iterations: 100000
  seed: 0
  unwrap: false
  log_dir: /global/homes/n/ncarrara/dune/mlreco/pid_study/pid_solo/
  log_step: 1
  overwrite_log: true
  tensorboard: true     # defaults to false
  train:
    weight_prefix: /global/homes/n/ncarrara/dune/mlreco/pid_study/pid_solo/
    save_step: 1000
    optimizer:
      name: Adam
      lr: 0.001

The changes to driver.py are:

  1. Adding in the SummaryWriter for writing log output to tensorboard format.
from torch.utils.tensorboard import SummaryWriter
  1. Adding in run number and tensorboard options from config:
# check for run number
        if 'run_number' not in base:
            base['run_number'] = '00'

        # check for tensorboard
        if 'tensorboard' not in base:
            base['tensorboard'] = False
  1. Added default options to initialize_base:
def initialize_base(self, seed, dtype='float32', world_size=0,
                        log_dir='logs', prefix_log=False, overwrite_log=False,
                        parent_path=None, iterations=None, epochs=None,
                        unwrap=False, rank=None, log_step=1, distributed=False,
                        split_output=False, train=None, verbosity='info',
                        run_number='00', tensorboard=False):
  1. Saving run number and tensorboard option as driver members:
self.run_number = run_number
        self.tensorboard = tensorboard
  1. Saving the log_path as a member to driver and initializing tensorboard:
# Initialize the log
        self.log_path = os.path.join(self.log_dir, log_name)
        self.logger = CSVWriter(self.log_path, overwrite=self.overwrite_log)

        # set up tensorboard
        if (self.tensorboard and self.main_process):
            time = datetime.now()
            now = f"{time.year}.{time.month}.{time.day}.{time.hour}:{time.minute}:{time.second}"
            self.tensorboard_dir = f'{self.log_path.replace(".csv", "/")}/{self.run_number}/{now}/'
            self.tensorboard = SummaryWriter(
                log_dir=self.tensorboard_dir
            )
  1. Logging the log_dict losses and accuracies to tensorboard:
# Report to tensorboard
            if (self.tensorboard and self.main_process):
                for key, value in log_dict.items():
                    if ("loss" in key or "accuracy" in key):
                        self.tensorboard.add_scalar(key, value, iteration)

One can easily spin up tensorboard from a notebook using the following commands:


import nersc_tensorboard_helper
%load_ext tensorboard
%tensorboard --logdir <location_of_log_path> --port 0
nersc_tensorboard_helper.tb_address()


You will see something like the following:

![image](https://github.com/user-attachments/assets/25d283fa-7870-47cd-825a-6c0fbe8723a3)

Tensorboard allows the comparison of logged output over several runs, which is quite convenient.  This will also work for submitted jobs.

![image](https://github.com/user-attachments/assets/a94ed629-2690-4533-8394-b44ca0ede333)
![image](https://github.com/user-attachments/assets/1a51b277-0e87-450d-9e6c-e6804278e4ef)







Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant