Csv loader pandas (#1267)

* extraneous tracked file? * matching remote devel * new CSVLoader and Code model for handling string outputs, plus test * null check passthrough * cleanup * two options for loading now, pandas by default and numpy if needed * changed order of nodes for xsd validation
idaholab · Jul 22, 2020 · 684d8e5 · 684d8e5
1 parent fb9d960
commit 684d8e5
Show file tree

Hide file tree

Showing 17 changed files with 365 additions and 68 deletions.
diff --git a/doc/user_manual/couplingAcode.tex b/doc/user_manual/couplingAcode.tex
@@ -17,7 +17,7 @@ \section{Advanced Users: How to couple a new code}
  path/to/raven/distribution/raven/framework/CodeInterfaces/
 \end{lstlisting}
 At the initialization stage, RAVEN imports all the Interfaces that are contained in this directory and performs some preliminary cross-checks.
-\\It is important to notice that the name of class in the Interface module is the one the user needs to specify when the new interface 
+\\It is important to notice that the name of class in the Interface module is the one the user needs to specify when the new interface
 needs to be used. For example, if the Interface module contains the class 	``NewCode'', the \textit{subType} in the \xmlNode{Code} block will be 	``NewCode'':
 \begin{lstlisting}[language=python]
   class NewCode(CodeInterfaceBase):
@@ -158,6 +158,9 @@ \subsection{Pre-requisites.}
   result1, result2, result3
   aValue1, aValue2, aValue3
 \end{lstlisting}
+Note that in general RAVEN is content with accepting floats or strings as data types in the CSV.
+However, if the CSV produced by running the code has a large number of columns (say, over 1000), it
+is necessary to include only floats and change the CSV loading utility. See more below (\ref{subsubsec:setCsvLoadUtil})
 %%%%%%%
 \subsection{Code Interface Creation}
 \label{subsec:codeinterfacecreation}
@@ -179,7 +182,7 @@ \subsection{Code Interface Creation}
 from CodeInterfaceBaseClass import CodeInterfaceBase
 class NewCode(CodeInterfaceBase):
   ...
-  def initialize(self, runInfoDict, oriInputFiles) 
+  def initialize(self, runInfoDict, oriInputFiles)
   def finalizeCodeOutput(self, command, output, workingDir)
   def getInputExtension(self)
   def checkForOutputFailure(self, output, workingDir)
@@ -268,7 +271,7 @@ \subsubsection{Method: \texttt{createNewInput}}
               input files. This list of files is the one the code interface needs to use to print the new perturbed list of files.
               Indeed, RAVEN already changes the file location in sub-directories and the Code Interface does not need to
               change the filename or location of the files. For example, the files are going to have a absolute path as following:
-              $.\\ path\_to\_working\_directory\\stepName\\anUniqueIdentifier\\filename.extension$. In case of sampling, the 
+              $.\\ path\_to\_working\_directory\\stepName\\anUniqueIdentifier\\filename.extension$. In case of sampling, the
               ``\textit{anUniqueIdentifier}'' is going to be an integer (e.g. 1).
   \item \textbf{\texttt{oriInputFiles}} , data type = list, List of the original input files;
   \item  \textbf{\texttt{samplerType}} , data type = string, Sampler type (e.g. MonteCarlo,
@@ -318,7 +321,7 @@ \subsubsection{Method: \texttt{initialize}}
 def initialize(self, runInfoDict, oriInputFiles)
 \end{lstlisting}
 The \textbf{initialize} function is an optional method. If present, it is called
-by RAVEN code at the begin of each Step (once per step) involving the particular Code Interface. 
+by RAVEN code at the begin of each Step (once per step) involving the particular Code Interface.
 This method is generally indicated to retrieve information from the RunInfo and/or the Input files.
 \\RAVEN is going to call this function passing in the following arguments:
 \begin{itemize}
@@ -390,7 +393,7 @@ \subsubsection{Method: \texttt{setRunOnShell}}
 some specific use cases, the following argument may need to be setted by the code interface developers:
 \begin{itemize}
   \item{shell}, the default value is \textbf{True}. If shell is \textbf{True}, the specified command
-    generated by RAVEN will be executed through the shell. This will allow RAVEN to have an enhanced 
+    generated by RAVEN will be executed through the shell. This will allow RAVEN to have an enhanced
     control flow with convenient access to other shell features such as shell pipes, filename wildcards,
     environment variable expansion, and expansion of ``~'' to a user's home directory. If shell is
     \textbf{False}, all the shell based features are disabled. In other words, the users could not use the
@@ -406,6 +409,21 @@ \subsubsection{Method: \texttt{setRunOnShell}}
     \end{lstlisting}
 \end{itemize}
 
+\subsubsection{Method: \texttt{setCsvLoadUtil}}
+\label{subsubsec:setCsvLoadUtil}
+\begin{lstlisting}[language=python]
+self.setCsvLoadUtil('pandas')
+\end{lstlisting}
+The default CSV loader in RAVEN is pandas, which allows arbitrary data types in the CSV, generally
+strings and floats. However, arbitrary data can be challenging to load if there are a large number
+of columns in the code's output CSV that RAVEN attempts to read in. As a rule of thumb, if there are
+over 1000 columns in a typical output CSV for your Code, the resulting values should only be floats
+and integers (not strings), and this method should be called during the CodeInterface construction
+or initialization
+to set the loading utility to \texttt{numpy}. While RAVEN's \texttt{numpy} CSV loading is notably
+faster than RAVEN's \texttt{pandas} CSV loading, it does not allow the flexibility of string entries
+except in the CSV header.
+
 \subsection{Tools for Developing Code Interfaces}
 To make generating a code interface as simple as possible, there are several tools RAVEN makes available within the Code Interface objects.
 

diff --git a/framework/CodeInterfaceBaseClass.py b/framework/CodeInterfaceBaseClass.py
@@ -24,6 +24,7 @@
 
 #Internal Modules------------------------------------------------------------------------------------
 from utils import utils
+import CsvLoader
 #Internal Modules End--------------------------------------------------------------------------------
 
 class CodeInterfaceBase(utils.metaclass_insert(abc.ABCMeta,object)):
@@ -44,6 +45,7 @@ def __init__(self):
     self.inputExtensions = []    # list of input extensions
     self._runOnShell = True      # True if the specified command by the code interfaces will be executed through shell.
     self._ravenWorkingDir = None # location of RAVEN's main working directory
+    self._csvLoadUtil = 'pandas' # utility to use to load CSVs
 
   def setRunOnShell(self, shell=True):
     """
@@ -61,6 +63,27 @@ def getRunOnShell(self):
     """
     return self._runOnShell
 
+  def getCsvLoadUtil(self):
+    """
+      Returns the string representation of the CSV loading utility to use
+      @ In, None
+      @ Out, getCsvLoadUtil, str, name of utility to use
+    """
+    # default to pandas, overwrite to 'numpy' if all of the following:
+    # - all entries are guaranteed to be floats
+    # - results CSV have a large number of headers (>1000)
+    return self._csvLoadUtil
+
+  def setCsvLoadUtil(self, util):
+    """
+      Returns the string representation of the CSV loading utility to use
+      @ In, getCsvLoadUtil, str, name of utility to use
+    """
+    ok = CsvLoader.CsvLoader.acceptableUtils
+    if util not in ok:
+      raise TypeError(f'Unrecognized CSV loading utility: "{util}"! Expected one of: {ok}')
+    self._csvLoadUtil = util
+
   def genCommand(self, inputFiles, executable, flags=None, fileArgs=None, preExec=None):
     """
       This method is used to retrieve the command (in tuple format) needed to launch the Code.

diff --git a/framework/CodeInterfaces/CobraTF/CTFinterface.py b/framework/CodeInterfaces/CobraTF/CTFinterface.py
@@ -17,15 +17,25 @@
 """
 
 from __future__ import division, print_function, unicode_literals, absolute_import
+import os
 from ctfdata import ctfdata
 from CodeInterfaceBaseClass import CodeInterfaceBase
 from GenericCodeInterface import GenericParser
 
-import os
 class CTF(CodeInterfaceBase):
   """
     this class is used a part of a code dictionary to specialize Model.Code for CTF (Cobra-TF)
   """
+  def __init__(self):
+    """
+      Constructor.
+      @ In, None
+      @ Out, None
+    """
+    CodeInterfaceBase.__init__(self)
+    # CTF creates enormous CSVs that are all floats, so we use numpy to speed up the loading
+    self.setCsvLoadUtil('numpy')
+
   def finalizeCodeOutput(self,command,output,workingDir):
     """
       This method is called by the RAVEN code at the end of each code run to create CSV files containing the code output results.

diff --git a/framework/CsvLoader.py b/framework/CsvLoader.py
@@ -21,10 +21,10 @@
 from __future__ import division, print_function, unicode_literals, absolute_import
 #End compatibility block for Python 3----------------------------------------------------------------
 
-#External Modules------------------------------------------------------------------------------------
+#External
+#Modules------------------------------------------------------------------------------------
 import numpy as np
-from scipy.interpolate import interp1d
-import copy
+import pandas as pd
 #External Modules End--------------------------------------------------------------------------------
 
 #Internal Modules------------------------------------------------------------------------------------
@@ -36,43 +36,91 @@ class CsvLoader(MessageHandler.MessageUser):
   """
     Class aimed to load the CSV files
   """
-  def __init__(self,messageHandler):
+  acceptableUtils = ['pandas', 'numpy']
+
+  def __init__(self, messageHandler):
     """
       Constructor
       @ In, messageHandler, MessageHandler, the message handler
       @ Out, None
     """
-    self.allOutParam      = False # all output parameters?
-    self.allFieldNames    = []
-    self.type               = 'CsvLoader'
-    self.printTag           = self.type
-    self.messageHandler     = messageHandler
+    self.type = 'CsvLoader'               # naming type for this class
+    self.printTag = self.type             # message handling representation
+    self.allOutParam = False              # all output parameters?
+    self.allFieldNames = []               # "header" of the CSV file
+    self.messageHandler = messageHandler  # message handling utility
+
+  def loadCsvFile(self, myFile, nullOK=None, utility='pandas'):
+    """
+      Function to load a csv file into realization format
+      It also retrieves the headers
+      The format of the csv must be comma-separated (pandas readable)
+      @ In, myFile, string, Input file name (absolute path)
+      @ In, nullOK, bool, indicates if null values are acceptable
+      @ In, utility, str, indicates which utility should be used to load the csv
+      @ Out, loadCsvFile, pandas.DataFrame or numpy.ndarray, the loaded data
+    """
+    if utility == 'pandas':
+      return self._loadCsvPandas(myFile, nullOK=nullOK)
+    elif utility == 'numpy':
+      return self._loadCsvNumpy(myFile, nullOK=nullOK)
+    else:
+      self.raiseAnError(RuntimeError, f'Unrecognized CSV loading utility: "{utility}"')
+
+  def _loadCsvPandas(self, myFile, nullOK=None):
+    """
+      Function to load a csv file into realization format
+      It also retrieves the headers
+      The format of the csv must be comma-separated (pandas readable)
+      @ In, myFile, string, Input file name (absolute path)
+      @ In, nullOK, bool, indicates if null values are acceptable
+      @ Out, df, pandas.DataFrame, the loaded data
+    """
+    # first try reading the file
+    try:
+      df = pd.read_csv(myFile)
+    except pd.errors.EmptyDataError:
+      # no data in file
+      self.raiseAWarning(f'Tried to read data from "{myFile}", but the file is empty!')
+      return
+    else:
+      self.raiseADebug(f'Reading data from "{myFile}"')
+    # check for NaN contents -> this isn't allowed in RAVEN currently, although we might need to change this for ND
+    if (not nullOK) and (pd.isnull(df).values.sum() != 0):
+      bad = pd.isnull(df).any(1).nonzero()[0][0]
+      self.raiseAnError(IOError, f'Invalid data in input file: row "{bad+1}" in "{myFile}"')
+    self.allFieldNames = list(df.columns)
+    return df
 
-  def loadCsvFile(self,myFile):
+  def _loadCsvNumpy(self, myFile, nullOK=None):
     """
-      Function to load a csv file into a numpy array (2D)
+      Function to load a csv file into realization format
       It also retrieves the headers
-      The format of the csv must be:
-      STRING,STRING,STRING,STRING
-      FLOAT ,FLOAT ,FLOAT ,FLOAT
-      ...
-      FLOAT ,FLOAT ,FLOAT ,FLOAT
-      @ In, fileIn, string, Input file name (absolute path)
-      @ Out, data, numpy.ndarray, the loaded data
+      The format of the csv must be comma-separated with all floats after header row
+      @ In, myFile, string, Input file name (absolute path)
+      @ In, nullOK, bool, indicates if null values are acceptable
+      @ Out, data, np.ndarray, the loaded data
     """
-    # open file
-    myFile.open(mode='rb')
-    # read the field names
-    head = myFile.readline().decode()
-    self.allFieldNames = head.split(',')
-    for index in range(len(self.allFieldNames)):
-      self.allFieldNames[index] = self.allFieldNames[index].strip()
-    # load the table data (from the csv file) into a numpy nd array
-    data = np.loadtxt(myFile,dtype='float',delimiter=',',ndmin=2,skiprows=1)
-    # close file
-    myFile.close()
+    with open(myFile, 'rb') as f:
+      head = f.readline().decode()
+    self.allFieldNames = list(x.strip() for x in head.split(','))
+    data = np.loadtxt(myFile, dtype=float, delimiter=',', ndmin=2, skiprows=1)
     return data
 
+  def toRealization(self, data):
+    """
+      Converts data from the "loadCsvFile" format to a realization-style format (dictionary
+      currently)
+      @ In, data, pandas.DataFrame or np.ndarray, result of loadCsvFile
+      @ Out, rlz, dict, realization
+    """
+    rlz = {}
+    if isinstance(data, pd.DataFrame):
+      rlz = dict((header, np.array(data[header])) for header in self.allFieldNames)
+    elif isinstance(data, np.ndarray):
+      rlz = dict((header, entry) for header, entry in zip(self.allFieldNames, data.T))
+    return rlz
+
   def getAllFieldNames(self):
     """
       Function to get all field names found in the csv file

diff --git a/framework/DataObjects/DataSet.py b/framework/DataObjects/DataSet.py
@@ -35,6 +35,8 @@
   from .DataObject import DataObject
 except ValueError:
   from DataObject import DataObject
+
+import CsvLoader
 from utils import utils, cached_ndarray, xmlUtils, mathUtils
 
 # for profiling with kernprof
@@ -1802,19 +1804,8 @@ def _readPandasCSV(self, fname, nullOK=None):
     # datasets can have them because we don't have a 2d+ CSV storage strategy yet
     else:
       nullOK = True
-    # first try reading the file
-    try:
-      df = pd.read_csv(fname)
-    except pd.errors.EmptyDataError:
-      # no data in file
-      self.raiseAWarning('Tried to read data from "{}", but the file is empty!'.format(fname+'.csv'))
-      return
-    else:
-      self.raiseADebug('Reading data from "{}.csv"'.format(fname))
-    # check for NaN contents -> this isn't allowed in RAVEN currently, although we might need to change this for ND
-    if (not nullOK) and (pd.isnull(df).values.sum() != 0):
-      bad = pd.isnull(df).any(1).nonzero()[0][0]
-      self.raiseAnError(IOError,'Invalid data in input file: row "{}" in "{}"'.format(bad+1,fname))
+    loader = CsvLoader.CsvLoader(self.messageHandler)
+    df = loader.loadCsvFile(fname, nullOK=nullOK)
     return df
 
   def _resetScaling(self):

diff --git a/framework/DataObjects/HistorySet.py b/framework/DataObjects/HistorySet.py
@@ -206,7 +206,7 @@ def _selectiveRealization(self,rlz):
     # TODO someday needs to be implemented for when ND data is collected!  For now, use base class.
     # TODO externalize it in the DataObject base class
     toRemove = []
-    for var,val in rlz.items():
+    for var, val in rlz.items():
       if var in self.protectedTags:
         continue
       # only modify it if it is not already scalar
@@ -227,9 +227,9 @@ def _selectiveRealization(self,rlz):
             val = val.values
           # FIXME this is largely a biproduct of old length-one-vector approaches in the deprecataed data objects
           if val.size == 1:
-            rlz[var] = float(val)
+            rlz[var] = val[0]
           else:
-            rlz[var] = float(val[indic])
+            rlz[var] = val[indic]
         elif method in ['inputPivotValue']:
           pivotParam = self.getDimensions(var)
           assert(len(pivotParam) == 1) # TODO only handle History for now

diff --git a/framework/DataObjects/PointSet.py b/framework/DataObjects/PointSet.py
@@ -118,7 +118,7 @@ def _selectiveRealization(self,rlz):
     # data was previously formatted by _formatRealization
     # then select the point we want
     toRemove = []
-    for var,val in rlz.items():
+    for var, val in rlz.items():
       if var in self.protectedTags:
         continue
       # only modify it if it is not already scalar
@@ -134,7 +134,7 @@ def _selectiveRealization(self,rlz):
         else:
           toRemove.append(var)
           continue
-        if method in ['inputRow','outputRow']:
+        if method in ['inputRow', 'outputRow']:
           # zero-d xarrays give false behavior sometimes
           # TODO formatting should not be necessary once standardized history,float realizations are established
           if type(val) == list:
@@ -143,9 +143,9 @@ def _selectiveRealization(self,rlz):
             val = val.values
           # FIXME this is largely a biproduct of old length-one-vector approaches in the deprecataed data objects
           if val.size == 1:
-            rlz[var] = float(val)
+            rlz[var] = val[0]
           else:
-            rlz[var] = float(val[indic])
+            rlz[var] = val[indic]
         elif method in ['inputPivotValue', 'outputPivotValue']:
           pivotParam = self.getDimensions(var)
           assert(len(pivotParam) == 1) # TODO only handle History for now