Skip to content

Commit

Permalink
Csv loader pandas (#1267)
Browse files Browse the repository at this point in the history
* extraneous tracked file?

* matching remote devel

* new CSVLoader and Code model for handling string outputs, plus test

* null check passthrough

* cleanup

* two options for loading now, pandas by default and numpy if needed

* changed order of nodes for xsd validation
  • Loading branch information
PaulTalbot-INL authored Jul 22, 2020
1 parent fb9d960 commit 684d8e5
Show file tree
Hide file tree
Showing 17 changed files with 365 additions and 68 deletions.
28 changes: 23 additions & 5 deletions doc/user_manual/couplingAcode.tex
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ \section{Advanced Users: How to couple a new code}
path/to/raven/distribution/raven/framework/CodeInterfaces/
\end{lstlisting}
At the initialization stage, RAVEN imports all the Interfaces that are contained in this directory and performs some preliminary cross-checks.
\\It is important to notice that the name of class in the Interface module is the one the user needs to specify when the new interface
\\It is important to notice that the name of class in the Interface module is the one the user needs to specify when the new interface
needs to be used. For example, if the Interface module contains the class ``NewCode'', the \textit{subType} in the \xmlNode{Code} block will be ``NewCode'':
\begin{lstlisting}[language=python]
class NewCode(CodeInterfaceBase):
Expand Down Expand Up @@ -158,6 +158,9 @@ \subsection{Pre-requisites.}
result1, result2, result3
aValue1, aValue2, aValue3
\end{lstlisting}
Note that in general RAVEN is content with accepting floats or strings as data types in the CSV.
However, if the CSV produced by running the code has a large number of columns (say, over 1000), it
is necessary to include only floats and change the CSV loading utility. See more below (\ref{subsubsec:setCsvLoadUtil})
%%%%%%%
\subsection{Code Interface Creation}
\label{subsec:codeinterfacecreation}
Expand All @@ -179,7 +182,7 @@ \subsection{Code Interface Creation}
from CodeInterfaceBaseClass import CodeInterfaceBase
class NewCode(CodeInterfaceBase):
...
def initialize(self, runInfoDict, oriInputFiles)
def initialize(self, runInfoDict, oriInputFiles)
def finalizeCodeOutput(self, command, output, workingDir)
def getInputExtension(self)
def checkForOutputFailure(self, output, workingDir)
Expand Down Expand Up @@ -268,7 +271,7 @@ \subsubsection{Method: \texttt{createNewInput}}
input files. This list of files is the one the code interface needs to use to print the new perturbed list of files.
Indeed, RAVEN already changes the file location in sub-directories and the Code Interface does not need to
change the filename or location of the files. For example, the files are going to have a absolute path as following:
$.\\ path\_to\_working\_directory\\stepName\\anUniqueIdentifier\\filename.extension$. In case of sampling, the
$.\\ path\_to\_working\_directory\\stepName\\anUniqueIdentifier\\filename.extension$. In case of sampling, the
``\textit{anUniqueIdentifier}'' is going to be an integer (e.g. 1).
\item \textbf{\texttt{oriInputFiles}} , data type = list, List of the original input files;
\item \textbf{\texttt{samplerType}} , data type = string, Sampler type (e.g. MonteCarlo,
Expand Down Expand Up @@ -318,7 +321,7 @@ \subsubsection{Method: \texttt{initialize}}
def initialize(self, runInfoDict, oriInputFiles)
\end{lstlisting}
The \textbf{initialize} function is an optional method. If present, it is called
by RAVEN code at the begin of each Step (once per step) involving the particular Code Interface.
by RAVEN code at the begin of each Step (once per step) involving the particular Code Interface.
This method is generally indicated to retrieve information from the RunInfo and/or the Input files.
\\RAVEN is going to call this function passing in the following arguments:
\begin{itemize}
Expand Down Expand Up @@ -390,7 +393,7 @@ \subsubsection{Method: \texttt{setRunOnShell}}
some specific use cases, the following argument may need to be setted by the code interface developers:
\begin{itemize}
\item{shell}, the default value is \textbf{True}. If shell is \textbf{True}, the specified command
generated by RAVEN will be executed through the shell. This will allow RAVEN to have an enhanced
generated by RAVEN will be executed through the shell. This will allow RAVEN to have an enhanced
control flow with convenient access to other shell features such as shell pipes, filename wildcards,
environment variable expansion, and expansion of ``~'' to a user's home directory. If shell is
\textbf{False}, all the shell based features are disabled. In other words, the users could not use the
Expand All @@ -406,6 +409,21 @@ \subsubsection{Method: \texttt{setRunOnShell}}
\end{lstlisting}
\end{itemize}

\subsubsection{Method: \texttt{setCsvLoadUtil}}
\label{subsubsec:setCsvLoadUtil}
\begin{lstlisting}[language=python]
self.setCsvLoadUtil('pandas')
\end{lstlisting}
The default CSV loader in RAVEN is pandas, which allows arbitrary data types in the CSV, generally
strings and floats. However, arbitrary data can be challenging to load if there are a large number
of columns in the code's output CSV that RAVEN attempts to read in. As a rule of thumb, if there are
over 1000 columns in a typical output CSV for your Code, the resulting values should only be floats
and integers (not strings), and this method should be called during the CodeInterface construction
or initialization
to set the loading utility to \texttt{numpy}. While RAVEN's \texttt{numpy} CSV loading is notably
faster than RAVEN's \texttt{pandas} CSV loading, it does not allow the flexibility of string entries
except in the CSV header.

\subsection{Tools for Developing Code Interfaces}
To make generating a code interface as simple as possible, there are several tools RAVEN makes available within the Code Interface objects.

Expand Down
23 changes: 23 additions & 0 deletions framework/CodeInterfaceBaseClass.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

#Internal Modules------------------------------------------------------------------------------------
from utils import utils
import CsvLoader
#Internal Modules End--------------------------------------------------------------------------------

class CodeInterfaceBase(utils.metaclass_insert(abc.ABCMeta,object)):
Expand All @@ -44,6 +45,7 @@ def __init__(self):
self.inputExtensions = [] # list of input extensions
self._runOnShell = True # True if the specified command by the code interfaces will be executed through shell.
self._ravenWorkingDir = None # location of RAVEN's main working directory
self._csvLoadUtil = 'pandas' # utility to use to load CSVs

def setRunOnShell(self, shell=True):
"""
Expand All @@ -61,6 +63,27 @@ def getRunOnShell(self):
"""
return self._runOnShell

def getCsvLoadUtil(self):
"""
Returns the string representation of the CSV loading utility to use
@ In, None
@ Out, getCsvLoadUtil, str, name of utility to use
"""
# default to pandas, overwrite to 'numpy' if all of the following:
# - all entries are guaranteed to be floats
# - results CSV have a large number of headers (>1000)
return self._csvLoadUtil

def setCsvLoadUtil(self, util):
"""
Returns the string representation of the CSV loading utility to use
@ In, getCsvLoadUtil, str, name of utility to use
"""
ok = CsvLoader.CsvLoader.acceptableUtils
if util not in ok:
raise TypeError(f'Unrecognized CSV loading utility: "{util}"! Expected one of: {ok}')
self._csvLoadUtil = util

def genCommand(self, inputFiles, executable, flags=None, fileArgs=None, preExec=None):
"""
This method is used to retrieve the command (in tuple format) needed to launch the Code.
Expand Down
12 changes: 11 additions & 1 deletion framework/CodeInterfaces/CobraTF/CTFinterface.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,25 @@
"""

from __future__ import division, print_function, unicode_literals, absolute_import
import os
from ctfdata import ctfdata
from CodeInterfaceBaseClass import CodeInterfaceBase
from GenericCodeInterface import GenericParser

import os
class CTF(CodeInterfaceBase):
"""
this class is used a part of a code dictionary to specialize Model.Code for CTF (Cobra-TF)
"""
def __init__(self):
"""
Constructor.
@ In, None
@ Out, None
"""
CodeInterfaceBase.__init__(self)
# CTF creates enormous CSVs that are all floats, so we use numpy to speed up the loading
self.setCsvLoadUtil('numpy')

def finalizeCodeOutput(self,command,output,workingDir):
"""
This method is called by the RAVEN code at the end of each code run to create CSV files containing the code output results.
Expand Down
106 changes: 77 additions & 29 deletions framework/CsvLoader.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@
from __future__ import division, print_function, unicode_literals, absolute_import
#End compatibility block for Python 3----------------------------------------------------------------

#External Modules------------------------------------------------------------------------------------
#External
#Modules------------------------------------------------------------------------------------
import numpy as np
from scipy.interpolate import interp1d
import copy
import pandas as pd
#External Modules End--------------------------------------------------------------------------------

#Internal Modules------------------------------------------------------------------------------------
Expand All @@ -36,43 +36,91 @@ class CsvLoader(MessageHandler.MessageUser):
"""
Class aimed to load the CSV files
"""
def __init__(self,messageHandler):
acceptableUtils = ['pandas', 'numpy']

def __init__(self, messageHandler):
"""
Constructor
@ In, messageHandler, MessageHandler, the message handler
@ Out, None
"""
self.allOutParam = False # all output parameters?
self.allFieldNames = []
self.type = 'CsvLoader'
self.printTag = self.type
self.messageHandler = messageHandler
self.type = 'CsvLoader' # naming type for this class
self.printTag = self.type # message handling representation
self.allOutParam = False # all output parameters?
self.allFieldNames = [] # "header" of the CSV file
self.messageHandler = messageHandler # message handling utility

def loadCsvFile(self, myFile, nullOK=None, utility='pandas'):
"""
Function to load a csv file into realization format
It also retrieves the headers
The format of the csv must be comma-separated (pandas readable)
@ In, myFile, string, Input file name (absolute path)
@ In, nullOK, bool, indicates if null values are acceptable
@ In, utility, str, indicates which utility should be used to load the csv
@ Out, loadCsvFile, pandas.DataFrame or numpy.ndarray, the loaded data
"""
if utility == 'pandas':
return self._loadCsvPandas(myFile, nullOK=nullOK)
elif utility == 'numpy':
return self._loadCsvNumpy(myFile, nullOK=nullOK)
else:
self.raiseAnError(RuntimeError, f'Unrecognized CSV loading utility: "{utility}"')

def _loadCsvPandas(self, myFile, nullOK=None):
"""
Function to load a csv file into realization format
It also retrieves the headers
The format of the csv must be comma-separated (pandas readable)
@ In, myFile, string, Input file name (absolute path)
@ In, nullOK, bool, indicates if null values are acceptable
@ Out, df, pandas.DataFrame, the loaded data
"""
# first try reading the file
try:
df = pd.read_csv(myFile)
except pd.errors.EmptyDataError:
# no data in file
self.raiseAWarning(f'Tried to read data from "{myFile}", but the file is empty!')
return
else:
self.raiseADebug(f'Reading data from "{myFile}"')
# check for NaN contents -> this isn't allowed in RAVEN currently, although we might need to change this for ND
if (not nullOK) and (pd.isnull(df).values.sum() != 0):
bad = pd.isnull(df).any(1).nonzero()[0][0]
self.raiseAnError(IOError, f'Invalid data in input file: row "{bad+1}" in "{myFile}"')
self.allFieldNames = list(df.columns)
return df

def loadCsvFile(self,myFile):
def _loadCsvNumpy(self, myFile, nullOK=None):
"""
Function to load a csv file into a numpy array (2D)
Function to load a csv file into realization format
It also retrieves the headers
The format of the csv must be:
STRING,STRING,STRING,STRING
FLOAT ,FLOAT ,FLOAT ,FLOAT
...
FLOAT ,FLOAT ,FLOAT ,FLOAT
@ In, fileIn, string, Input file name (absolute path)
@ Out, data, numpy.ndarray, the loaded data
The format of the csv must be comma-separated with all floats after header row
@ In, myFile, string, Input file name (absolute path)
@ In, nullOK, bool, indicates if null values are acceptable
@ Out, data, np.ndarray, the loaded data
"""
# open file
myFile.open(mode='rb')
# read the field names
head = myFile.readline().decode()
self.allFieldNames = head.split(',')
for index in range(len(self.allFieldNames)):
self.allFieldNames[index] = self.allFieldNames[index].strip()
# load the table data (from the csv file) into a numpy nd array
data = np.loadtxt(myFile,dtype='float',delimiter=',',ndmin=2,skiprows=1)
# close file
myFile.close()
with open(myFile, 'rb') as f:
head = f.readline().decode()
self.allFieldNames = list(x.strip() for x in head.split(','))
data = np.loadtxt(myFile, dtype=float, delimiter=',', ndmin=2, skiprows=1)
return data

def toRealization(self, data):
"""
Converts data from the "loadCsvFile" format to a realization-style format (dictionary
currently)
@ In, data, pandas.DataFrame or np.ndarray, result of loadCsvFile
@ Out, rlz, dict, realization
"""
rlz = {}
if isinstance(data, pd.DataFrame):
rlz = dict((header, np.array(data[header])) for header in self.allFieldNames)
elif isinstance(data, np.ndarray):
rlz = dict((header, entry) for header, entry in zip(self.allFieldNames, data.T))
return rlz

def getAllFieldNames(self):
"""
Function to get all field names found in the csv file
Expand Down
17 changes: 4 additions & 13 deletions framework/DataObjects/DataSet.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
from .DataObject import DataObject
except ValueError:
from DataObject import DataObject

import CsvLoader
from utils import utils, cached_ndarray, xmlUtils, mathUtils

# for profiling with kernprof
Expand Down Expand Up @@ -1802,19 +1804,8 @@ def _readPandasCSV(self, fname, nullOK=None):
# datasets can have them because we don't have a 2d+ CSV storage strategy yet
else:
nullOK = True
# first try reading the file
try:
df = pd.read_csv(fname)
except pd.errors.EmptyDataError:
# no data in file
self.raiseAWarning('Tried to read data from "{}", but the file is empty!'.format(fname+'.csv'))
return
else:
self.raiseADebug('Reading data from "{}.csv"'.format(fname))
# check for NaN contents -> this isn't allowed in RAVEN currently, although we might need to change this for ND
if (not nullOK) and (pd.isnull(df).values.sum() != 0):
bad = pd.isnull(df).any(1).nonzero()[0][0]
self.raiseAnError(IOError,'Invalid data in input file: row "{}" in "{}"'.format(bad+1,fname))
loader = CsvLoader.CsvLoader(self.messageHandler)
df = loader.loadCsvFile(fname, nullOK=nullOK)
return df

def _resetScaling(self):
Expand Down
6 changes: 3 additions & 3 deletions framework/DataObjects/HistorySet.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def _selectiveRealization(self,rlz):
# TODO someday needs to be implemented for when ND data is collected! For now, use base class.
# TODO externalize it in the DataObject base class
toRemove = []
for var,val in rlz.items():
for var, val in rlz.items():
if var in self.protectedTags:
continue
# only modify it if it is not already scalar
Expand All @@ -227,9 +227,9 @@ def _selectiveRealization(self,rlz):
val = val.values
# FIXME this is largely a biproduct of old length-one-vector approaches in the deprecataed data objects
if val.size == 1:
rlz[var] = float(val)
rlz[var] = val[0]
else:
rlz[var] = float(val[indic])
rlz[var] = val[indic]
elif method in ['inputPivotValue']:
pivotParam = self.getDimensions(var)
assert(len(pivotParam) == 1) # TODO only handle History for now
Expand Down
8 changes: 4 additions & 4 deletions framework/DataObjects/PointSet.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def _selectiveRealization(self,rlz):
# data was previously formatted by _formatRealization
# then select the point we want
toRemove = []
for var,val in rlz.items():
for var, val in rlz.items():
if var in self.protectedTags:
continue
# only modify it if it is not already scalar
Expand All @@ -134,7 +134,7 @@ def _selectiveRealization(self,rlz):
else:
toRemove.append(var)
continue
if method in ['inputRow','outputRow']:
if method in ['inputRow', 'outputRow']:
# zero-d xarrays give false behavior sometimes
# TODO formatting should not be necessary once standardized history,float realizations are established
if type(val) == list:
Expand All @@ -143,9 +143,9 @@ def _selectiveRealization(self,rlz):
val = val.values
# FIXME this is largely a biproduct of old length-one-vector approaches in the deprecataed data objects
if val.size == 1:
rlz[var] = float(val)
rlz[var] = val[0]
else:
rlz[var] = float(val[indic])
rlz[var] = val[indic]
elif method in ['inputPivotValue', 'outputPivotValue']:
pivotParam = self.getDimensions(var)
assert(len(pivotParam) == 1) # TODO only handle History for now
Expand Down
Loading

0 comments on commit 684d8e5

Please sign in to comment.