Other data analysis toolkits have something like a summary
function
that computes arguably sensible properties of columns. It shows things
like min, max, and mean for numeric columns, or the number of unique
values for textual columns.
We may be able to implement a similar utility with a type class whose instance for each column type is a Fold.
For example, suppose we have the following Frame:
user id | age | gender | occupation | zip code |
1 | 24 | M | technician | 85711 |
2 | 53 | F | other | 94043 |
3 | 23 | M | writer | 32067 |
4 | 24 | M | techician | 43537 |
The summary for a single column like occupation
(a categorical variable) might look like this:
category | count |
technician | 2 |
other | 1 |
writer | 1 |
However, the summary for a column like age
(an interval variable) could look like this:
parameter | value |
min | 24 |
max | 53 |
mean | 2.5 |
stddev | 1.29 |
median | 2.5 |
The summary of a whole frame might just go through summaries for each column:
entity | attribute | value |
user_id | class | Num |
user_id | class | Key |
user_id | min | 1 |
user_id | max | 4 |
user_id | mean | 2.5 |
user_id | stddev | 1.29 |
user_id | median | 2.5 |
user_id | unique | 4 |
age | class | Interval |
age | class | Ratio |
age | min | 24 |
age | max | 53 |
age | mean | 2.5 |
age | stddev | 1.29 |
age | median | 2.5 |
gender | class | Categorical |
gender | numCategories | 2 |
occupation | class | Categorical |
occupation | numCategories | 3 |
zip code | class | Categorical |
zip code | numCategories | 4 |
We currently support overriding all or none, but something more granular will be helpful. In particular with things like data types for categorical variables where the inference is more likely to be wrong.
The idea is that a numeric column could have several “failure modes” that we might like to enumerate.
Show how to perform a fold that performs a computation based on some grouping criteria.
Get a benchmark sorted out.
Some handy examples,
foldCoRec parsedTypeRep (bestRep "r" :: CoRec ColInfo CommonColumns)
> =Definitely Text
foldCoRec parsedTypeRep (bestRep "23" :: CoRec ColInfo CommonColumns)
> =Definitely Int
One area where GHC’s stage restrictions on Template Haskell bite us is in stating how to parse a particular file. The problem is that we parse the file once at compile time to generate type declarations, and again at runtime to read the values. As a reminder, GHC’s stage restriction means that values we pass to a splice must be literals or imported from another module. To be clear, we can’t do this,
x :: Foo
x = foo 23
mySplice x "skidoo"
myData :: RuntimeFoo
myData = readStuff x "skidoo"
This means that a value representing parser options must be imported so that it can be used during both phases. At the moment, the only parser options are defining how columns are separated, and whether or not there is a header row (the absence of a header is indicated by explicitly providing column names). We can capture most of the needed functionality by passing a separator character and a list of strings to the TH splice. This is a slight wart as any further parser options would extend the type of every parsing function. Using a record for options would mean that we could add options without having to change every type signature.
Another drawback of passing parsing options as literals is that it exacerbates another problem: repeating the name of the file to be parsed. Specifically, we need to provide the name for the template haskell splice that produces all the relevant declarations, and again for the runtime code that reads the data file. A minor advantage of this duplication is that we can provide a model file for the type declarations, and a lower quality data file that we want to analyze. This offers a way to infer tighter types than the noisy data would allow so that malformed records can more easily be discarded when they fail to parse at the specific type.
To be concrete, if we do not use a record for parser options, we could always pass the unpacked parser options wherever they are needed.
tableTypesOpt '|' ["name", "age", "occupation"] "Users" "data/users.dat"
userData :: Producer Users IO ()
userData = readTableOpt '|' ["name", "age", "occupation"] "data/users.dat"
The duplication of the column names is atrocious. We could declare all
Users
-related types and values, and the definition of userData
at
once to avoid repeating ourselves, but this seems like it might become
an unwieldy splice.
The best choice is for the splice to declare a value usersParser
that readTableOpt
could then use. This works out quite nicely.
> :set -XQuasiQuotes -XTemplateHaskell > import Language.Haskell.TH > putStrLn $(tableTypes "base" "base.csv" >>= stringE . show . ppr_list)
> set -XOverloadedStrings -XQuasiQuotes TempalteHaskell > import Data.Char > import Data.List > import Frames > import Frames.CSV > let stripModule = until (\w -> length w == 1 || not ("." `isInfixOf` w)) (tail . dropWhile isAlpha) > let onWords f xs = takeWhile isSpace xs ++ unwords (map f (words xs)) > putStrLn . unlines . map (onWords stripModule) $ lines $(tableTypes' (rowGen "data/ml-100k/u.user") {rowTypeName = "User", columnNames = ["user id", "age", "gender", "occupation", "zip code"], separator = "|"} >>= stringE . show . ppr_list)
Dumping the definitions created by the TH splices results in a pretty unreadable mess. Here’s how to use these functions to clean things up:
- Evaluate the three elisp definitions here
- Hit
C-c C-e
to getghc-mod
to evaluate all splices - Copy the contents of the
*GHC Info*
buffer to somewhere like your*scratch*
buffer (because*GHC Info*
is read-only) - Run
M-x pretty-splices
in that buffer
(defun replace-stringf (from to)
(beginning-of-buffer)
(while (search-forward from nil t)
(replace-match to nil t)))
(defun replace-regexpf (from to)
(beginning-of-buffer)
(while (re-search-forward from nil t)
(replace-match to nil nil)))
(defun pretty-splices ()
(interactive)
;; Fix newlines
(replace-stringf (rx (char ?\0)) "
")
;; Unqualify names
(replace-stringf "GHC.Types.:" "':")
(replace-stringf "Data.Text.Internal." "")
(replace-stringf "Data.Text." "T.")
(replace-stringf "GHC.Types.Int" "Int")
(replace-stringf "GHC.Base." "")
(replace-stringf "Frames.Col." "")
(replace-stringf "Data.Proxy." "")
(replace-stringf "Data.Vinyl.TypeLevel." "")
(replace-stringf "Data.Vinyl.Core." "")
(replace-stringf "Frames.Rec." "")
(replace-stringf "Data.Vinyl.Lens." "")
(replace-stringf "Frames.CSV.ParserOptions" "ParserOptions")
;; Erase inferrable type
(replace-regexpf "(Frames.TypeLevel.RIndex .*?)" "")
;; Make `:->' infix
(replace-regexpf (rx (sequence "(:->) \""
(group (0+ (not (in "\""))))
"\" "
(group (0+ (not (in " "))))))
"\"\\1\" :-> \\2")
;; Make `:' infix
(replace-regexpf (rx (sequence "((':) (" (group (0+ (not (in ")")))) ") '[])"))
"[\\1]")
(let ((x 10))
(while (plusp x)
(replace-regexpf (rx (sequence "((':) (" (group (0+ (not (in ")")))) ") ["
(group (0+ (not (in "]")))) "])"))
"[\\1, \\2]")
(decf x)))
;; Newline before top-level type signature
(replace-regexpf "^ [^ ]+ ::" "
\\&")
;; Newline before single-line type synonym definitions
(replace-regexpf "^ type [^ ]+ = [^ ]+.*$" "
\\&"))
These may be hurting compile times while not helping runtime performance. I’ll be looking at the benchdemo
executable.
Code | Compile (s) | Run (s) |
---|---|---|
vinyl-0.13.3 with INLINE | 8.8 | 0.37 |
vinyl-0.13.3 no INLINE | 8.0 | 0.36 |
vinyl-0.14.0 no INLINE | 10.8 | 0.38 |