#Easy DB
Simple database management interface targeted at scientific data storage.
This project was born out of an effort to generalize software developed to manage the database for Karen Strier's research on the critically endangered Northern Muriqui Monkey. Information about Karen's work can be found (here)[https://strierlab.anthropology.wisc.edu].
Easy DB allows for the declaration of simple validators and mutators that can be linked together to define a data processing pipeline. One use case for the software is processing semi-structured data like csv files or JSON for insertion into a MySQL database.
Beyond what you would expect from a basic input validation system, EasyDB allows for handling several features found in real world datasets that have been managed outside of an RDBMS.
In datasets managed in a visual tool like excel, shortcuts to avoid tedious data entry can be taken. One case supported by Easy DB is omitting values that appear repeatedly. Consider an excel document where each record has a date column and the data is ordered by date. To avoid effort duplicating the data explicitly, as well as reduce clutter, date values can be left blank with the assumption that implicitly inherited from the rows above. Obviously, this design is not conducive for direct insertion into a traditional database. A lookup of any particular record should give you all of the information needed, there should not be a need to look at records nearby to fill in missing data. Easy DB enables processing of datasets like this by allowing the data processing pipeline to be stateful and include information from previous records when processing later ones.
Another feature of the system is the ability to change data shape. This includes splitting or combining columns, such as a date and time column you want to store together in a single timestamp. Arbitrary length lists can also be split and directed to a child table in your database with a link to a parent record.
These features can be combined together to form more complex transformations and support rich data models. Below is an example from the unit tests. The API could use a little TLC, which I hope to work on soon. Even with this in mind, the system can greatly simplify a number of data manipulation tasks and has clear extension points to allow application specific transformations and validations.
// Here a series of alternating date and time columns are parsed and validated, then combined together
// into a string format suitable for insertion into a datetime column in mysql.
function test_repeated_combiner() {
$db = 1;
$val_processors = array();
$processor_config = $this->default_processor_config;
$processor_config['modifiers'] = array(new Date_Validator_Formatter());
$val_processors[] = new Value_Processor($db, $this->user_config, $processor_config);
$processor_config = $this->default_processor_config;
$processor_config['modifiers'] = array(new Time_Validator_Formatter());
$val_processors[] = new Value_Processor($db, $this->user_config, $processor_config);
$data_output = new Column_Combiner_Output($val_processors, "date_time", FALSE);
$data_output = new Repeated_Column_Output( array( $data_output), 2, 'foreign_key_column', 'table', FALSE, FALSE);
$record_processor = new Record_Processor(array('user_config' => $this->user_config,
'data_outputs' => array($data_output), 'output_table' => 'unused','primary_key_column' => 'unused'));
$record_processor->process_row(array("22/3/2012", "1:20", "12/5/2012", "2:30"));
$this->assertEquals(array("2012-3-22 1:20", "2012-5-12 2:30"), $record_processor->output_to_array());
}
// The system also supports defining constraints that apply across several columns
// here a date range validator is used to ensure that the end date comes after the start date.
// This validation is applied after one of the columns has been generated by a combiner
// over individual date and time columns into a timestamp value
function test_combiner_before_cross_column_validation() {
$db = 1;
$processor_config = $this->default_processor_config;
$processor_config['modifiers'] = array();
$data_output = $this->date_time_combiner();
$data_output2 = new Single_Column_Output(new Value_Processor($db, $this->user_config,
array( 'column' => 'test', 'modifiers' => array(new Date_Validator_Formatter()))), 'second_date', FALSE);
$data_output_names = array(
'date_time' => 0,
'second_date' => 1,
);
$record_processor = new Record_Processor(array('user_config' => $this->user_config,
'output_table' => 'unused',
'primary_key_column' => 'unused',
'data_output_names' => $data_output_names,
'data_outputs' => array($data_output, $data_output2),
'cross_column_validators' => array(new Date_Range_Validator(array('start_date' => 'date_time', 'end_date' => 'second_date')))
)
);
try {
$record_processor->process_row(array("22/3/2012", "not a time", "22/3/2013"));
} catch (Exception $ex) {
$this->assertEquals(array("22/3/2012", "not a time*", "22/3/2013"), $record_processor->get_last_input_row());
$this->assertEquals(Time_Validator_Formatter::INVALID_FORMAT_MSG, $ex->getMessage());
}
try {
$record_processor->process_row(array("22/3/2014", "2:30", "22/3/2013"));
} catch (Exception $ex) {
$this->assertEquals("End date does not occur after start date.", $ex->getMessage());
}
}
This software is released under the GNU Affero General Public License. A copy of the complete license is available in th LICENSE.txt file.
Copyright (C) 2014 Jason Altekruse
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.