Can generate and export relational data for demo purposes. Useful for filling, for instance, a database with dummy data the looks real enough to use for demonstrations. See OOAPI-demodata for an example of a web service which is exposes demo data generated using this toolkit.
- Get leiningen
- Clone this repository
- Install locally as a jar:
lein install
- Create standalone jar for command line:
lein uberjar
[[https://clojars.org/nl.surf/demo-data][]]
The DemoData toolkit requires two descriptors to generate data:
- schema describing all the types, references between types, the
type attributes, their dependencies, constraints and generators
Basic shape:
{ "types": [ { "name": "cat", "refs": { ... }, "attributes": { ... }, }, { "name": "person", ... } ] }
For an example see also doc/cats-schema.json.
- population map providing the number of instances to generate per type
For an example see doc/cats-population.json.
{ "cat": 10, "person": 30, ... }
First some terminology to avoid confusion: a type is a collection of attributes and refs. The types will be instantiated into entities having properties derived from the attributes and refs of that type.
The data generation process is attribute driven and goes through 3 phases:
- Empty entities are created according to the distribution defined by the population descriptor
- Refs are transformed into attributes, so a ref becomes a property of an entity
- Attributes are sorted in dependency order, so attributes with no deps are first. Then, for every attribute in order, the corresponding properties are generated entities.
An attribute is described by the following properties, all optional:
- value
Specifying a value will cause the property of the generated entities to always have the given value.
- generator
A named generator as string or array. In case of an array the first element of the array is the name of the generator, the rest of the elements are arguments to the generator;
["int", 1, 10]
will pick a number between 1 and 10. - constraints
A list of constraints. Currently only
unique
is available and will cause retries when other entities in the set have already take the generated value. - deps
A list of attribute this property depends on. These are full qualified names like
person/name
and may be paths into related (see Refs) using a nested list like["person/cat", "cat/name"]
.If a generator is specified the associated depend values will be added to the argument list to the generator. When no generator is specified the values will be passed as is to a property when instantiating entities.
- hidden
Properties generated from an attribute marked as hidden will be discarded in the final output. This defaults to
false
.
Refs are just special attributes to define relationships between types. They are hidden by default, cause they have a shape which is almost always considered “internal” to this library and another attribute is most likely used which depends on the ref to give it a proper format.
This toolkit supports the following relations through configuration:
- many-to-one
From the person
"refs"
in the cats example:{ "types": [ ... { "name": "person", ... "refs": { ... "cat": { "deps": ["cat/name"] }, ...
Here a person is associated with a random cat using the cat’s name as a key. This will create a (hidden by default) foreign key property named
"cat"
for a person which can be used to make a SQL-like join. To get from a person to the cat’s favorite, add an attribute with a dep like["person/cat", "cat/favorite"]
.From the person
"attributes"
in the cats example:{ "types": [ ... { "name": "person", ... "attributes": { ... "dilemma": { "deps": ["person/name", ["person/cat", "cat/name"], ["person/cat", "cat/favorite"]], "generator": ["format", "%s loves %s but %2$s loves %s"] }, ...
- one-to-one
Works similar to many-to-one, with a flag to specify that selected values must be unique.
{ "types": [ ... { "name": "person", ... "refs": { ... "cat": { "deps": ["cat/name"], "unique": true }, ...
This will result in a one-to-one relation provided both persons and cats have the same population count.
- many-to-many (Warning: needs work)
We’ll use a linking table which has an association with both side. Similar to the the fed-by
"refs"
in the cats example:{ "types": [ ... { "name": "person", ... "refs": { ... "pair": { "deps": ["cat/name", "person/name"], "attributes": ["cat", "person"] }, ...
This ref yields two attributes as named above associated to the given types respectively with the given keys. The
"unique"
assignment ensures unique pairs are selected to prevent getting multiple equal relations.In the above case the distribution of choices is random. To steer the picking of pairs to select as many different of one side as possible, it’s possible to provide a list of booleans to the unique assignment. Given the above case, having
"unique": [true, false]
will cause as many cats to be included as possible, the selection of persons is still random. - graphs and trees
References to the same entity type describe graphs. Some options are allowed to specify the kind of graph that can be generated.
{ "types": [ ... { "name": "person", ... "refs": { ... "parent": { "deps": ["person/father"], "graph": "tree" }, ...
When
graph
is"tree"
, the generated graph is a directed acyclic graph with a single root note that has a nil reference - in this case:{"person/father: ["person/name" nil]}
When a
nilabe
option is provided, this indicates the chance (between 0 and 1) that any generated ref is a nil reference - a new root. Combined with thegraph: "tree"
option above this implies a forest of independent trees. - one-to-many
If a reference is to an entity, the values can be selected via a match on the referenced attribute:
{ "types": [ { ... "name": "person", ... "attributes": { ... "fed-by": { "deps": [[["fed-by/cat", "cat/name"], "fed-by/person", "person/name"]] }, ...
The cat/fed-by
property will get as a value the list of zero or
more person/name
values. The same technique can be used to find
matching many-to-one or many-to-many refs.
Here’s a list of the currently implemented generators:
- uuid
Generates a Universally Unique Identifier.
- string
Generates a random string. Useful of creating test cases, not so much for demo data.
- int (takes 2 arguments or none)
Generate an integer between ~MIN_VALUE~ and ~MAX_VALUE~ or between the two given values (inclusive).
- int-cubic (takes 2 arguments or none)
Generate a integer between the two arguments with a cubic biased toward the high value.
- int-log (takes 2 arguments or none)
Generate a integer between the two arguments with a logarithmic biased toward the high value.
- increasing-int (takes arguments or none)
Generate a monotonically increasing integer starting at the given argument. Generated entities are grouped by their deps.
- bigdec-cubic (takes 2 arguments)
Generate a bigdecimal between the two arguments with a cubic biased toward the high value.
- char (takes 2 arguments or none)
Generate a random printable character or between the two given values (inclusive).
- one-of (takes 1 argument)
Take a random element from the given list.
- one-of-resource-lines (takes 1 argument)
Take a random line from the given file or resource.
- one-of-keyed-resource (takes 2 arguments)
Take a random line from a keys value of the given YAML file or resource. The first argument is the file and the second the key.
- weighted (takes 1 argument)
Take a value from a weighted object. For instance: with
{"cat" 2, "ferret" 1}
there’s a 2 in 3 chance"cat"
will be picked. - text-from-resource (takes 1 or 2 arguments)
Generate 3 lines of text from given resource by using markov probability chains. The number of lines can be specified by a second argument.
- lorum-ipsum (takes 1 argument or none)
Generate 3 “lorum ipsum” lines of fake Latin text. An optional argument specifies how many lines to generate.
- date (takes 2 arguments)
Pick a date between the given arguments formatted
1970-01-31
. - timestamp (takes 2 arguments)
Pick a timestamp between the given arguments formatted
1970-01-31T23:59:59+01:00
.
And some generators which will transform their arguments in some way or other:
- join (takes any number of arguments)
Concatenate all arguments to a string separated by spaces. Empty values will be omitted.
- format (takes at least 1 argument)
Uses printf-like format as first argument to render the rest of the arguments. See syntax for details.
- object (takes an even amount of arguments)
Construct an object by splitting the list of arguments and zipping them together. For instance:
["name", "spouse", "Fred", "Wilma"]
will become{"name": "Fred", "spouse": "Wilma"}
. - inc (takes 1 or 2 arguments)
Increments the given argument with one. One extra will be added on a retry attempt when trying to comply to a constraint.
- dec (takes 1 or 2 arguments)
Same as inc but decrement instead of increment.
- first-weekday-of (takes 3 arguments)
Determines the first given weekday of month in year. For instance
"monday", "january", 2020
. - last-day-of (takes 2 arguments)
Determines the last day of the given month in year. For instance
"january", 2020
. - abbreviate (takes 1 argument)
Make an abbreviation of group of words. So
"Fred loves Wilma"
becomes"BlW"
. When retrying this the number of retries will be appended.
See Writing generators to write your own.
Use this toolkit from the command line as follows:
java -jar target/demo-data-standalone.jar generate \
doc/cats-schema.json doc/cats-population.json \
generated/cats.json
Please note: running generate standalone will allow loading resources from the current working directory and up.
This toolkit can be used as library in your (Clojure). This will allow you
to generate data in memory and, more importantly, create your own
generators from scratch. To include this toolkit in you project, add
[nl.surf/demo-data "0.1.0-SNAPSHOT"]
as a dependency to your
project.clj
.
The most important namespaces are:
- nl.surf.demo-data.config
Functions here compile your configuration to a list of “executable” attributes. See the ~load~ function.
- nl.surf.demo-data.world
Functions here instantiate and populate your demo data by executing the attributes as defined by
config/load
. See the ~gen~ function.
Minimal example:
(-> {:types [{:name "person"
:attributes {:name {:generator ["one-of" ["Fred"
"Wilma"
"Barney"
"Betty"]]
:constraints ["unique"]}}}]}
(config/load)
(world/gen {:person 2}))
Possible result:
{:person [#:person{:name "Wilma"} #:person{:name "Barney"}]}
Play with the cats example:
(-> "doc/cats-schema.json"
(slurp)
(json/parse-string keyword)
(config/load)
(world/gen (-> "doc/cats-population.json"
(slurp)
(json/parse-string keyword))))
Generators are defined by the generator
multi-method in the
nl.surf.demo-data.config
namespace. An implementation of a generator
should return a function which takes a least one argument; state
. More
arguments are allowed and can be passed as described in Attributes.
The state
arguments contains 4 keys:
entity
The current entity populated so far.
world
A map of all the entities populated so far by type.
attr
The internal “executable” representation of the attribute.
dep-vals
(internal)The internal list of deferred deps values. These are also passed on the argument list.
Here a very basic example generator which sticks a exclamation mark after
a string, exclaim!
:
(defmethod config/generator "exclaim!" [_]
(fn [world value]
(str value "!")))
To handle retries for a constraint consider the world/*retry-attempt-nr*
binding:
(defmethod config/generator "exclaim!" [_]
(fn [_ value]
(apply str value (repeat (inc world/*retry-attempt-nr*) "!"))))
Please note: all provided generator which require randomness use clojure.data.generators. If you are generating something random and need a reproducible result, consider using the primitives in this library and use the *rnd* binding as a seeding mechanism.
Constraints are defined by the constraint
multi-method in the
nl.surf.demo-data.config
namespace. An implementation of a constraint
should return a function which takes two arguments: state
and the value
being considered. State is the same argument as provided to generator
functions, see also Writing generators.
Here a constraint to require an integer is even:
(defmethod config/constraint "even" [_]
(fn [_ value]
(even? value)))
When a constraint is not met during generation it is retried up to a 1000 time (configurable with binding ~*retries*~).
Create the schema and population descriptors from a JSON swagger defining from the command line as follows:
java -jar target/demo-data-standalone.jar bootstrap \
doc/ooapo-swagger.json \
generated/ooapi-schema.json generated/ooapi-population.json
This will create two files, demodata-schema.json
and
demodata-population.json
, for you to edit and start generating
demo data for your project.
You can generate data using:
java -jar target/demo-data-standalone.jar generate \
generated/ooapi-schema.json generated/ooapi-population.json
Copyright (C) 2020 SURFnet B.V.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.