pig-annotations is a class library that makes it easy to load your custom serialized java objects into pig as proper pig Tuples with a well-defined schema.
pig-annotations has a rather narrow scope. You should probably only use pig-annotations if the following is true:
- You use java. pig-annotations is a java library.
- You use pig.
- You already have a custom means of serializing your java objects in a line-based text format (like json).
Using pig-annotations is straightforward.
- You will need to provide an implementation of RecordInflater to convert from a line of text into your java object. This implementation must have a no-arg public constructor.
- You will need to annotate your object to specify how to convert the fields into values within a tuple.
It is probably easiest to demonstrate via an example.
Let's say you have a Person class that has two fields, age and gender.
package com.intentmedia.examples;
public class Person {
private String gender;
private Integer age;
// getters, setters, etc.
}
You need to tell pig-annotations how to transform each field.
package com.intentmedia.examples;
import com.intentmedia.pig.PigField;
import static org.apache.pig.data.DataType.CHARARRAY;
import static org.apache.pig.data.DataType.INTEGER;
public class Person {
@PigField(name = "gender", type = CHARARRAY)
private String gender;
@PigField(name = "gender", type = INTEGER)
private Integer age;
// getters, setters, etc.
}
For each field, you supply a name, and what Pig data type to map it to.
Finally, you need to tell pig-annotations how to load your object before it can turn it into a pig tuple. If your objects were stored as a csv like this:
male,25
female,26
Then you need to implement RecordInflater<Person>
.
package com.intentmedia.examples;
import com.intentmedia.examples.Person;
import com.intentmedia.convert.RecordInflater;
import org.apache.hadoop.io.Text;
import org.jetbrains.annotations.NotNull;
public class PersonFromCsvInflater implements RecordInflater<Person> {
@NotNull
@Override
public Person convert(@NotNull Text value) throws IllegalArgumentException {
String[] genderAndAge = value.toString().split(",");
Person person = new Person();
person.setGender(genderAndAge[0]);
person.setAge(Integer.parseInt(genderAndAge[1]));
return person;
}
}
Finally, just add one more annotation to your Person class.
package com.intentmedia.examples;
import com.intentmedia.pig.PigField;
import static org.apache.pig.data.DataType.CHARARRAY;
import static org.apache.pig.data.DataType.INTEGER;
@PigLoadable(recordInflater = PersonFromCsvInflater.class)
public class Person {
@PigField(name = "gender", type = CHARARRAY)
private String gender;
@PigField(name = "gender", type = INTEGER)
private Integer age;
// getters, setters, etc.
}
Now, to load your objects via pig, you would use a load function like:
REQUIRE 'location/to/pig-annotations.jar'
REQUIRE 'your/jar/with/other/classes.jar'
people = LOAD 'your/input/files/*.csv'
USING com.intentmedia.pig.AnnotatedObjectLoader('com.intentmedia.examples.Person');
And the people
alias will have the pig schema tuple(gender:chararray,age:int)
.
pig-annotations also supports the following features:
- Custom converters for fields that can't be autoboxed into pig types.
- Mapping Booleans to Integers (because Pig doesn't have booleans yet)
- Unwrapping fields annotated with @Embedded