Skip to content

intentmedia/pig-annotations

Repository files navigation

pig-annotations

pig-annotations is a class library that makes it easy to load your custom serialized java objects into pig as proper pig Tuples with a well-defined schema.

Should I use pig-annotations?

pig-annotations has a rather narrow scope. You should probably only use pig-annotations if the following is true:

  • You use java. pig-annotations is a java library.
  • You use pig.
  • You already have a custom means of serializing your java objects in a line-based text format (like json).

How do I use pig-annotations?

Using pig-annotations is straightforward.

  • You will need to provide an implementation of RecordInflater to convert from a line of text into your java object. This implementation must have a no-arg public constructor.
  • You will need to annotate your object to specify how to convert the fields into values within a tuple.

Example

It is probably easiest to demonstrate via an example.

Let's say you have a Person class that has two fields, age and gender.

package com.intentmedia.examples;

public class Person {
    private String gender;
    private Integer age;

    // getters, setters, etc.
}

You need to tell pig-annotations how to transform each field.

package com.intentmedia.examples;

import com.intentmedia.pig.PigField;

import static org.apache.pig.data.DataType.CHARARRAY;
import static org.apache.pig.data.DataType.INTEGER;

public class Person {

    @PigField(name = "gender", type = CHARARRAY)
    private String gender;


    @PigField(name = "gender", type = INTEGER)
    private Integer age;

    // getters, setters, etc.
}

For each field, you supply a name, and what Pig data type to map it to.

Finally, you need to tell pig-annotations how to load your object before it can turn it into a pig tuple. If your objects were stored as a csv like this:

male,25
female,26

Then you need to implement RecordInflater<Person>.

package com.intentmedia.examples;

import com.intentmedia.examples.Person;
import com.intentmedia.convert.RecordInflater;
import org.apache.hadoop.io.Text;
import org.jetbrains.annotations.NotNull;

public class PersonFromCsvInflater implements RecordInflater<Person> {
    @NotNull
    @Override
    public Person convert(@NotNull Text value) throws IllegalArgumentException {

        String[] genderAndAge = value.toString().split(",");

        Person person = new Person();
        person.setGender(genderAndAge[0]);
        person.setAge(Integer.parseInt(genderAndAge[1]));

        return person;
    }
}

Finally, just add one more annotation to your Person class.

package com.intentmedia.examples;

import com.intentmedia.pig.PigField;

import static org.apache.pig.data.DataType.CHARARRAY;
import static org.apache.pig.data.DataType.INTEGER;

@PigLoadable(recordInflater = PersonFromCsvInflater.class)
public class Person {

    @PigField(name = "gender", type = CHARARRAY)
    private String gender;


    @PigField(name = "gender", type = INTEGER)
    private Integer age;

    // getters, setters, etc.
}

Now, to load your objects via pig, you would use a load function like:

REQUIRE 'location/to/pig-annotations.jar'
REQUIRE 'your/jar/with/other/classes.jar'

people = LOAD 'your/input/files/*.csv' 
  USING com.intentmedia.pig.AnnotatedObjectLoader('com.intentmedia.examples.Person');

And the people alias will have the pig schema tuple(gender:chararray,age:int).

But wait, there's more

pig-annotations also supports the following features:

  • Custom converters for fields that can't be autoboxed into pig types.
  • Mapping Booleans to Integers (because Pig doesn't have booleans yet)
  • Unwrapping fields annotated with @Embedded