Conditional Random Fields: probabilistic models for segmenting and labeling sequence data, Lafferty et al, 2001
Paper, Tags: #nlp, #architectures
Conditional random fields (CRF) are a framework for building probabilistic models to segment and label sequence data, they offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models.
HMMs and stochastic grammars are generative models, assigning a joint probability to paired observation and label sequences. To define a joint probability, a generative model needs to enumerate all possible observation sequences, but this is not practical.
On the other hand, a conditional model specifies the probabilities of possible label sequences given an observation sequence. It doesn't expend modeling effort on the observations. The conditional probability of the label sequence can depend on arbitrary, non-independent features of the observation sequence without forcing the model to account for those feature dependencies. Generative models must make very strict independence assumptions on the observations.
Maximum entropy Markov models (MEMMs) are conditional probabilistic sequence models that attain all the above advantages. MEMMs have a weakness, label bias problem: the transitions leaving a given state compete only against each other, rather than against all other transitions in the model.
Transition scores are the conditional probabilities of possible next states give nthe current state and the observation sequence. All the mass that arrives at a state must be distributed among the possibile successor states. This causes a bias towards states with fewer outgoing transitions. In an extreme case, a state with one outgoing transition effectively ignores the observations.
CRFs has all the advantages of MEMMs and also solves the label bias problem. The difference between CRFs and MEMMs is that the underlying graphical model structure of CRFs is undirected, while that of MEMMs is directed.. A MEMM uses per-state exponential models for the conditional probabilities of next states given the current state, while a CRF has a single exponential model for the joint probability.
A CRF can be seen as a finite state model with unnormalized transition probabilities. CRFs perform better than HMMs and MEMMs when the true data distribution has higher-order dependencies than the model.
The total score mass, representing the strength of a partial labeling hypothesis that arrives in a state must be distributed among its outgoing transitions. The state muss pass on exactly the same total mass it received. When the state has a restricted number of outgoing transitions, this can result in unintuitive and undesirable outcomes.
- Solution 1: change the state-transition structure of the model, a special case of determinization but it's not always possible.
- Solution 2: start with a fully-connected model and let the training procedure figure out a good structure
Proper solutions require models that account for whole state sequences at once by letting some transitions vote more strongly than others.