chosen in any state. If you have not reached the end
of the sequence, go to step 2.
4. Sum the probabilities for each hypothesis over the
states and save the one with the highest.
If a model has a one-to-one correspondence between
a path and a labeling, the result of 1-best would be
identical to the result of the Viterbi algorithm. Gener-
ally, the 1-best algorithm will always produce a label-
ing with a probability at least as high as the result of
the Viterbi algorithm, so in this sense it can never be
worse.
The main draw-back of 1-best is that the computa-
tion time is significantly longer than Viterbi. However,
the number of active hypotheses can be reduced in var-
ious ways by discarding ones with a probability less
than a cut-off. An advantage of the algorithm is that
the memory requirement is only weakly dependent on
sequence length; it depends on the number of states in
the model times the number of exons in the hypothe-
ses. This is because there is no back tracking as in the
Viterbi algorithm, and the hypotheses use very little
memory because one just have to store the positions of
the borders between labels. The Viterbi algorithm re-
quires memory proportional to sequence length times
number of states, although there are more memory ef-
ficient implementations which then takes more time.
The low memory requirement is particularly nice for
parsing very long genomic sequence, where the simple
Viterbi algorithm requires megabytes of memory.
The 1-best algorithm will fail to find the correct re-
sult if the best hypothesis does not have the highest
probability in any state of the model for a given po-
sition in the sequence. This seems to be a rare phe-
nomenon in HMMgene. In the algorithm I tried to
keep also the partial hypothesis with the largest prob-
ability summed over all states, but this did not change
the result in a small number of test experiments.
Although I call the algorithm 1-best, it is not identi-
cal to the one described in (Schwarz & Chow 1990).
speech recognition, one is interested only in the short
sequence of phonemes represented by a speech signal.
For gene parsing it would correspond to collapsing the
labeling of the sequence in Figure 1 to ’intergenic, cod-
ing, intron, coding, intergenic’ or ’0CIC0’. Therefore,
in the original n-best algorithm the hypotheses are of
this form. The number of hypotheses for each state
can be greater than one, and this can be used when
searching for suboptimal predictions. This will be the
subject of future study.
HMMgene
A CI,IMM gene finder called HMMgene is currently
being developed. It consists of states modeling coding
regions including a crude length modeling, states mod-
eling splice sites and introns, and states for modeling
intergenic regions including states for the region up-
stream of the start codon and downstream of the stop
~82 ISMB-97
codon.
Two very important features of the model may not
be familiar to all readers, even if they know HMMs.
Instead of having the usual ’0th order’ emission prob-
ability distribution over letters of the alphabet, I allow
a state to have the emission probability conditioned
on the n previous letters in the sequence, which cor-
responds to an nth order Markov chain. These nth
order states are particularly useful for modeling cod-
ing regions. Here the basic model consists of three
4th order states, which is essentially like the inhomo-
geneous Markov chains used to model coding regions
in GeneMark (Borodovsky & McIninch 1993). Most
other states in the model are first order, i.e., captur-
ing dinucleotide statistics, except for the state model-
ing internal introns and the one for intergenic regions
which are both 3rd order. Notice that the state se-
quence is still first order, and therefore the high order
states do not change the HMM formalism significantly,
and all the standard algorithms are unchanged.
The second feature that may be unfamiliar is called
tying, which is used extensively in speech recognition.
(The same technique is called ’weight-sharing’ for neu-
ral networks.) Tying of two states means that the emis-
sion probabilities and/or the transition probabilities
are always identical in the two states. During estima-
tion a group of tied states are updated by the sum of
the changes calculated for the individual states in the
group, so it is like having the same state appearing
several places in the model. This is used for the intron
modeling. In order to incorporate splicing constraints
it is necessary to have three copies of the intron model,
and the states in these three models are tied. It is also
used to model the exon length by having several tied
copies of the basic three states for modeling a codon.
In this work the models were regularized by the
simple pseudocount method (see e.g. (Krogh et al.
1994)). After a few initial experiments these pseudo-
counts were set to 25 for each letter, except in the
high order states for the coding regions where a larger
pseudocount of 250 for each letter was used to avoid
over-fitting. When reestimating an nth order state,
the expected number of each (n + 1)-met is calculated
to obtain the conditional probabilities. For such states
the regularizer is spread evenly over all (n + 1)-mers,
i.e., the pseudocount for a letter is divided by 4
n.
In
the beginning of the training a decreasing amount of
noise was added as described in (Krogh et al. 1994)
in an attempt to avoid unfavorable local minima. Be-
cause the models used in this study were of relatively
low complexity, the local minimum problem was not
severe, and models trained on the same data did not
differ much in performance.
To speed up training, the ML estimated model was
used as the initial model for conditional ML. During
the iteration of the extended Baum-VCelch algorithm
the model accuracy on the training set was monitored,
and after a maximum number of iterations, the model