A gentle tutorial on accelerated parameter and confidence interval estimation for hidden Markov models using Template Model Builder

Abstract A very common way to estimate the parameters of a hidden Markov model (HMM) is the relatively straightforward computation of maximum likelihood (ML) estimates. For this task, most users rely on user‐friendly implementation of the estimation routines via an interpreted programming language such as the statistical software environment R. Such an approach can easily require time‐consuming computations, in particular for longer sequences of observations. In addition, selecting a suitable approach for deriving confidence intervals for the estimated parameters is not entirely obvious, and often the computationally intensive bootstrap methods have to be applied. In this tutorial, we illustrate how to speed up the computation of ML estimates significantly via the R package TMB. Moreover, this approach permits simple retrieval of standard errors at the same time. We illustrate the performance of our routines using different data sets: first, two smaller samples from a mobile application for tinnitus patients and a well‐known data set of fetal lamb movements with 87 and 240 data points, respectively. Second, we rely on larger data sets of simulated data of sizes 2000 and 5000 for further analysis. This tutorial is accompanied by a collection of scripts, which are all available in the Supporting Information. These scripts allow any user with moderate programming experience to benefit quickly from the computational advantages of TMB.

2008), HMMs found wide usage in many applied sciences. To name only a few, biology and bioinformatics (Durbin, 1998;Eddy, 1998;Schadt et al., 1998), finance (Hamilton, 1989), ecology (McClintock et al., 2020), stochastic weather modeling (Ailliot et al., 2015;Lystig & Hughes, 2002), and engineering (Mor et al., 2021). Furthermore, the scientific literature on this topic in statistics is rich, as illustrated, for example, by the manuscripts of Bartolucci et al. (2012), Cappé et al. (2006), and Zucchini et al. (2016). The aforementioned sources contain, among many other aspects, detailed descriptions of parameter estimation for HMMs by maximization of the (log-)likelihood function. In short, on the one hand, maximum likelihood (ML) estimation is commonly achieved by a direct numerical maximization as introduced by Turner (2008) and later detailed by Zucchini et al. (2016), who also provided a collection of R (R Core Team, 2021) scripts that is widely used. Alternatively, Expectation Maximization (EM)-type algorithms as first described by Baum et al. (1970) or Dempster et al. (1977) serve for parameter estimation equally well. These algorithms possess several advantageous properties, for example, their robustness to poor initial values compared to direct numerical maximization via hill-climbing algorithms. For more details on the EM algorithm in the context of HMMs and a comparison of both approaches, see Bulla and Berzel (2008), who also describe a hybrid approach combining both algorithms.
Evaluating uncertainty and obtaining confidence intervals (CIs) constitute another essential aspect when working with HMMs-and it is less straightforward than parameter estimation. Although Cappé et al. (2006, ch. 12) showed that CIs could be obtained based on asymptotic normality of the ML estimates of the parameters under certain conditions, Frühwirth-Schnatter (2006, p. 53) points out that in independent mixture models, "the regularity conditions are often violated." McLachlan and Peel (2004, p. 68) add that "In particular for mixture models, it is well known that the sample size has to be very large before the asymptotic theory of maximum likelihood applies." Lystig and Hughes (2002) show a way to compute the exact Hessian, and Zucchini et al. (2016) present an alternative way to compute the approximate Hessian and thus CIs, but admit that "the use of the Hessian to compute standard errors (and thence confidence intervals) is unreliable if some of the parameters are on or near the boundary of their parameter space." In addition, Visser et al. (2000) report that computational problems arise when deriving CIs from the Hessian for sequences longer than 100 observations.
In this tutorial, we illustrate how to accelerate parameter estimation for HMMs with the help of Template Model Builder (TMB). As described by Kristensen et al. (2016), TMB is an R package developed for efficiently fitting complex statistical models to data. It provides exact calculations of first-and second-order derivatives of the (log-)likelihood of a model by automatic differentiation, which allows for efficient gradient-and/or Hessian-based optimization of the likelihood on the one hand. On the other hand, TMB permits to infer CIs for the estimated parameters by means of the Hessian. We show how to carry out this part for HMMs using a couple of simple examples. Then, we compare the Hessian-based CIs with CIs resulting from likelihood profiling and bootstrapping, which are both more computationally intensive.
The tutorial is accompanied by a collection of scripts (listed at https://timothee-bacri.github.io/HMM_with_TMB/ github.html#directory-structure), which guide any user working with HMMs through the implementation of computationally efficient parameter estimation. The majority of scripts require only knowledge of R, just the computation of the (log-)likelihood function requires the involvement of C++. Moreover, we illustrate how TMB permits Hessian-or profile likelihood-based CIs for the estimated parameters at a very low computational cost. Naturally, the accelerated parameter estimation procedure may also serve for implementing computationally more efficient bootstrap CIs, an aspect we also make use of for our analyses.

PRINCIPLES OF USING TMB FOR ML ESTIMATION
In order to keep this tutorial at acceptable length, all sections follow the same concept. That is, the reader is encouraged to consult the respective part of our Supporting Information in parallel to reading each section; it is available at https://timothee-bacri.github.io/HMM_with_TMB. In particular, we recommend opening the file Data supplements.Rproj (available in the related repository at https://github.com/timothee-bacri/HMM_with_TMB) with R-Studio, which lets the reader have the correct working path set up automatically. This permits to copy-paste or download all the scripts presented in this tutorial for each section on the one hand. On the other hand, it allows for consistent maintenance of the code. Moreover, the repository also contains additional explanations, comments, and scripts.

Setup
Execution of our routines requires the installation of the R-package TMB and the software Rtools, where the latter serves for compiling C++ code on Windows. In order to ensure reproducibility of all results involving the generation of random numbers, the set.seed function requires R version number 3.6.0 or greater. Our scripts were tested on a workstation with 4 Intel(R) Xeon(R) Gold 6134 processors (3.7 GHz) running under the Linux distribution Ubuntu 18.04.6 LTS (Bionic Beaver) with 384 GB RAM and required about 1 week of computing time.
In particular for beginners, those parts of scripts involving C++ code can be difficult to debug because the code operates using a specific template. Therefore it is helpful to know that TMB provides a debugging feature, which can be useful to retrieve diagnostic error messages, in RStudio. Enabling this feature is optional and can be achieved by the command TMB:::setupRStudio() (requires manual confirmation and restarting RStudio).

Linear regression example
We begin by demonstrating the principles of TMB, which we illustrate through the fitting procedure for a simple linear model. This permits, among other things, to show how to handle parameters subject to constraints, an aspect particularly relevant for HMMs. A more comprehensive tutorial on TMB presenting many technical details in more depth is available at https://kaskr.github.io/adcomp/_book/Tutorial.html. Let and denote the predictor and response vector, respectively, both of length . For a simple linear regression model with intercept and slope , the negative log-likelihood equals where (⋅; , 2 ) corresponds to the density function of the univariate normal distribution with mean and variance 2 . The use of TMB requires the (negative) log-likelihood function to be coded in C++ under a specific template, which is then loaded into R. The minimization of this function and other postprocessing procedures are all carried out in R. Therefore, we require two files. The first file, named linreg.cpp, is written in C++ and defines the objective function, that is, the negative log-likelihood (nll) function of the linear model, as follows: Note that we define data inputs and using the DATA_VECTOR() declaration in the above code. Furthermore, we declare the nll as a function of the three parameters a, b, and log( ) using the PARAMETER() declaration. In order to be able to carry out unconstrained optimization procedures in the following, the nll function is parameterized in terms of log( ). While the parameter is constrained to be nonnegative, log( ) can be freely estimated. Alternatively, constraint optimization methods could be carried out, but we do not investigate such procedures. The ADREPORT() function is optional but useful for parameter inference at the postprocessing stage.
The second file needed is written in R and serves for compiling the nll function defined above and carrying out the estimation procedure by numerical optimization of the nll function. The .R file (shown below) carries out the compilation of the C++ file and minimization of the nll function: In addition to the core functionality presented above, different types of postprocessing of the results are possible as well. For example, the function sdreport returns the ML estimates and standard errors of the parameters in terms of which the nll is parameterized: In principle, the argument par.fixed = mod_linreg$par is optional but recommended, because it ensures that the sdreport function is carried out at the minimum found by nlminb (Gay, 1990). Instead of nlminb, other optimization routines may be used as well. For example, Zucchini et al. (2016) rely on the R function nlm. We selected nlminb because it offers a higher degree of flexibility, while having a syntax close to nlm. Note that the standard errors above are based on the Hessian matrix of the nll.
From a practical perspective, it is usually desirable to obtain standard errors for the constrained variables, in this case . To achieve this, one can run the summary function with argument select = "report": These standard errors result from the generalized delta method described by Kass and Steffey (1989), which is implemented within TMB. Note that full functionality of the sdreport function requires calling the function ADREPORT on the additional parameters of interest (i.e., those including transformed parameters, in our example ) in the C++ part. The select argument restricts the output to variables passed by ADREPORT. This feature is particularly useful when the likelihood has been reparameterized as above, and is especially relevant for HMMs. Following Zucchini et al. (2016), we refer to the original parameters as natural parameters, and to their transformed version as the working parameters.
Last, we display the estimation results from the lm function for comparison.

PARAMETER ESTIMATION TECHNIQUES FOR HMMs
In this section we recall basic concepts underlying parameter estimation for HMMs via direct numerical optimization of the likelihood. In terms of notation, we stay as close as possible to Zucchini et al. (2016), where a more detailed presentation is available.

Basic notation and model setup
A large variety of modeling approaches is possible with HMMs, ranging from rather simple to highly complex setups. In a basic HMM, one assumes that the data-generating process corresponds to a time-dependent mixture of conditional distributions. More specifically, the mixing process is driven by an unobserved (hidden) homogeneous Markov chain. In this paper we focus on a Poisson HMM, but only small changes are necessary to adapt our scripts to models with other conditional distributions. For an example of the application of TMB in a more complex HMM setting, see Berentsen et al. (2022). Let { ∶ = 1, … , } and { ∶ = 1, … , } denote the observed and hidden process, respectively, where denotes the (time) index ranging from one to . For an -state Poisson HMM, the conditional distributions with parameter are then specified through where = 1, … , . Furthermore, we let = { } and denote the transition probability matrix (TPM) of the Markov chain and the corresponding stationary distribution, respectively. It is noteworthy that Markov chains in the context of HMMs are often assumed irreducible and aperiodic. For example, Grimmett et al. (2001, Lemma 6.3.5 on p. 225 and Theorem 6.4.3 on p. 227) show that irreducibility ensures the existence of the stationary distribution, and Feller (1968, p. 394) describes that aperiodicity implies that a unique limiting distribution exists and corresponds to the stationary distribution. These results are, however, of limited relevance for most estimation algorithms, because the elements of are in general strictly positive. Nevertheless, one should be careful when manually fixing selected elements of to zero.

The likelihood function of an HMM
The likelihood function of an HMM requires, in principle, a summation over all possible state sequences. As shown, for example, by Zucchini et al. (2016, p. 37), a computationally convenient representation as a product of matrices is possible. Let ( ) = { 1 , … , } and ( ) = { 1 , … , } denote the "history" of the observed process and the observations , respectively, from time 1 up to time . Moreover, let denote the vector of model parameters, which consists of the parameters of the TPM and the parameters of the conditional probability density functions. Given these parameters, the likelihood of the observations { 1 , … , } can then be expressed as where corresponds to a diagonal matrix with the conditional probability density functions evaluated at (we will use the term density despite the discrete support), and denotes a vector of ones. The first element of the likelihood function, the so-called initial distribution, is given by the stationary distribution here. Alternatively, the initial distribution may be estimated freely, which requires minor changes to the likelihood function discussed in Section 4.1.
Note that the treatment of missing data is comparably straightforward in this setup. If is a missing observation, one just has to set ( ) = 1, thus ( ) reduces to the unity matrix as detailed in Zucchini et al. (2016, p. 40). Furthermore, this representation of the likelihood is quite natural from an intuitive point of view. From left to right, it can be interpreted as a pass through the observations: One starts with the initial distribution multiplied by the conditional density of 1 collected in ( 1 ). This is followed by iterative multiplications with the TPM modeling the transition to the next observation, and yet another multiplication with contributions of the following conditional densities.

Forward algorithm and backward algorithm
The pass through the observations described above actually forms the basis for an efficient evaluation of the likelihood function. More precisely, the so-called "forward algorithm" allows for a recursive computation of the likelihood. For setting up this algorithm, we need to define the vector by for = 1, 2, … , . The name forward algorithm originates from the way of calculating , that is, 0 = ( 1 ) = −1 ( ) for = 1, 2, … , .
After a pass through all observations, the likelihood results from In a similar way, the "backward algorithm" also permits the recursive computation of the likelihood, but starting with the last observation. To formulate the backward algorithm, let us define the vector for = 1, 2, … , so that The name backward algorithm results from the way of calculating , that is, Again, the likelihood can be calculated after a pass through all observations by ( ) = 1 .
In general, parameter estimation bases on the forward algorithm. The backward algorithm is, however, still useful because the quantities and together serve for a couple of interesting tasks. For example, they are the basis for deriving a particular type of conditional distributions and for state inference by local decoding (Zucchini et al., 2016, ch. 5, pp. 81-93). We present details on local decoding in the Supporting Information.
Last, it is well known that the execution of the forward (or backward) algorithm may quickly lead to underflow errors, because many elements of the vectors and matrices involved take values between zero and one. To avoid these difficulties, a scaling factor can be introduced. We follow the approach suggested by Zucchini et al. (2016, p. 48) and implement a scaled version of the forward algorithm, which directly provides the (negative) log-likelihood as result.

Reparameterization of the likelihood function
The representation of the likelihood and the algorithms shown above rely on the data and the set of parameters as input.
The data are subject to several constraints: (i) Typically there are various constraints of the parameters in the conditional distribution. For the Poisson HMM, all elements of the parameter vector = ( 1 , … , ) must be positive. (ii) In general, the parameters of the TPM have to be nonnegative, and the rows of must sum up to one.
The constraints of the TPM can be difficult to deal with using constrained optimization of the likelihood. A common approach is to reparameterize the log-likelihood in terms of unconstrained "working" parameters { , } = −1 ( , ), as follows. A possible reparameterization of is given by where are ( − 1) real valued, thus unconstrained, elements of an times matrix with no diagonal elements. The diagonal elements of follows implicitly from ∑ = 1 ∀ (Zucchini et al., 2016, p. 51). The corresponding reverse transformation is given by For the Poisson HMM the intensities can be reparameterized in terms of = exp( ), and consequently the unconstrained working parameters are given by = log( ), = 1, … , . Estimates of the "natural" parameters { , } can then be obtained by maximizing the reparameterized likelihood with respect to { , } and then transforming the estimated working parameters back to natural parameters via the above transformations, that is, {̂,̂} = (̂,̂). Note that in general the function needs to be one-to-one for the above procedure to work.

USING TMB
In the following we show how ML estimation of the parameters of HMMs can be carried out efficiently via TMB.

Likelihood function
Similar to the linear regression example presented in Section 2.2, the first and essential step is to define our nll function to be minimized later in a suitable C++ file. In our case, this function calculates the negative log-likelihood presented by Zucchini et al. (2016, p. 48), and our C++ code is analog to the R-code shown by Zucchini et al. (2016, pp. 331-333). This function, defined in the file named poi_hmm.cpp, tackles our setting with conditional Poisson distributions only. An extension to, for example, Gaussian, binomial, and exponential conditional distributions is straightforward. It requires to modify the density function in the poi_hmm.cpp file and the related functions for parameter transformation presented in Section 3.4. We illustrate the implementation of the Gaussian case in the Supporting Information. However, note that the number of possible modeling setups is very large: For example, the conditional distributions may vary from state to state, nested model specifications, the conditional mean may be linked to covariates, or the TPM could depend on covariates-to name only a few. Due to the very large number of possible extensions of the basic HMM, we refrain from implementing an R-package, but prefer to provide a proper guidance to the reader for building custom models suited to a particular application. As a small example, we illustrate how to implement a freely estimated initial distribution in the file poi_hmm.cpp. This modification can be achieved by uncommenting a couple of lines only.

Optimization
With the nll function available in C++, we can carry out the parameter estimation and all pre-/postprocessing in R. In the following we describe the steps to be carried out.
(1) Loading of the necessary packages, compilation of the nll function with TMB and subsequent loading, and loading of the auxiliary functions for parameter transformation.
(2) Loading of the observations. The data are part of a large data set collected with the "Track Your Tinnitus" (TYT) mobile application, a detailed description of which is presented in  and Pryss, Reichert, Herrmann, et al. (2015). We analyze 87 successive days of the "arousal" variable recorded for a single individual. This variable is measured on a discrete scale, where higher values correspond to a higher degree of excitement and lower values to a more calm emotional state (for details, see Probst et al., 2016Probst et al., , 2017. Loading the "arousal" variable can be achieved simply with Table 1 presents the raw data, which are also available for download in the Supporting Information. Then, these models are evaluated, for example, by means of model selection criteria (as carried out by Leroux & Puterman, 1992) or prediction performance (Celeux & Durand, 2008). The model selection procedure shows that both Akaike information criterion (AIC) and Bayesian information criterion (BIC) prefer a two-state model over a model with three or four states. Consequently, we focus on the two-state model in the following. The list object TMB_data contains the data and the number of states.
Second, initial values for the optimization procedure need to be defined. Although we will apply unconstrained optimization, we initialize the natural parameters, because this is much more intuitive and practical than handling the working parameters.
(4) Transformation from natural to working parameters. The previously created initial values are transformed and stored in the list parameters for the optimization procedure.
(5) Creation of the TMB negative log-likelihood function with its derivatives. This object, stored as obj_tmb requires the data, the initial values, and the previously created DLL as input. Setting argument silent = TRUE disables tracing information and is only used here to avoid excessive output.
This object also contains the previously defined initial values as a vector (par) rather than a list. The negative loglikelihood (fn), its gradient (gr), and Hessian (he) are functions of the parameters (in vector form) while the data are considered fixed: (6) Execution of the optimization. For this step we rely again on the optimizer implemented in the nlminb function.
The arguments, that is, initial values for the parameters and the function to be optimized, are extracted from the previously created TMB object.
As mentioned previously, various alternatives to nlminb exist and could be used at this step instead. (7) Obtaining the ML estimates of the natural parameters together with their standard errors is possible by using the previously introduced command sdreport. Recall that this requires the parameters of interest to be treated by the ADREPORT statement in the C++ part. It should be noted that the presentation of the set of parameters gamma below results from a column-wise representation of the TPM.
Note that the table above also contains estimation results for and accompanying standard errors, although is not estimated, but derived from . We provide further details on this aspect in Section 5.1. The value of the nll function in the minimum found by the optimizer can also be extracted directly from the object mod_tmb by accessing the list element objective: (8) In the optimization above we already benefited from an increased speed due to the evaluation of the nll in C++ compared to the forward algorithm being executed entirely in R. However, the use of TMB also permits to introduce the gradient and/or the Hessian computed by TMB into the optimization procedure. This is in general advisable, because TMB provides an exact value of both gradient and Hessian up to machine precision, which is superior to approximations used by optimizing procedure. Similar to the nll, both quantities can be extracted directly from the TMB object obj_tmb: Note that passing the gradient and Hessian provided by TMB to nlminb leads to the same minimum, that is, value of the nll function, here.

Basic nested model specification
In the context of HMMs (and other statistical models), nested models or models subject to certain parameter restrictions are commonly used. For example, it may be necessary to fix some parameters because of biological or physical constraints. TMB can be instructed to treat selected parameters as constants, or impose equality constraints on a set of parameters. For the practical implementation, it is noteworthy that such parameter restrictions should be imposed on the working parameters. However, it is also easily possible to impose restrictions on a natural parameter (e.g., ), and then identify the corresponding restriction on the working parameter (i.e., log( )). We illustrate a simple nested model specification by fixing 1 to one in our two-state Poisson HMM, the other parameter components correspond to the previous initial values.
We then transform these natural parameters into a set of working parameters.
For instructing TMB to treat selected parameters as constants, the map argument of the MakeADFun has to be specified in addition to the usual arguments. The map argument is a list consisting factor-valued vectors, which possess the same length as the working parameters and carry their names as well. The factor levels have to be unique for the regular parameters not subject to specific restrictions. If a parameter is fixed the corresponding entry of the map argument is filled with NA. In our example, this leads to: It is noteworthy that more complex constraints are possible as well. For example, to impose equality constraints (such as 11 = 22 ), the corresponding factor level has to be identical for the concerned entries. We refer to our Supporting Information for details.
Last, estimation of the remaining model parameters and extraction of the results is achieved as before.
Note that the standard error of 1 equals zero, because it is no longer considered a parameter and does not enter the optimization procedure.  Table 5 for the values of4

.4 State inference and forecasting
After estimating an HMM by the procedures illustrated in Section 4.2, it is possible to carry out a couple analyses that provide insight into the interpretation of the estimated model. These include, for example, the so-called smoothing probabilities, which correspond to the probability of being in state at time for = 1, … , , = 1, … , , given all observations. These probabilities can be obtained by wherêdenotes the set of ML estimates. The derived smoothing probabilities then serve for determining the most probable state * at time given the observations by * = arg max ∈{1,…, } P( = | ( ) = ( ) ).
Furthermore, the Viterbi algorithm determines the overall most probable sequence of states * 1 , … , * , given the observations. This is achieved by evaluating Other quantities of interest include the forecast distribution or ℎ-step-ahead probabilities, which are obtained through All the quantities shown above and the related algorithms for deriving them are described in detail in Zucchini et al. (2016, Ch. 5). In order to apply these algorithms, it is only necessary to extract the quantities required as input from a suitable MakeADFun object. Note that most algorithms rely on scaled versions of the forward-and backward algorithm. This is illustrated in detail in the Supporting Information. Figure 1 shows the TYT data together with the conditional mean values linked to the most probable state inferred by the smoothing probabilities.

CIs
A common approach for deriving CIs for the estimated parameters of statistical models bases on finite-difference approximations of the Hessian. This technique is, however, not suited for most HMMs due to computational difficulties, as already pointed out by Visser et al. (2000). The same authors suggest likelihood profile CIs or bootstrap-based CIs as potentially better alternatives. Despite the potentially high computational load, bootstrap-based CIs have become an established method in the context of HMMs (Bulla & Berzel, 2008;Zucchini et al., 2016) and found widespread application by practitioners.
In this section we illustrate how CIs based on the Hessian, likelihood profiling, and the bootstrap can be efficiently implemented by integrating TMB. This permits in particular to obtain Hessian-based and likelihood profile-based CIs at very low computational cost. For simplicity, we illustrate our procedures by means of the parameter 2 of our two-state Poisson HMM. We will further address the resulting CIs for and and performance-related aspects in Section 6.

Wald-type CIs based on the Hessian
Since the negative log-likelihood function of HMMs typically depends on the working parameters, evaluation of the Hessian in the optimum found by numerical optimization only serves for inference about the working parameters. From a practical perspective, however, inference about the natural parameters usually is of interest. As the Hessian ∇ 2 log ({̂,̂}) refers to the working parameters { , }, the delta method is suitable to obtain an estimate of the covariance matrix of {̂,̂} by with {̂,̂} = (̂,̂) as defined in Section 3.4. From a user's perspective, it is highly convenient that the entire right-hand side of Equation (2) can be directly computed via automatic differentiation in TMB. Moreover, it is particularly noteworthy that the standard errors of derived parameters can be calculated by the delta method similarly. For example, the stationary distribution is a function of in our case, and TMB provides a straightforward way to obtain standard errors of . This is achieved by first defining inside the C++ file poi_hmm.cpp (or, in our implementation, the related utils.cpp, which gathers auxiliary functions). Second, it is necessary to call ADREPORT on within the poi_hmm.cpp file. To display the resulting estimates and corresponding standard errors in R, one can rely on the command shown previously in Section 4.2. Subsequently, Wald-type CIs (Wald, 1943) follow in the usual manner. For example, the (1 − )% CI for 1 is given by 1 ± 1− ∕2 * 1 , where is the -percentile of the standard normal distribution, and 1 is the standard error of 1 obtained via the delta method. This part is easily implemented in R. We illustrate the calculation of these CIs for our two-state Poisson HMM in the Supporting Information.
Finally, note that the reliability of Wald-type CIs may suffer from a singular Fisher information matrix, which can occur for many different types of statistical models, including HMMs. This also jeopardizes the validity of AIC and BIC criteria. For further details on this topic, see, for example, Drton and Plummer (2017).

Likelihood profile-based CIs
The Hessian-based CIs presented above rely on asymptotic normality of the ML estimator. Properties of the ML estimator may, however, change in small samples. Moreover, symmetric CIs may not be suitable if the ML estimator lies close to a boundary of the parameter space. This occurs, for example, when states are highly persistent, which leads to entries close to one in the TPM. An alternative approach to construct CIs bases on the so-called profile likelihood (see, e.g., Meeker & Escobar, 1995;Venzon and Moolgavkar, 1988), which has also shown a satisfactory performance in the context of HMMs (Visser et al., 2000). In the following, we illustrate the principle of likelihood profile based CIs by the example of the parameter 2 in our two-state Poisson HMM. The underlying basic idea is to identify those values of our parameter of interest 2 in the neighborhood of̂2 that lead to a significant change in the log-likelihood, whereby the other parameters (i.e., , 1 ) are considered nuisance parameters (Meeker & Escobar, 1995). The term "nuisance parameters" means that these parameters need to be re-estimated (by maximizing the likelihood) for any fixed value of 2 different tô2. That is, the profile likelihood of 2 is defined as In order to construct profile likelihood-based CIs, let {̂,̂} denote the ML estimate for our HMM computed as described in Section 4.2. Evaluation of the log-likelihood function in this point results in the value log ({̂,̂}). The deviation of the likelihood of the ML estimate and the profile likelihood in the point 2 is then captured by the following likelihood ratio: As described above, the log-likelihood log( ( 2 )) results from re-estimating the two-state HMM with fixed parameter 2 . Therefore, this model effectively corresponds to a nested model of the full model with ML estimatê,̂. Consequently, asymptotically follows a 2 distribution with one degree of freedom-the difference in degrees of freedom between the two models. Based on this, a CI for 2 can be derived by evaluating at many different values of 2 and determining when the resulting value of becomes "too extreme." That is, for a given , one needs to calculate the 1 − quantile of the 2 1 distribution (e.g., 3.841 for = 5%). The CI at level 1 − for the parameter 2 is then given by For simplicity, the principles of likelihood profiling shown above rely on the natural parameters. Our nll function is, however, parameterized in terms of and optimized with respect to the working parameters. In practice, this aspect is easy to deal with. Once a profile CI for the working parameter (here 2 ) has been obtained following the procedure above, the corresponding CI for the natural parameter 2 results directly from transforming the upper and lower boundary of the CI for 2 by the one-to-one transformation 2 = exp( 2 ). For further details on the invariance of likelihood-based CIs to parameter transformations, see, for example, Meeker and Escobar (1995).
TMB provides an easy way to profiling through the function tmbprofile, which requires several inputs: first, the MakeADFun object called obj_tmb from our two-state Poisson HMM; second, the position of the (working) parameter to be profiled via the name argument. This position refers to the position in the parameter vector obj_tmb$par. Moreover, here the optional trace argument indicates how much information on the optimization is displayed. The following commands permit to profile the second working parameter 2 = log( 2 ). Furthermore, Figure 2 obtained via the plot function shows the resulting profile nll as function of the working parameter 2 . The vertical and horizontal lines correspond to the boundaries of the 95% CI and the critical value of the nll derived from Equation (4), respectively. The CI for 2 can directly be extracted via the function confint: The corresponding CI for 2 follows from: While simple linear combinations of variables can be profiled through the argument lincomb in the tmbprofile function, this is not possible for more complex functions of the parameters. This includes the stationary distribution , for which CIs cannot be obtained by this method.
Note that the function tmbprofile carries out several optimization procedures internally for calculating profile CIs. If this approach fails, or one prefers a specific optimization routine, the necessary steps for profiling can also be implemented by the user. To do so, it would be-roughly speaking-necessary to compute ( 2 ) for a sufficient number of 2 values to achieve the desired precision.
Last, it is noteworthy that calculating profile CIs for the elements of the TPM is not trivial. Since all elements in each row of the TPM depend on each other, the definition of the profile log-likelihood is already problematic. We apply a rather simplistic approach and compute CIs for each of the working parameters , , ≠ . The resulting upper and lower bounds of the profile CI are then transformed to their natural counterparts. However, other approaches may lead to improved results, see, for example, Fischer and Lewis (2021). Note that this approach seems to produce biased results for models with more than two states due to the dependence between row elements of the TPM. Hence, we advise to treat profile CIs for the TPM with care, in particular for HMMs with more than two states.

Bootstrap-based CIs
The last approach for deriving CIs is the bootstrap, which is frequently applied by many practitioners. Efron and Tibshirani (1993)  A thorough overview of this subject would go beyond the scope of this paper. As pointed out by Härdle et al. (2003), the so-called parametric bootstrap is suitable in the context of time-series models. For further details on the bootstrap for HMMs including the implementation of a parametric bootstrap, we refer to Zucchini et al. (2016, ch. 3, pp. 56-60).
Basically all versions of the bootstrap have in common that some kind of resampling procedure needs to be carried out first. Second, the model of interest is reestimated for each of the resampled data sets. A natural way to accelerate the second part consists in the use of TMB for the model estimation by means of the procedures presented in Section 4.2. Our Supporting Information contains a detailed example illustrating the implementation of a parametric percentile bootstrap for our two-state Poisson HMM.

APPLICATION TO DIFFERENT DATA SETS
This section aims to demonstrate the performance of TMB by means of a couple of practical examples that differ in terms of the number of observations and model complexity. These examples include the TYT data shown above, a data set of fetal lamb movements, and simulated data sets. For the performance comparisons, the focus lies on computational speed and the reliability of CIs. The R scripts necessary for this section may serve interested users for investigating their own HMM setting, and are all available in the Supporting Information.

TYT data
We begin by investigating the speed of five approaches for parameter estimation: one without the usage of TMB and four with TMB. In the following, denotes direct maximization of the log-likelihood through the optimization function nlminb without TMB. Furthermore, 0 , , , and denote direct maximization with TMB without using the gradient and Hessian provided by TMB, with the Hessian, with the gradient, and with both gradient and Hessian, respectively.
As a preliminary reliability check of our IT infrastructure and setup, we timed the fitting of our two-state HMM to the TYT data with the help of the microbenchmark package (Mersmann, 2021). For this data set, all five approaches converged to the same optimum and parameter estimates, apart from minor variations typical for numerical optimization (see Table 2). Table 3 shows the resulting average time required for the parameter estimation and the number of iterations needed by each approach, measured over 200 replications. The results show that the use of TMB significantly accelerates parameter estimation in comparison with . The most substantial acceleration is achieved by , underlining the benefit of using the gradient provided by TMB. Moreover, requires fewer iterations than the other approaches. However, the evaluation of the Hessian seems to increase the computational burden.
Next, we verified the reproducibility of the acceleration by TMB in a parametric bootstrap setting. More specifically, we simulated 200 bootstrap samples from the model estimated on the TYT data. Then, we reestimated the same model by our five approaches and derived acceleration ratios (with as reference approach) and their corresponding percentile CIs. As shown in Table 4, all acceleration ratios take values significantly larger than one, whether the gradient and Hessian are passed from TMB or not. In addition, the findings from the single TYT data set are confirmed, with providing the most substantial acceleration and reducing the number of iterations. This underlines that two factors are sources of the acceleration in Table 4: the use of C++ code on the one hand and computation of the gradient and/or Hessian by TMB on the other hand.
In order to obtain reliable results, we excluded all those bootstrap samples with a simulated state sequence not sojourning in each state of the underlying model at least once. This is necessary because such a constellation almost certainly leads to convergence problems on the one hand. On the other hand, even if the estimation algorithms converge, the estimated models are usually degenerate because of the lack of identifiability. Furthermore, for some very rare bootstrap samples, one or several of the estimation algorithms did not converge properly. In such cases, we discarded the results, generated an additional bootstrap sample, and reran the parameter estimation. Convergence problems mainly occurred due to 0 and failing. Therefore, we recommend passing at least the gradient when optimizing with TMB for increased stability.
Last, we investigate CIs obtained for the TYT data by the three different methods described in Section 5, served as the sole estimation approach. The columns to the left in Table 5 show the parameter estimates and the three types of 95% CIs obtained using the Hessian, likelihood profiling, and bootstrapping. For this data set, no major differences between the different CIs are visible. Furthermore, we assessed the accuracy of the different CIs by computing coverage probabilities, which are shown in the last three columns of Table 5. For calculating these coverage probabilities, we used  a Monte Carlo setting similar to the one described above. Samples that possessed state sequences not visiting all states or samples for which the estimation algorithm did not converge were replaced. Moreover, we also simulated a replacement sequence when the profile likelihood method failed to converge on any bound to ensure comparability of the results. The results, shown on the right of Table 5 indicate that all methods provide comparably reliable CI estimates, and neither outperforms the other for all parameters. For the Wald-type CIs, the coverage probabilities almost reach 100% for 22 , 21 , but lie comparably low for both the other elements of the TPM and 1 , 2 . However, profile likelihood-based CIs also take values above 95% for all elements of the TPM, and the coverage probabilities for bootstrap CIs are all above 95%.
Altogether, it appears that the reliability of CIs should be investigated carefully when fitting HMMs to very short sequences of observations.

Lamb data
The well-known data set presented in Leroux and Puterman (1992) serves as second example. This data set consists of the number of movements by a fetal lamb observed through ultrasound during 240 consecutive 5-s intervals, as shown in Figure 3. Since the results reported by Leroux and Puterman (1992) show that a two-state model is preferred by the BIC, we focus on this model only here-although other choices would be possible, for example, the AIC selects a three-state model. We selected this data set for several reasons. First, the number of observations is larger than for the TYT data (but still comparably low). Second, according to the results of Leroux and Puterman (1992), the first state largely dominates the datagenerating process, whereas the second state is not very persistent and linked to only a few observations. Third, the conditional means of the two states are not very different. The latter two aspects qualify these data as a "non-textbook example." Similar to the TYT data, all estimation algorithms converged to the same minimum of the nll, and provided almost identical parameter estimates on the original data set. A bootstrap experiment similar to the one described above for the TYT data led to comparable results, as shown in Table 6. The highest acceleration is achieved by , whereas Note: From left to right, the columns contain: the parameter name, parameter estimate, and lower (L.) and upper (U.) bound of the corresponding 95% CI derived via the Hessian provided by TMB, likelihood profiling, and percentile bootstrap. Then follow coverage probabilities derived for these three methods in a Monte Carlo study.
achieves the lowest acceleration despite requiring a lower number of iterations than the other approaches. However, most ratios lie above those obtained for the TYT data, indicating an increased benefit of using TMB with an increasing number of observations. Next, Table 7 shows parameter estimates and corresponding CIs in the columns to the left. Our estimates confirm the results of Leroux and Puterman (1992): The second state is not very persistent, and the conditional means 1 and 2 do not lie very far from each other. Concerning the CIs resulting from our three approaches, it is noticeable that the bootstrap CIs for the elements of the TPM are larger than those obtained by the other two approaches. The coverage probabilities presented in the columns to the right of Table 7 show similar patterns as observed for the TYT data. Wald-type CIs show certain deviations from 95% for the parameters related to the hidden state sequence, and the same is true for the profile CIs. Bootstrap CIs seem to be slightly too large for most parameters, leading to values greater than 95%. As for the TYT data, we recommend to be aware of the limited reliability of CIs for short sequences of observations. On a minor note, some nonnegligible differences can be noted when comparing our estimation results to those reported by Leroux and Puterman (1992). The reasons for this are difficult to determine, but some likely explanations are given in the following. First, differences in the parameter estimates may result, for example, from the optimizing algorithms used and related setting (e.g., convergence criterion, number of steps, optimization routines used in 1992). Moreover, Leroux and Puterman (1992) seem to base their calculations on an altered likelihood, which is reduced by removing the constant term ∑ =1 log( !) from the log-likelihood. This modification may also possess an impact on the behavior of the optimization algorithm, as, for example, relative convergence criteria and step size could be affected.

Simulation study
The two previously analyzed data sets are both of comparably small size. In order to systematically investigate the performance of TMB in the context of larger samples, we carried out a small simulation study. For this study, we simulated sequences of observations of length 2000 and 5000 from HMMs with two and three states, respectively. The parameters underlying the simulation are The solid horizontal lines correspond to the conditional mean of the inferred state at each time. For readability, the graph is truncated to 500 data points. See Table 12 for the values of0

F I G U R E 5
Plot of the second simulated data (of size 5000) generated by a three-state Poisson HMM Note: The solid horizontal lines correspond to the conditional mean of the inferred state at each time. For readability, the graph is truncated to 500 data points. See Table 13 for the values off or the HMM with three states. Figure 4 and Figure , = (1, 5, 9, 13).
The bootstrap setting described in the previous sections served for computing acceleration ratios for the simulated data sets. employing the Hessian computed by TMB for longer observation sequences, since this method requires a much lower number of iterations than all other approaches. The results for the three-state model, as shown in Table 9, basically underline all findings from the two-state model. Most importantly, again requires a much lower number of iterations than the other approaches (Tables 10 and  11).
Similar to the two previous sections, Table 12 to Table 15 show parameter estimates, CIs, and coverage probabilities. The four exemplary sequences of observations shown in Figure 4-7 served for deriving parameter estimates and CIs. The coverage probabilities result from a Monte Carlo study as the ones previously described. For the two-state model, the CIs TA B L E 1 1 Acceleration and iterations for the fourth simulated data (of size 5000) obtained by our three methods are very similar for all parameters. The coverage probabilities lie comparably close to the theoretical level of 95% for all parameters of the two-state model. Moreover, no systematically too small or large CIs seem to result from any of the three methods. The same holds true for the three-state model, with minor exceptions for the profile CIs. For this method, the coverage probabilities of the CIs are close to 100% for diagonal elements of the TPM, while the corresponding probabilities for off-diagonal elements are less than 95%. This pattern amplifies for the four-state models, as shown in Table 15. Moreover, profile likelihood CIs of the elements of the TPM are not consistently provided by the tmbprofile function for the sequences of length 2000 from the four-state model. Therefore, the corresponding entries in Table 14 remain empty. Last, we would like to point out that the runtime of our entire simulation study was approximately 1 week. Without the acceleration by TMB, several months would have been necessary. This illustrates not only the saving of time but also of resources (e.g., energy consumption, wear of IT infrastructure).  F I G U R E 7 Plot of the fourth simulated data (of size 5000) generated by a four-state Poisson HMM Note: The solid horizontal lines correspond to the conditional mean of the inferred state at each time. For readability, the graph is truncated to 500 data points. See Table 15 for the values ofˆ

DISCUSSION
In this tutorial, we provide researchers from all applied fields with an introduction to parameter estimation for HMMs via TMB using R. Some procedures need to be coded in C++, which represents a certain requirement on the user and may not be beneficial if, for example, an HMM needs to be estimated only once. However, for users working more or less regularly with HMMs, the use of TMB accelerates existing parameter estimation procedures without having to carry out major changes to R code that is already in use. Moreover, after finishing the estimation procedure, TMB obtains standard errors for the estimated parameters at a very low computational cost. We examined the performance of TMB in the context of Poisson HMMs through two small real data sets and in a simulation setting with longer sequences of observations. Overall, it is notable that the parameter estimation process is strongly accelerated on the one hand. This applies even to small data sets, and the highest acceleration is obtained when only the gradient is supplied by TMB (instead of both gradient and Hessian). On the other hand, the standard errors obtained through TMB are very similar to the standard errors obtained by profiling the likelihood and bootstrapping while being (much) less computationally intensive. This is novel since Hessian-based CIs did not seem to be reliable in the past, as illustrated, for example, by Visser et al. (2000).
Along with the tutorial character of this paper comes the shortcoming that we restricted ourselves to only one comparably simple HMM with Poisson conditional distributions. The extension to other distributions is, however, not overly complicated. We briefly illustrate the case of Gaussian conditional distributions in the Supporting Information. Moreover, to keep the paper at acceptable length, we were not able to address a couple of potential extensions and research questions. For example, it would be interesting to investigate whether supplying derivatives provided by TMB to different optimizers has a positive impact on convergence properties. This may be of particular interest when considering the impact of different potentially poor initial values on the optimization results. Following Bulla and Berzel (2008), one could also consider a hybrid algorithm in this context to benefit from the advantageous properties of the EM algorithm, such as the resilience to poor initial values. When setting up this approach, the EM algorithm would most likely also benefit from acceleration by C++. The reliability of Wald-type CIs provided by TMB for other models and very long sequences with, for example, hundreds of thousands or millions of observations could also be of interest. In addition, improving the reliability of CIs for small samples (such as our TYT and lamb data) may be worth investigating. In a similar direction, it seems rather obvious to benefit from TMB for more complex settings such as panel data with random effects, where computationally efficient routines play an important role. Furthermore, in the bootstrap analysis of the TYT and in particular of the lamb data set, we noted that the TMB-internal function tmbprofile sometimes fails to provide CIs for parameters very close to a boundary. For these two data sets, this was the case for the elements of the TPM which take values close to one: Approximately 7% (TYT) and 29% (lamb), respectively, of the generated bootstrap samples were affected. The same problem showed for the four-state model with 2000 observations (here more than 50% of the generated data sets). It remains to clarify whether this problem can be solved by a suitable modification of tmbprofile, or if the underlying difficulties require an entirely different approach. Last, profile likelihood CIs for the elements of the TPM seems to be subject to bias for models with more than two states. This may be due to the strong dependence of all elements of the TPM in the same row, which is problematic for a proper definition of the profile likelihood function ( ) difficult (see, e.g., Fischer & Lewis, 2021). Therefore, we recommend treating profile CIs for the elements of the TPM with care, in particular for models with more than two states, and further research in this direction is needed.
From an application perspective, the use of TMB allows executing estimation procedures at a significantly reduced cost compared to the execution of plain R. Such a performance gain could be of interest when repeatedly executing statistical procedures on mobile devices. It seems plausible to enrich the TYT app (or similar apps collecting a sufficient amount of data) by integrating a warning system that detects when the user enters a new state, inferred through a suitable HMM or another procedure accelerated via TMB in real time. This new state could, for example, represent an improvement or worsening of a predefined medical condition and recommend the user to contact the consulting physician if the change persists for a certain period. Provided the agreement of the user, this collected information could also be preprocessed and transferred automatically to the treating physician and allow to identify personalized treatment options.

A C K N O W L E D G M E N T S
The work of J. Bulla was supported by the GENDER-Net Co-Plus Fund (GNP-182). Furthermore, this work was supported by the Financial Market Fund (Norwegian Research Council project no. 309218). We thank the University of Regensburg and European School for Interdisciplinary Tinnitus Research (ESIT) for providing access to the TYT data. Special thanks go to J. Simões for data preparation, and W. Schlee and B. Langguth for helpful comments and suggestions. J. Bulla's