The basic idea is dirt simple. After that, apparently the OP has an overview. Af...

The basic idea is dirt simple. After that, apparently the OP has an overview. After that, the math gets quite pure. E.g., the Markov assumption is that in the underlying process, the past and future are conditionally independent given the present. Fairly quickly start discussing this assumption in terms of sigma algebras from measure theory instead of just random variables. The sigma algebras are pretty: (a) If have the sigma algebra generated by a random variable, the random variable is determined (from memory from 20 years ago; my source was

I. I. Gihman and A. V. Skorohod, The Theory of Stochastic Processes I, ISBN 0-387-06573-3, Springer-Verlag, New York, 1974.

and I lost my copy in a move and have yet to order another from Springer). (b) If have infinitely many random variables, discussing conditional independence is at least clumsy but the sigma algebra they generate makes talking about conditional independence easy. Where to get infinitely many random variables? In the continuous time case, the past one second will do. In the discrete time case, the full past of the process if assume the process goes back all the way in time.

But, sure, in practice, can't do anything very direct with sigma algebras, that is, they are pretty theory.

Here is the whole idea in a reasonably general case in brief:

Do some financial planning for the next five years. Make time discrete, say, one point in time for each month. Have a point at the beginning and also at the end for a total of 61 points.

Build a spreadsheet with one column for each of the 61 points in time and one row for each of the relevant variables. To make this stochastic optimal control, Markov decision theory, or other equivalent names, in the spreadsheet have some cells that get their values from the built-in random number generator.

Also at each of the first 60 points in time, have some variables that are the decisions get to make.

In the last column, have the value of the business then.

Our mission, and we have to accept it, is to determine the decisions (values in the decision cells) that maximize the value of the business at the end. Ah, since we have the random cells, typically we want to maximize the expected (average) value at the end.

So, we start at column 60 and consider all the possible data we could have in column 60. Sure, that data fills maybe a few exabytes. Then we pick values for the decisions in column 60 and reevaluate the spreadsheet many times letting the random cells generate random numbers, average, and see what average value we get in column 61. We do this for all possible decisions. Sure, now we are keeping busy a few million square feet of computing. But the work is highly parallel. That's what we would about have to do using a general purpose program, but in practice we would look at the specific problem, note special structure (properties, assumptions, simplifications), and get some speedups.

To be more clear, at column 60 set aside the random cells and the decision cells -- the rest have the state at column 60. So, for each possible state, we try all the possible decisions, and for each decision we reevaluate many times and get the expected value of the business in column 61 and pick the best decision. So, there in column 60, for each of the possible states, we get the best decision and the corresponding expected business value in column 61. We store this -- that is, we store the best decisions and expected value for each state. Here we fill some exabytes of storage.

Now we back up to column 59. For each state we try all the possible decisions. For each of those decisions we recalculate, that is, exercise the random number generators, and see what state we get in column 60 and the corresponding business value in column 61. So, for each state in column 59 we find the best decisions.

Now we continue in this way back to columns 58, 57, ..., 1. In column 1 we know the state we are in because that is our present. So our calculation for column 1 told us what decision to make there in column 1.

Then in reality, soon we get to column 2, see what state we are in, and use the decision we tabulated for that state. We continue with column 3, 4, ..., 60, make our decisions in column 60, and see what business value we get in column 61.

The whole thing is the same except much faster if we have no random cells.

There are also continuous time treatments, essentially the same intuitively.

For some books,

Dimitri P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models.

George L. Nemhauser, Dynamic Programming.

Stuart E. Dreyfus and Averill M. Law, The Art and Theory of Dynamic Programming.

E. B. Dynkin and A. A. Yushkevich, Controlled Markov Processes.

Wendell H. Fleming and Raymond W. Rishel, Deterministic and Stochastic Optimal Control.

There is more from R. T. Rockafellar and R. J.-B. Wets.

And of course we should mention R. Bellman, the father of this subject.