FWIW the "Speech Synthesis Markup Language" (SSML) standard does have sections f...

FWIW the "Speech Synthesis Markup Language" (SSML) standard does have sections for "3.2 Prosody and Style" https://www.w3.org/TR/speech-synthesis11/#S3.2 & "3.1.10 phoneme Element" https://www.w3.org/TR/speech-synthesis11/#edef_phoneme in it which can enable quite granular control.

Of course, whether a particular speech synthesis system supports such features is another thing.

It also has a `pitch_contour` attribute: https://www.w3.org/TR/speech-synthesis11/#pitch_contour

Coincidentally enough I pretty much only know any of this because a few years ago I created a GUI for a client which enabled an assistive technology researcher to "draw" in the pitch contour required for a word/phrase (not unlike the project demonstrated in your video :) ) from which the SSML was then generated.