OpenDroneMap use a text file where you specify image name and pixels with their corresponding GPS coordinates for those pixels.
Pitch, yaw, gimbal direction can probably help you with optimizing those algorithm. I am thinking of using OpenCV for image recognition of a large “chess board”, from there it should be easy to get both the GPS-coordinates and pixels in the images seeing this calibration board.