IMO the wrapper products all suffer from the same problem. The LLM is trained to...

IMO the wrapper products all suffer from the same problem. The LLM is trained to do a specific set of tasks such as Chat, Coding, Image understanding, and image/video generation, and tool use in support of the above. If you suddenly ask the LLM to do something it was not trained for such as producing power points - you get a few surprisingly successful results, followed by a large set of crap. There is no reason for customers or your own team to expect the underlying model to improve unless token usage is so massive it motivates training investment in this area.

LLMs are a facsimile of general intelligence on tasks similar to their training set and which can be solved in finite context length. If you are outside of the training set - you will have poor results. Likewise if you are in the training set, then the foundation model vendor will already have a great product to sell you (claude code/chatgpt etc.)