This is the one thing i didnt think about. But text to speech wont take any hardware.
And im sure modern GPU's can handle image generation.
Then just dont use it!
Yes, they could. I dabbled with Flux and Stable Diffusion for a while, more on an engine basis than anything. For around 1200 x 1200 picture, it took me around 60-180seconds, while my pc was at its limit, i couldn't watch youtube. You could argue 'but you have just an RTX 3060 +32 GB ram' that is fair. so mabe a 5090 can do a bit more. But i am guessing you end up on 120s while your pc is at its limit.
So take a guess, if modern pcs can do this.
AIs or better PCs aren't built for it to run anything locally, first you need a lot of cloud space or local space think 50-100 GB just for models and so on, if you want decent results. Then you probably need the same amount as VRAM. Yeah good look finding that card.