Not a video encoding expert, but for live streams you can't merge the output until you process all the N parts, so you introduce delays. And if any part of the input pipeline, like an overlay containing a logo or text, is generated dynamically i.e. not a static mp4, it basically counts as a live stream.
Why not cut the image in rectangles and process those simultaneously? Wouldn't that work for live streams? (There may be artefacts at the seams though?)