I'm a little surprised the Euclidean Cluster Extraction technique works at 5fps! My first recommendation is to decimate the point cloud to using only 10% of the points, then feed it into PCL and see if you get enough of a speedup to make it usable.
In theory, if you implemented ECE yourself you could optimize the way the kd-tree is built to avoid rebuilding it from scratch every frame, but I have a feeling that's outside of the scope here.
If the 10% approach doesn't work, I would start looking into more heuristic solutions. If you know everyone is going to be upright (not laying on the floor) I suggest taking a projection of the point cloud from above, and finding places where the points are concentrated. This is what I do when I use ofxVirtualKinect: instead of using the frontal projection, I mostly use the overhead perspective.

This was pre-ofxCv but you could do it in less code now: