If detecting what is said isn't important, what's the most accurate way of knowing when someone is talking?
Well, human voices typically have a base frequency between 70-400Hz. You can look for spectral peaks there. If you want more accuracy than that, you can analyze the spectrum for formants.
If the person isn't articulating - just going "ooooooo", "eeeeeeeee", or "ssssssss" - your job gets much harder.