A job is split up into several subjobs/instances. Each subjob/instance processes 0 or more agenda items (or frames).
A frame is considered failed if any of its fail conditions are met. The most common is non-zero exit status, but there are other factors that can cause a frame to be considered failed.
A subjob is considered failed if the backend for the job fails to start or if any of its other fail conditions are met. If autowrangling is turned on, however, failed subjobs get automatically migrated to another machine.
"Retry Frame/Work" will retry a failed frame up to n times.
"Retry Subjob" will retry retry a failed subjob/instance up to n times.
If you want to retry a frame when it fails, use "Retry Frame/Work". You may also want to check to see if your interface gives you a "Retry Work Delay" field. If you have that option, setting it to, for example, 3, will give you a 3 second delay between retries. If you don't have that option, you'll need to upgrade to a newer version of Qube to get it.