In model inference or data transmission, it refers to the total time from sending a request to obtaining a response. It is a core indicator for measuring the real-time performance of inference services, affected by factors such as model complexity, hardware performance, and network conditions.



