This study explores dynamic allocation and measurement of visual perceptual and cognitive workload in human-robot interaction using a novel workload allocation algorithm and an affective prediction algorithm, supplemented by a user study. The research leverages the "Husformer," a multi-modal framework with cross-modal transformers, to analyze data from biosensors and behavioral sensors, enhancing our understanding of human states during interaction tasks. The effectiveness of this approach is validated through user experiments, building upon prior studies correlating cognitive workload with changes in GUI complexity and object motion, thereby optimizing task allocation based on individual cognitive load.