This research focuses on leveraging Spot's advanced hardware and basic behaviors to develop a robust data collection method for facilitating robot learning through task demonstrations, such as "putting a chair next to a table." Utilizing demonstrations, a voxel map is created, encompassing robot state and goal information, fed into a modified Vision-Language Model, Per-Act, which generates 3D colored voxels for training. The dataset, comprising around 25 varied demonstrations with Spot, includes comprehensive data recordings used to train the model, emphasizing the practical application of robotic manipulation and navigation.