Human-robot collaboration (HRC) is an emerging solution for the construction industry’s productivity and safety challenges. The seamless HRC requires robots to understand the structure of construction tasks. Nevertheless, the implicit, dynamic construction task flow poses a non-trivial challenge. To address this challenge, a vision-based multi-granularity task’s primitive learning method is proposed. This study seeks to enhance the mutual understanding between workers and robots by determining which granularity level is best for the tasks’ understanding. Results show that the intermediate level has the best compromise between classification performance and embedded task knowledge. The outcomes will improve the smoothness of a HRC team.