When I test your repo i found that in line 72 of nn_matching.h have a trouble.
when you call nn_cosine_distance if for loop (line 55), when i==0, time taken by .cpu() was >22000 micro second, but with other index, it took only 10-20 microseconds. If there is the way decrease that one, performance will increase dramatically.