I would like to optimize this cycle because my performance are so bad. For each iteration I call the kernel that just separate nodes in two lists, the list that contain the nodes that have at least an edge pointing to the list of current leaves, and the list cointaining the other nodes, and i go ahead until I reach the root node. So I have so much allocation and deallocation but I don't know if it is the better way to do that (surely not):
I would like to optimize this cycle because my performance are so bad. For each iteration I call the kernel that just separate nodes in two lists, the list that contain the nodes that have at least an edge pointing to the list of current leaves, and the list cointaining the other nodes, and i go ahead until I reach the root node. So I have so much allocation and deallocation but I don't know if it is the better way to do that (surely not): [code] bool flag = true; while(flag){
// I take the current reference of "leaves" and "nonLeaves" copyArrayHostToDevice(maxBis[index], d_oldLeaves, allLen[index]); copyArrayHostToDevice(nonLeaves, d_oldNonLeaves, (allLen[index]-1));
// Copy back to the host cudaMemcpy(&allLen[index], lastLen, sizeof(int), cudaMemcpyDeviceToHost); copyArrayDeviceToHost(d_localLeaves, maxBis[index], allLen[index]); copyArrayDeviceToHost(d_localNonLeaves, nonLeaves, (counterNonLeaves/2));
// Check to see if I arrived at the end of the cycle if(allLen[index] == 1){ index++; flag = false; }
cudaFree(d_localNonLeaves); cudaFree(d_localLeaves); cudaFree(d_oldLeaves); cudaFree(d_oldNonLeaves); cudaFree(lastLen); } [/code] Assume that before this has been done a first preprocess operation that stored in [code]maxBis[0][/code] the starting leaves, and in [code]nonLeaves[/code] the other nodes.