Depth-First Search (DFS) is a pervasive algorithm, often used as a building block for topological sort, connectivity and planarity testing, among many other applications. We propose a novel work-efficient parallel algorithm for the DFS traversal of directed acyclic graph (DAG). The algorithm traverses the entire DAG in a BFS-like fashion no more than three times. As a result it finds the DFS pre-order (discovery) and post-order (finish time) as well as the parent relationship associated with every node in a DAG. We analyse the runtime and work complexity of this novel parallel algorithm. Also, we show that our algorithm is easy to implement and optimize for performance. In particular, we show that its CUDA implementation on the GPU outperforms sequential DFS on the CPU by up to 6x in our experiments.