Heart sound segmentation, which aims at detecting the first and second heart sound in phonocardiogram, is an essential step to automatically analyze heart valve diseases. Recently, the neural network-based methods have demonstrated their promising performance in segmenting the heart sound data. However, the methods also suffer from serious limitations due to the used envelope features. The reason is largely due to that the envelope features cannot effectively model the intrinsic sequential characteristic, resulting in the poor utilization of the duration information of heart cycles. In this paper, we propose a Duration Long–Short Term Memory network (Duration LSTM) to effectively address this problem by incorporating the duration features. The proposed method is investigated in the real-world phonocardiogram dataset (Massachusetts Institute of Technology heart sounds database) and compared with the two representatives of the existing state-of-the-art methods, the experimental results demonstrate that the proposed method has the promising performance on different tolerance windows. In addition, the proposed model also has some advantages in the impact of recording length and the phenomenon of the end effect.