Improving Automated Requirements Trace Retrieval Through Term-Based Enhancement Strategies
Requirements traceability is concerned with managing and documenting the life of requirements. Its primary goal is to support critical software development activities such as evaluating whether a generated software system satisfies the specified set of requirements, checking that all requirements have been implemented by the end of the lifecycle, and analyzing the impact of proposed changes on the system. Various approaches for improving requirements traceability practices have been proposed in recent years. Automated traceability methods that utilize information retrieval (IR) techniques have been recognized to effectively support the trace generation and retrieval process. IR based approaches not only significantly reduce human effort involved in manual trace generation and maintenance, but also allow the analyst to perform tracing on an “as-needed” basis. The IR-based automated traceability tools typically retrieve a large number of potentially relevant traceability links between requirements and other software artifacts in order to return to the analyst as many true links as possible. As a result, the precision of the retrieval results is generally low and the analyst often needs to manually filter out a large amount of unwanted links. The low precision among the retrieved links consequently impacts the usefulness of the IR-based tools. The analyst’s confidence in the effectiveness of the approach can be negatively affected both by the presence of a large number of incorrectly retrieved traces, and the number of true traces that are missed. In this thesis we present three enhancement strategies that aim to improve precision in trace retrieval results while still striving to retrieve a large number of traceability links. The three strategies are: 1) Query term coverage (TC) This strategy assumes that a software artifact sharing a larger proportion of distinct words with a requirement is more likely to be relevant to that requirement. This concept is defined as query term coverage (TC). A new approach is introduced to incorporate the TC factor into the basic IR model such that the relevance ranking for query-document pairs that share two or more distinct terms will be increased and the retrieval precision is improved. 2) Phrasing The standard IR models generate similarity scores for links between a query and a document based on the distribution of single terms in the document collection. Several studies in the general IR area have shown phrases can provide a more accurate description of document content and therefore lead to improvement in retrieval [21, 23, 52]. This thesis therefore presents an approach using phrase detection to enhance the basic IR model and to improve its retrieval accuracy. 3) Utilizing a project glossary Terms and phrases defined in the project glossary tend to capture the critical meaning of a project and therefore can be regarded as more meaningful for detecting relations between documents compared to other more general terms. A new enhancement technique is then introduced in this thesis that utilizes the information in the project glossary and increases the weights of terms and phrases included in the project glossary. This strategy aims at increasing the relevance ranking of documents containing glossary items and consequently at improving the retrieval precision. The incorporation of these three enhancement strategies into the basic IR model, both individually and synergistically, is presented. Extensive empirical studies have been conducted to analyze and compare the retrieval performance of the three strategies. In addition to the standard performance metrics used in IR, a new metric average precision change  is also introduced in this thesis to measure the accuracy of the retrieval techniques. Empirical results on datasets with various characteristics show that the three enhancement methods are generally effective in improving the retrieval results. The improvement is especially significant at the top of the retrieval results which contains the links that will be seen and inspected by the analyst first. Therefore the improvement is especially meaningful as it implies the analyst may be able to evaluate those important links earlier in the process. As the performance of these enhancement strategies varies from project to project, the thesis identifies a set of metrics as possible predictors for the effectiveness of these enhancement approaches. Two such predictors, namely average query term coverage (QTC) and average phrasal term coverage (PTC), are introduced for the TC and the phrasing approach respectively. These predictors can be employed to identify which enhancement algorithm should be used in the tracing tool to improve the retrieval performance for specific documents collections. Results of a small-scale study indicate that the predictor values can provide useful guidelines to select a specific tracing approach when there is no prior knowledge on a given project. The thesis also presents criteria for evaluating whether an existing project glossary can be used to enhance results in a given project. The project glossary approach will not be effective if the existing glossary is not being consistently followed in the software development. The thesis therefore presents a new procedure to automatically extract critical keywords and phrases from the requirements collection of a given project. The experimental results suggest that these extracted terms and phrases can be used effectively in lieu of missing or ineffective project glossary to help improve precision of the retrieval results. To summarize, the work presented in this thesis supports the development and application of automated tracing tools. The three strategies share the same goal of improving precision in the retrieval results to address the low precision problem, which is a big concern associated with the IR-based tracing methods. Furthermore, the predictors for individual enhancement strategies presented in this thesis can be utilized to identify which strategy will be effective in the specific tracing tasks. These predictors can be adopted to define intelligent tracing tools that can automatically determine which enhancement strategy should be applied in order to achieve the best retrieval results on the basis of the metrics values. A tracing tool incorporating one or more of these methods is expected to achieve higher precision in the trace retrieval results than the basic IR model. Such improvement will not only reduce the analyst’s effort of inspecting the retrieval results, but also increase his or her confidence in the accuracy of the tracing tool.