Contributions of the Thesis

Within this work, we have provided a set of tools and methodologies usable both in the domain of document summarization, as well as other NLP tasks. We have illustrated how the proposed methodologies can be adapted to tackle such problems as sentence salience detection, redundancy removal and novelty detection in summaries, summary system evaluation and the quantification of
textual qualities. We believe that this research offers the basis for the evolution of current NLP algorithms through the use of n-gram graphs. We further hope to have provided a new view for the modeling of textual representation while retaining language-neutrality and usefulness, aiming at generic NLP usability algorithms, operators and functions.
Recapitulating, this work has made the following basic contributions.

  • An overview of the summarization process from a perspective that tries to combine existing approaches and views and sketch the outline of the summarization research. (Part I)
  • A statistically extracted, language neutral, generic usability representation — namely n-gram graphs — that offers richer information than the feature vector. The representation is accompanied by a set of theoretical and practical tools for the application of the n-gram graph representation and algorithms in NLP tasks. (Part I)
  • An automatic evaluation system, aiming to capture the textual quality of given summaries in a language-neutral way, by using the n-gram graph representation. (Part II) The evaluation system we call AutoSummENG has achieved state-of-the-art performance while maintaining language neutrality and simplicity.
  • The Symbol Sequence Statistical Normality measure, as a quality indicative feature of text, based on the statistics of character sequences within a given text.
  • An automatic summarization system based on the use of n-gram graphs, focusing on the tackling of content selection and redundancy removal in a language-neutral manner. (Part III)
    The proposed variations of our summarization system offered competitive results on the TAC 2008 corpus, without using complex features and machine learning techniques to optimize the performance.

 
Within the presented research we dedicated some time to help promote the collaboration between summarization community researchers. This time gave birth to:

  • The FABLE framework, aiming to support the AESOP (Automatically Evaluating Summaries Of Peers) task of the Text Analysis Conference upcoming in 2009, by providing a common framework for the integration and evaluation of summary evaluation techniques.
  • The JINSECT toolkit, which is a Java-based toolkit and library that supports and demonstrates the use of n-gram graphs within a whole range of Natural Language Processing applications, ranging from summarization and summary evaluation to text classification and indexing. The toolkit is a contribution to the NLP community, under the LGPL licence that allows free use in both commercial and non-commercial environments.