Is automatic summarization fundamentally a machine translation problem?

There are two types of automatic summarization:

  • extraction, which picks a subset of sentences to represent the whole.
  • abstraction, which may rewrite some sentences to create a better summary.


Extraction is normally done with statistics. The technique uses stemming and word-counting to guess what are the most typical parts of the text.

In contrast, machine language is normally done either with:

  • Hand-coded language rules that tell the computer how to transform input strings into output strings.
  • Comparison of texts that have already been translated by humans. Google Translate uses this approach, which is considered more modern and accurate.

So current efforts at extraction are easier than machine translation. Extraction uses different techniques.

However, automatic extraction could borrow methods from machine translation. For example, a database of source texts could be created where humans already did the extraction. Then the computer could learn to summarize texts by comparing the original with the summary, like it does for machine translation.

The task of extraction would still be easier than translation, because you’re not actually transforming anything when you summarize. You’re just picking a subset of sentences from the source document that represent the larger whole.


Abstraction is the form of summarization that is more difficult. Since the abstract can rewrite sentences in the source material, it needs skills in natural language that are more like a person.

These skills are not very different from rewriting a document from one language to another. So you could argue that abstraction is more like machine translation than extraction.

