In the previous article, we explored the need to utilize AI to summarize blogs for social media and got an introduction to aspects of machine-automated summarization. Now let’s dig deeper to understand how this summarization is actually done. There are two main approaches to automatic summarization: extractive and abstractive. Extractive summarization, at a high level, is a technique that allows the machine to identify key phrases from the article and combine them to output a summary that retains the original message.
Suppose you have a URL that links to a thousand-word blog article. The first step for the algorithm is to extract the entire blog, which is done through web scraping. The next goal is to break down the article into individual sentences, which can be achieved through Natural Language Processing libraries such as spaCy and NLTK. Next, these sentences are input into a language model. One such model is BERT, an advanced NLP model. The BERT model is trained on a large corpus of data, in order to make it more intelligent and accurate. The BERT model creates word embeddings internally. These embeddings are essentially a numerical form of each word, in which words are converted into vectors based on the similarity of the words in context of the blog. For example, words like Russia and Putin would be numerically close. This transformation from word to number is performed to ensure that the machine can understand these words in context, as computers can only comprehend numerical data.
Extractive summarization is a fairly common method of text summarization, but there are also other techniques involved. One such technique, as mentioned briefly before, is abstractive summarization.
In extractive summarization, the machine paraphrases the source document and creates new phrases/sentences that convey the most critical information from the text. This is extremely similar to how a human reads a document and explains key messages in his or her own words. Abstractive summarization is commonly applied in deep learning situations as it can surpass the grammatical mistakes that extractive summarization sometimes makes. Although abstractive has its benefits, it is often more difficult to develop than extractive, a key reason for the increasingly common use of extractive summarization as the text summarization approach.