How natural language processing will revolutionise our daily lives
Emerging Digital Technology
Natural Language Processing:
Future Opportunities for Expansion
Natural Language Processing is a rapidly changing field that has revolutionised the way people access required information, how they communicate with each other, and how they interact with their electronic devices. NLP technology helps to protect our inboxes from spam messages, to analyse documents to detect plagiarism, and evaluates customer opinions to build market forecasts, among many other commercial applications. NLP-based applications are now an inseparable component of our everyday lives. Consistent decreases in computational costs and increases in the amount information available for processing have led to further commercial growth in this area, and produced a diverse range of products that rely on understanding human language. Rule-based systems are being replaced by statistically-based products, which provide improved performance when supplied with increasing amounts of data, and are less dependent on the type of information being analysed. By 2020, the market for NLP applications is expected to grow to $13.4 billion at a compound annual growth rate of 18.4%.
High-level Overview of NLP Area
Natural language processing (NLP) is an interdisciplinary branch of artificial intelligence and computer science concerned with developing automatic techniques for understanding, analysing, and generating samples of human language. NLP applications are now a common component of many types of human-computer communication. They transform search queries and voice commands expressed in a natural language into machine understandable instructions, and allow for the automatic analysis of large amounts of information. NLP products such as spelling correctors and translation systems also improve the quality of human-human communication. All of these tools rely on effective methods of processing ambiguous samples human speech or written language. The area of NLP is highly segmented in various dimensions, for example by area of application, by industry, or by type of technology (see the Infobox for further details).
NLP is currently undergoing a significant commercial and technological growth in response to new opportunities for innovation brought about by mobile computing, the Internet of Things, and big data. According to a 2015-2020 forecast by Research and Markets, the NLP market is expected to grow to $13.4 billion by the end of this period, at a compound annual growth rate (CAGR) of 18.4%. While North America and Europe maintain the highest NLP market shares, the Asia Pacific region is expected to become the fastest growing region for NLP, due to an increased interest in voice recognition technologies. NLP is transforming into a globally competitive sector, as we can see market growth increasing around the world.
The most prominent applications for NLP have been found in industries such as e-commerce, information technologies, telecommunications, and aerospace & defence. Information extraction and machine translation are the two leading application areas, and new domains are predicted to emerge by the end of 2020. In this μInsight, we provide an extensive outline of the major trends impacting the largest application areas in this field.
Infobox: Dimensions of NLP segmentation
Type of Application
Information extraction (IE), machine translation (MT), information retrieval (IR), natural language generation (NLG), question answering, sentiment analysis, report generation, summarisation, etc.
Type of Technology
Generation, analysis, categorisation, etc.
Healthcare, information services, financial and banking services, research, consumer and retail markets, job markets, entertainment, education, etc.
Model of Deployment
On-cloud or on-premises
Method of Solution
Rule-based, statistical, and hybrid
Information extraction (IE) is the area of NLP that deals with identification of factual information from unstructured texts. Extracted facts may relate to particular real-world events and their attributes (e.g., when and where the event took place), to actions with their actors (e.g., who did what to whom, where, and when), or to entities. Once extracted, this factual information is usually stored in a structured form, such as in database records or templates.
As one of the largest NLP application areas, IE is a crucial data processing tool for industries that rely on processing large amounts of information, including health care, news portals, hedge funds, financial services, consumer websites, and employment websites. With an increasing diversity of data entry formats, sources, and methods, there is clear demand for tools that can extract structured information from unstructured documents at minimal cost.
Companies in the financial services industry are benefiting from the use of IE products to generate comprehensive reports. For example, Two Sigma, a hedge fund that manages over $35 billion in total assets, recently released an automatically generated topical analysis of their meeting minutes. By extracting just the named entities from their meeting records, the company can now monitor major shifts in discussion topics over time. Similarly, Ross Intelligence is using IBM’s Watson to perform information extraction on unstructured loan facility agreements and contracts to monitor various trends in the market.
The demand for IE products is expected to increase due to the growing importance of web data for effective marketing and decision-making. The unpredictable nature of information available online dictates the two modern requirements for IE systems: independence from the contextual type of information, and adaptability to various online platforms.
Web scraping solutions such as Dexi Pipes, Mozenda, Content Grabber, and Webhose.io are extremely popular tools for acquiring and processing web-based information. These tools range from low-level products which require familiarity with regular expressions and markup languages used for document formatting and structural annotation, to high-level solutions that can be used by laypersons. Most high-level IE tools can only be used for a narrow range of tasks. In contrast, many low-level products are more easily adapted to provide solutions for a wider range of tasks. It is therefore essential for competitive IE start-ups to maintain an optimal balance between user-friendliness and applicability. A good example of a scalable solution in the market is DiffBot, one of the leading web data extractors. This system incorporates computer vision techniques to render and analyse the page structure, and applies IE methods afterwards in order to extract essential semantic content—for example, when analysing a shopping site, extracting information such as the product’s name, price, user reviews, videos, images, related brands, etc. Clients of DiffBot include Cisco, Adobe, Microsoft Bing, and Amazon.
Job matching platforms, such as UpScored, Actonomy, and SkillFinder, represent another emerging application area for IE-based solutions. The goal of these platforms is to extract terms associated with job titles, descriptions, and skills from both job-seekers’ CVs and employers’ postings about vacancies in order to perform effective bi-directional job matching.
"The demand for IE products is expected to increase due to the growing importance of web data for effective marketing and decision-making."
The task of machine translation (MT) is to automatically translate information from one human language to another. Google Translate and automatic translation tools used on social networks, such as Facebook Translate, are examples of popular MT applications. Achieving high precision and high quality in automated translation is a challenging task, because it often requires interpretation of not only separate words, but large phrases and sentences, idioms, and abbreviations in the input language. After the input sentences are interpreted, they are matched with the closest analogues in the target language. Comprehensive MT solutions must also guarantee the linguistic coherence of the output text. All these steps require in-depth analysis of the input and output information.
MT is the second largest area for NLP applications in terms of market size. By 2022, the market is expected to reach $983.3 million according to Radiant Insights, or perhaps as much as $1.3 billion (growing at a compound annual rate of 18.6%) according to Global Market Insights. The area of MT has recently experienced a major shift from rule-based systems, i.e. those relying on large context specific dictionaries and pre-defined syntactic and morphological rules for source and target languages, to statistically-based ones, i.e., those that can learn translation rules by means of observing the patterns in texts previously translated by humans. Since statistical MT systems do not rely on exhaustive sets of predefined rules and dictionaries, they are more robust when encountering new translation tasks. However, rule-based MT tools remain prevalent in small and medium-scale enterprises.
One recent trend that promises to create substantial improvements in the effectiveness of MT systems is the use of deep neural networks (DNNs). DNNs consist of multiple layers of information representation and processing. Their architecture allows the design of fully trainable MT models, i.e., models in which every component is automatically adjusted based on the training data in order to maximise translation accuracy. Currently, only the largest global players in the market are transitioning to neural network-based MT tools. Facebook, is one example, as they have recently introduced DNN-based English-to-German translation, and are aiming to expand this technique to 45 other languages. Other examples include the complete transition to DNN-based translation tools by Microsoft’s Skype and China’s search giant Baidu, and the partial adoption of this technique by Google’s Translate tool.
Further expansion of the MT market is expected, as many industries require regular and accurate translation of large amounts of data, such as the healthcare industry and the research industry. Another factor motivating the growth of MT is the increasing demand for cloud-based MT tools, which are expected to dominate the market. Mobile applications providing instant translation of texts on smartphones represent another area dependent on MT.
One of the key conditions for future growth of this market is availability of substantial training data. Training data is an important component of statistical MT, since these systems’ rules are established based on observed examples of previously translated texts. Therefore, the implementation of tools for translation between uncommon language pairs, such as Turkish and Catalan, cannot be achieved at the moment due to the scarcity of data.
Natural Language Generation
The task of natural language generation (NLG) aims to achieve the opposite of the outcome of natural language understanding: NLG systems translate machine representations of stored information into human language. NLG techniques are often integral components of complex NLP products, for example the text generation mechanisms of machine translation systems, the language outputs of intelligent personal assistants, such as Apple’s Siri and Microsoft’s Cortana, and in text summarisation tools.
The area of NLG has expanded massively since 2016, driven by millions of competing chatbot start-ups appearing on the market within the last nine months. The concept of a chatbot, a computer agent that can simulate intelligent conversations with humans using textual or audio input and output, has existed since the NLP field’s earliest days. However, integration of such bots into mobile apps has revolutionised this market. Mobile messaging bots are now possible due to recent improvements in natural language understanding tasks. These bots can serve as intelligent agents that are able to initiate conversations with users, answer their questions, present requested information, and provide customer support through mobile messenger applications. Customers do not need to install standalone mobile applications for each product type, and can use these messengers as a centralised way of communicating with service providers.
For the marketing, commerce, and advertising sectors, messaging bots provide more advanced opportunities to engage with potential audiences by means of exchanging links, videos, or audio files, and by communicating with clients bi-directionally. In addition to marketing, chatbots create more opportunities for providing prompt product support by means of sending agent-generated answers to customers’ questions.
According to recent statistics, Facebook’s Messenger app, one of the leaders in the area, has over 11,000 integrated chatbots. In addition to textual communication, Skype, Kik, Slack, and Facebook’s bot platforms also allow exchange of links, images, and videos.
Facebook recently introduced Bot Engine, a natural language processing platform that can be used by companies to create customised bots to reach Facebook’s more than 900 million mobile users. Bot Engine also allows non-expert users to instruct their mobile bots in how to respond in particular situations by providing sample answers to potential user questions. Over 23,000 user accounts have already been created on the Facebook platform. Most of these users have created or plan to create various types of messaging bots for their business needs.
The clear commercial interest of many industries in mobile bots suggests there will be further global expansion of this market.
"The area of NLG has expanded massively since 2016, driven by millions of competing chatbot start-ups appearing on the market within the last nine months."
Information retrieval (IR) is a field that is separate but related to NLP, and focuses on searching large collections of information (web pages, documents, databases) for the responses that are most relevant to a user’s query. Web search engines, such as Google Search, are the most widely used IR tools. The search criteria in such systems are usually expressed as a natural language query. Therefore, the majority of the effective search engines today also rely heavily on NLP methods.
The demand for IR products increased exponentially in response to the growth of Internet and information services in 2000s. Nowadays, this market represents one of the most successful and competitive in terms of sales out of all types of software tools. According to U.S. Census Bureau, the industry achieved $27.7 billion in sales in 2000. Today, the revenue increase during 2-3 year periods for the global market leaders such as Google and Yahoo! varies from $5 billion to $20 billion.
Along with universal search engines, IR is also being applied to specific research and consumer question answering websites. One of the long-term objectives of IR tools is to integrate NLP techniques in order to gain a better understanding of users’ search queries, and to provide more intelligent output. A good example of this trend is the market leader, Google Search, which is now able to handle the meanings of superlatives and temporal relations expressed in queries. The importance of natural language understanding techniques for the IR area is only expected to grow.
"One of the long-term objectives of IR tools is to integrate NLP techniques in order to gain a better understanding of users’ search queries, and to provide more intelligent output."
Spelling correction systems are widely used to check for typos and grammatical errors in applications where large amounts of human-typed information is processed, such as email clients, messengers, word processors, and search engines. Advanced spell checking products examine not only the correctness of separate words, but also the syntactic structure of the sentence and the suitability of the word given the context of the sentence. The main challenge of modern autocorrect products is to be able to adapt to ever-changing language patterns, and to learn new linguistic expressions in a timely manner. Such expressions include proper nouns, created words, and phrases. Many tools, such as Google Docs, allow for a high degree of customisation in order to satisfy user preferences.
Smart keyboards represent an important emerging market in this application area. Commonly found on mobile phones and tablets, smart keyboards maintain a constantly expanding list of suggestions for data input. Smart keyboards are not a new development: Apple’s Quick Type, Harmony Smart Keyboard, KHMER, and others have been on the market for over a decade. However, the increasing amount of storage space on new smartphones and tablets has helped remove one of the major limitations on the quality of these keyboards, that the large size of the keyboard’s word dictionaries and the infinite number of possible letter combinations used to require “big data” processing capacities to be useful. With more storage space available on mobile devices, the quality of smart keyboards has significantly improved. The next generation of smart keyboards is expected to be able to learn about neighbouring combinations of keys that cause the majority of spelling errors for each user, and to adapt the keyboard layout accordingly to minimise such errors.
Another challenge for modern spelling correction systems is to provide robust solutions for industries that work with ambiguous and old language, such as the Text Creation Partnership – an organisation working to transcribe old handwritten texts into digital form.
The task of text summarisation systems is to automatic reduce large texts to small outlines containing the key ideas from the full text. Automatic text summarisers are widely used for generation of comprehensive summaries for news, scientific articles, and collections of documents. Most of the products in the area operate based on one of two common approaches: extraction or generation. Extraction methods are based on the selection of key phrases or words from the existing texts. These keywords can later be used for document tagging. Some extraction techniques operate on the level of entire sentences and paragraphs, and compose summaries by means of combining these textual chunks. Generation techniques, in contrast, perform deeper analysis of text semantics, and use natural language generation and paraphrasing in order to produce more comprehensive summaries. Due to the complexity of the generative approach, the majority of tools currently on the market, such as SMMRY, Textuality from Saaskit, and Open Text Summarizer, operate based on extraction techniques.
Many summarisation products, such as TextTeaser, are machine-learning based, which allows for constant improvement of the results when provided with agreater number of summarised texts. Machine learning also makes topic-specific training of summarisers possible.
"Generation-based text summarisers are expected to become dominant over the extraction-based summarisation tools currently making up the majority of the market."
Future Expansion of the NLP Area
The area of information extraction is expected to expand further during the next few years in response to increased investments in the healthcare industry. New innovations and applications in this sector will eventually increase the impact of IE on the natural sciences, defence, telecommunications, human resources, financial, and the employment sectors as well.
Global changes occurring in the area of machine translation—particularly the transition to statistical methods and neural network-based techniques—will revolutionise this area over the next five years, likely leading to instant MT tools becoming integral components of almost all mobile, desktop, and web applications.
Within the next few years, mobile chatbots are expected to revolutionise the commerce and marketing sectors. It is probable that they also will be adopted by other industries that require prompt customer assistance, such as mobile network operators, the hospitality industry, and the healthcare industry. Virtual personal assistants will become more intelligent and capable of handling a wider range of tasks.
The area of information retrieval will also become more interconnected with NLP in the near future. As a result, search engines will provide more relevant outputs for user’s queries.
More in-depth syntactic and semantic analysis of input data and the increasing storage capacities of mobile devices will make spelling correction tools more effective within the next five years.
The area of text summarisation will continue to impact industries that operate on large quantities of text, i.e., education, research, and media. During the next five years, a shift is expected from supervised machine learning-based text summarisers, which rely on large amounts of training data, to unsupervised algorithms, which are based on sentence ranking and do not require extra information, making such tools more widely applicable.
Further down the line, generation-based text summarisers are expected to dominate extraction-based summarisation tools which currently make up the majority of the market.
Overall, a global expansion of NLP in all its dimensions—industry, application types, technology, etc.—can be expected in the near future. This will significantly impact our day-to-day lives, as it will facilitate communication between people speaking different languages, allow for more intelligent information processing, and enable us to effectively engage with virtual assistants during daily tasks. High levels of competition in this market are predicted to stimulate rapid improvements in NLP tools, and NLP applications will consequently become a ubiquitous part of electronic devices, software applications, and services.
"Global changes occurring in the area of machine translation—particularly the transition to statistical methods and neural network-based techniques—will revolutionise this area over the next five years, likely leading to instant MT tools becoming integral components of almost all mobile, desktop, and web applications."