KYC: Solving a Major Data Challenge Through Natural Language Processing


This article aims to investigate the benefits of automation, focusing on the use of NLP, to enhance how specific KYC checks are conducted.

Know your customer (KYC) and anti-money laundering (AML) are common processes across most financial services (FS) institutions. The goal of these processes is to identify and flag data anomalies. Originating from an individual or an institution, the ability to ingest and assess a large amount of data is an essential aspect of KYC and AML processes.

In addition to being labelled as a data challenge, several FS institutions also see KYC as an operational challenge. Although a vital process, KYC is a lengthy – often largely manual – and non-revenue-generating procedure.

Nevertheless, KYC is not only an integral process that provides firms with a better understanding of their client's needs, but it is also a legal requirement for most FS firms. Failure to do so could result in hefty fines by the regulators.

Automation in KYC

UK regulator Financial Conduct Authority (FCA) has reported over GBP 390M in AML fines between January 2019 and January 2020.1 KYC represents a part of the overall AML process, and the FCA has highlighted the importance KYC represents to banks when controlling the UK financial systems.

To ensure banks adhere to the recommendations and regulations imposed by the FCA, firms operating in the UK must take necessary precautions and impose effective risk-based AML control frameworks. Such measures can be accomplished by integrating machine learning (ML) and natural language processing (NLP) to add value to their operation.

Know your customer

Know your customer, or KYC, is a process whereby a business verifies the identity of its clients and assesses their suitability along with any potential risks that their custom may bring. These could be risks of any illegal intention towards the business relationship. This is particularly pertinent to a technological solution, as the process usually involves the manipulation of large quantities of data stored in various locations.

Furthermore, KYC is governed by compliance guidelines, which means the efficiency is limited. This is a process that is so crucial to the safety of a business, and, therefore, it must be done in the most effective way, ensuring that the error rate is kept as low as possible.

In general, the process is very repetitive with the manipulation of large quantities of data, but, more importantly, should there be any human error in the process, it can lead to a potentially high impact.

KYC has a medium complexity level, but this is increased due to the large volumes and the compliance guidelines that must be constantly adhered to. Based on previous projects, market research, and Synpulse’s expertise, we believe that process efficiency can be improved by up to 80% should KYC be handled by a robot. Most importantly, trends in cases can be continuously monitored and studied to automatically and instantly draw attention to any fraudulent activity.

Most, if not all, KYC processes will involve ingesting and processing different types and formats of data, from free-text and unstructured content to multimedia images and sometimes even videos.

A common, yet very time-consuming part of a KYC process is what is referred to by the business as a standard “Google Check”. A Google Check aims to establish that a potential new client holds no malicious intent towards the business.

The check involves searching for a potential client's first and last name in the search engine. The results are then cross-checked against any negative keywords that could potentially be flagged as an issue. Examples of such could be “prison, theft, fraud, etc.”. If flagged, the collected information is passed for further analysis on the nature of the malicious intent.

When conducting such cases, firms, which do not have a robust and well-defined technology framework in place, are bound to become error-prone and have a much lengthier processing time. Technologies, such as NLP, when integrated with technologies, such as robotic process automation (RPA), can tremendously impact how “KYC Google Checks” are performed and handled.

For example, Synpulse created a number of processes at a leading Dutch Private Bank with the aim to automate their KYC, AML, and Google Checks of new clients, and the banks’ client due diligence checks are now fully automated and running without assistance. But, most importantly, we saw an efficiency gain of 80%, and four out of five full time equivalents (FTEs) are no longer necessary for the completion of the process. This means that they can be engaged in less repetitive and more intelligently productive tasks elsewhere.

Natural language processing

Natural language processing is an area within artificial intelligence that has been around for many decades. However, it has only been commercially accessible in more recent years. From a technological standpoint, it focuses on the way a computer or device is able to process and manipulate human language in all its different forms and idiosyncrasies.

One of the most well-known examples of NLP in the real world would be the popular virtual assistants, like the Google Assistant, Alexa, or Siri, which listen out for a prompt (e.g., “Hey Google”), input it as a query in natural language, process it, determine the answer, and then provide an output to the query in natural language.

Although the time frame between a query and an output seems reasonably quick and straightforward, the computational power that goes on behind the scenes is vast.

When a query is received, the software must first establish what is being asked of it. When comparing it to a human, if a human were to be asked the question, “should I wear this coat today?”, there are several possibilities about what is actually being queried:

“Is it cold enough outside to need a coat?”

“Is it raining right now?”

“Is it going to rain later?”

“Is this coat the correct one to wear or would you recommend a more stylish one?”

When the software is asked the same question, it must also establish the semantics behind the query to determine what the most suitable response is.

To ensure the software can compute the above query, there are several steps it must complete. From sentence and word segmentation, where the software establishes how words are combined together in order to generate complete sentences, to text lemmatisation, stemming, etc.

To better demonstrate how NLP works, below is an extract from a BBC News article2 from 1966 when England won the footballing World Cup. The article has been analysed, and several queries were made towards different NLP algorithms3 in order to show how different the outcomes can be if questions are not worded efficiently. The results are shown below:

KYC: Solving a Major Data Challenge Through Natural Language Processing 2

The above table of results is very helpful in highlighting the advantages and frailties of NLP. It is possible to see that the different algorithms would not always provide the same output, and this is due to the way in which a query is processed.

For example, in question number two, the third algorithm provides an incorrect number because it does not stop searching once it reaches the correct number. Instead, it continues throughout the passage until it finds all numbers, as the question included the phrase “how many”, and, therefore, it incorrectly assumes that the result is the sum of all.

Most interestingly are the results of questions three and four. From a human perspective, the two queries are the same, but they are worded slightly differently. However, in question three, the algorithms are unable to determine the correct answer.

This is particularly significant when looking at it from a business perspective. It shows that we cannot throw questions at the algorithms without first manipulating the data in some form so the solution can correctly determine the output. Therefore, it is important to mention that poor input will generate poor output.

NLP applied in KYC

Using the example highlighted in the section above, “Google Checks – KYC”, it is possible to showcase how having a robust and holistic risk-based technology framework can enhance AML and KYC data as well as operational problems.

When focusing on the data challenge, KYC requires firms to ingest and assess large quantities of data, often in multiple formats. By using a combination of NLP and ML, a software application can learn to identify and flag data anomalies. Whilst NLP enables the solution to understand what is being ingested, ML enables the solution to continue to learn and adapt from every query. This ensures the solutions continue to evolve as the checks are processed.

When addressing the operational efficiency challenges in KYC, those can be mitigated by applying RPA to the process. Whilst NLP and ML will focus on analysing and interpreting the data, RPA will conduct any preprogrammed manual steps in the process, reducing cost and significantly enhancing the speed of a KYC process.

This is yet to be widely applied within the KYC practices. However, a Tractica report4 on the NLP market has estimated that the total NLP software, hardware, and services market could potentially exceed GBP 16.7 billion by 2025. Furthermore, it also forecasts that NLP software solutions that utilise AI will see a market growth from GBP 101 million in 2016 to GBP 4 billion by 2025.

KYC: Solving a Major Data Challenge Through Natural Language Processing 3
Figure 1. Future NLP revenue prediction

How can Synpulse help?

Synpulse has a wealth of experience in automating a multitude of processes across different sectors and understands in depth the importance of ensuring the data being interacted with is of as high quality as possible.

NLP, whilst an important and crucial aspect of intelligent automation, has its limitations and is not simply a quick fix to be thrown at a process. It should ideally be combined with additional technologies, such as ML or RPA, to ensure the greatest accuracy of results is obtained.

Synpulse has a substantial pool of professionals with extensive implementation experience and robust judgment on how to best utilise the different technologies within the intelligent automation domain. They guarantee that a process is not only automated correctly, but also efficiently, and requires little human interaction.

When working on a process with our clients, we take every measure to ensure that the solution obtained provides a holistic approach, and, therefore, meets the needs of the client as best as possible.


Our experts in this topic