Legal documents play a vital role in protecting the interests of a business and the business owners over the course of a company’s lifetime. Documents such as master agreements, shareholder or partnership agreements, and memorandum of understanding specify how a company’s business affairs should be organised. In Wholesale banking, for example, the contractual relationships that a bank has with its clients dictate how the bank should calculate the daily margin calls, estimate its counterparty credit exposures, and whether a client is entitled to the client money protection rules or not. It is therefore vital that audit professionals pay close attention to legal documents and provide sufficient assurance that a company’s day to day business affairs are being conducted in accordance with its legal and contractual obligations.
But it is easier said than done. Even a small to medium sized organisation can easily have hundreds of legal documents of all manner of substance and size. The contents of these documents could also be non-searchable scanned images of signed agreements, making them unsuitable for digital analysis – i.e. they can only be reviewed by an immensely time-consuming and manual process. Historically, due to time and resource constraints, audit professionals have therefore been able to review only a sample of legal documents during the course of their audit work. But the reality is that even a small rate of error in correctly applying the contractual terms laid down in the legal documents could expose an organisation to significant legal, financial and regulatory risks. So, the million-dollar question now is what can audit professionals do to provide a more holistic assessment of an organisation’s exposure to such risks?
This is where modern digital tools and data analysis techniques come into the fore. Recent advancements in the areas of machine learning, natural language processing, image processing and optical character recognition (OCR) means that it is now possible to use sophisticated, freely available and open source digital tools to simplify the process and significantly reduce the time it takes for reviewing legal documents. It is therefore vital that audit professionals familiarise themselves with these emerging technologies and data analysis techniques, and have an appreciation for what these tools can or can’t do. With that objective in mind, this article is an attempt to provide a brief introduction to a collection of cutting edge and open source digital tools that one can use to build a fully automated and end-to-end legal document analysis process.
So, with that background, here is the list of my favourite open source digital tools for the task of auditing legal documents.
Selenium for automating administrative tasks
Obtaining a large collection of legal documents might sound like a trivial task (e.g. just ask IT or the Legal team to send them to you). But to implement a complete end-to-end data analysis process, you will still need additional information on each file (e.g. which client it belongs to, the type of agreement, when it was created etc). To avoid confusion, you will also have to make sure that the file names are unique. This is where you may have to use Robotic Process Automation (RPA) techniques. Selenium is one such open source browser activity automation software that you can use to fully automate the task of logging into the legal system and obtaining the legal documents and related metadata.
Tesseract for extracting text from scanned legal documents
Optical Character Recognition (OCR) technology allows you to convert non-searchable scanned documents to machine-readable texts. Tesseract, a Google sponsored project, is the leading open source OCR engine that recognises over 116 different languages out of the box (Tesseract is used by Google for text detection on mobile devices, in video, and in Gmail image spam detection). In late 2018, the Tesseract project released version 4 of the software, which for the first time used machine learning based text recognition techniques to significantly improve the accuracy of image to text conversion.
Spacy for validating the accuracy of image to text conversions
The accuracy of the OCR process (i.e. image to text conversions) very much depends on the quality of the scanned images (no coffee stains please!). You will therefore have to rely on some automated checks and balances to validate the accuracy of the image to text conversions. This is where you can use Spacy, an industry-strength Natural Language Processing library in Python. With the help of Spacy, will you will be able to assess whether the vast majority of the text extracted from images contains valid dictionary words. If this test fails, you can apply a range of fully automated image correction techniques (see below) to improve the accuracy of the OCR process. Bonus tip: the cover and signature pages of a legal document will most likely fail this test as they typically have sparse text (e.g. the names of the parties to the contract, which are likely to be non-dictionary words).
ImageMagick and OpenCV for improving the quality of scanned documents
Tesseract does various image processing operations internally before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn’t good enough, which can result in a significant reduction in accuracy. This is where you can use open source image processing tools such as ImageMagick and OpenCV to improve the quality of the input images so that the Tesseract OCR engine gives you an accurate output. The image correction techniques that you can apply using these tools include de-skewing (i.e. straightening an image that is slanting too far in one direction), rescaling (OCR works best on images which have a DPI of at least 300, so it may be beneficial to resize the images), removing background noise and borders (scanned pages often have dark borders around them, which can be erroneously picked up as extra characters, especially if they vary in shape and gradation), de-blurring, cropping and more!!
LexNLP for extracting and analysing legal clauses
Having thousands of pages of digitised legal text is no good if you have to review them manually. This is where you will need LexNLP, a Python based open source Natural Language Processing library for working with real and unstructured legal text, including contracts, plans, policies, procedures, and other material. LexNLP provides functionality to automatically extract a broad range of facts from legal texts – e.g. monetary amounts, non-monetary amounts, percentages, ratios, conditional statements and constraints, geographical locations, dates and durations, courts, regulations, and citations.
Elasticsearch for making the legal documents searchable
Elasticsearchis an open-source full-text search and analytics engine. It allows you to store, search, and analyse big volumes of text quickly and in near real time. It is generally used as the underlying technology that powers applications that have complex search features and requirements. You can therefore use Elasticsearch to implement a Google-like text-search functionality for your digitised legal documents. Subsequently, you can use this functionality to add a manual as well as fully automated text-pattern search capability.
Python as the glue and Linux as the platform
Finally, you can integrate all of the above tools and techniques into a single cohesive software solution by using Python (a high-level programming language) as the glue. Although you can use a Windows machine to implement this end-to-end legal document analysis software solution, my personal preference would be to use a Linux environment (e.g. RedHat or Ubuntu) for a variety of technical reasons that I will not dwell upon in this article.
To summarise, the technology for automating resource intense parts of the audit process for reviewing legal documents exists today and are freely available as open source software (they will undoubtedly get even better over time). Such technology offers audit professionals an exciting opportunity to significantly increase both the speed and quality of their audits. It is therefore time that audit professionals start exploring and embracing these new and emerging technologies.