April 28, 2021
Every decent AI system needs plenty of high-quality data with which to train. This is especially important for adapting to local needs because getting this wrong could be disastrous – as Volvo discovered after finding its Scandinavian-trained autonomous vehicle prototypes didn’t know what to do when they encountered kangaroos in Australia. Building efforts It may be expensive to acquire and curate the right local data to build a good AI system. But it is important to invest in finding suitable training data, collecting it, correcting for any errors and ensuring the data is not corrupted (for example, by a cyberattack). AI developers have many ways to find good data: Use data in the public domain (some risk of bias and data unsuitability); Purchase data; Generate new data – for example, use a text and data mining (TDM) system which is the automated discovery of new information from different written resources. After making all that investment, it is critical not to allow a third party to disrupt your operations and business model through legal action. This is why anyone using a TDM system should be aware of two key copyright concerns. What qualifies as copyright? Most would assume data collected for training by a TDM could be protected as a database. But if the system merely captures and dumps data into a file without any organisation, it may not qualify for copyright protection as a “database” since the data isn’t arranged in an “original” way. Regarding database protections: UK law has provisions for databases defined as collections of independent works, data or other materials arranged in a systematic or methodical...