Ever heard that the best way to predict how long something will take is to estimate a number, then double it?
It’s one of the most notoriously difficult problems for project managers, and one that is ripe for data analysts to explore.
We recently looked at this problem during GovHack Melbourne 2016, a 48-hour event sponsored by the Australian government to encourage creative solutions to government service delivery.
Using data from IP Australia, the Commonwealth intellectual property authority, we created a model to estimate how much time would be needed for any given patent application to be processed.
This type of problem is ideal for big data and machine learning analysis. By combing through large, loosely structured datasets, machine learning can find trends and build prediction models. Businesses can then use these models to increase growth – whether optimizing marketing campaigns to reduce customer acquisition cost, refining business processes to reduce overheads, or testing category selection to increase revenue.
Let’s go through the basic steps to build this kind of prediction model.
Selecting which data to use
To start with, we needed to understand exactly what we were measuring.
There is no universal standard for when a patent application is actually “decided” – is it when a report is issued? When the applicant is notified? When published by the World Intellectual Property Organization?
This was a fundamental question that we could only answer through discussions with representatives from IP Australia.
We decided to use the date on which a patent is published in the Australian Official Journal of Patents, known as the “Acceptance Published Date”.
This let us calculate the number of days that had elapsed between Application Date and Acceptance Published Date for all patents in the database.
We next needed to understand what factors influenced this number.
There were a number of other tables in the IP Australia database – applicant and agent details, claims and oppositions, changes of ownership, and so on.
However, we only wanted to predict when a patent application would be finalized. We didn’t care if it was subsequently sold or lapsed, so we safely discarded all that information.
Intuitively, you would expect an application to need more or less time depending on what type of patent is applied for, the classification, the country of origin, and so on.
Most of this information was captured in the following database table:
|Later (ie Secondary)|
|Later (ie Secondary)|
|Later (ie Secondary)|
|Later (ie Secondary)||Advanced||Inventive||12/8/2015 11:38:03 AM|
|First (ie Primary)||Advanced||Inventive||12/8/2015 11:50:55 AM|
|First (ie Primary)||Advanced||Inventive||12/8/2015 2:58:38 PM|
|First (ie Primary)||Advanced||Inventive||12/8/2015 3:28:03 PM|
This is only a moderate sized table – 831mb uncompressed and 5,845,156 rows in total, covering patent applications as far back as 1920.
Although it’s nowhere near the level of “big” data, it’s starting to hit the limits of what can be manipulated directly in Excel.
Trimming the dataset
Our working dataset still contained information that might seem relevant to our question, but actually has no bearing in reality.
How long patent applications took in the 1970s, for example, isn’t really relevant any more. IP Australia’s technology, workload and business processes are probably completely different. If we use this type of irrelevant data, we’re going to undermine the accuracy of our prediction model.
We were comfortable excluding all patent applications lodged prior to the year 2000.
This was a completely arbitrary cut-off. If a new system had been implemented in 2008, that would have been a better choice. If there had been very little change since 1995, this would have given us more source data to increase the accuracy of the model.
Given the 48-hour time constraint, it seemed a reasonable assumption to make.
By this stage, we had already trimmed the data twice – first, by removing unnecessary tables, and second, by removing unnecessary rows.
We still needed to decide which columns should be fed into our model (“features”, in machine learning lexicon). We want to get rid of features that are unlikely to influence what we are trying to predict.
This is normally a difficult task. We want the model itself to discover the relationship – we don’t want to prejudice everything with our own assumptions.
For this specific dataset, however, it was less complicated.
Application IDs clearly don’t matter – they could be completely random selections of letters for all we care – and were easily discarded.
In addition, two columns (ipc_mark_value and edited_ipc_mark_value) simply aggregated values from other columns, so these could safely be set aside.
Converting data to numbers
This gave us a dataset highly relevant to our objective, but not so easy to manipulate.
Many of the values were text, descriptions or labels – we needed to transform these into something more easily manipulated with mathematics.
So how are we going to transform words like “Advanced” or “First (ie Primary)” to numbers?
Conventionally, there are two ways to do this.
The first is called Label Encoding – listing all possible unique values for a column, and converting each value to its equivalent position in the list.
For example, the column ipc_mark_type_code can be set to one of two values – “First (ie Primary)” or “Later (ie Secondary)”. We can replace all instances of the former with 0, and all instances of the latter with 1.
The second is called One-of-K encoding. This involves converting each unique value to its own column, assigning a value of 1 where the original column was set to that value, or a 0 otherwise. The column ipc_mark_type_code, for example, could be represented as:
Label Encoding is the simplest, but it can also imply associations between numbers that don’t actually exist.
For example, using Label Encoding on a list of 100 products will treat products 50 and 51 as “closer” than products 1 and 99, even though this is meaningless in reality.
One-of-K Encoding, on the other hand, drastically increases the size of your dataset. On small datasets, this is tolerable. On large datasets with multiple columns, each with thousands of distinct values, this will significantly impact the time required to process the data.
Cleaning and normalizing the data
Our working dataset was now a single table that was full of numbers – mostly.
If you have a keen eye, you might see a number of gaps in the ipc_class_version_date, classification_code or classification_level_code columns in the original table.
We need to transform these “empty” values into numbers.
Zero might be appropriate in some circumstances (though this can skew the data if we’re not careful – in fact, too many empty values suggests we should eliminate the column altogether).
Replacing “empty” cells with the average of each column is also a valid choice – and the method we chose for this specific project.
Ideally, we also wanted the data in each column to be structured as similarly as possible.
For example, years (2000 through 2016) use much larger numbers than encoded classifications (0 or 1).
Variations in larger numbers can obscure variations in the smaller numbers.
The simplest way to address this problem is to normalize each column – calculating the mean and the standard deviation, and then for every value, subtracting the former and dividing by the latter.
Once the table was normalized, we could look at the impact of each column on a like-for-like basis.
Linear regression – final step
The final step was definitely the juiciest – applying an algorithm to model the relationship.
But it was also the most straightforward. Most of our time was spent selecting, transforming, and cleaning the data.
Without this, we wouldn’t have our list of numerical inputs – patent classifications, country of origin, and so on. We also wouldn’t have our outputs – the number of days from start to finish for every patent applications.
With these lists, we can model the relationship between the two. If we change the country of origin from Australia to China, does that increase the length of the application process by 20%? Do mechanical patents generally need 6 months longer than biomedical patents?
To do this type of analysis, we chose an ordinary least squares linear regression. Although this is a very basic technique, it handled this particular problem extremely well.
We also used cross-validation – creating a model from a subset of the data, and testing how well that model performs on the rest of the data.
Our mean absolute percentage error (MAPE) – basically a measure of how accurate our predictions were – was around 16.5%.
Given the time constraints, we felt this was an excellent outcome.
This figure did conceal quite a large variance, but eyeballing the large predictive errors let us identify some clear areas for improvement.
In any event, we felt this was a perfect demonstration of using statistical analysis to address real-life business problems.
Big data analytics
While the IP Australia data was only moderately sized, the same techniques are equally applicable to datasets running to the order of hundreds of gigabytes or terabytes – true “Big Data” territory.
In fact, the greater the size of the dataset, the more accurate the model can be in its predictions.
That’s not to say these models will ever be perfect – the real world is a bit too random and chaotic for that.
They are, however, incredibly useful in terms of reducing risk and improving delivery, particularly when dealing with problems at huge scale.
Retail is a perfect example of this – improving predictions by even 1% can translate to enormous revenue impact across hundreds of stores and millions of transactions.