In August 2016, AWS launched their Amazon Kinesis Data Analytics tool which enables users to stream data in real time and employing only standard SQL. This tool is fully-managed and users don't have to bother with learning any new programming languages or processing frameworks.
Since its initial release, Amazon have continued to add new features to the product and just in the last few days they have released a new machine learning feature for detecting Hotspots.
What does this feature add?
It enables user to detect so-called "hotspots" in streaming data; that is to say, regions of activity in data streams that are significantly higher than expected normal levels. Previously, users would have had to have built and train a machine learning model first.
The new upgrade gets rid of the need for that extra work. This makes it both easier and quicker to identify and deal with sections of your date that require immediate attention and, if necessary, output the results to Kinesis Firehose or Data Stream and even an AWS Lambda function.
The function parameters are inputStream, windowSize and scanRadius and although the values for these parameters vary according to application, you can tinker around with them in the console to refine and get the results you're looking for.
The function accepts the DOUBLE, FLOAT, INTEGER, TINYINT, SMALLINT, BIGINT and REAL data types. However DECIMEL is not a supported type so you can use DOUBLE instead.
The output of the function is a table object which has the same schema as the input but with an additional column consisting of a JSON string which describes all the data abnormalities found in the record.
Practical real world uses
A compelling example of the practical use of this function would be for a taxi company that tracks their vehicles pick-up and drop-off location or a ride-share company. The data might indicate a serious traffic hold-up or bottleneck. This could be due to an accident or a road works.
Now the data might already be publicly accessible but with a short python script you load the Data Stream with the records which, in turn, will feed your Data Analytics. Then you simply output all of that to a Firehose connected to Amazon Elasticsearch service and then use an application like Kibana to visualise it all. Other uses for the function could include a data centre where the servers are overheating.
There are so many services which generate huge amounts of data; web applications, wearables, mobile devices, industrial sensors and software application are capable of generating TB of information per hour.
All of that information needs to be continuously gathered, stored and processed. Kinesis makes those tasks simple and cheaper than ever and now with the Hotspot function, any data abnormalities can be spotted early and addressed.