As an organization, much of what you do with your systems involves data. You strive to gain insight from data in order to gain a better understanding of your systems and their behavior, which in turn allows you to make informed decisions. The capability to transform data into actionable insights is the key towards gaining a competitive advantage in the industry. Fundamentally, this capability transforms organizations from reactive environments being managed by static and aged data to automated continuous learning environments in real time.
To understand how you can build Real-time Data Analysis and implement Machine Learning, let’s us walk through a typical process of building a real-time social media analytics system. This system analyzes twitter hashtag trends in real time, calculating the total number of activities, engagement rate, total reach, and the number of impressions your hashtags received. Intelligent analysis of this can make us understand the overall sentiments and find out the influencers associated with your deployed hashtags, using Machine Learning.
The following diagram explains the major components of Social Media Analytics System:
- Data acquisition – At this stage, data is collected, prepared and forwarded for processing. We use twitter streaming API that feeds Tweet objects to Apache Kafka in real time.
- Data processing and pipeline – Steps such as preprocessing, sample selection and the training of datasets take place, as a precursor to the execution of ML routines. Using Apache Spark, we stream Tweet objects from Apache Kafka and create various data pipelines which include data normalization and cleaning, and subsequently transforming to data frames that are stored in the Apache Parquet in AWS S3, to be used as training datasets for Machine Learning. In addition to the above, there are two more data pipelines – one of which stores the time-series data to InfluxDB. The second is the sentiment analysis data pipeline, which uses the inferences gleaned from machine learning to get a real-time sentiment polarity score.
- Feature extraction or feature engineering – A subset of the data processing component, where features that describe the structures inherent in your data are analyzed and selected. This data is stored, preferably in a columnar structure, using Apache Parquet, which is thereafter used for supervised learning using Tensorflow.
- ML Engineering/Data modeling, which includes the data model designs and machine learning algorithms used in ML, mentioned below:
- Model fitting, where a set of training data is assigned to a model in order to make reliable predictions on new or untrained data
- Execution, the environment where the processed and trained data is forwarded for use in the execution of ML routines (such as experimentation, testing, and tuning)
- Deployment, where business-usable results of the ML process — such as models or insights — are deployed to enterprise applications, systems or data stores
- For Sentiment Analysis, we used LSTMs implemented using Tensorflow.
- For predicting influencers, we used sine wave function using Recursive Neural Networks in Tensorflow.
- Real-Time Data Analytics refers to analytics that can be immediately accessed. A Nodejs application streams the data from real-time analytics, processed by an Apache Spark job to a Kafka topic, from which it is streamed using Socket.IO and REST APIs to data visualization interfaces, built using ReactJs.
Hopefully, this was helpful in understanding the approach taken towards building a real-time data analysis platform. Most organizations already have the data to analyze sitting in their existing CRM, ERP, CPM or other operational systems, but this information is often locked away and/or ignored. With such current approaches, providing real-time analysis is no longer a costly endeavor.
Furthermore, developments in machine learning and artificial intelligence (AI) are revealing new interesting insights, which would have otherwise never been observed by existing analytical human models. These newfound capabilities are bringing greater understanding in the domains of customer engagement, their patterns, behavior and operational performance.
I would love to hear your thoughts about where we are headed, and the emergence of exciting new technology, in the comments.