DataViz - Easy Visualization for Machine Learning

DataViz for Easy Plotting

Ever since machine learning and big data are getting extremely popular, it makes total sense that we need a better way to look into what’s going on behind the data and model. There exists a number of frameworks you can leverage to produce figures on your demand, most popular ones like matplotlib, seaborn, plot.ly, and also bokeh which we use in this project. Of course, there’re many other choices.

I’ve been using matplotlib for quite a long time, it’s been regarded as a default plotting packages you need to install if you are using Python as your programing language for machine learning, however, it is now kinda of outdated, mainly because it does not support modern web browser very well, instead, other frameworks like bokeh and plot.ly originally are designed for deployment and share over website, they are deeply built with modern web technologies like javascript.

Frankly to say, I change to bokeh a while ago, because it does produce amazing figures and also with a number advantages to deploy your figures simply as a set of HTML files. I haven’t yet tired other frameworks, but as to my knowledge, bokeh is one of the most promising ones, plot.ly is also great, however I dislike their concept of plotting by API calls. I won’t bother discuss pro&cons of different plotting frameworks, I chose bokeh since it does support both low level and higher level plotting functions, except that it does not support contour plot, for this I felt like really sorry, but you can find alternative solution too.

Motivation

So firstly I show a simple demonstration of how it should looks like in the end. demonstration of dataviz The three examples above are actually plotted for three kinds of purposes in data analytics. The main purpose of having this DataViz wrappers built beyond bokeh is to better facilitate my daily research. In machine learning, usually we are more interested at a certain number of plots. For instances, you build a model for a task, probably you wanna evaluate the model with varying parameters, then you will plot a validation curve with mean and standard deviation, w.r.t. different parameter settings. You will also be interested at a learning curve of how this model performs with respect the training size.

Even scikit-learn provides two very convenient functions for you to create these common curves. See examples below,

from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
...
cvfolds = ShuffleSplit(n_splits=5, test_size=0.2)
tr_sizes, tr_scores, tt_scores = \
         learning_curve(estimator, n_jobs=-1, X=Xtr, y=ytr, cv=cvfolds,
                        train_sizes=np.linspace(0.1, 1, 5), scoring='accuracy')
tr_scores, tt_scores = \
         validation_curve(estimator=clf, param_name=eval_param, param_range=c_range,
                          X=Xtr, y=ytr, scoring=scoring_str, n_jobs=-1,
                          cv=cvfolds, verbose=True)

The usage of these two curves are quite straightforward, I use it quite often in experiments. The estimator always corresponds to the classifier you build, which must have implemented fit and prediction methods, if you do not pass the scoring argument, you also need to implement an estimator.score() function in your estimator. Anyway, how to use these is out of scope for this post. If we look at the output, there are tr_scores and tt_scores which have dimension x_sizes*n_folds. x_sizes corresponds to the x-axis, and n_folds corresponds to the number of repetitions in your cross validation folds. At this point, it would be nice to have helper functions to directly generate plots with these arrays.

Bokeh

Frankly to say, bokeh still has a lot to be improved, especially it does not provide contour/contourf plots, which I use a lot when I plot classification boundary or density maps. But it does really nice in support modern browsers. It can generate directly HTML files that can be deployed on your own website, or share with friends. And also it integrates seamlessly with Jupyter notebook, which allows you interactively play around it. Besides, another very nice feature is the widget and controls, like you can create Button/Checklist/Input Box very easily in bokeh, and make them response to your plots, with matplotlib to achieve these? a lot of pain in ass, it is afterall not designed as so.

Concept of DataViz

Simply use bokeh’s native API will just be fine to plot your data, however, I want to make it even simpler for my research purpose. That is every time if I wanna inspect a model or data or experimental results, I can present all the plots that I am interested in one simple web page. Using bokeh, I can easily achieve this.

Now the DataViz is simply a main class contains several plotting methods,

class DataViz(object):
   '''
   A class for multiple common plotting functions for machine learning research
   '''
   def __init__(self, config):
      '''
      init a DataViz obj for plotting dataset
      :param config: configuration for plotting
      :return: a figure handler of bokeh
      '''      
   def _get_figure_instance(self, ...):
      '''
      Return a figure with default setting
      '''
      fig = figure(...)
      return fig
   def feature_scatter1d(self, ...):
      '''
      Plot feature values along the x-axis
      '''
   def project2d(self, ...):
      '''
      Project high-dimensiona data to 2D for visulaization
      '''
   def plot_corr(self, ...):
      '''
      Correlation matrix plot
      '''
   def fill_between(self, ...):
      '''
      Plot curves with fill area up and down
      '''
   def simple_curves(self, ...):
      '''
      Simple curves
      '''
   def send_to_server(self, server=..., port=.., user="xxx", pw="yyy"):
      '''
      send plots to remote hosting server
      '''
   def email_to(self, recipients=[], ... ):
      '''
      Email the plots to recipients
      '''
   def save_as(self, format='html'):
      '''
      Save the plots as certain format
      '''

The backbone code of the DataViz class is shown above. The basic idea is that one DataViz instance should represent a figure, where each method of DataViz returns a particular plot attaches to the figure. Each plotting method in the DataViz instance will firstly get a default plot configured by default settings. DataViz also provides other utilities, e.g., send the figure to server for hosting or email someone the plots.

So, why bother to have this? given that you can basically directly use bokeh to achieve your goals. Well, this is more or less helpers for those who only cares the plots important in machine learning community. There’s of course tradeoff between usability and flexibility of any kind of frameworks. The higher you encapsulate, the more easier for you to use, but of course you lose your flexibility.

Plot Types

We currently support following plots particularly for machine learning use cases, the list will be updated actively during my research needs. For the moment, I do not intend to publish the code, it’s just helpers anyway. But for those who really wanna give them a shot, just mail me to check if I can send you the code.

Currently we support only a few plot types, ofc, the list will be getting longer in the future.

feature_scatter1d

This plot shows how the feature values distribute, along the x-axis are the indices of feature columns, and along the y-axis is the values of N data samples. By plotting it, you can get an overview of how is your data distributed, and decide afterwards how you’re gonna preprocess it. An example is shown below in Figure 1 (Left).

Figure 1: (Left) Feature scatter in 1D, x-axis indicates the index of features, y-axis shows the actual feature values. (Right) Data points are projected to lower dimension (2d) by PCA or other dimension reduction techniques.

project2d

The name of project2d is self-explanatory. We wanna see how data distributed in lower dimension, For example, If we have two categories of data, And we use different colors to annotate the data, Therefore we can get inside what is going on behind the observations. One example you can see in the table 1 (Right). This is actually a very common task when you first time get your data. Well normally you’re going to get a very high dimensional data, and the data will be transformed into 2 dimensional data using PCA algorithm or some other dimension reduction methods. Some examples are Multidimension Scaling (MDS), t-Distributed Stochastic Neighbor Embedding (t-SNE) and of course PCA/Kernel PCA. Those are all my favoriate ones. Before you start to learn any model from the data, it is a common practice to run them in order to see if there’s any insight behind it.

Correlation map

Another common plot you wanna check is the correlation map among features/targets. After computing the correlation on pairwise features and targets, you can see which features are strongly correlated, or you can see which features are obviously more decisive for targets. This is quite crucial at the first step, because you might do a feature selection before feeding a huge number of irrelevant features to your model. And it is usually the case that your observations are very sparse, only a small subset of features are relevant. An example is show in Figure 2.

Figure 2: Correlation map among features and targets, blue means positive correlation, red means negative correlation.

Simple curves and fill_between curves

Simple curves and fill_between curves target on same type of plots, that is for instance, an error curve or learning curve described as before. The only difference is that fill_between provides a upper and lower bound over the actual curve, to present information such as error bars, standard variation over multiple repetitions of runs. See the Figure 3 for an example.

Figure 1: (Left) fill_between curves, the shaded area present standard deviation, and the solid line draws the mean. (Right) Simple curves only draw basic lines over XY-axis without providing more information.

Future Works

Visualizing your data and model correctly is crucial while you develop your learning systems, I always put a lot more emphasis on demonstrating the model, it is not only for your own insight, but as well as how your customers or supervisors can understand what you have achieved.

For the future, I plan to work out more type of plots which are relevant for machine learning community, for example, the contour/contourf plots, classification boundary, visualization of neural networks and so on. Besides, I also want to add facilities like better deployment of figures, widgets for better interaction for website and so on.