Batch vs Online Learning

Vivek Pandit
4 min readDec 16, 2022

--

While designing an ML-based application, one of the important things to choose would be how models in production learn incrementally from the stream of incoming data or otherwise. There are mainly two approaches to how ML applications can learn in production, Online and Batch. Machine learning architects need to develop a good understanding of when to choose which approach.

Generated by DALL-E

Batch Learning

Although typical machine learning is done offline using the batch learning method, online learning does have its applications.

During batch learning, data is gathered over time. The machine learning model is then periodically trained using this accumulated data in batches. Because the model is unable to learn progressively from a stream of real-time data, it is the exact reverse of online learning. In batch learning, the machine learning algorithm does not modify its parameters until batches of fresh data have been consumed.

Large batches of accumulated data are used to train models, which requires more time and resources like CPU, memory, and disc input/output. Additionally, it requires more time to deploy models into production because this can only be done periodically depending on how well the model performs after being trained with fresh data.

A model that was learned using batch learning must be retrained using the fresh dataset if it has to learn about new data.

Online Learning

Online machine learning is a sort of machine learning in which the best predictor for future data is updated at each step using data that is received sequentially.

I‍n online machine learning the best prediction model for future data is updated continuously and sequentially, as new data keeps arriving. Thus every time new data arrives, the model parameters get updated based on the new data. At each stage the training is quite fast and cheap, also the model is always up to date because parameters associated with the model adjust themselves based on the new data.

The application of online learning might be in stock market prediction or weather forecasting. Also, if computational resources are a concern, you can go for online learning. Online machine learning is also a good choice in scenarios when a model has to learn from feedback. Online learning also saves storage space, because you keep discarding the data from which it has learned already.

https://www.analyticsvidhya.com/blog/2015/01/introduction-online-machine-learning-simplified-2/

Batch vs Online Learning

Python Libraries for Online Learning

Scikit-Multiflow

scikit-multiflow is an open-source machine learning package for streaming data. It extends the scientific tools available in the Python ecosystem. scikit-multiflow is intended for streaming data applications where data is continuously generated and must be processed and analyzed on the go. Data samples are not stored, so learning methods are exposed to new data only once.

The (theoretical) infinite nature of the data stream poses additional challenges. While data in unbounded, resources such as memory and time are limited, therefore stream learning methods must be efficient. Additionally, dynamic environments imply that data can change over time. The change in the distribution of data is known as concept drift and can lead to model performance degradation if not handled properly. Drift-aware stream learning methods are especially designed to be robust against this phenomenon.

Jubatus

Jubatus is a distributed processing framework and streaming machine learning library. Jubatus includes these functionalities:

  • Online Machine Learning Library: Classification, Regression, Recommendation (Nearest Neighbor Search), Graph Mining, Anomaly Detection, Clustering
  • Feature Vector Converter (fv_converter): Data Preprocess and Feature Extraction
  • Framework for Distributed Online Machine Learning with Fault Tolerance

River

The river is a machine learning library for continuous learning and dynamic data streams. For various stream learning challenges, it includes many state-of-the-art learning methods, data generators/transformers, performance indicators, and evaluators. It’s the outcome of combining two of Python’s most popular stream learning packages: Creme and scikit-multiflow.

In river, machine learning models are extended classes of specialized mixins that vary based on the learning job, such as classification, regression, clustering, and so on. This maintains library compatibility and makes it easier to extend/modify current models as well as create new models that are compatible with the river.

Learn and predict are the two main functions of all predictive models. The learn one method is used for learning (updates the internal state of the model). The predict one (classification, regression, and clustering), predict proba one (classification), and score one (anomaly detection) algorithms provide predictions depending on the learning goal. It’s worth noting that the river includes transformers, which are stateful objects that use the transform one method to convert an input.

References:-https://analyticsindiamag.com/a-guide-to-river-a-python-tool-for-online-learning/

--

--