Waikato Environment for Knowledge Analysis (WEKA)
WEKA (Waikato Environment for Knowledge Analysis) is an open source machine learning software in JAVA. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.
WEKA has been downloaded more than 10,542,000 times, is the most popular open source software for Machine Learning in Java, and the most popular tool to learn Machine Learning, thanks to the best-selling book “Data Mining” and MOOC courses.
The software has also been cited in more than 18,000 research and applied data science publications.
WEKA is one of the oldest available machine learning systems available, having started development in 1993, and it is still very active in the machine learning / data mining / AI space.
Massive Online Analysis
MOA is the most popular open source framework for data stream mining, with a very active growing community. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation, that are suitable for data streams, i.e. cases where one doesn’t have the opportunity to re-process the data multiple times.
River - Machine Learning for Data Streams in Python
River is a Python package for online/streaming machine learning and caters for different machine learning problems, including regression, classification, and unsupervised learning. It can also be used for adhoc tasks, such as computing online metrics, and concept drift detection.
All the tools in the library can be updated with a single observation at a time, and can therefore be used to process streaming data. In the streaming setting, data can evolve. Adaptive methods are specifically designed to be robust against concept drift in dynamic environments.
Streaming techniques efficiently handle resources such as memory and processing time given the unbounded nature of data streams. River is designed for users with any experience level.
streamDM: Data Mining for Spark Streaming
streamDM is a new open source software for mining big data streams using Spark Streaming, developed at Huawei Noah's Ark Lab. streamDM is licensed under Apache Software License v2.0.
Big Data stream learning is more challenging than batch or offline learning, since the data may not keep the same distribution over the lifetime of the stream. Moreover, each example coming in a stream can only be processed once, or they need to be summarized with a small memory footprint, and the learning algorithms must be very efficient.
Spark Streaming is an extension of the core Spark API that enables stream processing from a variety of sources. Spark is a extensible and programmable framework for massive distributed processing of datasets, called Resilient Distributed Datasets (RDD). Spark Streaming receives input data streams and divides the data into batches, which are then processed by the Spark engine to generate the results.
Spark Streaming data is organized into a sequence of DStreams, represented internally as a sequence of RDDs.
Advanced Data mining And Machine learning System (ADAMS)
ADAMS is a flexible workflow engine aimed at quickly building and maintaining data-driven, reactive workflows, easily integrated into business processes, released under GPLv3.
The core of ADAMS is the workflow engine, which follows the philosophy of less is more. Instead of letting the user place operators (or actors in ADAMS terms) on a canvas and then manually connect inputs and outputs, ADAMS uses a tree-like structure. This structure and the control actors define how the data is flowing in the workflow, no explicit connections necessary. The tree-like structure stems from the internal object representation and the nesting of sub-actors within actor-handlers.