Data Modeling using Machine Learning and JubaQL¶
In general, there are five steps when use machine learning for business.
- Data collection
- Hypothesis Generation
- Hypothesis Validation
- Data Modeling based on the hypothesis
- Building online system using the model
Jubatus supports various machine learning algorithms, and is a powerful tool on the latter two steps above, i.e., data modeling based on the hypothesis and building online system using the model. However, Jubatus users need an effort that is inconvenient for the former three phases.
Therefore, we developed JubaQL as a DSL based on SQL/CQL, so as to reduce the entry barriers to use Jubatus for machine learning tasks in a stream processing environment.
Goal of JubaQL¶
JubaQL is a SQL/CQL based DSL(Domain Specific Language) and its implementation. The overall goal of JubaQL are as follows:
- Making it easy to use for domain experts to use Jubatus for machine learning tasks with real-world data.
- Integration of batch-processing and stream-processing
The major functions of JubaQL are as follows:
・ SQL-Like syntaxJubaQL should be a language that is easy to learn, easy to use, and should look familiar to the typical data scientist. Therefore an SQL-like syntax was chosen and enriched by terms from machine learning where it seemed appropriate.
・ Stream Re-ProcessingJubaQL allow to do easy stream re-processing, i.e., start processing with previously archived stream data (e.g., stored in HDFS) and then seamlessly switch to live stream data (e.g., coming from Apache Kafka.) This allows to obtain reproducible results and can be used for parameter tuning.
・ Cascaded ProcessingIt is possible to modify and enhance streams entirely on the server side, without data being exchanged with the client. The most important use case is to use analysis results from one model as input for another model without user intervention or client-server traffic.
・ Trigger-based ActionsJubaQL enables you to define functions (UDF = user-defined function) and install them as triggers on a stream that are executed when some condition is fulfilled. One important use case would be to store data in a database or send a notification when a certain event is discovered, without the need to poll the server repeatedly.
・ Time-Series AnalysisYou can define time windows on a stream, compute features for these time windows and train/query Jubatus using these features. This can be used whenever analysis of multiple subsequent events (as opposed to a single event in time) shall be done, e.g., for location traces.