diff --git a/demo/standalone/data/data.csv b/demo/standalone/data/data.csv new file mode 100644 index 00000000000..0817d176204 --- /dev/null +++ b/demo/standalone/data/data.csv @@ -0,0 +1,11 @@ +c1,c3,c4,c5,c6,c7,c8 +aaa,11,22,1.2,11.3,1636097290000,2021-07-20 +aaa,12,22,2.2,12.3,1636097390000,2021-08-19 +aa,13,22,3.2,13.3,1636097490000,2021-05-20 +bb,14,22,4.2,14.3,1636097590000,2021-09-23 +bb,15,22,5.2,15.3,1636097690000,2021-03-21 +bb,16,22,6.2,16.3,1636097790000,2021-05-20 +cc,17,22,7.2,17.3,1636097890000,2021-05-26 +dd,18,22,8.2,18.3,1636097990000,2021-06-20 +ee,19,22,9.2,19.3,1636097000000,2021-01-10 +ff,20,22,9.2,19.3,1636098000000,2021-01-10 diff --git a/docs/cn/cluster.md b/docs/cn/cluster.md new file mode 100644 index 00000000000..6b7292591fa --- /dev/null +++ b/docs/cn/cluster.md @@ -0,0 +1,166 @@ + +# 快速上手(集群版) +本教程针对OpenMLDB集群版,会涵盖使用OpemMLDB构建机器学习应用的整个流程,包括离线特征抽取、模型训练,以及在线的数据导入、在线特征抽取、模型预测等步骤。希望通过该教程,读者可以了解通过OpenMLDB,如何快速完成从原始数据处理到模型上线的整个生命周期。 + + +OpenMLDB提供了Java和Python SDKs,该示例中,我们使用Python SDK。 + +为了方便理解,我们会基于Kaggle比赛[Predict Taxi Tour Duration数据集](https://github.com/4paradigm/OpenMLDB/tree/main/demo/predict-taxi-trip-duration-nb/demo/data),来介绍整个流程。数据集和相关代码可以在[这里](https://github.com/4paradigm/OpenMLDB/tree/main/demo/predict-taxi-trip-duration-nb/demo)查看,并且可以根据步骤尝试运行。 + +## 离线 +### 特征抽取 +对于特征抽取,用户首先需要通过对数据集的了解,构建特征抽取的SQL脚本,例如,在Taxi Tour Duration数据集中,我们可以构建如下特征抽取的[SQL](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/fe.sql): +```sql +select trip_duration, passenger_count, +sum(pickup_latitude) over w as vendor_sum_pl, +max(pickup_latitude) over w as vendor_max_pl, +min(pickup_latitude) over w as vendor_min_pl, +avg(pickup_latitude) over w as vendor_avg_pl, +sum(pickup_latitude) over w2 as pc_sum_pl, +max(pickup_latitude) over w2 as pc_max_pl, +min(pickup_latitude) over w2 as pc_min_pl, +avg(pickup_latitude) over w2 as pc_avg_pl , +count(vendor_id) over w2 as pc_cnt, +count(vendor_id) over w as vendor_cnt +from t1 +window w as (partition by vendor_id order by pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), +w2 as (partition by passenger_count order by pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW); +``` + + +通过执行特征抽取的SQL,可以从原数据里提取出我们需要的特征数据集,用来进行训练。 + +对于离线特征抽取的输入数据,可以直接存放在本地文件或者HDFS上面,然后通过Spark执行特征抽取SQL。 + + +Python示例代码如下(完整代码参考[这里](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/openmldb_batch.py)): + +```python +from pyspark.sql import SparkSession +import numpy as np +import pandas as pd + + +def run_batch_sql(sql): + spark = SparkSession.builder.appName("OpenMLDB Demo").getOrCreate() + parquet_predict = "./data/taxi_tour_table_predict_simple.snappy.parquet" + parquet_train = "./data/taxi_tour_table_train_simple.snappy.parquet" + train = spark.read.parquet(parquet_train) + train.createOrReplaceTempView("t1") + train_df = spark.sql(sql) + train_set = train_df.toPandas() + predict = spark.read.parquet(parquet_predict) + predict.createOrReplaceTempView("t1") + predict_df = spark.sql(sql) + predict_set = predict_df.toPandas() + return train_set, predict_set +``` + + +***注意***:需要使用[OpenMLDB Spark Distribution](https://github.com/4paradigm/OpenMLDB/blob/main/docs/en/compile.md#optimized-spark-distribution-for-openmldb-optional) + + +### 模型训练 +通过特征抽取,获得训练和预测数据集,就可以使用标准的模型训练方式,来进行模型训练。 + + +下面示例使用Gradient Boosting Machine (GBM) 进行模型训练,并将训练好的模型保存在`model_path`路径下(完整代码参考[这里](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/train.py)): +```python +import lightgbm as lgb +import openmldb_batch + + +train_set, predict_set = openmldb_batch.run_batch_sql(sql) +y_train = train_set['trip_duration'] +x_train = train_set.drop(columns=['trip_duration']) +y_predict = predict_set['trip_duration'] +x_predict = predict_set.drop(columns=['trip_duration']) +lgb_train = lgb.Dataset(x_train, y_train) +lgb_eval = lgb.Dataset(x_predict, y_predict, reference=lgb_train) + +gbm = lgb.train(params, + lgb_train, + num_boost_round=20, + valid_sets=lgb_eval, + early_stopping_rounds=5) +gbm.save_model(model_path) +``` + +## 在线 +在线服务主要需要两个输入: +- 离线过程中训练好的模型 +- 在线数据集 + +对于在线模型预测,一般我们需要历史数据,来完成特征抽取,主要是因为特征抽取SQL一般是基于一定的时间窗口,需要一段时间的数据做特征统计。 +我们称在线预测需要依赖的这部分历史数据为在线数据集,这部分数据集相对离线训练的数据集会较小,主要为近期的一些历史数据。 + +### 在线数据导入 +在线数据集可以通过类似数据库的导入方式,导入OpenMLDB。 + +下面示例展示如何从csv文件导入数据到OpenMLDB,和标准数据库导入数据类似(完整代码参考[这里](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/import.py)) + +```python +import sqlalchemy as db + + +ddl=""" +create table t1( +id string, +vendor_id int, +pickup_datetime timestamp, +dropoff_datetime timestamp, +passenger_count int, +pickup_longitude double, +pickup_latitude double, +dropoff_longitude double, +dropoff_latitude double, +store_and_fwd_flag string, +trip_duration int, +index(key=vendor_id, ts=pickup_datetime), +index(key=passenger_count, ts=pickup_datetime) +); +""" + +engine = db.create_engine('openmldb:///db_test?zk=127.0.0.1:2181&zkPath=/openmldb') +connection = engine.connect() +# 创建数据库 +connection.execute("create database db_test;") +# 建表 +connection.execute(ddl) + +# 从csv读出数据并插入到表里 +with open('data/taxi_tour_table_train_simple.csv', 'r') as fd: + for line in fd: + row = line.split(',') + insert = "insert into t1 values('%s', %s, %s, %s, %s, %s, %s, %s, %s, '%s', %s);"% tuple(row) + connection.execute(insert) +``` + +### 在线特征抽取 +对于在线特征抽取,需要根据输入的数据,以及在线数据集,通过和离线特征相同的特征抽取[SQL](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/fe.sql),获得特征。 + + +示例代码如下: +```python +import sqlalchemy as db + + +engine = db.create_engine('openmldb:///db_test?zk=127.0.0.1:2181&zkPath=/openmldb') +connection = engine.connect() +features = connection.execute(sql, request_data) +``` + +### 模型预测 +根据在线特征抽取结果,以及离线训练好的模型,可以获得预测值,就完成了在线预估的过程。 + + +示例代码如下: +```python +import lightgbm as lgb + + +bst = lgb.Booster(model_path) +duration = bst.predict(feature) +``` + +***注***:实际中,会启动预估服务,通过线上服务的方式,接收在线输入的数据,进行模型预测,然后再把预测结果返回给用户。这里省略了服务上线的过程,不过在[demo](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo)的上手示例中,包含了模型上线的过程。 diff --git a/docs/cn/standalone.md b/docs/cn/standalone.md index c21ecb10000..ab6b7cfb2bf 100644 --- a/docs/cn/standalone.md +++ b/docs/cn/standalone.md @@ -1,4 +1,4 @@ -# 单机版使用文档 +# 快速上手(单机版) ## 部署 ### 修改配置文件 @@ -49,7 +49,7 @@ sh bin/stop-all.sh ``` **注**: 需要至少指定一个index并设置`ts`列。`ts`列是用来做ORDERBY的那一列 ### 导入数据 -只支持导入本地csv文件 +只支持导入本地csv文件(示例csv文件可以在[这里](../../demo/standalone/data/data.csv)下载) ```sql > USE demo_db; > LOAD DATA INFILE '/tmp/data.csv' INTO TABLE demo_table1; @@ -73,7 +73,7 @@ null_value | null值 | String | null | Any String ---------- sum ---------- - 3.600000 + 56.000004 ---------- 1 rows in set @@ -131,4 +131,4 @@ curl http://127.0.0.1:8080/dbs/demo_db/deployments/demo_data_service -X POST -d' "input": [["aaa", 11, 22, 1.2, 1.3, 1635247427000, "2021-05-20"]] }' ``` -URL中ip和port是`conf/apiserver.flags`中配置的`endpoint` \ No newline at end of file +URL中ip和port是`conf/apiserver.flags`中配置的`endpoint` diff --git a/docs/en/cluster.md b/docs/en/cluster.md new file mode 100644 index 00000000000..18a43231f76 --- /dev/null +++ b/docs/en/cluster.md @@ -0,0 +1,173 @@ + +# QuickStart (Cluster Version) + +This tutorial is targeted at the cluster version of OpenMLDB and it will cover the whole lifecycle of how to build a machine learning application with the help of OpenMLDB, +including feature extraction, model training; online data import, online feature extraction, model prediction, etc. +From this tutorial, readers can understand how to use OpenMLDB to complete the machine learning lifecycle from raw data to model deployment. + + +OpenMLDB provides both Java and Python SDKs. In this tutorial, we will Python SDK. + +In order to better understand the workflow, we use Kaggle Competition [Predict Taxi Tour Duration Dataset](https://github.com/4paradigm/OpenMLDB/tree/main/demo/predict-taxi-trip-duration-nb/demo/data) +to demostrate the whole process. Dataset and source code can be found +[here](https://github.com/4paradigm/OpenMLDB/tree/main/demo/predict-taxi-trip-duration-nb/demo). + +## Offline +### Feature Extraction +In order to do feature extraction, users have to know the data and construct a SQL script. + +For example, for the Taxi Tour Duration Dataset, we can construct the following [SQL](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/fe.sql) for feature extraction: +```sql +select trip_duration, passenger_count, +sum(pickup_latitude) over w as vendor_sum_pl, +max(pickup_latitude) over w as vendor_max_pl, +min(pickup_latitude) over w as vendor_min_pl, +avg(pickup_latitude) over w as vendor_avg_pl, +sum(pickup_latitude) over w2 as pc_sum_pl, +max(pickup_latitude) over w2 as pc_max_pl, +min(pickup_latitude) over w2 as pc_min_pl, +avg(pickup_latitude) over w2 as pc_avg_pl , +count(vendor_id) over w2 as pc_cnt, +count(vendor_id) over w as vendor_cnt +from t1 +window w as (partition by vendor_id order by pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), +w2 as (partition by passenger_count order by pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW); +``` + +After executing the feature extraction SQL, we can extract the features from the raw data, which will be used for model training. + +The SQL can executed in Spark to extract features from the raw dataset, which are stored directly in local filesystem or HDFS. + +Sample Python code is shown as follows(complete code can be found [here](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/openmldb_batch.py)): + +```python +from pyspark.sql import SparkSession +import numpy as np +import pandas as pd + + +def run_batch_sql(sql): + spark = SparkSession.builder.appName("OpenMLDB Demo").getOrCreate() + parquet_predict = "./data/taxi_tour_table_predict_simple.snappy.parquet" + parquet_train = "./data/taxi_tour_table_train_simple.snappy.parquet" + train = spark.read.parquet(parquet_train) + train.createOrReplaceTempView("t1") + train_df = spark.sql(sql) + train_set = train_df.toPandas() + predict = spark.read.parquet(parquet_predict) + predict.createOrReplaceTempView("t1") + predict_df = spark.sql(sql) + predict_set = predict_df.toPandas() + return train_set, predict_set +``` + + +***NOTE***: [OpenMLDB Spark Distribution](https://github.com/4paradigm/OpenMLDB/blob/main/docs/en/compile.md#optimized-spark-distribution-for-openmldb-optional) is used + + +### Model Training +After we get the train and predict datasets from feature extraction, we can use the standard methods to train models. + + +For example, we can use Gradient Boosting Machine (GBM) to train the model and produce a model file(complete source code can be found [here](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/train.py)): +```python +import lightgbm as lgb +import openmldb_batch + + +train_set, predict_set = openmldb_batch.run_batch_sql(sql) +y_train = train_set['trip_duration'] +x_train = train_set.drop(columns=['trip_duration']) +y_predict = predict_set['trip_duration'] +x_predict = predict_set.drop(columns=['trip_duration']) +lgb_train = lgb.Dataset(x_train, y_train) +lgb_eval = lgb.Dataset(x_predict, y_predict, reference=lgb_train) + +gbm = lgb.train(params, + lgb_train, + num_boost_round=20, + valid_sets=lgb_eval, + early_stopping_rounds=5) +gbm.save_model(model_path) +``` + +## Online +Online service requires two inputs: +- the model trained from offline process +- online dataset + +The feature extraction SQL is generally based on time windows. Thus in the online model prediction, we usually need history data to extract features from time ranges of data. +The history data is called online dataset. Online dataset is generally restricted to the recent time range, which is small compared to the offline dataset. + +### Online dataset import +The online dataset can be imported to OpenMLDB in a similar way to the traditional database. + +The following code shows how to import a csv file to OpenMLDB (Comlete code can be found [here](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/import.py)) + +```python +import sqlalchemy as db + + +ddl=""" +create table t1( +id string, +vendor_id int, +pickup_datetime timestamp, +dropoff_datetime timestamp, +passenger_count int, +pickup_longitude double, +pickup_latitude double, +dropoff_longitude double, +dropoff_latitude double, +store_and_fwd_flag string, +trip_duration int, +index(key=vendor_id, ts=pickup_datetime), +index(key=passenger_count, ts=pickup_datetime) +); +""" + +engine = db.create_engine('openmldb:///db_test?zk=127.0.0.1:2181&zkPath=/openmldb') +connection = engine.connect() +# create database +connection.execute("create database db_test;") +# create table +connection.execute(ddl) + +# read data from csv file and insert into table +with open('data/taxi_tour_table_train_simple.csv', 'r') as fd: + for line in fd: + row = line.split(',') + insert = "insert into t1 values('%s', %s, %s, %s, %s, %s, %s, %s, %s, '%s', %s);"% tuple(row) + connection.execute(insert) +``` + +### Online feature extraction +Online feature extraction requires both input data and online dataset. The feature extraction SQL is the same as the offline +[SQL](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo/fe.sql). + + +Sample code is as follows: +```python +import sqlalchemy as db + + +engine = db.create_engine('openmldb:///db_test?zk=127.0.0.1:2181&zkPath=/openmldb') +connection = engine.connect() +features = connection.execute(sql, request_data) +``` + +### Model Prediction +Base on the online feature extracted and the model trained from the offline process, we can get the prediction result. + +Sample code is as follows: +```python +import lightgbm as lgb + + +bst = lgb.Booster(model_path) +duration = bst.predict(feature) +``` + +***NOTE***: Generally, in the real deployment,we launch a prediction service, +who will accept request and do the prediction, and then return the results back to the users. +We skip the online deployment step here, but it is included in the [demo tour](https://github.com/4paradigm/OpenMLDB/blob/main/demo/predict-taxi-trip-duration-nb/demo). diff --git a/docs/en/standalone.md b/docs/en/standalone.md index 1e7ca296027..a0c79de3679 100644 --- a/docs/en/standalone.md +++ b/docs/en/standalone.md @@ -1,4 +1,4 @@ -# Standalone Mode Guideline +# QuickStart (Standalone Version) ## Deployment ### Modify Config @@ -49,7 +49,7 @@ sh bin/stop-all.sh ``` **Note**: Specify at least one index and set the `ts` column which is used for ORDERBY. The `ts` column is the key in `index` option and can be setted with `timestamp` or `bigint` column only. ### Import Data -Only support csv file format +Only support csv file format (a sample csv file can be downloaded [here](../../demo/standalone/data/data.csv)) ```sql > USE demo_db; > LOAD DATA INFILE '/tmp/data.csv' INTO TABLE demo_table1; @@ -73,7 +73,7 @@ Below demonstrates a data exploration task. ---------- sum ---------- - 3.600000 + 56.000004 ---------- 1 rows in set @@ -131,4 +131,4 @@ curl http://127.0.0.1:8080/dbs/demo_db/deployments/demo_data_service -X POST -d' "input": [["aaa", 11, 22, 1.2, 1.3, 1635247427000, "2021-05-20"]] }' ``` -The ip and port in url is `endpoint` in `conf/apiserver.flags` \ No newline at end of file +The ip and port in url is `endpoint` in `conf/apiserver.flags`