RumbleDB ML
RumbleDB ML is a Machine Learning library built on top of the RumbleDB engine that makes it more productive and easier to perform ML tasks thanks to the abstraction layer provided by JSONiq.
The machine learning capabilities are exposed through JSONiq function items. The concepts of "estimator" and "transformer", which are core to Machine Learning, are naturally function items and fit seamlessly in the JSONiq data model.
Training sets, test sets, and validation sets, which contain features and labels, are exposed through JSONiq sequences of object items: the keys of these objects are the features and labels.
The names of the estimators and of the transformers, as well as the functionality they encapsulate, are directly inherited from the SparkML library which RumbleDB ML is based on: we chose not to reinvent the wheel.
Transformers
A transformer is a function item that maps a sequence of objects to a sequence of objects.
It is an abstraction that either performs a feature transformation or generates predictions based on trained models. For example:
-
Tokenizer is a feature transformer that receives textual input data and splits it into individual terms (usually words), which are called tokens.
-
KMeansModel is a trained model and a transformer that can read a dataset containing features and generate predictions as its output.
Estimators
An estimator is a function item that maps a sequence of objects to a transformer (yes, you got it right: that's a function item returned by a function item. This is why they are also called higher-order functions!).
Estimators abstract the concept of a Machine Learning algorithm or any algorithm that fits or trains on data. For example, a learning algorithm such as KMeans is implemented as an Estimator. Calling this estimator on data essentially trains a KMeansModel, which is a Model and hence a Transformer.
Parameters
Transformers and estimators are function items in the RumbleDB Data Model. Their first argument is the sequence of objects that represents, for example, the training set or test set. Parameters can be provided as their second argument. This second argument is expected to be an object item. The machine learning parameters form the fields of the said object item as key-value pairs.
Type Annotations
RumbleDB ML works on highly structured data, because it requires full type information for all the fields in the training set or test set. It is on our development plan to automate the detection of these types when the sequence of objects gets created in the fly.
RumbleDB supports a user-defined type system with which you can validate and annotate datasets against a JSound schema.
This annotation is required to be applied on any dataset that must be used as input to RumbleDB ML, but it is superfluous if the data was directly read from a structured input format such as Parquet, CSV, Avro, SVM or ROOT.
Examples
- Tokenizer Example:
declare type local:id-and-sentence as {
"id": "integer",
"sentence": "string"
};
let $local-data := (
{"id": 1, "sentence": "Hi I heard about Spark"},
{"id": 2, "sentence": "I wish Java could use case classes"},
{"id": 3, "sentence": "Logistic regression models are neat"}
)
let $df-data := validate type local:id-and-sentence* { $local-data }
let $transformer := get-transformer("Tokenizer")
for $i in $transformer(
$df-data,
{"inputCol": "sentence", "outputCol": "output"}
)
return $i
// returns
// { "id" : 1, "sentence" : "Hi I heard about Spark", "output" : [ "hi", "i", "heard", "about", "spark" ] }
// { "id" : 2, "sentence" : "I wish Java could use case classes", "output" : [ "i", "wish", "java", "could", "use", "case", "classes" ] }
// { "id" : 3, "sentence" : "Logistic regression models are neat", "output" : [ "logistic", "regression", "models", "are", "neat" ] }
- KMeans Example:
declare type local:col-1-2-3 as {
"id": "integer",
"col1": "decimal",
"col2": "decimal",
"col3": "decimal"
};
let $vector-assembler := get-transformer("VectorAssembler")(
?,
{ "inputCols" : [ "col1", "col2", "col3" ], "outputCol" : "features" }
)
let $local-data := (
{"id": 0, "col1": 0.0, "col2": 0.0, "col3": 0.0},
{"id": 1, "col1": 0.1, "col2": 0.1, "col3": 0.1},
{"id": 2, "col1": 0.2, "col2": 0.2, "col3": 0.2},
{"id": 3, "col1": 9.0, "col2": 9.0, "col3": 9.0},
{"id": 4, "col1": 9.1, "col2": 9.1, "col3": 9.1},
{"id": 5, "col1": 9.2, "col2": 9.2, "col3": 9.2}
)
let $df-data := validate type local:col-1-2-3* {$local-data }
let $df-data := $vector-assembler($df-data)
let $est := get-estimator("KMeans")
let $tra := $est(
$df-data,
{"featuresCol": "features"}
)
for $i in $tra(
$df-data,
{"featuresCol": "features"}
)
return $i
// returns
// { "id" : 0, "col1" : 0, "col2" : 0, "col3" : 0, "prediction" : 0 }
// { "id" : 1, "col1" : 0.1, "col2" : 0.1, "col3" : 0.1, "prediction" : 0 }
// { "id" : 2, "col1" : 0.2, "col2" : 0.2, "col3" : 0.2, "prediction" : 0 }
// { "id" : 3, "col1" : 9, "col2" : 9, "col3" : 9, "prediction" : 1 }
// { "id" : 4, "col1" : 9.1, "col2" : 9.1, "col3" : 9.1, "prediction" : 1 }
// { "id" : 5, "col1" : 9.2, "col2" : 9.2, "col3" : 9.2, "prediction" : 1 }
RumbleDB ML Functionality Overview:
RumblDB eML - Catalogue of Estimators:
AFTSurvivalRegression
Parameters:
- aggregationDepth: integer
- censorCol: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- maxIter: integer
- predictionCol: string
- quantileProbabilities: array (of double)
- quantilesCol: string
- tol: double
ALS
Parameters:
- alpha: double
- checkpointInterval: integer
- coldStartStrategy: string
- finalStorageLevel: string
- implicitPrefs: boolean
- intermediateStorageLevel: string
- itemCol: string
- maxIter: integer
- nonnegative: boolean
- numBlocks: integer
- numItemBlocks: integer
- numUserBlocks: integer
- predictionCol: string
- rank: integer
- ratingCol: string
- regParam: double
- seed: double
- userCol: string
BisectingKMeans
Parameters:
- distanceMeasure: string
- featuresCol: string
- k: integer
- maxIter: integer
- minDivisibleClusterSize: double
- predictionCol: string
- seed: double
BucketedRandomProjectionLSH
Parameters:
- bucketLength: double
- inputCol: string
- numHashTables: integer
- outputCol: string
- seed: double
ChiSqSelector
Parameters:
- fdr: double
- featuresCol: string
- fpr: double
- fwe: double
- labelCol: string
- numTopFeatures: integer
- outputCol: string
- percentile: double
- selectorType: string
CountVectorizer
Parameters:
- binary: boolean
- inputCol: string
- maxDF: double
- minDF: double
- minTF: double
- outputCol: string
- vocabSize: integer
CrossValidator
Parameters:
- collectSubModels: boolean
- estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- numFolds: integer
- parallelism: integer
- seed: double
DecisionTreeClassifier
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- thresholds: array (of double)
DecisionTreeRegressor
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- seed: double
- varianceCol: string
FPGrowth
Parameters:
- itemsCol: string
- minConfidence: double
- minSupport: double
- numPartitions: integer
- predictionCol: string
GBTClassifier
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- lossType: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- thresholds: array (of double)
- validationIndicatorCol: string
GBTRegressor
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- lossType: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- predictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- validationIndicatorCol: string
GaussianMixture
Parameters:
- featuresCol: string
- k: integer
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- seed: double
- tol: double
GeneralizedLinearRegression
Parameters:
- family: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- link: string
- linkPower: double
- linkPredictionCol: string
- maxIter: integer
- offsetCol: string
- predictionCol: string
- regParam: double
- solver: string
- tol: double
- variancePower: double
- weightCol: string
IDF
Parameters:
- inputCol: string
- minDocFreq: integer
- outputCol: string
Imputer
Parameters:
- inputCols: array (of string)
- missingValue: double
- outputCols: array (of string)
- strategy: string
IsotonicRegression
Parameters:
- featureIndex: integer
- featuresCol: string
- isotonic: boolean
- labelCol: string
- predictionCol: string
- weightCol: string
KMeans
Parameters:
- distanceMeasure: string
- featuresCol: string
- initMode: string
- initSteps: integer
- k: integer
- maxIter: integer
- predictionCol: string
- seed: double
- tol: double
LDA
Parameters:
- checkpointInterval: integer
- docConcentration: double
- docConcentration: array (of double)
- featuresCol: string
- k: integer
- keepLastCheckpoint: boolean
- learningDecay: double
- learningOffset: double
- maxIter: integer
- optimizeDocConcentration: boolean
- optimizer: string
- seed: double
- subsamplingRate: double
- topicConcentration: double
- topicDistributionCol: string
LinearRegression
Parameters:
- aggregationDepth: integer
- elasticNetParam: double
- epsilon: double
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- loss: string
- maxIter: integer
- predictionCol: string
- regParam: double
- solver: string
- standardization: boolean
- tol: double
- weightCol: string
LinearSVC
Parameters:
- aggregationDepth: integer
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- maxIter: integer
- predictionCol: string
- rawPredictionCol: string
- regParam: double
- standardization: boolean
- threshold: double
- tol: double
- weightCol: string
LogisticRegression
Parameters:
- aggregationDepth: integer
- elasticNetParam: double
- family: string
- featuresCol: string
- fitIntercept: boolean
- labelCol: string
- lowerBoundsOnCoefficients: object (of object of double)
- lowerBoundsOnIntercepts: object (of double)
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- regParam: double
- standardization: boolean
- threshold: double
- thresholds: array (of double)
- tol: double
- upperBoundsOnCoefficients: object (of object of double)
- upperBoundsOnIntercepts: object (of double)
- weightCol: string
MaxAbsScaler
Parameters:
- inputCol: string
- outputCol: string
MinHashLSH
Parameters:
- inputCol: string
- numHashTables: integer
- outputCol: string
- seed: double
MinMaxScaler
Parameters:
- inputCol: string
- max: double
- min: double
- outputCol: string
MultilayerPerceptronClassifier
Parameters:
- blockSize: integer
- featuresCol: string
- initialWeights: object (of double)
- labelCol: string
- layers: array (of integer)
- maxIter: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- solver: string
- stepSize: double
- thresholds: array (of double)
- tol: double
NaiveBayes
Parameters:
- featuresCol: string
- labelCol: string
- modelType: string
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- smoothing: double
- thresholds: array (of double)
- weightCol: string
OneHotEncoder
Parameters:
- dropLast: boolean
- handleInvalid: string
- inputCols: array (of string)
- outputCols: array (of string)
OneVsRest
Parameters:
- featuresCol: string
- labelCol: string
- parallelism: integer
- predictionCol: string
- rawPredictionCol: string
- weightCol: string
PCA
Parameters:
- inputCol: string
- k: integer
- outputCol: string
Pipeline
Parameters:
QuantileDiscretizer
Parameters:
- handleInvalid: string
- inputCol: string
- inputCols: array (of string)
- numBuckets: integer
- numBucketsArray: array (of integer)
- outputCol: string
- outputCols: array (of string)
- relativeError: double
RFormula
Parameters:
- featuresCol: string
- forceIndexLabel: boolean
- formula: string
- handleInvalid: string
- labelCol: string
- stringIndexerOrderType: string
RandomForestClassifier
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- subsamplingRate: double
- thresholds: array (of double)
RandomForestRegressor
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- labelCol: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- predictionCol: string
- seed: double
- subsamplingRate: double
StandardScaler
Parameters:
- inputCol: string
- outputCol: string
- withMean: boolean
- withStd: boolean
StringIndexer
Parameters:
- handleInvalid: string
- inputCol: string
- outputCol: string
- stringOrderType: string
TrainValidationSplit
Parameters:
- collectSubModels: boolean
- estimator: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- parallelism: integer
- seed: double
- trainRatio: double
VectorIndexer
Parameters:
- handleInvalid: string
- inputCol: string
- maxCategories: integer
- outputCol: string
Word2Vec
Parameters:
- inputCol: string
- maxIter: integer
- maxSentenceLength: integer
- minCount: integer
- numPartitions: integer
- outputCol: string
- seed: double
- stepSize: double
- vectorSize: integer
- windowSize: integer
RumbleDB ML - Catalogue of Transformers:
AFTSurvivalRegressionModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- quantileProbabilities: array (of double)
- quantilesCol: string
ALSModel
Parameters:
- coldStartStrategy: string
- itemCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- userCol: string
Binarizer
Parameters:
- inputCol: string
- outputCol: string
- threshold: double
BisectingKMeansModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
BucketedRandomProjectionLSHModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
Bucketizer
Parameters:
- handleInvalid: string
- inputCol: string
- inputCols: array (of string)
- outputCol: string
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- splits: array (of double)
- splitsArray: array (of array of double)
ChiSqSelectorModel
Parameters:
- featuresCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
ColumnPruner
Parameters:
CountVectorizerModel
Parameters:
- binary: boolean
- inputCol: string
- minTF: double
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
CrossValidatorModel
Parameters:
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
DCT
Parameters:
- inputCol: string
- inverse: boolean
- outputCol: string
DecisionTreeClassificationModel
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- thresholds: array (of double)
DecisionTreeRegressionModel
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- varianceCol: string
DistributedLDAModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- seed: double
- topicDistributionCol: string
ElementwiseProduct
Parameters:
- inputCol: string
- outputCol: string
- scalingVec: object (of double)
FPGrowthModel
Parameters:
- itemsCol: string
- minConfidence: double
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
FeatureHasher
Parameters:
- categoricalCols: array (of string)
- inputCols: array (of string)
- numFeatures: integer
- outputCol: string
GBTClassificationModel
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
- thresholds: array (of double)
GBTRegressionModel
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxIter: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- stepSize: double
- subsamplingRate: double
GaussianMixtureModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
GeneralizedLinearRegressionModel
Parameters:
- featuresCol: string
- linkPredictionCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
HashingTF
Parameters:
- binary: boolean
- inputCol: string
- numFeatures: integer
- outputCol: string
IDFModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
ImputerModel
Parameters:
- inputCols: array (of string)
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
IndexToString
Parameters:
- inputCol: string
- labels: array (of string)
- outputCol: string
Interaction
Parameters:
- inputCols: array (of string)
- outputCol: string
IsotonicRegressionModel
Parameters:
- featureIndex: integer
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
KMeansModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
LinearRegressionModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
LinearSVCModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- rawPredictionCol: string
- threshold: double
- weightCol: double
LocalLDAModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- seed: double
- topicDistributionCol: string
LogisticRegressionModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- threshold: double
- thresholds: array (of double)
MaxAbsScalerModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
MinHashLSHModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
MinMaxScalerModel
Parameters:
- inputCol: string
- max: double
- min: double
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
MultilayerPerceptronClassificationModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- thresholds: array (of double)
NGram
Parameters:
- inputCol: string
- n: integer
- outputCol: string
NaiveBayesModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- thresholds: array (of double)
Normalizer
Parameters:
- inputCol: string
- outputCol: string
- p: double
OneHotEncoder
Parameters:
- dropLast: boolean
- inputCol: string
- outputCol: string
OneHotEncoderModel
Parameters:
- dropLast: boolean
- handleInvalid: string
- inputCols: array (of string)
- outputCols: array (of string)
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
OneVsRestModel
Parameters:
- featuresCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- rawPredictionCol: string
PCAModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
PipelineModel
Parameters:
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
PolynomialExpansion
Parameters:
- degree: integer
- inputCol: string
- outputCol: string
RFormulaModel
Parameters:
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
RandomForestClassificationModel
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- probabilityCol: string
- rawPredictionCol: string
- seed: double
- subsamplingRate: double
- thresholds: array (of double)
RandomForestRegressionModel
Parameters:
- cacheNodeIds: boolean
- checkpointInterval: integer
- featuresCol: string
- featureSubsetStrategy: string
- impurity: string
- maxBins: integer
- maxDepth: integer
- maxMemoryInMB: integer
- minInfoGain: double
- minInstancesPerNode: integer
- numTrees: integer
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
- predictionCol: string
- seed: double
- subsamplingRate: double
RegexTokenizer
Parameters:
- gaps: boolean
- inputCol: string
- minTokenLength: integer
- outputCol: string
- pattern: string
- toLowercase: boolean
SQLTransformer
Parameters:
- statement: string
StandardScalerModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
StopWordsRemover
Parameters:
- caseSensitive: boolean
- inputCol: string
- locale: string
- outputCol: string
- stopWords: array (of string)
StringIndexerModel
Parameters:
- handleInvalid: string
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
Tokenizer
Parameters:
- inputCol: string
- outputCol: string
TrainValidationSplitModel
Parameters:
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
VectorAssembler
Parameters:
- handleInvalid: string
- inputCols: array (of string)
- outputCol: string
VectorAttributeRewriter
Parameters:
VectorIndexerModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)
VectorSizeHint
Parameters:
- handleInvalid: string
- inputCol: string
- size: integer
VectorSlicer
Parameters:
- indices: array (of integer)
- inputCol: string
- names: array (of string)
- outputCol: string
Word2VecModel
Parameters:
- inputCol: string
- outputCol: string
- parent: estimator (i.e., function(object*, object) as function(object*, object) as object*)