Flatline user manual ==================== S-expressions vs. JSON ---------------------- Flatline expressions in this manual use its lisp-like syntax, based on `symbolic expressions `__ or *sexps*. When sending them to BigML via our API, you can also use their JSON representation, which is trivially obtained by using JSON lists for each paranthesised sexp. For instance: :: (if (< (f "a") 3) 0 4) => ["if", ["<", ["f", "a"], 3], 0, 4] Literal values -------------- Constant numbers, symbols, booleans and strings, using Java/Clojure syntax are valid expressions. Examples: .. code:: lisp 1258 2.349 this-is-a-symbol "a string" true false Counters -------- While running over an input dataset, Flatline keeps track of the (zero-based) number of the input row that's being used, which can be accessed with the function ``row-number``, which takes no arguments: :: (row-number) => current input row (0-based) A typical use of this function is to generate a unique identifier for each row. The row number will start at 0 unless you skip some rows of the input dataset, and increase by one on each new row (unless you're specifying a input row step when generating a dataset). Field accessors --------------- Field values ~~~~~~~~~~~~ Input field values are accessed using the ``field`` operator: :: (field [] []) := field id | field name | column_number := integer expression := output value if the requested row is out-of-range where ```` can be either the identifier, name or column number of the desired field, and the optional ```` (an integer, defaulting to 0) denotes the offset with respect to the current input row. The optional ```` is the output value if the value of the field for the given row (taking into account the shift, if any) is outside the limits of our dataset. It can be a constant value or an expression. If ```` is not set, the accessor will return a missing in those cases. So, for instance, these sexps denote field values extracted from the current row: .. code:: lisp (field 0) (field 0 0) (field 0 -1 "default-string") (field 0 -1 (mean (field 0))) (field 0 -1 3) (field "000004") (field "a field name" 0) while .. code:: lisp (field "000001" -2) denotes the value of the cell corresponding to a field with identifier "000001" two rows *before* the current one. Positive shift values denote rows after the current one. .. code:: lisp (field "a field" 3) (field "another field" 2) For convenience, and since ``field`` is probably going to be your most often user operator, it can be abbreviated to ``f``: .. code:: lisp (f "000001" -2) (f 3 1) (f 1 -1 3) (f "a field" 23) We also provide a predicate, ``missing?``, that will tell you whether the value of the field for the given row (taking into account the shift, if any) is a missing token: :: (missing? []) E.g.: .. code:: lisp (missing? "species") (missing? "000001" -2) (missing? 3 1) (missing? "a field" 23) will all yield boolean values. For backwards compatibility, ``missing`` is an alias for ``missing?``. Randomized field values ~~~~~~~~~~~~~~~~~~~~~~~ There are two Flatline functions that will let you generate a random value in the domain of a given field, given its designator: :: (random-value ) (weighted-random-value ) e.g. .. code:: lisp (random-value "age") (weighted-random-value "000001") (weighted-random-value 3) Both functions generate a value with the constrain that it belongs to the domain of the given field, but while ``random-value`` uses a uniform probability of the field's range of values, ``weighted-random-value`` uses de distribution of the field values (as computed in its histogram) as the probability measure for the random generator. These two functions work for numeric, categorical and text fields, with generated values satisfying: - For numeric fields, generated values are in the interval ``[(minimum ), (maximum )]`` - For categorical fields, generated values belong to the set ``(categories )`` - For text fields, we generate terms in the field's tag cloud (generated values correspond to single terms in the cloud). - Datetime **parent** fields are not supported, since they don't have a defined distribution: you can use any of their numeric children for generating values following their distributions. A common use of these functions is replacing missing values with random data, which in Flatline you could write as, say: .. code:: lisp (if (missing? "00000") (random-value "000000") (f "000000")) We provide a shortcut for those common operations with the functions ``ensure-value`` and ``ensure-weighted-value``: :: (ensure-value ) := (if (missing? ) (random-value ) (field )) (ensure-weighted-value ) := (if (missing? ) (weighted-random-value ) (field )) We them, our example above can be simply written as: .. code:: lisp (ensure-value "000000") or, if you want that the generated random values follow the same distribution as the field "000000": .. code:: lisp (ensure-weighted-value "000000") Normalized field values ~~~~~~~~~~~~~~~~~~~~~~~ For numeric fields, it's often useful to normalize their values to a standard interval (usually [0, 1]). To that end, you can use the Flatline primitive ``normalize``, which takes as arguments the designator for the field you want to normalize and, optionally, the two bounds of the resulting interval: :: (normalize [ ]) => (+ from (* (- to from) (/ (- (f id) (minimum id)) (- (maximum id) (minimum id))))) For instance: .. code:: lisp (normalize "000001") ;; = (normalize "000001" 0 1) (normalize "width" -1 1) (normalize "length" 8 23) As shown in the formula above, ``normalize`` linearly maps the minimum value of the field to ``from`` (0 by default) and the maximum value to ``to`` (1 by default). Besides this linear normalization, it's also common to standardize numeric data values by mapping them to a gaussian, according to the equation: :: x[i] -> (x[i] - mean(x)) / standard_deviation(x) or, in flatline terms: :: (/ (- (f ) (mean )) (standard-deviation )) This normalization function is called the Z score, and we provide it as the function ``z-score``: :: (z-score ) E.g.: .. code:: lisp (z-score "000034") (z-score "a numeric field") (z-score 23) As with ``normalize``, the field used must have a numeric optype. You can use the function ``log-normal`` to apply ``z-score`` to the logarithm of your field. This is useful when your field follows a log-normal distribution and you want to map it to a gaussian. .. code:: lisp (log-normal "000003") (z-score "a numeric field") (z-score 1) This function requires numeric fields with, at least, 80% of the values greater than 0 and a non-zero mean value. Vectorized categorical or text fields ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It may be useful to convert categorical or text fields to numeric values for models which accept only numeric data as input. This can be accomplished with the Flatline primitive ``vectorize``: :: (vectorize []) For categorical fields, the output is a binary indicator vector. In other words, it is a list of numeric fields, one per possible categorical value, and for each instance, the numeric field corresponding to the category of that instance will have a value of ``1``, whereas the remaining numeric fields will have a value of ``0``. For text fields, the output is a list of numeric fields, each corresponding to a term in the field's tag cloud. The value of each field is the number of times that term appears in that instance. A numeric expression or literal can be passed as an optional second argument to limit the number of generated fields to the *n* most frequent categories or text terms. Field properties ~~~~~~~~~~~~~~~~ Summary properties ^^^^^^^^^^^^^^^^^^ Field descriptors contain lots of properties with metadata about the field, including its summary. These propeties (when they're atomic) can be accessed via ``field-prop``: :: (field-prop ...) := string | numeric | boolean For instance, you can access the name for field "00023" via: .. code:: lisp (field-prop string "00023" name) or the value of the nested property missing\_count inside the summary with: .. code:: lisp (field-prop numeric "00023" summary missing_count) We provide several shortcuts for concrete summary properties, to save you typing: :: (maximum ) (mean ) (median ) (minimum ) (missing-count ) (population ) (sum ) (sum-squares ) (standard-deviation ) (variance ) (preferred? ) (category-count ) (bin-center ) (bin-count ) As you can see, the category and count accessors take an additional parameter designating either the category (a string or order number) and the bin (a 0-based integer index) you refer to: .. code:: lisp (category-count "species" "Iris-versicolor") (category-count "species" (f "000004")) (bin-count "age" (f "bin-selector")) (bin-center "000003" 3) (bin-center (field "field-selector") 4) Discretization of numeric fields ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A simple way to discretize a numeric field is to assign a label to each of a finite set of segments, defined by a sequence of upper bounds. For instance: .. code:: lisp (let (v (f "age")) (cond (< v 2) "baby" (< v 10) "child" (< v 20) "teenager" "adult")) Flatline provides a shortcut for the above expression via its ``segment-label`` primitive: .. code:: lisp (segment-label "000000" "baby" 2 "child" 10 "teenager" 20 "adult") As you can see, the first argument is the field designator (as usual, a name, column number or identifier), followed by alternating labels and upper bounds. More generally: :: (segment-label ... ) ... strings, ... numbers => (cond (< (f ) ) (< (f ) ) ... (< (f ) ) ) The alternating labels and bounds must be constant strings and numbers. If you want to use segments of equal length between the minimum and maximum value of the field, you can omit the upper bounds and give simply the list of labels, e.g. .. code:: lisp (segment-label 0 "1st fourth" "2nd fourth" "3rd fourth" "4th fourth") which would be equivalent to: .. code:: lisp (let (max (maximum 0) min (minimum 0) step (/ (- max min) 4)) (segment-label 0 "1st fourth" (+ min step) "2nd fourth" (+ min step step) "3rd fourth" (+ min step step step) "4th fourth")) or, in general: :: (segment-label ... ) with ... strings => (let (min (minimum ) step (- (maximum ) min) shift (- (f ) min)) (cond (< shift step) (< shift (* 2 step)) ... (< shift (* (- n 1) step)) )) Items and itemsets ^^^^^^^^^^^^^^^^^^ A common operation on fields of optype *items* is to check whether they contain a list of items. That can be used, for instance, to filter the rows of a dataset that satisfy a given association rule, but calling ``contains-items?`` with the list of items in the antecedent and consequent of the desired rule. :: (contains-items? ... ) ;; with of type string for i in [0, n] The ``contains-items`` primitive takes as first argument the descriptor of the field we want to check (which must have optype items), followed by the one or more items we want to check, which must all have type string. For instance, the predicate: .. code:: lisp (contains-items? "000000" "blue" "green" "darkblue") will filter the rows whose first column satisfies the association rule ``blue, green -> darkblue``. It is also possible to check whether an items field contains *only* the given list of items (in any order), using ``equal-to-items?``, which works exactly as ``contains-items?`` except for the fact that it's exclusive: :: (equals-to-items? ... ) ;; with of type string for i in [0, n] Regions ^^^^^^^ It is possible to manipulate and modify values of type *regions*. In flatline, a regions value is a list of lists. Each of the inner lists has 5 elements. The first one is the label of the region (a string), and its followed by four integers, which are the coordinates of the top-left corner and bottom-right corner of the region at hand. The ``regions?`` primitive checks whether a list represents a valid region (checking also if the vertex coordinates are consitent): .. code:: lisp (region? (list "label" 10 10 20 30)) ;; => "true" (region? (list 10 10 20 30)) ;; => "false" (region? (list -10 10 -20 30)) ;; => "false" When we access a field of type regions, the returned value will be a list with all its values satisfying the ``region?`` predictate. We can add a new region to it with ``add-region``: :: (add-region ) (add-region