今天看啥  ›  专栏  ›  Richard2012

KDB文档 - Transforming Data

Richard2012  · 掘金  ·  · 2021-03-28 15:41
阅读 3

KDB文档 - Transforming Data

Types

Casting is a way to convert a value of one type to a compatible type. Sometimes the conversion is exact; other times information is lost. Enumeration and parsing in q also fit into the cast pattern.

The type Operator

The non-atomic unary function (type) can be applied to any entity in q to return its data type expressed as a short. It is a "feature" of q that the data type of an atom is negative whereas the type of a simple list is positive.

q)type 42
-7h
q)type 10 20 30
7h
q)type 98.6
-9h
q)type 1.1 2.2 3.3
9h
q)type `a
-11h
q)type `a`b`c
11h
q)type "z"
-10h
q)type "abc"
10h
复制代码

Type of a Variable

The type of a variable is the type of the value associated with the variable's name.

q)a:42
q)type a
-7h
q)a:"abc"
q)type a
10h
q)value `.
a| 42
复制代码

Cast

Since q is dynamically typed, casting occurs at run-time using the binary operator $, which is atomic in both operands. The right operand is the source value and the left operand specifies the target type. There are three ways to spec

  • A (positive) numeric short type value
  • A char type value
  • A type name symbol

Casts that Widen

In these examples, no information is lost in the cast, as the target type is wider than the source type. Here are examples using the short type specification in the target.

q)7h$42i / int to long
42
q)6h$42 / long to int
42i
q)9h$42 / long to float
42f
复制代码

It is arguably most readable to use the symbolic type name.

q)`int$42
42i
q)`long$42i
42
q)`float$42
42f
复制代码

Casts across Disparate Types

The underlying value of a char is its position in the ASCII collation sequence, so we can cast char to and from integers, provided the integer is less than 256.

q)`char$42
"*"
q)`long$"\n"
10
复制代码

The underlying value of a date is its count of days from the millennium, so we can cast to and from anint.

q)`date$0
2000.01.01
q)`int$2001.01.01 / millennium occurred on leap year
366
复制代码

Casts that Narrow

Some casts lose information. This includes the usual suspects of float to integer and wider integers to narrower ones.

q)`long$12.345
12
q)`short$123456789
32767h
复制代码

Cast any numeric to a boolean using the C philosophy that zero is 0b and anything else is 1b.

We can also extract constituents from complex types.

q)`date$2015.01.02D10:20:30.123456789
2015.01.02
q)`year$2015.01.02
2015i
q)`month$2015.01.02
2015.01m
q)`mm$2015.01.02
1i
q)`dd$2015.01.02
2i
q)`hh$10:20:30.123456789
10i
q)`minute$10:20:30.123456789
10:20
q)`uu$10:20:30.123456789
20i
q)`second$10:20:30.123456789
10:20:30
q)`ss$10:20:30.123456789
30i
复制代码

Casting Integral Infinities

When integral infinities are cast to integers of wider type, they are their underlying bit patterns, reinterpreted. Since these bit patterns are legitimate values for the wider type, the cast results in a finite value.

q)`int$0Wh
32767i
q)`int$-0Wh
-32767i
q)`long$0Wi
2147483647
q)`long$-0Wi
-2147483647
复制代码

Coercing Types

q)L:10 20 30 40
q)L[1]:42h
'type
复制代码

This situation can arise when the list and the assignment value are created dynamically. Coerce the type by casting it to that of the target, provided of course that the cast is legitimate.

q)L[1]:(type L)$42h
q)L,:(type L)$43h
q)L
10 20 30
复制代码

Cast is Atomic

Cast is atomic in the right operand.

q)"i"$10 20 30
10 20 30i
q)`float$(42j; 42i; 42j)
42 42 42f
复制代码

Cast is atomic in the left operand.

q)`short`int`long$42
42h
42i
42
q)"ijf"$98.6
99i
99
98.6
复制代码

Cast is atomic in both operands simultaneously.

q)"ijf"$10 20 30
10i
20
30f
复制代码

Data to and from Text

A q string is a simple list of char.

Data to Strings

The function string can be applied to any q entity to produce a text representation suitable for console display or storage in a file. Here are the key features of string.

  • The result is always a list of char, never a single char. Thus you will see singleton char lists from single digits.

  • The result contains no q type indicators or other decorations. In general, the result is the most compact representation of the input, which may not actually be convertible (i.e., parsed) back to the original value.

  • Applying string to an actual string (i.e., list of char) probably will not give you what you want.

    q)string 42 "42" q)string 4 ,"4" q)string 42i "42" q)a:2.0 q)string a ,"2" q)f:{xx} q)string f "{xx}"

The string function is clearly not atomic

q)string 1 2 3
,"1"
,"2"
,"3"
q)string "string"
,"s"
,"t"
,"r"
,"i"
,"n"
,"g"
q)string (1 2 3; 10 20 30)
,"1" ,"2" ,"3"
"10" "20" "30"
q)string `Life`the`Universe`and`Everything
"Life"
"the"
"Universe"
"and"
"Everything"
复制代码

Creating Symbols from Strings

To cast a char or a string to a symbol, use `$.

q)`$"abc"
`abc
q
q)`$"Hello World"
`Hello word
复制代码

You can include any characters in a symbol this way but you may need to escape them into the string.

q)`$"Zaphod \"Z\""
`Zaphod "Z"
q)`$"Zaphod \n"
`Zaphod
复制代码

The unary `$ is atomic and will thus convert an entire list (or column) of strings to symbols.

q)`$("Life";"the";"Universe";"and";"Everything")
`Life`the`Universe`and`Everything
复制代码

Parsing Data from Strings

The $ operator is overloaded to parse strings into data of any type exactly as the q interpreter does. This overload is invoked by using an uppercase type char as the target left operand and a string in the right operand. If the specified parse cannot be performed, a null of the target type is returned

q)"J"$"42"
42
q)"F"$"42"
42f
q)"F"$"42.0"
42f
q)"I"$"42.0"
0Ni
q)"I"$" "
0Ni
复制代码

Date parsing is flexible with respect to the format of the date.

q)"D"$"12.31.2014"
2014.12.31
q)"D"$"12-31-2014"
2014.12.31
q)"D"$"12/31/2014"
2014.12.31
q)"D"$"12/1/2014"
2014.12.01
q)"D"$"2014/12/31"
2014.12.31
复制代码

Creating Typed Empty Lists

q)c1:`float$()
q)c1:98.6
复制代码

Notice that an operation that yields a simple list retains the type on an empty result.

q)0#10 20 30
`long$()
复制代码

Enumerations

Traditional Enumerations

To begin, recall that in traditional languages, an enumerated type is a way of associating a series of names with a corresponding set of integral values. Often the sequence of numbers is consecutive and begins with 0. The association is usually given a name and represents a new type.

A traditional enumerated type serves multiple purposes.

  • It allows a descriptive name to be used instead of an arbitrary number – e.g., 'red', 'green', 'blue' instead of 0, 1 and 2.
  • It enables type checking to ensure that only permissible values are supplied – e.g., choosing a color name from a list instead of remembering its number is less prone to error.
  • It provides namespacing, meaning the same name can be reused in different domains without fear of confusion – e.g., color.blue and note.blue (the flatted fifth).

There is also a subtler, more powerful use of enumerations: normalizing data.

Data Normalization

Broadly speaking, data normalization seeks to eliminate duplication, retaining only the minimum required data. In the archetypal example, suppose you know that you will have a list of text entries taken from a fixed and reasonably short set of values

v:`ccccccc`bbbbbbb`aaaaaaa`ccccccc`ccccccc`bbbbbbb
u:distinct v
u
`ccccccc`bbbbbbb`aaaaaaa
k:u?v
k
0 1 2 0 0 1
复制代码

Enumerating Symbols

The process of converting a list of symbols to the equivalent list of indices described in the previous section is called enumeration in q.

It uses (yet another overload of) $ with the name of the variable holding the unique symbols as the left operand and a list of symbols drawn from that domain on the right.

q)`u$v
`u$`ccc`bbb`aaa`ccc`bbb`aaa
q)`int$(`u$v)
0 1 2 0 1 2i
复制代码

Using Enumerated Symbols

We continue with the example of the previous section, renamed to use the standard sym domain.

q)sym:`g`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)ev
`sym$`g`g`msft`aapl`msft`aapl`msft`ibm`msft`aapl`g`ibm`aapl`msft`msft`aapl`g`..
q)`int$ev
0 0 2 1 2 1 2 3 2 1 0 3 1 2 2 1 0 1 2 0 1 1 2 3 0 1 2 1 2 1 0 0 1 2 1 2 3 3 0..
复制代码

The enumerated ev can be substituted for the original v in nearly all situations.

q)v[3]
`aapl
q)ev[3]
`u$`aapl
q)v[3]:`ibm
q)ev[3]:`ibm
q)v=`ibm
000100010010011101000010010100000000100100000001000000001100001001011..
q)ev=`ibm
q)where v=`aapl
4 5 19 20 21 31 33 34 41 42 43 49 58 59 61 74 81 83 90 94 95 98 114..
q)where ev=`aapl
4 5 19 20 21 31 33 34 41 42 43 49 58 59 61 74 81 83 90 94 95 98 114..
000100010010011101000010010100000000100100000001000000001100001001011..
q)v?`aapl
4
q)ev?`aapl
4
q)v in `ibm`aapl
000111010010011101011110010100010110100101110001010000001111011001011..
q)ev in `ibm`aapl
000111010010011101011110010100010110100101110001010000001111011001011..
复制代码

 While the enumerated version is item-wise equal to the original, the entities are not identical.

q)all v=ev
1b
q)v~ev
0b
复制代码

This is because the types matter with ~.

Type of Enumerations

Each enumeration is assigned a new numeric data type, beginning with 20h. Starting with q version 3.2, the type 20h is reserved for the conventional enumeration domain sym, whether you use it or not (you should). The types of other enumerations you create will begin with 21h and proceed sequentially. The convention of negative type for atoms and positive type for simple lists still holds. In a fresh q session we see the following.

q)sym1:`g`aapl`msft`ibm
q)type `sym1$1000000?sym1
21h
q)sym2:`a`b`c
q)type `sym2$`c
-22h
复制代码

Updating an Enumerated List

The normalization provided by an enumeration reduces updating all occurrences of a given value to a single operation. This can have significant performance implications for large lists with many repetitions. Continuing with our example above, suppose the list u contains the items in a stock index and we wish to change one of the constituents. A single update to u suffices.

q)sym:`g`aapl`msft`ibm
q)ev:`sym$`g`g`msft`ibm`aapl`aapl`msft`ibm`msft`g`ibm`g..
q)sym[0]:`twit
q)sym
`twit`aapl`msft`ibm
q)ev
`sym$`twit`twit`msft`ibm`aapl`aapl`msft`ibm`msft`twit`ibm`twit..
复制代码

In contrast, to make the equivalent update to v requires changing every occurrence.

q)v
`g`g`msft`ibm`aapl`aapl`msft`ibm`msft`g`ibm`g…
q)@[v; where v=`g; :; `twit]
复制代码

Dynamically Appending to an Enumeration Domain

One situation in which an enumeration is more complicated than working with the denormalized data is when you want to add a new value. Continuing with the example above, appending a new item to an ordinary list of symbols is a single operation. 

q)sym:`g`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)v,:`twtr
q)ev,:`twtr
'cast
复制代码

The new value must first be added to the unique list.

q)sym,:`twtr
q)ev,:`twtr
复制代码

Resolving an Enumeration

An enumerated symbol can be substituted for its equivalent symbol value in most expressions. However, there are some situations in which you need non-enumerated values. One case is converting from one enumeration domain to another, which happens when copying from one kdb+ database to another or in merging two databases.

Given an enumerated symbol, or a list of such, you can recover the un-enumerated value(s) by applying the built-in value. In our on-going example,

q)sym:`g`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)value ev
`aapl`g`msft`msft`ibm`msft`msft`msft`msft`msft`g`ibm`ibm`ibm..
q)v~value ev
1b
复制代码

出处:code.kx.com/q4m3/7_Tran…

作者:Jeffry A. Borror




原文地址:访问原文地址
快照地址: 访问文章快照