You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.
7
+
A light-weight Julia interface for implementing conventions about the
8
+
scientific interpretation of data, and for performing type coercions
9
+
enforcing those conventions.
8
10
9
11
The package makes the distinction between between **machine type** and **scientific type**:
10
12
11
13
* the _machine type_ is a Julia type the data is currently encoded as (for instance: `Float64`)
12
-
* the _scientific type_ is a type defined by this package which encapsulates how the data should be _interpreted_ in the rest of the code (for instance: `Continuous` or `Multiclass`)
14
+
* the _scientific type_ is a type defined by this package which
15
+
encapsulates how the data should be _interpreted_ (for instance:
16
+
`Continuous` or `Multiclass`)
13
17
14
-
As a motivating example, the data might contain a column corresponding to a _number of transactions_, the machine type in that case could be an `Int` whereas the scientific type would be a `Count`.
18
+
The distinction is useful because the same machine type is often used
19
+
to represent data with *differing* scientific interpretations - `Int`
20
+
is used for product numbers (a factor) but also for a person's weight
21
+
(a continuous variable) - while the same scientific
22
+
type is frequently represented by *different* machine types - both
23
+
`Int` and `Float64` are used to represent weights, for example.
15
24
16
-
The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a `Float64` whereas the meaning should still be a `Count`.
17
25
18
26
## Very quick start
19
27
20
-
(For more information and examples please refer to [the doc](https://alan-turing-institute.github.io/ScientificTypes.jl/dev))
28
+
For more information and examples please refer to [the
This is a very quick start presenting two key functions exported by ScientificTypes:
31
+
ScientificTypes.jl has three components:
23
32
24
-
*`schema(X)` which gives an extended schema of the table `X` with the column scientific types implied by the current scitype convention,
25
-
*`coerce(X, ...)` which allows to overwrite scientific types for specific columns to indicate their appropriate scientific interpretation.
33
+
- An *interface*, for articulating a convention about the scientific
34
+
interpretation of data. This consists of a definition of a scientific
35
+
type hierarchy, and a single function `scitype` with scientific
36
+
types as values. Someone implementing a convention must add methods
37
+
to this function, while the general user just applies it to data, as
38
+
in `scitype(4.5)` (returning `Continuous` in the *mlj* convention).
39
+
40
+
- A built-in convention, called *mlj*, active by default.
41
+
42
+
- Convenience methods for working with scientific types, the most commonly used being:
43
+
44
+
-`schema(X)`, which gives an extended schema of any table `X`,
45
+
including the column scientific types implied by the active
46
+
convention.
47
+
.
48
+
- `coerce(X, ...)`, which coerces the machine types of `X`
49
+
to reflect a desired scientific type.
26
50
27
51
```julia
28
52
using ScientificTypes, DataFrames
@@ -49,7 +73,8 @@ will print
49
73
:e -- Union{Missing, Unknown}
50
74
```
51
75
52
-
this uses the default "MLJ convention" to attribute a scitype (cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)).
76
+
this uses the default *mlj* convention to attribute a scitype
Copy file name to clipboardExpand all lines: docs/src/index.md
+33-10Lines changed: 33 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,22 +17,35 @@ The package `ScientificTypes` provides:
17
17
18
18
- A hierarchy of new Julia types representing scientific data types for use in method dispatch (eg, for trait values). Instances of the types play no role:
19
19
20
-
```@example 0
21
-
using ScientificTypes, AbstractTrees
22
-
ScientificTypes.tree()
20
+
```
21
+
Found
22
+
├─ Known
23
+
│ ├─ Finite
24
+
│ │ ├─ Multiclass
25
+
│ │ └─ OrderedFactor
26
+
│ ├─ Infinite
27
+
│ │ ├─ Continuous
28
+
│ │ └─ Count
29
+
│ ├─ Image
30
+
│ │ ├─ ColorImage
31
+
│ │ └─ GrayImage
32
+
│ └─ Table
33
+
└─ Unknown
23
34
```
24
35
25
36
- A single method `scitype` for articulating a convention about what scientific type each Julia object can represent. For example, one might declare `scitype(::AbstractFloat) = Continuous`.
26
37
27
-
- A default convention called *mlj*, based on optional dependencies `CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a convenience method `coerce` for performing scientific type coercion on `AbstractVectors` and columns of tabular data (any table implementing the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface).
38
+
- A default convention called *mlj*, based on dependencies
39
+
`CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a
40
+
convenience method `coerce` for performing scientific type coercion
41
+
on `AbstractVectors` and columns of tabular data (any table
42
+
implementing the [Tables.jl](https://github.com/JuliaData/Tables.jl)
43
+
interface).
28
44
29
45
- A `schema` method for tabular data, based on the optional Tables dependency, for inspecting the machine and scientific types of tabular data, in addition to column names and number of rows.
30
46
31
-
### Dependencies
32
-
33
-
The only dependencies are [`Requires.jl`](https://github.com/MikeInnes/Requires.jl) and `InteractiveUtils` (from stdlib).
34
47
35
-
## Quick start
48
+
## Getting started
36
49
37
50
The package is registered and can be installed via the package manager with `add ScientificTypes`.
38
51
@@ -182,6 +195,16 @@ Similarly, the scitype of an `AbstractArray` is `AbstractArray{U}` where `U` is
182
195
scitype([1.3, 4.5, missing])
183
196
```
184
197
198
+
*Performance note:* Computing type unions over large arrays is
199
+
expensive and, depending on the convention's implementation and the
200
+
array eltype, computing the scitype can be slow. (In the *mlj*
201
+
convention this is mitigated with the help of the
202
+
`ScientificTypes.Scitype` method, of which other conventions could
203
+
make use. Do `?ScientificTypes.Scitype` for details.) An eltype `Any`
204
+
will always be slow and you may want to consider replacing an array
205
+
`A` with `broadcast(idenity, A)` to collapse the eltype and speed up
206
+
the computation.
207
+
185
208
Provided the [Tables.jl](https://github.com/JuliaData/Tables.jl) package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:
186
209
187
210
```@example 5
@@ -288,7 +311,7 @@ X = (a = rand("abc", n), # 3 values, not number --> Multiclass
288
311
autotype(X, only_changes=true)
289
312
```
290
313
291
-
For example, we could first apply the `:discrete_to_continuous` rule,
314
+
For example, we could first apply the `:discrete_to_continuous` rule,
292
315
followed by `:few_to_finite` rule. The first rule will apply to `b` and `e`
293
316
but the subsequent application of the second rule will mean we will
294
317
get the same result apart for `e` (which will be `Continuous`)
0 commit comments