Skip to content

Commit 4e7f935

Browse files
authored
Merge pull request #42 from alan-turing-institute/dev
For tagged release 0.2.4
2 parents d10edb6 + f12c30d commit 4e7f935

File tree

12 files changed

+297
-144
lines changed

12 files changed

+297
-144
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ matrix:
1818
- julia: nightly
1919

2020
after_success:
21-
- julia -e 'using Pkg; pkg"add Coverage"; using Coverage; Codecov.submit(Codecov.process_folder())'
21+
- julia -e 'import Pkg; Pkg.add("Coverage"); using Coverage; Coveralls.submit(process_folder())'
2222

2323
jobs:
2424
include:

Project.toml

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,22 @@
11
name = "ScientificTypes"
22
uuid = "321657f4-b219-11e9-178b-2701a2544e81"
33
authors = ["Anthony D. Blaom <[email protected]>"]
4-
version = "0.2.3"
4+
version = "0.2.4"
55

66
[deps]
7-
InteractiveUtils = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
8-
Requires = "ae029012-a4dd-5104-9daa-d747884805df"
7+
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
8+
ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
9+
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
910

1011
[compat]
11-
Requires = "0.5.2"
12+
CategoricalArrays = "^0.7"
13+
ColorTypes = "^0.8"
14+
Tables = "^0.2"
1215
julia = "1"
1316

1417
[extras]
15-
AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
16-
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
17-
ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
1818
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
19-
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
2019
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
2120

2221
[targets]
23-
test = ["AbstractTrees", "CategoricalArrays", "ColorTypes", "Random", "Tables", "Test"]
22+
test = ["Random", "Test"]

README.md

Lines changed: 34 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,25 +4,49 @@
44
| :-----------: | :------: | :-----------: |
55
| [![Build Status](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl.svg?branch=master)](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl) | [![codecov.io](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl/coverage.svg?branch=master)](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl?branch=master) | [![](https://img.shields.io/badge/docs-stable-blue.svg)](https://alan-turing-institute.github.io/ScientificTypes.jl/dev) |
66

7-
A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.
7+
A light-weight Julia interface for implementing conventions about the
8+
scientific interpretation of data, and for performing type coercions
9+
enforcing those conventions.
810

911
The package makes the distinction between between **machine type** and **scientific type**:
1012

1113
* the _machine type_ is a Julia type the data is currently encoded as (for instance: `Float64`)
12-
* the _scientific type_ is a type defined by this package which encapsulates how the data should be _interpreted_ in the rest of the code (for instance: `Continuous` or `Multiclass`)
14+
* the _scientific type_ is a type defined by this package which
15+
encapsulates how the data should be _interpreted_ (for instance:
16+
`Continuous` or `Multiclass`)
1317

14-
As a motivating example, the data might contain a column corresponding to a _number of transactions_, the machine type in that case could be an `Int` whereas the scientific type would be a `Count`.
18+
The distinction is useful because the same machine type is often used
19+
to represent data with *differing* scientific interpretations - `Int`
20+
is used for product numbers (a factor) but also for a person's weight
21+
(a continuous variable) - while the same scientific
22+
type is frequently represented by *different* machine types - both
23+
`Int` and `Float64` are used to represent weights, for example.
1524

16-
The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a `Float64` whereas the meaning should still be a `Count`.
1725

1826
## Very quick start
1927

20-
(For more information and examples please refer to [the doc](https://alan-turing-institute.github.io/ScientificTypes.jl/dev))
28+
For more information and examples please refer to [the
29+
manual](https://alan-turing-institute.github.io/ScientificTypes.jl/dev).
2130

22-
This is a very quick start presenting two key functions exported by ScientificTypes:
31+
ScientificTypes.jl has three components:
2332

24-
* `schema(X)` which gives an extended schema of the table `X` with the column scientific types implied by the current scitype convention,
25-
* `coerce(X, ...)` which allows to overwrite scientific types for specific columns to indicate their appropriate scientific interpretation.
33+
- An *interface*, for articulating a convention about the scientific
34+
interpretation of data. This consists of a definition of a scientific
35+
type hierarchy, and a single function `scitype` with scientific
36+
types as values. Someone implementing a convention must add methods
37+
to this function, while the general user just applies it to data, as
38+
in `scitype(4.5)` (returning `Continuous` in the *mlj* convention).
39+
40+
- A built-in convention, called *mlj*, active by default.
41+
42+
- Convenience methods for working with scientific types, the most commonly used being:
43+
44+
- `schema(X)`, which gives an extended schema of any table `X`,
45+
including the column scientific types implied by the active
46+
convention.
47+
.
48+
- `coerce(X, ...)`, which coerces the machine types of `X`
49+
to reflect a desired scientific type.
2650

2751
```julia
2852
using ScientificTypes, DataFrames
@@ -49,7 +73,8 @@ will print
4973
:e -- Union{Missing, Unknown}
5074
```
5175

52-
this uses the default "MLJ convention" to attribute a scitype (cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)).
76+
this uses the default *mlj* convention to attribute a scitype
77+
(cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)).
5378

5479
Now you could want to specify that `b` is actually a `Count`, and that `d` and `e` are `Multiclass`; this is done with the `coerce` function:
5580

docs/src/index.md

Lines changed: 33 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,22 +17,35 @@ The package `ScientificTypes` provides:
1717

1818
- A hierarchy of new Julia types representing scientific data types for use in method dispatch (eg, for trait values). Instances of the types play no role:
1919

20-
```@example 0
21-
using ScientificTypes, AbstractTrees
22-
ScientificTypes.tree()
20+
```
21+
Found
22+
├─ Known
23+
│ ├─ Finite
24+
│ │ ├─ Multiclass
25+
│ │ └─ OrderedFactor
26+
│ ├─ Infinite
27+
│ │ ├─ Continuous
28+
│ │ └─ Count
29+
│ ├─ Image
30+
│ │ ├─ ColorImage
31+
│ │ └─ GrayImage
32+
│ └─ Table
33+
└─ Unknown
2334
```
2435

2536
- A single method `scitype` for articulating a convention about what scientific type each Julia object can represent. For example, one might declare `scitype(::AbstractFloat) = Continuous`.
2637

27-
- A default convention called *mlj*, based on optional dependencies `CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a convenience method `coerce` for performing scientific type coercion on `AbstractVectors` and columns of tabular data (any table implementing the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface).
38+
- A default convention called *mlj*, based on dependencies
39+
`CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a
40+
convenience method `coerce` for performing scientific type coercion
41+
on `AbstractVectors` and columns of tabular data (any table
42+
implementing the [Tables.jl](https://github.com/JuliaData/Tables.jl)
43+
interface).
2844

2945
- A `schema` method for tabular data, based on the optional Tables dependency, for inspecting the machine and scientific types of tabular data, in addition to column names and number of rows.
3046

31-
### Dependencies
32-
33-
The only dependencies are [`Requires.jl`](https://github.com/MikeInnes/Requires.jl) and `InteractiveUtils` (from stdlib).
3447

35-
## Quick start
48+
## Getting started
3649

3750
The package is registered and can be installed via the package manager with `add ScientificTypes`.
3851

@@ -182,6 +195,16 @@ Similarly, the scitype of an `AbstractArray` is `AbstractArray{U}` where `U` is
182195
scitype([1.3, 4.5, missing])
183196
```
184197

198+
*Performance note:* Computing type unions over large arrays is
199+
expensive and, depending on the convention's implementation and the
200+
array eltype, computing the scitype can be slow. (In the *mlj*
201+
convention this is mitigated with the help of the
202+
`ScientificTypes.Scitype` method, of which other conventions could
203+
make use. Do `?ScientificTypes.Scitype` for details.) An eltype `Any`
204+
will always be slow and you may want to consider replacing an array
205+
`A` with `broadcast(idenity, A)` to collapse the eltype and speed up
206+
the computation.
207+
185208
Provided the [Tables.jl](https://github.com/JuliaData/Tables.jl) package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:
186209

187210
```@example 5
@@ -288,7 +311,7 @@ X = (a = rand("abc", n), # 3 values, not number --> Multiclass
288311
autotype(X, only_changes=true)
289312
```
290313

291-
For example, we could first apply the `:discrete_to_continuous` rule,
314+
For example, we could first apply the `:discrete_to_continuous` rule,
292315
followed by `:few_to_finite` rule. The first rule will apply to `b` and `e`
293316
but the subsequent application of the second rule will mean we will
294317
get the same result apart for `e` (which will be `Continuous`)
@@ -298,4 +321,4 @@ autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
298321
```
299322

300323
One should check and possibly modify the returned dictionary
301-
before passing to `coerce`.
324+
before passing to `coerce`.

src/ScientificTypes.jl

Lines changed: 102 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@ module ScientificTypes
22

33
export Scientific, Found, Unknown, Finite, Infinite
44
export OrderedFactor, Multiclass, Count, Continuous
5-
export Binary, Table, ColorImage, GrayImage
5+
export Binary, Table
6+
export ColorImage, GrayImage
67
export scitype, scitype_union, scitypes, coerce, schema
78
export mlj
9+
export autotype
810

9-
using Requires, InteractiveUtils
11+
using Tables, CategoricalArrays, ColorTypes
1012

1113
# ## FOR DEFINING SCITYPES ON OBJECTS DETECTED USING TRAITS
1214

@@ -64,7 +66,7 @@ const Scientific = Union{Missing,Found}
6466
"""
6567
MLJBase.Table{K}
6668
67-
The scientific type for tabular data (a containter `X` for which
69+
The scientific type for tabular data (a container `X` for which
6870
`Tables.is_table(X)=true`).
6971
7072
If `X` has columns `c1, c2, ..., cn`, then, by definition,
@@ -107,45 +109,127 @@ end
107109
# ## THE SCITYPE FUNCTION
108110

109111
"""
110-
scitype(x)
112+
scitype(X)
111113
112114
The scientific type that `x` may represent.
113-
114115
"""
115116
scitype(X) = scitype(X, Val(convention()))
116117
scitype(X, C) = scitype(X, C, Val(trait(X)))
117118
scitype(X, C, ::Val{:other}) = Unknown
118119

119120
scitype(::Missing) = Missing
120121

121-
122122
# ## CONVENIENCE METHOD FOR UNIONS OVER ELEMENTS
123123

124124
"""
125-
scitype_union(A)
125+
scitype_union(A)
126126
127127
Return the type union, over all elements `x` generated by the iterable
128128
`A`, of `scitype(x)`.
129129
130130
See also `scitype`.
131-
132131
"""
133132
scitype_union(A) = reduce((a,b)->Union{a,b}, (scitype(el) for el in A))
134133

135134

136-
# ## SCITYPES OF TUPLES AND ARRAYS
135+
# ## SCITYPES OF TUPLES
137136

138137
scitype(t::Tuple, ::Val) = Tuple{scitype.(t)...}
139138

140-
# The following fallback can be quite slow. Individual conventions
141-
# will usually be able to find more perfomant overloadings of this
142-
# method:
143-
scitype(A::B, ::Val) where {T,N,B<:AbstractArray{T,N}} =
139+
140+
# ## SCITYPES OF ARRAYS
141+
142+
"""
143+
ScientificTypes.Scitype(::Type, C::Val)
144+
145+
Method for implementers of a conventions to enable speed-up of scitype
146+
evaluations for large arrays.
147+
148+
In general, one cannot infer the scitype of an object of type
149+
`AbstractArray{T, N}` from the machine type alone. For, example, this
150+
never holds in the *mlj* convention for a categorical array, or in the
151+
following examples: `X=Any[1, 2, 3]` and `X=Union{Missing,Int64}[1, 2,
152+
3]`.
153+
154+
Nevertheless, for some *restricted* machine types `U`, the statement
155+
`type(X) == AbstractArray{T, N}` for some `T<:U` already allows one
156+
deduce that `scitype(X) = AbstractArray{S,N}`, where `S` is determined
157+
by `U` alone. This is the case in the *mlj* convention, for example,
158+
if `U = Integer`, in which case `S = Count`. If one explicitly declares
159+
160+
ScientificTypes.Scitype(::Type{<:U}, ::Val{:convention}) = S
161+
162+
in such cases, then ScientificTypes ensures a considerable speed-up in
163+
the computation of `scitype(X)`. There is also a partial speed-up for
164+
the case that `T <: Union{U, Missing}`.
165+
166+
For example, in *mlj* one has `Scitype(::Type{<:Integer}) = Count`.
167+
168+
"""
169+
Scitype(::Type, C::Val) = nothing
170+
Scitype(::Type{Any}, C::Val) = nothing # b/s `Any` isa `Union{<:Any, Missing}`
171+
172+
# For all such `T` we can also get almost the same speed-up in the case that
173+
# `T` is replaced by `Union{T, Missing}`, which we detect by wrapping
174+
# the answer:
175+
176+
Scitype(MT::Type{Union{T, Missing}}, C::Val) where T = Val(Scitype(T, C))
177+
178+
# For example, in *mlj* convention, Scitype(::Integer) = Count
179+
180+
const Arr{T,N} = AbstractArray{T,N}
181+
182+
# the dispatcher:
183+
scitype(A::Arr{T}, C) where T = scitype(A, C, Scitype(T, C))
184+
185+
# the slow fallback:
186+
scitype(A::Arr{<:Any,N}, ::Val, ::Nothing) where N =
144187
AbstractArray{scitype_union(A),N}
145188

189+
# the speed-up:
190+
scitype(::Arr{<:Any,N}, ::Val, S) where N = Arr{S,N}
191+
192+
# partial speed-up for missing types, because broadcast is faster than
193+
# computing scitype_union:
194+
function scitype(A::Arr{<:Any,N}, C::Val, ::Val{S}) where {N,S}
195+
if S == nothing
196+
return scitype(A, C, S)
197+
else
198+
Atight = broadcast(identity, A)
199+
if typeof(A) == typeof(Atight)
200+
return Arr{Union{S,Missing},N}
201+
else
202+
return Arr{S,N}
203+
end
204+
end
205+
end
206+
146207

147208
# ## STUB FOR COERCE METHOD
148209

210+
"""
211+
coerce(A::AbstractArray, T; verbosity=1)
212+
213+
Coerce the julia types of elements of `A` to ensure the returned array
214+
has `T` or `Union{Missing,T}` as the union of its element scitypes,
215+
according to the active convention.
216+
217+
A warning is issued if missing values are encountered, unless
218+
`verbosity` is `0` or less.
219+
220+
julia> mlj()
221+
julia> v = coerce([1, missing, 5], Continuous)
222+
3-element Array{Union{Missing, Float64},1}:
223+
1.0
224+
missing
225+
5.0
226+
227+
julia> scitype(v)
228+
AbstractArray{Union{Missing,Continuous}, 1}
229+
230+
See also [`scitype`](@ref), [`scitype_union`](@ref).
231+
232+
"""
149233
function coerce end
150234

151235

@@ -197,33 +281,16 @@ schema(X, ::Val{:other}) =
197281
"an object with trait `:other`\n"*
198282
"Perhaps you meant to import Tables first?"))
199283

284+
include("tables.jl")
285+
include("autotype.jl")
200286

201287
## ACTIVATE DEFAULT CONVENTION
202288

203-
# and include code not requring optional dependencies:
289+
# and include code not requiring optional dependencies:
204290

205291
mlj()
206292
include("conventions/mlj/mlj.jl")
207-
208-
209-
## FOR LOADING OPTIONAL DEPENDENCIES
210-
211-
function __init__()
212-
213-
# for printing out the type tree:
214-
@require(AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c",
215-
include("tree.jl"))
216-
217-
# the scitype and schema of tabular data:
218-
@require(Tables="bd369af6-aec1-5ad0-b16a-f7cc5008161c",
219-
(include("tables.jl"); include("autotype.jl")))
220-
221-
# :mlj conventions requiring external packages
222-
@require(CategoricalArrays="324d7699-5711-5eae-9e2f-1d82baa6b597",
223-
include("conventions/mlj/finite.jl"))
224-
@require(ColorTypes="3da002f7-5984-5a60-b8a6-cbb66c0b333f",
225-
include("conventions/mlj/images.jl"))
226-
227-
end
293+
include("conventions/mlj/finite.jl")
294+
include("conventions/mlj/images.jl")
228295

229296
end # module

src/autotype.jl

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
export autotype
2-
31
"""
42
autotype(X)
53

0 commit comments

Comments
 (0)