Skip to content

Commit 27a9bac

Browse files
authored
Merge pull request #76 from alan-turing-institute/dev
For patch release 0.3.2
2 parents d5fd76f + cdc585b commit 27a9bac

File tree

13 files changed

+122
-77
lines changed

13 files changed

+122
-77
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "ScientificTypes"
22
uuid = "321657f4-b219-11e9-178b-2701a2544e81"
33
authors = ["Anthony D. Blaom <[email protected]>"]
4-
version = "0.3.1"
4+
version = "0.3.2"
55

66
[deps]
77
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"

README.md

Lines changed: 42 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -35,18 +35,20 @@ ScientificTypes.jl has three components:
3535
type hierarchy, and a single function `scitype` with scientific
3636
types as values. Someone implementing a convention must add methods
3737
to this function, while the general user just applies it to data, as
38-
in `scitype(4.5)` (returning `Continuous` in the *mlj* convention).
38+
in `scitype(4.5)` (returning `Continuous` in the *MLJ* convention).
3939

40-
- A built-in convention, called *mlj*, active by default.
40+
- A built-in convention, called *MLJ*, active by default.
4141

42-
- Convenience methods for working with scientific types, the most commonly used being:
42+
- Convenience methods for working with scientific types, the most commonly used being
4343

44-
- `schema(X)`, which gives an extended schema of any table `X`,
45-
including the column scientific types implied by the active
46-
convention.
47-
.
48-
- `coerce(X, ...)`, which coerces the machine types of `X`
49-
to reflect a desired scientific type.
44+
- `schema(X)`, which gives an extended schema of any Tables.jl
45+
compatible table `X`, including the column scientific types
46+
implied by the active convention.
47+
48+
- `coerce(X, ...)`, which coerces the machine types of `X` to
49+
reflect a desired scientific type.
50+
51+
For example,
5052

5153
```julia
5254
using ScientificTypes, DataFrames
@@ -58,40 +60,51 @@ X = DataFrame(
5860
e = ['M', 'F', missing, 'M', 'F'],
5961
)
6062
sch = schema(X) # schema is overloaded in Scientifictypes
61-
for (name, scitype) in zip(sch.names, sch.scitypes)
62-
println(":$name -- $scitype")
63-
end
6463
```
6564

6665
will print
6766

6867
```
69-
:a -- Continuous
70-
:b -- Union{Missing, Continuous}
71-
:c -- Count
72-
:d -- Count
73-
:e -- Union{Missing, Unknown}
68+
_.table =
69+
┌─────────┬─────────────────────────┬────────────────────────────┐
70+
│ _.names │ _.types │ _.scitypes │
71+
├─────────┼─────────────────────────┼────────────────────────────┤
72+
│ a │ Float64 │ Continuous │
73+
│ b │ Union{Missing, Float64} │ Union{Missing, Continuous} │
74+
│ c │ Int64 │ Count │
75+
│ d │ Int64 │ Count │
76+
│ e │ Union{Missing, Char} │ Union{Missing, Unknown} │
77+
└─────────┴─────────────────────────┴────────────────────────────┘
78+
_.nrows = 5
7479
```
7580

76-
this uses the default *mlj* convention to attribute a scitype
77-
(cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)).
81+
Here the default *MLJ* convention is being applied ((cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)). Detail is obtained in the obvious way; for example:
82+
83+
```julia
84+
julia> sch.names
85+
(:a, :b, :c, :d, :e)
86+
```
7887

7988
Now you could want to specify that `b` is actually a `Count`, and that `d` and `e` are `Multiclass`; this is done with the `coerce` function:
8089

8190
```julia
8291
Xc = coerce(X, :b=>Count, :d=>Multiclass, :e=>Multiclass)
83-
sch = schema(Xc)
84-
for (name, scitype) in zip(sch.names, sch.scitypes)
85-
println(":$name -- $scitype")
86-
end
92+
schema(Xc)
8793
```
8894

89-
will print
95+
which prints
9096

9197
```
92-
:a -- Continuous
93-
:b -- Union{Missing, Count}
94-
:c -- Count
95-
:d -- Multiclass{2}
96-
:e -- Union{Missing, Multiclass{2}}
98+
_.table =
99+
┌─────────┬──────────────────────────────────────────────┬───────────────────────────────┐
100+
│ _.names │ _.types │ _.scitypes │
101+
├─────────┼──────────────────────────────────────────────┼───────────────────────────────┤
102+
│ a │ Float64 │ Continuous │
103+
│ b │ Union{Missing, Int64} │ Union{Missing, Count} │
104+
│ c │ Int64 │ Count │
105+
│ d │ CategoricalValue{Int64,UInt8} │ Multiclass{2} │
106+
│ e │ Union{Missing, CategoricalValue{Char,UInt8}} │ Union{Missing, Multiclass{2}} │
107+
└─────────┴──────────────────────────────────────────────┴───────────────────────────────┘
108+
_.nrows = 5
109+
97110
```

docs/src/index.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Found
3535

3636
- A single method `scitype` for articulating a convention about what scientific type each Julia object can represent. For example, one might declare `scitype(::AbstractFloat) = Continuous`.
3737

38-
- A default convention called *mlj*, based on dependencies
38+
- A default convention called *MLJ*, based on dependencies
3939
`CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a
4040
convenience method `coerce` for performing scientific type coercion
4141
on `AbstractVectors` and columns of tabular data (any table
@@ -122,12 +122,24 @@ Finally there is a `coerce!` method that does in-place coercion provided the dat
122122
- Developers can define their own conventions using the code in `src/conventions/mlj/` as a template. The active convention is controlled by the value of `ScientificTypes.CONVENTION[1]`.
123123

124124

125+
## Special note on binary data
126+
127+
ScientificTypes does not define a separate "binary" scientific
128+
type. Rather, when binary data has an intrinsic "true" class (for example
129+
pass/fail in a product test), then it should be assigned an
130+
`OrderedFactor{2}` scitype, while data with no such class (e.g., gender)
131+
should be assigned a `Multiclass{2}` scitype. In the former case
132+
we recommend that the "true" class come after "false" in the ordering
133+
(corresponding to the usual assignment "false=0" and "true=1"). Of
134+
course, `Finite{2}` covers both cases of binary data.
135+
136+
125137
## Detailed usage examples
126138

127139
```@example 3
128140
using ScientificTypes
129141
# activate a convention
130-
mlj() # redundant as it's the default
142+
ScientificTypes.set_convention(MLJ) # redundant as it's the default
131143
132144
scitype((2.718, 42))
133145
```
@@ -203,12 +215,12 @@ scitype([1.3, 4.5, missing])
203215

204216
*Performance note:* Computing type unions over large arrays is
205217
expensive and, depending on the convention's implementation and the
206-
array eltype, computing the scitype can be slow. (In the *mlj*
218+
array eltype, computing the scitype can be slow. (In the *MLJ*
207219
convention this is mitigated with the help of the
208220
`ScientificTypes.Scitype` method, of which other conventions could
209221
make use. Do `?ScientificTypes.Scitype` for details.) An eltype `Any`
210222
will always be slow and you may want to consider replacing an array
211-
`A` with `broadcast(idenity, A)` to collapse the eltype and speed up
223+
`A` with `broadcast(identity, A)` to collapse the eltype and speed up
212224
the computation.
213225

214226
Provided the [Tables.jl](https://github.com/JuliaData/Tables.jl) package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:
@@ -246,7 +258,7 @@ Note that `Table(Continuous,Finite)` is a *type* union and not a `Table` *instan
246258

247259
## The MLJ convention
248260

249-
The table below summarizes the *mlj* convention for representing
261+
The table below summarizes the *MLJ* convention for representing
250262
scientific types:
251263

252264
Type `T` | `scitype(x)` for `x::T` | package required

src/ScientificTypes.jl

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ export Binary, Table
66
export ColorImage, GrayImage
77
export scitype, scitype_union, elscitype, coerce, coerce!, schema
88
export info
9-
export mlj
109
export autotype
1110

1211
# re-export from CategoricalArrays:
@@ -59,11 +58,14 @@ info(object) = info(object, Val(ScientificTypes.trait(object)))
5958

6059
# ## CONVENTIONS
6160

62-
const CONVENTION=[:unspecified]
61+
abstract type Convention end
62+
struct MLJ <: Convention end
63+
64+
const CONVENTION=[MLJ(),]
6365
convention() = CONVENTION[1]
6466

65-
function mlj()
66-
CONVENTION[1] = :mlj
67+
function set_convention(C)
68+
CONVENTION[1] = C()
6769
return nothing
6870
end
6971

@@ -163,7 +165,6 @@ include("autotype.jl")
163165

164166
# and include code not requiring optional dependencies:
165167

166-
mlj()
167168
include("conventions/mlj/mlj.jl")
168169
include("conventions/mlj/finite.jl")
169170
include("conventions/mlj/images.jl")

src/autotype.jl

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ which applying autotype differs from just using the ambient convention. When
1717
coercing with autotype, `only_changes` should be true.
1818
* `rules=(:few_to_finite,)`: the set of rules to apply.
1919
"""
20-
function autotype(X; only_changes::Bool=true,
20+
autotype(X; kwargs...) = _autotype(X, Val(trait(X)); kwargs...)
21+
function _autotype(X, ::Val{:table}; only_changes::Bool=true,
2122
rules::NTuple{N,Symbol} where N=(:few_to_finite,))
2223
# check that X is a table
2324
@assert Tables.istable(X) "The function `autotype` requires tabular data."
@@ -55,11 +56,11 @@ function autotype(X; only_changes::Bool=true,
5556
return suggested_types
5657
end
5758

58-
function autotype(X::AbstractArray{T,M};
59+
function _autotype(X::AbstractArray{T,M}, ::Val{:other};
5960
rules::NTuple{N,Symbol} where N=(:few_to_finite,)) where {T,M}
6061
# check that the rules are recognised
6162
_check_rules(rules)
62-
sugg_type = scitype_union(X)
63+
sugg_type = elscitype(X)
6364
np = prod(size(X))
6465
for rule in rules
6566
if rule == :few_to_finite

src/conventions/mlj/finite.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
nlevels(c::CategoricalValue) = length(levels(c.pool))
22
nlevels(c::CategoricalString) = length(levels(c.pool))
33

4-
scitype(c::CategoricalValue, ::Val{:mlj}) =
4+
scitype(c::CategoricalValue, ::MLJ) =
55
c.pool.ordered ? OrderedFactor{nlevels(c)} : Multiclass{nlevels(c)}
6-
scitype(c::CategoricalString, ::Val{:mlj}) =
6+
scitype(c::CategoricalString, ::MLJ) =
77
c.pool.ordered ? OrderedFactor{nlevels(c)} : Multiclass{nlevels(c)}
88

99
# for temporary hack below:
@@ -64,7 +64,7 @@ end
6464

6565
const CatArr{T,N,V} = CategoricalArray{T,N,<:Any,V}
6666

67-
function scitype(A::CatArr{T,N,V}, ::Val{:mlj}) where {T,N,V}
67+
function scitype(A::CatArr{T,N,V}, ::MLJ) where {T,N,V}
6868
nlevels = length(levels(A))
6969
if isordered(A)
7070
S = OrderedFactor{nlevels}

src/conventions/mlj/images.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
scitype(image::AbstractArray{<:Gray,2}, ::Val{:mlj}) =
1+
scitype(image::AbstractArray{<:Gray,2}, ::MLJ) =
22
GrayImage{size(image)...}
3-
scitype(image::AbstractArray{<:AbstractRGB,2}, ::Val{:mlj}) =
3+
scitype(image::AbstractArray{<:AbstractRGB,2}, ::MLJ) =
44
ColorImage{size(image)...}

src/conventions/mlj/mlj.jl

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
scitype(::AbstractFloat, ::Val{:mlj}) = Continuous
2-
scitype(::Integer, ::Val{:mlj}) = Count
1+
scitype(::AbstractFloat, ::MLJ) = Continuous
2+
scitype(::Integer, ::MLJ) = Count
33

44
function _coerce_missing_warn(::Type{T}) where T
55
T >: Missing || @warn "Missing values encountered coercing scitype to $T.\n"*
@@ -8,8 +8,9 @@ end
88

99
# ## IMPLEMENT PERFORMANCE BOOSTING FOR ARRAYS
1010

11-
Scitype(::Type{<:Integer}, ::Val{:mlj}) = Count
12-
Scitype(::Type{<:AbstractFloat}, ::Val{:mlj}) = Continuous
11+
Scitype(::Type{<:Integer}, ::MLJ) = Count
12+
Scitype(::Type{<:AbstractFloat}, ::MLJ) = Continuous
13+
Scitype(::Type{<:AbstractString}, ::MLJ) = Unknown
1314

1415

1516
## COERCE ARRAY TO CONTINUOUS

src/schema.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ schema(X, ::Val{:other}) =
7777

7878
TRAIT_FUNCTION_GIVEN_NAME[:table] = Tables.istable
7979

80-
function scitype(X, ::Val, ::Val{:table})
80+
function scitype(X, ::Convention, ::Val{:table})
8181
Xcol = Tables.columns(X)
8282
col_names = propertynames(Xcol)
8383
types = map(col_names) do name
@@ -101,7 +101,7 @@ function schema(X, ::Val{:table})
101101
Xcol = Tables.columntable(X)
102102
names = s.names
103103
types = Tuple{s.types...}
104-
scitypes = Tuple{(scitype_union(getproperty(Xcol, name))
104+
scitypes = Tuple{(elscitype(getproperty(Xcol, name))
105105
for name in names)...}
106106
return Schema(names, types, scitypes, _nrows(X))
107107
end

src/scitype.jl

Lines changed: 22 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ scitype(X)
33
44
The scientific type that `x` may represent.
55
"""
6-
scitype(X) = scitype(X, Val(convention()))
6+
scitype(X) = scitype(X, convention())
77
scitype(X, C) = scitype(X, C, Val(trait(X)))
88
scitype(X, C, ::Val{:other}) = Unknown
99

@@ -24,66 +24,69 @@ scitype_union(A) = reduce((a,b)->Union{a,b}, (scitype(el) for el in A))
2424

2525
# ## SCITYPES OF TUPLES
2626

27-
scitype(t::Tuple, ::Val) = Tuple{scitype.(t)...}
27+
scitype(t::Tuple, ::Convention) = Tuple{scitype.(t)...}
2828

2929

3030
# ## SCITYPES OF ARRAYS
3131

3232
"""
33-
ScientificTypes.Scitype(::Type, C::Val)
33+
ScientificTypes.Scitype(::Type, ::C)
3434
35-
Method for implementers of a conventions to enable speed-up of scitype
36-
evaluations for large arrays.
35+
Method for implementers of a convention `C` to enable speed-up of
36+
scitype evaluations for large arrays.
3737
3838
In general, one cannot infer the scitype of an object of type
3939
`AbstractArray{T, N}` from the machine type alone. For, example, this
40-
never holds in the *mlj* convention for a categorical array, or in the
40+
never holds in the *MLJ* convention for a categorical array, or in the
4141
following examples: `X=Any[1, 2, 3]` and `X=Union{Missing,Int64}[1, 2,
4242
3]`.
4343
4444
Nevertheless, for some *restricted* machine types `U`, the statement
4545
`type(X) == AbstractArray{T, N}` for some `T<:U` already allows one
4646
deduce that `scitype(X) = AbstractArray{S,N}`, where `S` is determined
47-
by `U` alone. This is the case in the *mlj* convention, for example,
47+
by `U` alone. This is the case in the *MLJ* convention, for example,
4848
if `U = Integer`, in which case `S = Count`. If one explicitly declares
4949
50-
ScientificTypes.Scitype(::Type{<:U}, ::Val{:convention}) = S
50+
ScientificTypes.Scitype(::Type{<:U}, ::C) = S
5151
5252
in such cases, then ScientificTypes ensures a considerable speed-up in
5353
the computation of `scitype(X)`. There is also a partial speed-up for
5454
the case that `T <: Union{U, Missing}`.
5555
56-
For example, in *mlj* one has `Scitype(::Type{<:Integer}) = Count`.
56+
For example, in the *MLJ* convention, one has
57+
`Scitype(::Type{<:Integer}, ::MLJ) = Count`.
5758
5859
"""
59-
Scitype(::Type, C::Val) = nothing
60-
Scitype(::Type{Any}, C::Val) = nothing # b/s `Any` isa `Union{<:Any, Missing}`
60+
Scitype(::Type, c::Convention) = nothing
61+
Scitype(::Type{Any}, c::Convention) =
62+
nothing # b/s `Any` isa `Union{<:Any, Missing}`
6163

6264
# For all such `T` we can also get almost the same speed-up in the case that
6365
# `T` is replaced by `Union{T, Missing}`, which we detect by wrapping
64-
# the answer:
66+
# the answer as a Val:
6567

66-
Scitype(MT::Type{Union{T, Missing}}, C::Val) where T = Val(Scitype(T, C))
68+
Scitype(MT::Type{Union{T, Missing}}, c::Convention) where T =
69+
Val(Scitype(T, c))
6770

68-
# For example, in *mlj* convention, Scitype(::Integer) = Count
71+
# For example, Scitype(::Integer, ::MLJ) = count
6972

7073
const Arr{T,N} = AbstractArray{T,N}
7174

7275
# the dispatcher:
73-
scitype(A::Arr{T}, C) where T = scitype(A, C, Scitype(T, C))
76+
scitype(A::Arr{T}, c, ::Val{:other}) where T = arr_scitype(A, c, Scitype(T, c))
7477

7578
# the slow fallback:
76-
scitype(A::Arr{<:Any,N}, ::Val, ::Nothing) where N =
79+
arr_scitype(A::Arr{<:Any,N}, ::Convention, ::Nothing) where N =
7780
AbstractArray{scitype_union(A),N}
7881

7982
# the speed-up:
80-
scitype(::Arr{<:Any,N}, ::Val, S) where N = Arr{S,N}
83+
arr_scitype(::Arr{<:Any,N}, ::Convention, S) where N = Arr{S,N}
8184

8285
# partial speed-up for missing types, because broadcast is faster than
8386
# computing scitype_union:
84-
function scitype(A::Arr{<:Any,N}, C::Val, ::Val{S}) where {N,S}
87+
function arr_scitype(A::Arr{<:Any,N}, c::Convention, ::Val{S}) where {N,S}
8588
if S == nothing
86-
return scitype(A, C, S)
89+
return arr_scitype(A, c, S)
8790
else
8891
Atight = broadcast(identity, A)
8992
if typeof(A) == typeof(Atight)

0 commit comments

Comments
 (0)