In NLP models, we generally embed words from one-hot vector space, where each word represented by single-entry vector ([0,...,1,...,0]
), to low dimensional space, which is continuous embedding space. This is basically done by multiplying one-hot vectors with a projection matrix. Albeit, we don’t materialize this vectors and the multiplication in computer, it is done by indexing the projection matrix. Therefore, we don’t need to store one-hot vector [0,...,1,...,0]
at no time, instead we store index of the single-entry-1
(e.g 874). You may check onehot.jl from Flux.jl to see how this is implemented.
However, the problem is in generating those vectors, and Flux.jl uses search(O(n)
) to get indices from given labels(words). This is little problematic if you want to train your model in a large dataset. I think we need hashing instead of searching.
function onehot(l, labels) i = something(findfirst(isequal(l), labels), 0) # <= !!!search!!! i > 0 || error("Value $l is not in labels") OneHotVector(i, length(labels)) end
So, we were using a Dictionary
structure which maps words to their index in the vocabulary. We, then, store only word indices(=OneHot
structure) in the final data. Dictionary structure is also helpful in easily creating your vocabulary by single pass over data.
julia> data = "this is an example data with example words" "this is an example data with example words" julia> vocab = Dict{String,Int}() Dict{String,Int64} with 0 entries julia> for word in split(data) get!(vocab,word,length(vocab)+1) end julia> vocab Dict{String,Int64} with 7 entries: "this" => 1 "is" => 2 "example" => 4 "data" => 5 "words" => 7 "with" => 6 "an" => 3 julia> vocab["example"] 4 julia> encoded = map(x->vocab[x],split(data)) 8-element Array{Int64,1}: 1 2 3 4 5 6 4 7
However, you also may want to decode those indices while you are printing your data. The dictionary structure that we use maps word=>index
not index=>word
. Simply, an array of the words, in the order of their vocabulary index, will serve for latter mapping that we need. What we were doing was storing an extra word array separately, and use them when needed.
julia> function invert(vocab) int2tok = Array{String}(undef,length(vocab)) for (k,v) in vocab; int2tok[v] = k; end return int2tok end julia> words = invert(vocab) 7-element Array{String,1}: "this" "is" "an" "example" "data" "with" "words" julia> words[4] "example" julia> vocab["example"] 4
This enable us to go in both direction in O(1)
time by using 2x
memory. However, this two separate data structure (words
and vocab
) was making my codes complicated. So, this is where IndexedDict
comes to play. IndexedDict is an indexed dictionary which works in both ways. You can think it as a multiple inheritance of Dictionary
and Array
. You may use push!
and append!
methods to add new words to your vocabulary. It automatically checks new words and adds them to dictionary. You can index it with integers or you can also get the index of an element.
julia> data = "this is an example data with example words" "this is an example data with example words" julia> vocab = IndexedDict{String}() IndexedDict{String}(Dict{String,Int64}(), String[]) julia> append!(vocab,split(data)) IndexedDict{String}(Dict("this"=>1,"is"=>2,"example"=>4,"data"=>5,"words"=>7,"with"=>6,"an"=>3), ["this", "is", "an", "example", "data", "with", "words"]) julia> vocab["example"] 4 julia> vocab[4] "example" julia> vocab[[1,2,3,4]] 4-element Array{String,1}: "this" "is" "an" "example" julia> vocab[["example","this"]] 2-element Array{Int64,1}: 4 1 julia> push!(vocab,"a-new-word") IndexedDict{String}(Dict("a-new-word"=>8,"this"=>1,"is"=>2,"example"=>4,"data"=>5,"words"=>7,"with"=>6,"an"=>3), ["this", "is", "an", "example", "data", "with", "words", "a-new-word"]) julia> vocab["a-new-word"] 8 julia> length(vocab) 8
You may also directly create an IndexedDict
from an existing array or a dictionary, but you need to be careful you supply unique elements or a proper dictionary.
julia> data = ["this", "is", "an", "example", "data", "with", "unique","words"] 8-element Array{String,1}: "this" "is" "an" "example" "data" "with" "unique" "words" julia> vocab = IndexedDict(data) IndexedDict{String}(Dict("unique"=>7,"this"=>1,"is"=>2,"example"=>4,"data"=>5,"words"=>8,"with"=>6,"an"=>3), ["this", "is", "an", "example", "data", "with", "unique", "words"]) julia> vocab["unique"] 7 julia> vocab[7] "unique" julia> length(vocab) 8
julia> dict = Dict("word"=>1,"this"=>2) Dict{String,Int64} with 2 entries: "this" => 2 "word" => 1 julia> vocab = IndexedDict(dict) IndexedDict{String}(Dict("this"=>2,"word"=>1), ["word", "this"]) julia> vocab["this"] 2 julia> vocab[2] "this" julia> length(vocab) 2
With IndexedDict
my parsing and model codes became more readable, so I would like to share it. If you have better data structure for the purpose let me know!
The source code:
import Base: get, length, getindex, push!, append! | |
struct IndexedDict{T} | |
toIndex::Dict{T,Int}; | |
toElement::Vector{T}; | |
IndexedDict{T}(toIndex,toElement) where T = new(toIndex,toElement) | |
IndexedDict{T}(toIndex,toElement) where T<:Integer = error("Cannot Create IndexedDict of Integers") | |
IndexedDict{T}(toIndex,toElement) where T<:AbstractArray = error("Cannot Create IndexedDict of Arrays") | |
end | |
IndexedDict{T}() where T = IndexedDict{T}(Dict{T,Int}(),T[]) | |
function IndexedDict(toElement::Vector{T}) where T | |
toIndex=Dict{T,Int}(v=>k for (k,v) in enumerate(toElement)) | |
IndexedDict(toIndex, toElement) | |
end | |
function IndexedDict(toIndex::Dict{T,Int}) where T | |
toElement=Vector{T}(undef,length(toIndex)) | |
for (k,v) in toIndex; toElement[v]=k; end | |
IndexedDict{T}(toIndex,toElement) | |
end | |
get(d::IndexedDict,v,default) = get(d.toIndex,v,default) | |
length(d::IndexedDict) = length(d.toElement) | |
getindex(d::IndexedDict,inds::Integer) = d.toElement[inds] | |
getindex(d::IndexedDict{T},inds::T) where T = d.toIndex[inds] | |
getindex(d::IndexedDict{T}, elements::Array{T,1}) where T = map(e->d[e], elements) | |
getindex(d::IndexedDict, inds::Array{<:Integer,1}) = d.toElement[inds] | |
append!(d1::IndexedDict{T}, d2::IndexedDict{T}) where T = append!(d1,d2.toElement) | |
function push!(d::IndexedDict, element) | |
if !haskey(d.toIndex,element) | |
d.toIndex[element]=length(d)+1; | |
push!(d.toElement,element) | |
end | |
return d | |
end | |
function append!(d::IndexedDict, elements) | |
for element in elements | |
push!(d,element) | |
end | |
return d | |
end |