To be able to edit code and run cells, you need to run the notebook yourself. Where would you like to run the notebook?

In the cloud (experimental)

Binder is a free, open source service that runs scientific notebooks in the cloud! It will take a while, usually 2-7 minutes to get a session.

On your computer

(Recommended if you want to store your changes.)

  1. Copy the notebook URL:
  2. Run Pluto

    (Also see: How to install Julia and Pluto)

  3. Paste URL in the Open box

Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Author 1
👀 Reading hidden code
begin
import Pkg
Pkg.activate(mktempdir())
Pkg.add(["PlutoUI"])
using PlutoUI
end
❔
  Activating new project at `/tmp/jl_PZxfUj`
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
    Updating `/tmp/jl_PZxfUj/Project.toml`
  [7f904dfe] + PlutoUI v0.7.64
    Updating `/tmp/jl_PZxfUj/Manifest.toml`
  [6e696c72] + AbstractPlutoDingetjes v1.3.2
  [3da002f7] + ColorTypes v0.12.1
  [53c48c17] + FixedPointNumbers v0.8.5
  [47d2ed2b] + Hyperscript v0.0.5
  [ac1192a8] + HypertextLiteral v0.9.5
  [b5f81e59] + IOCapture v0.2.5
  [682c06a0] + JSON v0.21.4
  [6c6e2e6c] + MIMEs v1.1.0
  [69de0a69] + Parsers v2.8.3
  [7f904dfe] + PlutoUI v0.7.64
  [aea7be01] + PrecompileTools v1.2.1
  [21216c6a] + Preferences v1.4.3
  [189a3867] + Reexport v1.2.2
  [410a4b4d] + Tricks v0.1.10
  [5c2747f8] + URIs v1.5.2
  [0dad84c5] + ArgTools
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
2.1 s
👀 Reading hidden code
14.4 μs

Cleaning

👀 Reading hidden code
171 μs
clean (generic function with 1 method)
👀 Reading hidden code
function clean(text)
filter(islatin, lowercase(text))
end
413 μs
👀 Reading hidden code
latin = [('a' : 'z')..., ' ']
17.1 μs
islatin (generic function with 1 method)
👀 Reading hidden code
function islatin(character)
'a' <= character <= 'z' || character == ' '
end
522 μs
map(clean, samples)
👀 Reading hidden code
27.9 ms

Transition tables

👀 Reading hidden code
168 μs
transition_counts (generic function with 1 method)
function transition_counts(sample)
A = zeros(Int, (length(latin),length(latin)))
for i in 1:(length(sample)-1)
c1 = sample[i]
c2 = sample[i+1]
i1 = findfirst(isequal(c1), latin)
i2 = findfirst(isequal(c2), latin)
A[i1, i2] += 1
end
A
end
👀 Reading hidden code
1.3 ms
map(transition_counts ∘ clean, samples)
👀 Reading hidden code
62.7 ms
using LinearAlgebra
👀 Reading hidden code
259 μs
map(normalize ∘ transition_counts ∘ clean, samples)
👀 Reading hidden code
245 ms
transition_frequencies = normalize ∘ transition_counts ∘ clean;
👀 Reading hidden code
31.2 μs

abbabbabccabc

Let's try it out! To keep things simple, let's only look at the letters a, b, c.

👀 Reading hidden code
263 μs
@bind transition_demo TextField(default="abba")
👀 Reading hidden code
230 ms
3×3 Matrix{Float64}:
 0.0      0.57735  0.0
 0.57735  0.57735  0.0
 0.0      0.0      0.0
transition_frequencies(transition_demo)[1:3, 1:3]
# the 3x3 top left corner corresponds to a, b & c
👀 Reading hidden code
33.5 ms

Change the text to abbaaaaaaaaaaaaa - what do you see?

👀 Reading hidden code
159 μs

Interpreting this table

👀 Reading hidden code
168 μs

Some questions:

Which letters appear double? Which one is most common?

Which letter is most likely to follow a W?

Which letter is most likely to precede a W?

What is the probability that a vowel comes after a consonant?

What is the sum of each row? What is the sum of each column? How can we interpret these values?

👀 Reading hidden code
263 μs

Detecting the language

👀 Reading hidden code
179 μs

We are faced with a challenge - we have some text, and we want to know whether it is written in English or Spanish! This might be a simple task for us, but a computer needs a little help.

👀 Reading hidden code
38.2 ms
👀 Reading hidden code
6.6 ms
"Small boats are typically found on inland waterways such as rivers and lakes, or in protected coastal areas. However, some boats, such as the whaleboat, were intended for use in an offshore environment. In modern naval terms, a boat is a vessel small enough to be carried aboard a ship. Anomalous definitions exist, as lake freighters 1,000 feet (300 m) long on the Great Lakes are called \"boats\". \n"
mystery_sample
👀 Reading hidden code
9.7 μs

To solve this problem, we are going to use the transition table of our mystery sample.

👀 Reading hidden code
242 μs
27×27 Matrix{Float64}:
 0.0        0.0240215  0.0        0.0        …  0.0        0.0240215  0.0  0.0720646
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0720646  0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.192172
 0.0480431  0.0240215  0.0240215  0.0960861     0.0240215  0.0        0.0  0.240215
 0.0        0.0        0.0        0.0        …  0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0240215
 ⋮                                           ⋱                        ⋮    
 0.0240215  0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0480431  0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0240215
 0.0        0.0        0.0        0.0        …  0.0        0.0        0.0  0.0
 0.31228    0.120108   0.0720646  0.0240215     0.0        0.0        0.0  0.0480431
transition_frequencies(mystery_sample)
👀 Reading hidden code
115 μs
distances = map(samples) do sample
norm(transition_frequencies(mystery_sample) - transition_frequencies(sample))
end
👀 Reading hidden code
197 ms

It looks like this text is English!

👀 Reading hidden code
27.3 ms




👀 Reading hidden code
115 μs

Other languages

Throughout this notebook, we used samples, without making assumptions about the actual names of the languages. This is not just for mathematical kicks - writing general code means that it can be directly applied to new problems!

So go back to the first cell, and add a third language, or change English and Spanish to somthing else!

👀 Reading hidden code
417 μs

Appendix

👀 Reading hidden code
169 μs