Activating new project at `/tmp/jl_gMNsXF` Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Updating `/tmp/jl_gMNsXF/Project.toml` [7f904dfe] + PlutoUI v0.7.61 Updating `/tmp/jl_gMNsXF/Manifest.toml` [6e696c72] + AbstractPlutoDingetjes v1.3.2 [3da002f7] + ColorTypes v0.11.5 [53c48c17] + FixedPointNumbers v0.8.5 [47d2ed2b] + Hyperscript v0.0.5 [ac1192a8] + HypertextLiteral v0.9.5 [b5f81e59] + IOCapture v0.2.5 [682c06a0] + JSON v0.21.4 [6c6e2e6c] + MIMEs v1.0.0 [69de0a69] + Parsers v2.8.1 [7f904dfe] + PlutoUI v0.7.61 [aea7be01] + PrecompileTools v1.2.1 [21216c6a] + Preferences v1.4.3 [189a3867] + Reexport v1.2.2 [410a4b4d] + Tricks v0.1.10 [5c2747f8] + URIs v1.5.1 [0dad84c5] + ArgTools [56f22d72] + Artifacts [2a0f44e3] + Base64 [ade2ca70] + Dates [f43a241f] + Downloads [7b1f6079] + FileWatching [b77e0a4c] + InteractiveUtils [b27032c2] + LibCURL [76f85450] + LibGit2 [8f399da3] + Libdl [37e2e46d] + LinearAlgebra [56ddb016] + Logging [d6f4376e] + Markdown [a63ad114] + Mmap [ca575930] + NetworkOptions [44cfe95a] + Pkg [de0858da] + Printf [3fa0cd96] + REPL [9a3f8284] + Random [ea8e919c] + SHA [9e88b42a] + Serialization [6462fe0b] + Sockets [2f01184e] + SparseArrays [10745b16] + Statistics [fa267f1f] + TOML [a4e569a6] + Tar [8dfed614] + Test [cf7118a7] + UUIDs [4ec0a83e] + Unicode [e66e0078] + CompilerSupportLibraries_jll [deac9b47] + LibCURL_jll [29816b5a] + LibSSH2_jll [c8ffd9c3] + MbedTLS_jll [14a3606d] + MozillaCACerts_jll [4536629a] + OpenBLAS_jll [83775a58] + Zlib_jll [8e850b90] + libblastrampoline_jll [8e850ede] + nghttp2_jll [3f19e933] + p7zip_jll
"An oak is a tree or shrub in the genus Quercus (/ˈkwɜːrkəs/;[1] Latin \"oak tree\") of the beech family, Fagaceae. There are approximately 600 extant species of oaks. The common name \"oak\" also appears in the names of species in related genera, notably Lithocarpus (stone oaks), as well as in those of" ⋯ 262 bytes ⋯ "Asia, Europe, and North Africa. North America contains the largest number of oak species, with approximately 90 occurring in the United States, while Mexico has 160 species of which 109 are endemic. The second greatest center of oak diversity is China, which contains approximately 100 species.[2]\n"
"Son árboles de gran porte por lo general, aunque también se incluyen arbustos. Los hay de follaje permanente, caducifolios y marcescentes. Las flores masculinas se presentan en amentos, inflorescencias complejas colgantes, habitualmente cada flor con entre cuatro y diez estambres, lo más a menudo s" ⋯ 727 bytes ⋯ "ementos dominantes del paisaje arbóreo en muchos territorios de su área de distribución (fundamentalmente en el hemisferio norte). Son frecuentes los fenómenos de hibridación entre sus especies, que suelen presentar, además, facilidad para la regeneración vegetativa por brotes de raíz o de cepa. \n"
Cleaning
clean (generic function with 1 method)
'a'
'b'
'c'
'd'
'e'
'f'
'g'
'h'
'i'
'j'
'k'
'l'
'm'
'n'
'o'
'p'
'q'
'r'
's'
't'
'u'
'v'
'w'
'x'
'y'
'z'
' '
islatin (generic function with 1 method)
"an oak is a tree or shrub in the genus quercus kwrks latin oak tree of the beech family fagaceae there are approximately extant species of oaks the common name oak also appears in the names of species in related genera notably lithocarpus stone oaks as well as in those of unrelated species such as" ⋯ 205 bytes ⋯ "itudes in the americas asia europe and north africa north america contains the largest number of oak species with approximately occurring in the united states while mexico has species of which are endemic the second greatest center of oak diversity is china which contains approximately species"
"son rboles de gran porte por lo general aunque tambin se incluyen arbustos los hay de follaje permanente caducifolios y marcescentes las flores masculinas se presentan en amentos inflorescencias complejas colgantes habitualmente cada flor con entre cuatro y diez estambres lo ms a menudo seis de lar" ⋯ 657 bytes ⋯ "rticipan como elementos dominantes del paisaje arbreo en muchos territorios de su rea de distribucin fundamentalmente en el hemisferio norte son frecuentes los fenmenos de hibridacin entre sus especies que suelen presentar adems facilidad para la regeneracin vegetativa por brotes de raz o de cepa "
Transition tables
transition_counts (generic function with 1 method)
27×27 Matrix{Int64}: 0 1 2 0 2 1 1 0 2 0 9 2 5 … 4 0 6 7 11 0 0 0 0 0 0 8 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 6 0 1 0 3 0 0 5 8 0 0 1 0 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 8 5 0 10 3 4 0 0 0 0 0 0 6 3 0 0 13 13 0 1 2 0 3 0 0 26 2 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 2 0 0 0 0 0 0 0 0 7 1 0 0 0 4 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 2 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 3 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 19 1 7 2 5 3 5 2 10 0 1 4 1 0 2 2 14 19 2 0 5 0 0 0 5
27×27 Matrix{Int64}: 0 2 3 14 0 0 1 0 2 2 0 5 … 1 0 13 18 2 1 0 0 1 1 1 23 0 0 0 0 2 0 0 0 2 0 0 1 0 0 5 0 1 2 0 0 0 0 0 0 4 0 0 0 5 0 0 1 12 0 0 1 1 0 1 0 1 4 0 0 0 0 0 1 10 0 0 0 23 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 2 2 0 4 2 0 0 2 0 1 2 0 9 1 0 12 36 4 0 0 0 0 0 5 35 1 0 0 0 4 0 0 0 1 0 0 4 … 0 0 3 1 0 1 0 0 0 0 0 0 3 0 0 0 5 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ 3 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 4 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 1 0 0 0 0 0 3 14 3 22 20 25 13 4 6 3 1 0 15 16 2 6 16 4 3 4 0 0 4 0 0
27×27 Matrix{Float64}: 0.0 0.0117876 0.0235751 0.0 … 0.0 0.0 0.0 0.0943006 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0117876 0.0707254 0.0 0.0117876 0.0 0.0 0.0 0.0 0.0117876 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0943006 0.0589378 0.0 0.117876 0.0353627 0.0353627 0.0 0.0 0.306477 0.0235751 0.0 0.0 0.0 … 0.0 0.0 0.0 0.082513 0.0117876 0.0 0.0 0.0 0.0 0.0 0.0 0.0235751 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.082513 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.223964 0.0117876 0.082513 0.0235751 0.0 0.0 0.0 0.0589378
27×27 Matrix{Float64}: 0.0 0.0143781 0.0215671 0.100647 … 0.00718904 0.00718904 0.165348 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0287561 0.0 0.0 0.0 0.0 0.0 0.00718904 0.0718904 0.0 0.0 0.0 0.0 0.0 0.0143781 0.0143781 0.0 0.0287561 0.0143781 0.0 0.0359452 0.251616 0.00718904 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0215671 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0215671 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0359452 0.0287561 0.0 0.0 0.0 … 0.0 0.0 0.0215671 0.100647 0.0215671 0.158159 0.143781 0.0287561 0.0 0.0
abbabbabccabc
Let's try it out! To keep things simple, let's only look at the letters a, b, c.
3×3 Matrix{Float64}:
0.0 0.57735 0.0
0.57735 0.57735 0.0
0.0 0.0 0.0
Change the text to
abbaaaaaaaaaaaaa
- what do you see?
Interpreting this table
Some questions:
Which letters appear double? Which one is most common?
Which letter is most likely to follow a W?
Which letter is most likely to precede a W?
What is the probability that a vowel comes after a consonant?
What is the sum of each row? What is the sum of each column? How can we interpret these values?
Detecting the language
We are faced with a challenge - we have some text, and we want to know whether it is written in English or Spanish! This might be a simple task for us, but a computer needs a little help.
"Small boats are typically found on inland waterways such as rivers and lakes, or in protected coastal areas. However, some boats, such as the whaleboat, were intended for use in an offshore environment. In modern naval terms, a boat is a vessel small enough to be carried aboard a ship. Anomalous definitions exist, as lake freighters 1,000 feet (300 m) long on the Great Lakes are called \"boats\". \n"
To solve this problem, we are going to use the transition table of our mystery sample.
27×27 Matrix{Float64}:
0.0 0.0240215 0.0 0.0 … 0.0 0.0240215 0.0 0.0720646
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0720646 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.192172
0.0480431 0.0240215 0.0240215 0.0960861 0.0240215 0.0 0.0 0.240215
0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0240215
⋮ ⋱ ⋮
0.0240215 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0480431 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0240215
0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0
0.31228 0.120108 0.0720646 0.0240215 0.0 0.0 0.0 0.0480431
0.638384
0.788619
It looks like this text is English!
Other languages
Throughout this notebook, we used samples
, without making assumptions about the actual names of the languages. This is not just for mathematical kicks - writing general code means that it can be directly applied to new problems!
So go back to the first cell, and add a third language, or change English and Spanish to somthing else!