Language Support
Vocab
In WordSiv, a Vocab is an object that contains a word list and other language-specific data that allow a WordSiv object to appropriately filter words and generate text.
Note
I considered naming this object WordList, but it also can contain
word counts and punctuation data. I considered calling it Lang, but it's
possible to have more than one set of words (and punctuation, etc.) per
language. I can imagine having Vocabs derived from different genres of text:
en-news
, en-wiki
, etc!
Using a Built-in Vocab
See Basic Usage for how to list and select a built-in Vocab. If you're curious about the origin/license1 of these lists you can examine the built-in Vocabs in wordsiv/_vocab_data.
Creating a custom Vocab
It's easy to add your own Vocab to WordSiv. The harder part is actually deriving wordlists from a text corpus) and refining the capitalization (if applicable), which we won't detail here.
Let's say we grab the top 20 German words from this frequency wordlist derived
from OpenSubtitles, and save it as de-words.tsv
(replacing spaces
with tabs):
ich 3699605
sie 2409949
das 1952794
ist 1920535
du 1890181
nicht 1734016
die 1585020
es 1460530
und 1441012
der 1109693
wir 1075801
was 1072372
zu 918548
er 851812
ein 841835
in 793011
mir 645137
mit 641744
ja 635186
den 588653
We can now create a Vocab and add it to WordSiv:
from wordsiv import Vocab, WordSiv
# Create a Vocab from a file
de_vocab = Vocab(lang="de", data_file="de.tsv", bicameral=True)
# Add Vocab to WordSiv object
ws = WordSiv()
ws.add_vocab("de-subtitles", de_vocab)
# Try it out
print(ws.sent(vocab="de-subtitles"))
We get the output:
Die du die der ich nicht sie das und e
Adding Custom Punctuation to a Vocab
But what if we want punctuation? We have some default punctuation for the built-in languages in wordsiv/_punctuation.py, but not yet for German (at the time of writing). Let's copy/paste the English one (for now2) and try it out:
from wordsiv import Vocab, WordSiv
# Define the punctuation dictionary
de_punc = {
"insert": {
" ": 0.365,
", ": 0.403,
": ": 0.088,
"; ": 0.058,
"–": 0.057,
"—": 0.022,
" … ": 0.006,
},
"wrap_sent": {
("", "."): 0.923,
("", "!"): 0.034,
("", "?"): 0.04,
("", "…"): 0.003,
},
"wrap_inner": {
("", ""): 0.825,
("(", ")"): 0.133,
("‘", "’"): 0.013,
("“", "”"): 0.028,
},
}
# Create a Vocab from a file, this time passing punctuation
de_vocab = Vocab(lang="de", data_file="de.tsv", bicameral=True, punctuation=de_punc)
# Add Vocab to WordSiv Object
ws = WordSiv()
ws.add_vocab("de-subtitles", de_vocab)
# Try it out, turning up punctuation randomness so we see more variation
print(ws.para(vocab="de-subtitles", rnd_punc=0.5))
Now we see punctuation:
Ich ist mit das ich (du und) mit es sie… Nicht das was zu sie—du die ja nicht und zu ist du? Das er das “wir” ich was sie der du mit das die und zu ich. In und in, ich ja ich die der das (nicht er sie ich) mir.
Contributing Vocabs to WordSiv
WordSiv is as only as good as the Vocabs (and punctuation dictionaries!) that are available to it, and we'd love any help on improving language support. Feel free to create an issue on the GitHub repo if you're interested in helping us improve language support. You don't even have to be a programmer—we just need native speakers to help us construct useful Vocabs. However, if you are looking to learn some programming, building wordlists and punctuation can be a fun first project (and I'd be glad to help!).
My long-term vision is to build a community-maintained project (outside of WordSiv) that has a huge selection of multilingual proofing text, wordlists, punctuation, etc. and resources and code that enable the global type community to more easily leverage the language data that is commonplace in NLP/linguistics/engineering circles. A lot of the source data already exists, it just needs to be adapted for the needs/tooling of type designers.
-
Licensing for wordlists is a bit odd, because they're often built by crawling a bunch of data with all kinds of licenses. I'm just doing my best here to respect licenses where I can! ↩
-
I'd recommend deriving punctuation frequencies for the target language from real text, and normalizing the probabilities between 0 and 1. I have a script that builds these dictionaries, which I hope to publish soon! ↩