Language Support

Vocab

In WordSiv, a Vocab is an object that contains a word list and other language-specific data that allow a WordSiv object to appropriately filter words and generate text.

Note

I considered naming this object WordList, but it also can contain word counts and punctuation data. I considered calling it Lang, but it's possible to have more than one set of words (and punctuation, etc.) per language. I can imagine having Vocabs derived from different genres of text: en-news, en-wiki, etc!

Using a Built-in Vocab

See Basic Usage for how to list and select a built-in Vocab. If you're curious about the origin/license¹ of these lists you can examine the built-in Vocabs in wordsiv/_vocab_data.

Creating a custom Vocab

It's easy to add your own Vocab to WordSiv. The harder part is actually deriving wordlists from a text corpus) and refining the capitalization (if applicable), which we won't detail here.

Let's say we grab the top 20 German words from this frequency wordlist derived from OpenSubtitles, and save it as de-words.tsv (replacing spaces with tabs):

ich 3699605
sie 2409949
das 1952794
ist 1920535
du  1890181
nicht   1734016
die 1585020
es  1460530
und 1441012
der 1109693
wir 1075801
was 1072372
zu  918548
er  851812
ein 841835
in  793011
mir 645137
mit 641744
ja  635186
den 588653

We can now create a Vocab and add it to WordSiv:

from wordsiv import Vocab, WordSiv

# Create a Vocab from a file
de_vocab = Vocab(lang="de", data_file="de.tsv", bicameral=True)

# Add Vocab to WordSiv object
ws = WordSiv()
ws.add_vocab("de-subtitles", de_vocab)

# Try it out
print(ws.sent(vocab="de-subtitles"))

We get the output:

Die du die der ich nicht sie das und e

Adding Custom Punctuation to a Vocab

But what if we want punctuation? We have some default punctuation for the built-in languages in wordsiv/_punctuation.py, but not yet for German (at the time of writing). Let's copy/paste the English one (for now²) and try it out:

from wordsiv import Vocab, WordSiv

# Define the punctuation dictionary
de_punc = {
    "insert": {
        " ": 0.365,
        ", ": 0.403,
        ": ": 0.088,
        "; ": 0.058,
        "–": 0.057,
        "—": 0.022,
        " … ": 0.006,
    },
    "wrap_sent": {
        ("", "."): 0.923,
        ("", "!"): 0.034,
        ("", "?"): 0.04,
        ("", "…"): 0.003,
    },
    "wrap_inner": {
        ("", ""): 0.825,
        ("(", ")"): 0.133,
        ("‘", "’"): 0.013,
        ("“", "”"): 0.028,
    },
}

# Create a Vocab from a file, this time passing punctuation
de_vocab = Vocab(lang="de", data_file="de.tsv", bicameral=True, punctuation=de_punc)

# Add Vocab to WordSiv Object
ws = WordSiv()
ws.add_vocab("de-subtitles", de_vocab)

# Try it out, turning up punctuation randomness so we see more variation
print(ws.para(vocab="de-subtitles", rnd_punc=0.5))

Now we see punctuation:

Ich ist mit das ich (du und) mit es sie… Nicht das was zu sie—du die ja nicht und zu ist du? Das er das “wir” ich was sie der du mit das die und zu ich. In und in, ich ja ich die der das (nicht er sie ich) mir.

Contributing Vocabs to WordSiv

WordSiv is as only as good as the Vocabs (and punctuation dictionaries!) that are available to it, and we'd love any help on improving language support. Feel free to create an issue on the GitHub repo if you're interested in helping us improve language support. You don't even have to be a programmer—we just need native speakers to help us construct useful Vocabs. However, if you are looking to learn some programming, building wordlists and punctuation can be a fun first project (and I'd be glad to help!).

My long-term vision is to build a community-maintained project (outside of WordSiv) that has a huge selection of multilingual proofing text, wordlists, punctuation, etc. and resources and code that enable the global type community to more easily leverage the language data that is commonplace in NLP/linguistics/engineering circles. A lot of the source data already exists, it just needs to be adapted for the needs/tooling of type designers.

Licensing for wordlists is a bit odd, because they're often built by crawling a bunch of data with all kinds of licenses. I'm just doing my best here to respect licenses where I can! ↩
I'd recommend deriving punctuation frequencies for the target language from real text, and normalizing the probabilities between 0 and 1. I have a script that builds these dictionaries, which I hope to publish soon! ↩