Skip to content

Filtering Words

WordSiv provides options for filtering the words that are used to generate text:

We demonstrate these arguments with top_words() so you can see what's happening without randomization. You can use these arguments for word(), words(), top_word(), top_words(), sent(), sents(), para(), paras(), text().

Filter Words by Letter Case

The most important parameter in WordSiv (for bicameral languages) is case. WordSiv allows for words in Vocabs to be:

  • lowercase (e.g. "owl"): lc
  • capitalized (e.g. "Korea"): cap
  • all caps (e.g. "WWF"): uc
  • camel-case (e.g. "DDoS"): (no parameter, but respected)

The case argument allows you to select the desired letter case while considering your available glyphs (if you've set glyphs), optionally transforming the original case of words from the Vocab to expand results.

The options are best demonstrated with a small example Vocab:

from wordsiv import Vocab, WordSiv

# Make example Vocab w/ no probabilities
example_words = "grape\nApril\nBART\nDDoS"
vocab = Vocab(bicameral=True, lang="en", data=example_words)

# Build our WordSiv object
wsv = WordSiv(add_default_vocabs=False)
wsv.add_vocab("example", vocab)
wsv.vocab = "example"

# Select words that *already have* desired case in the Vocab
assert wsv.top_word(case="lc") == "grape"
assert wsv.top_word(case="cap_og") == "April"
assert wsv.top_word(case="uc_og") == "BART"
assert wsv.top_words(case="any_og") == ["grape", "April", "BART", "DDoS"]

# Select words that *can be transformed* to desired case
assert wsv.top_words(case="cap") == ["Grape", "April"]
assert wsv.top_words(case="uc") == ["GRAPE", "APRIL", "BART"]

# Select all words and transform to desired case
assert wsv.top_words(case="cap_force") == ["Grape", "April", "Bart", "Ddos"]
assert wsv.top_words(case="uc_force") == ["GRAPE", "APRIL", "BART", "DDOS"]

# Special 'any' case tries 'any_og', then tries 'cap' and 'uc' if no results
# Notice we left out the 'case' parameter on the second call, since 'any' is the default
assert wsv.top_word(glyphs="Grape", case="any") == "Grape"
assert wsv.top_word(glyphs="APRIL") == "APRIL"

Default: Smart Any Case (case='any')

The default option (case='any') tries to match as many words as possible, first trying any_og, then cap, then uc if there are not any matches.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

#   Returns 'ZOO' as most common word
print(wsv.top_word(glyphs="ZO", min_wl=5))

# No `case` arg, so the default is `case='any'` which will:
# - try to get (unmodified) words from Vocab which can be spelled with "ZO"
#   (that have at least 3 characters): No results.
# - try to capitalize Vocab words: "zoo" becomes "Zoo" but we can't spell it
#   without 'o'. No results.
# - try to uppercase Vocab words: "zoo" becomes "ZOO", which we can spell!

Any Case (case='any_og')

The any_og option selects any word from the Vocab that can be spelled with glyphs (if set, otherwise all words). It does not change the case of words from the vocab.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

# We don't get any words, because there are no words of at least 5 letters in
# our Vocab that "AIPRS" can spell
print(wsv.top_word(glyphs="AIPRS", case="any_og", min_wl=5))

# However, this returns "Paris" since we have all the glyphs
print(wsv.top_word(glyphs="aiPrs", case="any_og", min_wl=5))

Lowercase (case='lc')

The lc option selects lowercase words from the Vocab (e.g. "bread"). It will not try to lowercase any words with capitals, since we wouldn't want to lowercase words like "Paris", "FAA", or "DDoS"

Why no lc_og?

There is no need for a lc_og option, because lc already only selects lowercase words from the Vocab.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="lc"))

Forced Lowercase (case='lc_force')

The lc_force option selects all words from the Vocab, and indiscriminately transforms them to lowercase.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="lc_force"))

Capitalized (case='cap')

The cap option selects capitalized words from the Vocab (e.g. "Paris") as well as lowercase (e.g. "boat") words from the Vocab, capitalizing them (e.g. "Paris", "Boat").

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="cap"))

Capitalized, No Case Change (case='cap_og')

The cap_og option selects capitalized words from the Vocab (like "Paris"). It does not capitalize any lowercase words (like case='cap' does). This is useful for getting capitalized words like proper nouns.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="cap_og"))

Forced Capitalized (case='cap_force')

The cap_force option selects all words from the Vocab, and indiscriminately transforms them to uppercase.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="cap_force"))

All Caps (case='uc')

The uc option selects all caps words from the Vocab (e.g. "WWF"), as well as lowercase (e.g. "boat") and capitalized (e.g. "Paris") words from the vocab, transforming them to all caps (e.g. "WWF", "BOAT", "PARIS").

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="uc"))

All Caps, No Case Change (case='uc_og')

The uc_og option selects all caps words from the Vocab (e.g. "WWF"). It does not capitalize any lowercase or capitalized words (like case='uc' does). This is useful for getting all caps words like acronyms.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="uc_og"))

Forced All Caps (case='uc_force')

The uc_force option selects all words from the Vocab, and indiscriminately transforms them to uppercase.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="uc_force"))

Filter Words by Word Length

Arguments wl, min_wl, and max_wl let you select for the length of words in the Vocab:

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

# Exactly 7 letters
print(wsv.top_word(wl=7))

# At least 4 letters
print(wsv.top_word(min_wl=4))

# No more than 20 letters
print(wsv.top_word(max_wl=20))

# At least 10 letters, no more than 20 letters
print(wsv.top_word(min_wl=10, max_wl=20))

Filter Words by Substrings

Arguments startswith, endswith, contains, and inner let you select words which contain specific substrings.

Word Starts With String (startswith)

Argument startswith matches words that have a specific glyph/string at the start of the word, after any case transformations may have occurred.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

# starts with single glyph
print(wsv.top_word(startswith="v"))

# starts with string
print(wsv.top_word(startswith="ev"))

Word Ends With String (endswith)

Argument endswith matches words that have a specific glyph/string at the end of the word, after any case transformations may have occurred.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

# contains single glyph
print(wsv.top_word(endswith="s"))
# contains string
print(wsv.top_word(endswith="ats"))

Word Contains String(s) (contains)

Argument contains matches words which contain specific string(s) in the word, after any case transformations may have occurred.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

# contains single glyph
print(wsv.top_word(contains="a"))
# contains string
print(wsv.top_word(contains="orr"))

# contains multiple glyphs/strings
# `inner` only accepts a tuple, not a list (this is for caching)
print(wsv.top_word(contains=("b", "rr")))

Word Contains Inner String(s) (inner)

Argument inner matches words which contain specific string(s) in word[1:-1] (all but first and last glyphs), after any case transformations may have occurred.

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

# Contains glyph inside (word[1:-1])
print(wsv.top_word(inner="b"))
# Contains string inside (word[1:-1])
print(wsv.top_word(inner="br"))

# Contains multiple strings inside (word[1:-1])
# `inner` only accepts a tuple, not a list (this is for caching)
print(wsv.top_word(inner=("br", "ck")))

Filter Words by Regex

The regexp argument lets you match words by regular expression. This filter happens after any case transformations may have occurred. It uses the regex library from PyPI which gives more options for selecting unicode blocks and more:

from wordsiv import WordSiv

wsv = WordSiv(vocab="en")

print(wsv.top_word(regexp=r"h.+b.*ger"))

# WordSiv uses regex (third-party) regex library from PyPi,
# so you can specify Unicode blocks like this:
wsv_es = WordSiv(vocab="es")
print(wsv_es.top_words(regexp=r".*\p{InLatin-1_Supplement}.*"))

# See https://www.regular-expressions.info/unicode.html for
# more examples of \p{...} syntax

Debugging Filters and Raising Errors

Most of the time when proofing, some output is better than nothing. When WordSiv can't find a matching word for your glyphs and filter arguments, it will just return an empty string and send a warning log message to the console with some details. However, we can change these behaviors:

Suppressing Warning Messages

WordSiv by default outputs log messages when there are no matching words for the given filters. However, you can turn this off by adjusting the log level:

from wordsiv import WordSiv
import logging

# Initially we'll set the logger to show WARNING and more severe
log = logging.getLogger("wordsiv")
log.setLevel(logging.WARNING)

wsv = WordSiv(vocab="en")

# We see a warning in our console
print(wsv.top_word(contains="BLAH"))

# We can suppress these warnings by setting the logging level to ERROR:
log.setLevel(logging.ERROR)

# No warning this time!
print(wsv.top_word(contains="NOWAY"))

Raising Errors

Maybe you want your script to halt if there are no matching words, or want to catch the error and try something else. For this you can use the raise_errors option, which will raise wordsiv.FilterError if there are no word matches.

from wordsiv import WordSiv, FilterError

wsv = WordSiv(vocab="en", raise_errors=True)

try:
    wsv.top_word(contains="NOWAY")
except FilterError as e:
    print(f'We can handle our error "{e}" here!')