Filtering Words
WordSiv provides options for filtering the words that are used to generate text:
- Letter Case:
case
- Word Length:
wl
,min_wl
,max_wl
- Substrings:
startswith
,endswith
,contains
,inner
- Regular Expressions:
regexp
We demonstrate these arguments with top_words()
so you can see what's
happening without randomization. You can use these arguments for word()
,
words()
, top_word()
, top_words()
, sent()
, sents()
, para()
,
paras()
, text()
.
Filter Words by Letter Case
The most important parameter in WordSiv (for bicameral languages) is case
.
WordSiv allows for words in Vocabs to be:
- lowercase (e.g.
"owl"
):lc
- capitalized (e.g.
"Korea"
):cap
- all caps (e.g.
"WWF"
):uc
- camel-case (e.g.
"DDoS"
): (no parameter, but respected)
The case
argument allows you to select the desired letter case while
considering your available glyphs (if you've set glyphs
), optionally
transforming the original case of words from the Vocab to expand results.
The options are best demonstrated with a small example Vocab:
from wordsiv import Vocab, WordSiv
# Make example Vocab w/ no probabilities
example_words = "grape\nApril\nBART\nDDoS"
vocab = Vocab(bicameral=True, lang="en", data=example_words)
# Build our WordSiv object
wsv = WordSiv(add_default_vocabs=False)
wsv.add_vocab("example", vocab)
wsv.vocab = "example"
# Select words that *already have* desired case in the Vocab
assert wsv.top_word(case="lc") == "grape"
assert wsv.top_word(case="cap_og") == "April"
assert wsv.top_word(case="uc_og") == "BART"
assert wsv.top_words(case="any_og") == ["grape", "April", "BART", "DDoS"]
# Select words that *can be transformed* to desired case
assert wsv.top_words(case="cap") == ["Grape", "April"]
assert wsv.top_words(case="uc") == ["GRAPE", "APRIL", "BART"]
# Select all words and transform to desired case
assert wsv.top_words(case="cap_force") == ["Grape", "April", "Bart", "Ddos"]
assert wsv.top_words(case="uc_force") == ["GRAPE", "APRIL", "BART", "DDOS"]
# Special 'any' case tries 'any_og', then tries 'cap' and 'uc' if no results
# Notice we left out the 'case' parameter on the second call, since 'any' is the default
assert wsv.top_word(glyphs="Grape", case="any") == "Grape"
assert wsv.top_word(glyphs="APRIL") == "APRIL"
Default: Smart Any Case (case='any'
)
The default option (case='any'
) tries to match as many words as possible,
first trying any_og
, then cap
, then uc
if there are not any matches.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# Returns 'ZOO' as most common word
print(wsv.top_word(glyphs="ZO", min_wl=5))
# No `case` arg, so the default is `case='any'` which will:
# - try to get (unmodified) words from Vocab which can be spelled with "ZO"
# (that have at least 3 characters): No results.
# - try to capitalize Vocab words: "zoo" becomes "Zoo" but we can't spell it
# without 'o'. No results.
# - try to uppercase Vocab words: "zoo" becomes "ZOO", which we can spell!
Any Case (case='any_og'
)
The any_og
option selects any word from the Vocab that can be spelled with
glyphs
(if set, otherwise all words). It does not change the case of words
from the vocab.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# We don't get any words, because there are no words of at least 5 letters in
# our Vocab that "AIPRS" can spell
print(wsv.top_word(glyphs="AIPRS", case="any_og", min_wl=5))
# However, this returns "Paris" since we have all the glyphs
print(wsv.top_word(glyphs="aiPrs", case="any_og", min_wl=5))
Lowercase (case='lc'
)
The lc
option selects lowercase words from the Vocab (e.g. "bread"
). It will
not try to lowercase any words with capitals, since we wouldn't want to
lowercase words like "Paris"
, "FAA"
, or "DDoS"
Why no lc_og
?
There is no need for a lc_og
option, because lc
already only selects
lowercase words from the Vocab.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="lc"))
Forced Lowercase (case='lc_force'
)
The lc_force
option selects all words from the Vocab, and indiscriminately
transforms them to lowercase.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="lc_force"))
Capitalized (case='cap'
)
The cap
option selects capitalized words from the Vocab (e.g. "Paris"
) as
well as lowercase (e.g. "boat"
) words from the Vocab, capitalizing them
(e.g. "Paris", "Boat"
).
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="cap"))
Capitalized, No Case Change (case='cap_og'
)
The cap_og
option selects capitalized words from the Vocab (like "Paris"
).
It does not capitalize any lowercase words (like case='cap'
does). This is
useful for getting capitalized words like proper nouns.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="cap_og"))
Forced Capitalized (case='cap_force'
)
The cap_force
option selects all words from the Vocab, and indiscriminately
transforms them to uppercase.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="cap_force"))
All Caps (case='uc'
)
The uc
option selects all caps words from the Vocab (e.g. "WWF"
), as well as
lowercase (e.g. "boat"
) and capitalized (e.g. "Paris"
) words from the
vocab, transforming them to all caps (e.g. "WWF", "BOAT", "PARIS"
).
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="uc"))
All Caps, No Case Change (case='uc_og'
)
The uc_og
option selects all caps words from the Vocab (e.g. "WWF"
). It
does not capitalize any lowercase or capitalized words (like case='uc'
does). This is useful for getting all caps words like acronyms.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="uc_og"))
Forced All Caps (case='uc_force'
)
The uc_force
option selects all words from the Vocab, and indiscriminately
transforms them to uppercase.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en", glyphs="HAMBUGERShambugers")
print(wsv.top_word(case="uc_force"))
Filter Words by Word Length
Arguments wl
, min_wl
, and max_wl
let you select for the length of words in
the Vocab:
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# Exactly 7 letters
print(wsv.top_word(wl=7))
# At least 4 letters
print(wsv.top_word(min_wl=4))
# No more than 20 letters
print(wsv.top_word(max_wl=20))
# At least 10 letters, no more than 20 letters
print(wsv.top_word(min_wl=10, max_wl=20))
Filter Words by Substrings
Arguments startswith
, endswith
, contains
, and inner
let you select words which contain specific substrings.
Word Starts With String (startswith
)
Argument startswith
matches words that have a specific glyph/string at the
start of the word, after any case transformations may have occurred.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# starts with single glyph
print(wsv.top_word(startswith="v"))
# starts with string
print(wsv.top_word(startswith="ev"))
Word Ends With String (endswith
)
Argument endswith
matches words that have a specific glyph/string at the
end of the word, after any case transformations may have occurred.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# contains single glyph
print(wsv.top_word(endswith="s"))
# contains string
print(wsv.top_word(endswith="ats"))
Word Contains String(s) (contains
)
Argument contains
matches words which contain specific string(s) in the word,
after any case transformations may have occurred.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# contains single glyph
print(wsv.top_word(contains="a"))
# contains string
print(wsv.top_word(contains="orr"))
# contains multiple glyphs/strings
# `inner` only accepts a tuple, not a list (this is for caching)
print(wsv.top_word(contains=("b", "rr")))
Word Contains Inner String(s) (inner
)
Argument inner
matches words which contain specific string(s) in word[1:-1]
(all
but first and last glyphs), after any case transformations may have occurred.
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
# Contains glyph inside (word[1:-1])
print(wsv.top_word(inner="b"))
# Contains string inside (word[1:-1])
print(wsv.top_word(inner="br"))
# Contains multiple strings inside (word[1:-1])
# `inner` only accepts a tuple, not a list (this is for caching)
print(wsv.top_word(inner=("br", "ck")))
Filter Words by Regex
The regexp
argument lets you match words by regular expression. This filter
happens after any case transformations may have occurred. It uses the
regex library from PyPI which gives more
options for selecting
unicode blocks and more:
from wordsiv import WordSiv
wsv = WordSiv(vocab="en")
print(wsv.top_word(regexp=r"h.+b.*ger"))
# WordSiv uses regex (third-party) regex library from PyPi,
# so you can specify Unicode blocks like this:
wsv_es = WordSiv(vocab="es")
print(wsv_es.top_words(regexp=r".*\p{InLatin-1_Supplement}.*"))
# See https://www.regular-expressions.info/unicode.html for
# more examples of \p{...} syntax
Debugging Filters and Raising Errors
Most of the time when proofing, some output is better than nothing. When WordSiv
can't find a matching word for your glyphs
and filter arguments, it will just
return an empty string and send a warning log message to the console with some
details. However, we can change these behaviors:
Suppressing Warning Messages
WordSiv by default outputs log messages when there are no matching words for the given filters. However, you can turn this off by adjusting the log level:
from wordsiv import WordSiv
import logging
# Initially we'll set the logger to show WARNING and more severe
log = logging.getLogger("wordsiv")
log.setLevel(logging.WARNING)
wsv = WordSiv(vocab="en")
# We see a warning in our console
print(wsv.top_word(contains="BLAH"))
# We can suppress these warnings by setting the logging level to ERROR:
log.setLevel(logging.ERROR)
# No warning this time!
print(wsv.top_word(contains="NOWAY"))
Raising Errors
Maybe you want your script to halt if there are no matching words, or want to
catch the error and try something else. For this you can use the raise_errors
option, which will raise wordsiv.FilterError
if there are no word matches.