Corpi

Classes:

Corpus([get, cache_dir])

Class to get a corpus

Txt(path[, sep])

Args: get (bool): get the corpus on init

English([get, cache_dir])

Parameters

get (bool) – get the corpus on init

Common_English(nth_most, *args, **kwargs)

Parameters

nth_most (int) – only get the nth most common words, default 10000 (the whole list)

CMUDict([get, cache_dir])

CMU Phonetic dictionary

CMUDict_Common([nth_most])

Args: get (bool): get the corpus on init

Proteins([get, cache_dir])

Parameters

get (bool) – get the corpus on init

Functions:

get_corpus(corp_name)

class recurse_words.corpi.Corpus(get=False, cache_dir='~/.recurse_words')

Bases: abc.ABC

Class to get a corpus

Parameters

get (bool) – get the corpus on init

Attributes:

name

url

decode

decode the raw bytes from the request in download_corpus

corpus

corpus_dict

_abc_impl

Methods:

__init__([get, cache_dir])

Parameters

get (bool) – get the corpus on init

download_corpus()

Return the raw bytes of the request

_load()

_save(corpus)

get()

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = ''
url = ''
decode = True

decode the raw bytes from the request in download_corpus

__init__(get=False, cache_dir='~/.recurse_words')
Parameters

get (bool) – get the corpus on init

property corpus
property corpus_dict
download_corpus()Union[bytes, str]

Return the raw bytes of the request

_load()List[str]
_save(corpus: List[str])
get()List[str]
abstract clean(to_clean: str)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

_abc_impl = <_abc_data object>
class recurse_words.corpi.Txt(path: pathlib.Path, sep='\n', *args, **kwargs)

Bases: recurse_words.corpi.Corpus

Args: get (bool): get the corpus on init

Attributes:

Methods:

download_corpus()

Return the raw bytes of the request

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'txt'
download_corpus()Union[bytes, str]

Return the raw bytes of the request

clean(to_clean: str)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

_abc_impl = <_abc_data object>
class recurse_words.corpi.English(get=False, cache_dir='~/.recurse_words')

Bases: recurse_words.corpi.Corpus

Parameters

get (bool) – get the corpus on init

Attributes:

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'english'
url = 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'
clean(to_clean: str)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

_abc_impl = <_abc_data object>
class recurse_words.corpi.Common_English(nth_most: int = 10000, *args, **kwargs)

Bases: recurse_words.corpi.Corpus

Parameters

nth_most (int) – only get the nth most common words, default 10000 (the whole list)

Attributes:

Methods:

__init__([nth_most])

Parameters

nth_most (int) – only get the nth most common words, default 10000 (the whole list)

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'common'
url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt'
__init__(nth_most: int = 10000, *args, **kwargs)
Parameters

nth_most (int) – only get the nth most common words, default 10000 (the whole list)

clean(to_clean: str)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

_abc_impl = <_abc_data object>
class recurse_words.corpi.CMUDict(get=False, cache_dir='~/.recurse_words')

Bases: recurse_words.corpi.Corpus

CMU Phonetic dictionary

Parameters

get (bool) – get the corpus on init

Attributes:

name

url

lut

Convert their phonetic code to fingle letters

tul

inverse lookup, single letter to original code

ipa

lut from single-letter code to IPA symbols

_abc_impl

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'phonetic'
url = 'http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b'
lut = {'AA': 'a', 'AE': '@', 'AH': 'A', 'AO': 'c', 'AW': 'W', 'AY': 'Y', 'B': 'b', 'CH': 'C', 'D': 'd', 'DH': 'D', 'EH': 'E', 'ER': 'R', 'EY': 'e', 'F': 'f', 'G': 'g', 'HH': 'h', 'IH': 'I', 'IY': 'i', 'JH': 'J', 'K': 'k', 'L': 'l', 'M': 'm', 'N': 'n', 'NG': 'G', 'OW': 'o', 'OY': 'O', 'P': 'p', 'R': 'r', 'S': 's', 'SH': 'S', 'T': 't', 'TH': 'T', 'UH': 'U', 'UW': 'u', 'V': 'v', 'W': 'w', 'WH': 'H', 'Y': 'y', 'Z': 'z', 'ZH': 'Z'}

Convert their phonetic code to fingle letters

tul = {'@': 'AE', 'A': 'AH', 'C': 'CH', 'D': 'DH', 'E': 'EH', 'G': 'NG', 'H': 'WH', 'I': 'IH', 'J': 'JH', 'O': 'OY', 'R': 'ER', 'S': 'SH', 'T': 'TH', 'U': 'UH', 'W': 'AW', 'Y': 'AY', 'Z': 'ZH', 'a': 'AA', 'b': 'B', 'c': 'AO', 'd': 'D', 'e': 'EY', 'f': 'F', 'g': 'G', 'h': 'HH', 'i': 'IY', 'k': 'K', 'l': 'L', 'm': 'M', 'n': 'N', 'o': 'OW', 'p': 'P', 'r': 'R', 's': 'S', 't': 'T', 'u': 'UW', 'v': 'V', 'w': 'W', 'y': 'Y', 'z': 'Z'}

inverse lookup, single letter to original code

ipa = {'@': 'æ', 'A': 'ʌ', 'C': 'tʃ', 'D': 'ð', 'E': 'ɛ', 'G': 'ŋ', 'H': 'ʍ', 'I': 'ɪ', 'J': 'dʒ', 'O': 'ɔɪ', 'R': 'ɝ', 'S': 'ʃ', 'T': 'θ', 'U': 'ʊ', 'W': 'aʊ', 'Y': 'aɪ', 'Z': 'ʒ', 'a': 'ɑ', 'b': 'b', 'c': 'ɔ', 'd': 'd', 'e': 'eɪ', 'f': 'f', 'g': 'ɡ', 'h': 'h', 'i': 'i', 'k': 'k', 'l': 'l', 'm': 'm', 'n': 'n', 'o': 'oʊ', 'p': 'p', 'r': 'ɹ', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'y': 'j', 'z': 'z'}

lut from single-letter code to IPA symbols

clean(to_clean: str)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

_abc_impl = <_abc_data object>
class recurse_words.corpi.CMUDict_Common(nth_most=10000, *args, **kwargs)

Bases: recurse_words.corpi.CMUDict

Args: get (bool): get the corpus on init

Attributes:

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'phonetic_common'
clean(to_clean: str)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

_abc_impl = <_abc_data object>
class recurse_words.corpi.Proteins(get=False, cache_dir='~/.recurse_words')

Bases: recurse_words.corpi.Corpus

Parameters

get (bool) – get the corpus on init

Attributes:

_abc_impl

name

url

decode

decode the raw bytes from the request in download_corpus

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

_abc_impl = <_abc_data object>
name = 'proteins'
url = 'https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_protein.faa.gz'
decode = False

decode the raw bytes from the request in download_corpus

clean(to_clean: bytes)List[str]

Clean the corpus of any debris, returning a list of strings

Parameters

to_clean (str) – the big long string to clean

Returns

list of words

recurse_words.corpi.get_corpus(corp_name: str)Type[recurse_words.corpi.Corpus]