Corpi¶

recurse_words.corpi

Classes:

`Corpus`([get, cache_dir])	Class to get a corpus
`Txt`(path[, sep])	Args: get (bool): get the corpus on init
`English`([get, cache_dir])	Parameters get (bool) – get the corpus on init
`Common_English`(nth_most, args, *kwargs)	Parameters nth_most (int) – only get the nth most common words, default 10000 (the whole list)
`CMUDict`([get, cache_dir])	CMU Phonetic dictionary
`CMUDict_Common`([nth_most])	Args: get (bool): get the corpus on init
`Proteins`([get, cache_dir])	Parameters get (bool) – get the corpus on init

Functions:

get_corpus(corp_name)

class recurse_words.corpi.Corpus(get=False, cache_dir='~/.recurse_words')¶

Bases: abc.ABC

Class to get a corpus

Parameters: get (bool) – get the corpus on init

Attributes:

`name`
`url`
`decode`	decode the raw bytes from the request in download_corpus
`corpus`
`corpus_dict`
`_abc_impl`

Methods:

`__init__`([get, cache_dir])	Parameters get (bool) – get the corpus on init
`download_corpus`()	Return the raw bytes of the request
`_load`()
`_save`(corpus)
`get`()
`clean`(to_clean)	Clean the corpus of any debris, returning a list of strings

name = ''¶

url = ''¶

decode = True¶: decode the raw bytes from the request in download_corpus

__init__(get=False, cache_dir='~/.recurse_words')¶

Parameters: get (bool) – get the corpus on init

property corpus¶

property corpus_dict¶

download_corpus() → Union[bytes, str]¶: Return the raw bytes of the request

_load() → List[str]¶

_save(corpus: List[str])¶

get() → List[str]¶

abstract clean(to_clean: str) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

_abc_impl = <_abc_data object>¶

class recurse_words.corpi.Txt(path: pathlib.Path, sep='\n', *args, **kwargs)¶

Bases: recurse_words.corpi.Corpus

Args: get (bool): get the corpus on init

Attributes:

`name`
`_abc_impl`

Methods:

`download_corpus`()	Return the raw bytes of the request
`clean`(to_clean)	Clean the corpus of any debris, returning a list of strings

name = 'txt'¶

download_corpus() → Union[bytes, str]¶: Return the raw bytes of the request

clean(to_clean: str) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

_abc_impl = <_abc_data object>¶

class recurse_words.corpi.English(get=False, cache_dir='~/.recurse_words')¶

Bases: recurse_words.corpi.Corpus

Parameters: get (bool) – get the corpus on init

Attributes:

`name`
`url`
`_abc_impl`

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'english'¶

url = 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'¶

clean(to_clean: str) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

_abc_impl = <_abc_data object>¶

class recurse_words.corpi.Common_English(nth_most: int = 10000, *args, **kwargs)¶

Bases: recurse_words.corpi.Corpus

Parameters: nth_most (int) – only get the nth most common words, default 10000 (the whole list)

Attributes:

`name`
`url`
`_abc_impl`

Methods:

__init__([nth_most])

Parameters: nth_most (int) – only get the nth most common words, default 10000 (the whole list)

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'common'¶

url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt'¶

__init__(nth_most: int = 10000, *args, **kwargs)¶

Parameters: nth_most (int) – only get the nth most common words, default 10000 (the whole list)

clean(to_clean: str) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

_abc_impl = <_abc_data object>¶

class recurse_words.corpi.CMUDict(get=False, cache_dir='~/.recurse_words')¶

Bases: recurse_words.corpi.Corpus

CMU Phonetic dictionary

Parameters: get (bool) – get the corpus on init

Attributes:

`name`
`url`
`lut`	Convert their phonetic code to fingle letters
`tul`	inverse lookup, single letter to original code
`ipa`	lut from single-letter code to IPA symbols
`_abc_impl`

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'phonetic'¶

url = 'http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b'¶

lut = {'AA': 'a', 'AE': '@', 'AH': 'A', 'AO': 'c', 'AW': 'W', 'AY': 'Y', 'B': 'b', 'CH': 'C', 'D': 'd', 'DH': 'D', 'EH': 'E', 'ER': 'R', 'EY': 'e', 'F': 'f', 'G': 'g', 'HH': 'h', 'IH': 'I', 'IY': 'i', 'JH': 'J', 'K': 'k', 'L': 'l', 'M': 'm', 'N': 'n', 'NG': 'G', 'OW': 'o', 'OY': 'O', 'P': 'p', 'R': 'r', 'S': 's', 'SH': 'S', 'T': 't', 'TH': 'T', 'UH': 'U', 'UW': 'u', 'V': 'v', 'W': 'w', 'WH': 'H', 'Y': 'y', 'Z': 'z', 'ZH': 'Z'}¶: Convert their phonetic code to fingle letters

tul = {'@': 'AE', 'A': 'AH', 'C': 'CH', 'D': 'DH', 'E': 'EH', 'G': 'NG', 'H': 'WH', 'I': 'IH', 'J': 'JH', 'O': 'OY', 'R': 'ER', 'S': 'SH', 'T': 'TH', 'U': 'UH', 'W': 'AW', 'Y': 'AY', 'Z': 'ZH', 'a': 'AA', 'b': 'B', 'c': 'AO', 'd': 'D', 'e': 'EY', 'f': 'F', 'g': 'G', 'h': 'HH', 'i': 'IY', 'k': 'K', 'l': 'L', 'm': 'M', 'n': 'N', 'o': 'OW', 'p': 'P', 'r': 'R', 's': 'S', 't': 'T', 'u': 'UW', 'v': 'V', 'w': 'W', 'y': 'Y', 'z': 'Z'}¶: inverse lookup, single letter to original code

ipa = {'@': 'æ', 'A': 'ʌ', 'C': 'tʃ', 'D': 'ð', 'E': 'ɛ', 'G': 'ŋ', 'H': 'ʍ', 'I': 'ɪ', 'J': 'dʒ', 'O': 'ɔɪ', 'R': 'ɝ', 'S': 'ʃ', 'T': 'θ', 'U': 'ʊ', 'W': 'aʊ', 'Y': 'aɪ', 'Z': 'ʒ', 'a': 'ɑ', 'b': 'b', 'c': 'ɔ', 'd': 'd', 'e': 'eɪ', 'f': 'f', 'g': 'ɡ', 'h': 'h', 'i': 'i', 'k': 'k', 'l': 'l', 'm': 'm', 'n': 'n', 'o': 'oʊ', 'p': 'p', 'r': 'ɹ', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'y': 'j', 'z': 'z'}¶: lut from single-letter code to IPA symbols

clean(to_clean: str) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

_abc_impl = <_abc_data object>¶

class recurse_words.corpi.CMUDict_Common(nth_most=10000, *args, **kwargs)¶

Bases: recurse_words.corpi.CMUDict

Args: get (bool): get the corpus on init

Attributes:

`name`
`_abc_impl`

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

name = 'phonetic_common'¶

clean(to_clean: str) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

_abc_impl = <_abc_data object>¶

class recurse_words.corpi.Proteins(get=False, cache_dir='~/.recurse_words')¶

Bases: recurse_words.corpi.Corpus

Parameters: get (bool) – get the corpus on init

Attributes:

`_abc_impl`
`name`
`url`
`decode`	decode the raw bytes from the request in download_corpus

Methods:

clean(to_clean)

Clean the corpus of any debris, returning a list of strings

_abc_impl = <_abc_data object>¶

name = 'proteins'¶

url = 'https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_protein.faa.gz'¶

decode = False¶: decode the raw bytes from the request in download_corpus

clean(to_clean: bytes) → List[str]¶

Clean the corpus of any debris, returning a list of strings

Parameters: to_clean (str) – the big long string to clean
Returns: list of words

recurse_words.corpi.get_corpus(corp_name: str) → Type[recurse_words.corpi.Corpus]¶