Corpi¶
Classes:
|
Class to get a corpus |
|
Args: get (bool): get the corpus on init |
|
|
|
|
|
CMU Phonetic dictionary |
|
Args: get (bool): get the corpus on init |
|
|
Functions:
|
-
class
recurse_words.corpi.
Corpus
(get=False, cache_dir='~/.recurse_words')¶ Bases:
abc.ABC
Class to get a corpus
- Parameters
get (bool) – get the corpus on init
Attributes:
decode the raw bytes from the request in download_corpus
Methods:
__init__
([get, cache_dir])- Parameters
get (bool) – get the corpus on init
Return the raw bytes of the request
_load
()_save
(corpus)get
()clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
name
= ''¶
-
url
= ''¶
-
decode
= True¶ decode the raw bytes from the request in download_corpus
-
__init__
(get=False, cache_dir='~/.recurse_words')¶ - Parameters
get (bool) – get the corpus on init
-
property
corpus
¶
-
property
corpus_dict
¶
-
download_corpus
() → Union[bytes, str]¶ Return the raw bytes of the request
-
_load
() → List[str]¶
-
_save
(corpus: List[str])¶
-
get
() → List[str]¶
-
abstract
clean
(to_clean: str) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
_abc_impl
= <_abc_data object>¶
-
class
recurse_words.corpi.
Txt
(path: pathlib.Path, sep='\n', *args, **kwargs)¶ Bases:
recurse_words.corpi.Corpus
Args: get (bool): get the corpus on init
Attributes:
Methods:
Return the raw bytes of the request
clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
name
= 'txt'¶
-
download_corpus
() → Union[bytes, str]¶ Return the raw bytes of the request
-
clean
(to_clean: str) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
_abc_impl
= <_abc_data object>¶
-
-
class
recurse_words.corpi.
English
(get=False, cache_dir='~/.recurse_words')¶ Bases:
recurse_words.corpi.Corpus
- Parameters
get (bool) – get the corpus on init
Attributes:
Methods:
clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
name
= 'english'¶
-
url
= 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'¶
-
clean
(to_clean: str) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
_abc_impl
= <_abc_data object>¶
-
class
recurse_words.corpi.
Common_English
(nth_most: int = 10000, *args, **kwargs)¶ Bases:
recurse_words.corpi.Corpus
- Parameters
nth_most (int) – only get the nth most common words, default 10000 (the whole list)
Attributes:
Methods:
__init__
([nth_most])- Parameters
nth_most (int) – only get the nth most common words, default 10000 (the whole list)
clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
name
= 'common'¶
-
url
= 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt'¶
-
__init__
(nth_most: int = 10000, *args, **kwargs)¶ - Parameters
nth_most (int) – only get the nth most common words, default 10000 (the whole list)
-
clean
(to_clean: str) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
_abc_impl
= <_abc_data object>¶
-
class
recurse_words.corpi.
CMUDict
(get=False, cache_dir='~/.recurse_words')¶ Bases:
recurse_words.corpi.Corpus
CMU Phonetic dictionary
- Parameters
get (bool) – get the corpus on init
Attributes:
Convert their phonetic code to fingle letters
inverse lookup, single letter to original code
lut from single-letter code to IPA symbols
Methods:
clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
name
= 'phonetic'¶
-
url
= 'http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b'¶
-
lut
= {'AA': 'a', 'AE': '@', 'AH': 'A', 'AO': 'c', 'AW': 'W', 'AY': 'Y', 'B': 'b', 'CH': 'C', 'D': 'd', 'DH': 'D', 'EH': 'E', 'ER': 'R', 'EY': 'e', 'F': 'f', 'G': 'g', 'HH': 'h', 'IH': 'I', 'IY': 'i', 'JH': 'J', 'K': 'k', 'L': 'l', 'M': 'm', 'N': 'n', 'NG': 'G', 'OW': 'o', 'OY': 'O', 'P': 'p', 'R': 'r', 'S': 's', 'SH': 'S', 'T': 't', 'TH': 'T', 'UH': 'U', 'UW': 'u', 'V': 'v', 'W': 'w', 'WH': 'H', 'Y': 'y', 'Z': 'z', 'ZH': 'Z'}¶ Convert their phonetic code to fingle letters
-
tul
= {'@': 'AE', 'A': 'AH', 'C': 'CH', 'D': 'DH', 'E': 'EH', 'G': 'NG', 'H': 'WH', 'I': 'IH', 'J': 'JH', 'O': 'OY', 'R': 'ER', 'S': 'SH', 'T': 'TH', 'U': 'UH', 'W': 'AW', 'Y': 'AY', 'Z': 'ZH', 'a': 'AA', 'b': 'B', 'c': 'AO', 'd': 'D', 'e': 'EY', 'f': 'F', 'g': 'G', 'h': 'HH', 'i': 'IY', 'k': 'K', 'l': 'L', 'm': 'M', 'n': 'N', 'o': 'OW', 'p': 'P', 'r': 'R', 's': 'S', 't': 'T', 'u': 'UW', 'v': 'V', 'w': 'W', 'y': 'Y', 'z': 'Z'}¶ inverse lookup, single letter to original code
-
ipa
= {'@': 'æ', 'A': 'ʌ', 'C': 'tʃ', 'D': 'ð', 'E': 'ɛ', 'G': 'ŋ', 'H': 'ʍ', 'I': 'ɪ', 'J': 'dʒ', 'O': 'ɔɪ', 'R': 'ɝ', 'S': 'ʃ', 'T': 'θ', 'U': 'ʊ', 'W': 'aʊ', 'Y': 'aɪ', 'Z': 'ʒ', 'a': 'ɑ', 'b': 'b', 'c': 'ɔ', 'd': 'd', 'e': 'eɪ', 'f': 'f', 'g': 'ɡ', 'h': 'h', 'i': 'i', 'k': 'k', 'l': 'l', 'm': 'm', 'n': 'n', 'o': 'oʊ', 'p': 'p', 'r': 'ɹ', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'y': 'j', 'z': 'z'}¶ lut from single-letter code to IPA symbols
-
clean
(to_clean: str) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
_abc_impl
= <_abc_data object>¶
-
class
recurse_words.corpi.
CMUDict_Common
(nth_most=10000, *args, **kwargs)¶ Bases:
recurse_words.corpi.CMUDict
Args: get (bool): get the corpus on init
Attributes:
Methods:
clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
name
= 'phonetic_common'¶
-
clean
(to_clean: str) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
_abc_impl
= <_abc_data object>¶
-
-
class
recurse_words.corpi.
Proteins
(get=False, cache_dir='~/.recurse_words')¶ Bases:
recurse_words.corpi.Corpus
- Parameters
get (bool) – get the corpus on init
Attributes:
Methods:
clean
(to_clean)Clean the corpus of any debris, returning a list of strings
-
_abc_impl
= <_abc_data object>¶
-
name
= 'proteins'¶
-
url
= 'https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_protein.faa.gz'¶
-
decode
= False¶ decode the raw bytes from the request in download_corpus
-
clean
(to_clean: bytes) → List[str]¶ Clean the corpus of any debris, returning a list of strings
- Parameters
to_clean (str) – the big long string to clean
- Returns
list of words
-
recurse_words.corpi.
get_corpus
(corp_name: str) → Type[recurse_words.corpi.Corpus]¶