germanetpy.icbased_similarity¶
Classes
|
This class is a configuration object, that helps to filter GermaNets lexical units and Synsets to extract the ones with certain interesting properties. |
|
The IC-based measures are computed based on relative frequencies of words in a large corpus. |
|
This Enum represents the semantic relatedness measures |
- class germanetpy.icbased_similarity.ICBasedSimilarity(germanet, wordcategory, path: str, separator: str = '\t')[source]¶
Bases:
objectThe IC-based measures are computed based on relative frequencies of words in a large corpus. Synset frequencies are computed by adding up the frequencies of all words that belong to a Synset. These measures can not be computed between synsets with different word categories
- create_simple_freq_dic(word_category, path: str, separator: str)[source]¶
Reads in the frequency list files and stores the frequency information for each Synset in a dictionary. The keys are the Synset IDs. This method also adds all available Synset frequencies for the given category.
- Parameters:
word_category (WordCategory) – The word category
path – The path to a frequency list containing words and their frequencies in a corpus
separator – The char that separates a word and its frequency in the given frequency list
- init_min_max_normalization_values(synset_pair) dict[source]¶
This methods computes the minimal values (two Synsets are equal) and the maximum values (two Synsets are maximally apart in the graph) for normalization
- init_ic_map()[source]¶
Computes the information content for each synset in GermaNet (of a given word category).
- get_information_content(synset) float[source]¶
The information content graduates semantic concepts from general to specific. The more specific a concept, the smaller the probability and thus the higher its informativeness. The information content of a semantic con- cept is estimated by the relative frequency of the concept in a large corpus (cumulated synset frequency)
- Parameters:
synset (Synset) – the information content should be computed for
- Returns:
the information content for the given synset
- resnik(synset1, synset2, normalize: bool = False, normalized_max: float = 1.0) float[source]¶
Two concepts are more related the more information they share. The shared information of two concepts can be quantified by the information content of two concepts’ lowest common subsumer. When several LCS are available the highest IC is returned.
- Parameters:
- Returns:
The information content of the LCS of the two given synsets.
- jiang_and_conrath(synset1, synset2, normalize: float = False, normalized_max: float = 1.0) float[source]¶
The Jiang and Conraths measure includes knowledge about the individual information contents of each synset. The smaller the difference of the information content of the two synsets, the more related they are.
- Parameters:
- Returns:
The jiang and conrath relatedness measure
- lin(synset1, synset2, normalize: bool = False, normalized_max: float = 1.0) float[source]¶
The lin measure takes the individual information contents of each synset and the information content of the LCS into account. The LCS with the highest information content is used for the computation.
- Parameters:
- Returns:
The Lin relatedness measure
- normalize(raw_value: float, normalized_max: float, semrel_measure: SemRelMeasure) float[source]¶
Normalizes a raw value of semantic relatedness to a value between a lower bound and the given upper bound.
- Parameters:
raw_value – The raw value
normalized_max – The upper bound
semrel_measure – The semantic relatedness measure, the value corresponds to.
- Returns:
The normalized semantic relatedness value
- property germanet¶
- property root_freq¶
- property synset2cumfreq¶
- property jcnmaxdist¶
- property normalization_dic¶
- property synset2ic¶
- property most_informative_synset¶
- property synset2simple_freq¶