User:TJones (WMF)/Notes/Sandbox
Transclusion..
The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether any changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of the words hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others. (Note that the results in each case will be ranked differently because exact matches are preferred).
In addition to listing the words that are grouped together, we also include the number of times each word appears in the text sample. This helps us estimate the relative importance of potential errors. For example, if two words are improperly grouped together, but the words are very rare, that’s not as bad as if they were very common. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] |
When we make less extreme modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at groups before and after the modification to assess the effect of the group changes.
Old-vs-new groups are presented as follows: hope >> 2 o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes] n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 ĥợṕễ] The first line shows the stem ( The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do. In terms of gains and losses:
The The numbers with the word—e.g., Problems can arise when more common words are grouped together incorrectly. For example, a grouping like |
Oh yeah.