make.frequency.list | R Documentation |
Function for generating a frequency list of words or other (linguistic) features. It basically counts the elements of a vector and returns a vector of these elements in descending order of frequency.
make.frequency.list(data, value = FALSE, head = NULL, relative = TRUE)
data |
either a vector of elements (e.g. words, letter n-grams),
or an object of a class |
value |
if this function is switched on, not only the most frequent elements are returned, but also their frequencies. Default: FALSE. |
head |
this option is meant to limit the number of the most frequent features to be returned. Default value is NULL, which means that the entire range of frequent and unfrequent features is returned. |
relative |
if you've switched on the option |
The function returns a vector of features (usually, words) in a descending
order of their frequency. Alternatively, when the option value
is set
TRUE, it returns a vector of frequencies instead, and the features themselves
might be accessed using the generic names
function.
Maciej Eder
load.corpus.and.parse
, make.table.of.frequencies
# assume there is a text: text = "Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before. It was a fine, thick piece of wood, bulbous-headed, of the sort which is known as a \"Penang lawyer.\"" # this text can be converted into vector of words: words = txt.to.words(text) # an avanced tokenizer is available via the function 'txt.to.words.ext': words2 = txt.to.words.ext(text, corpus.lang = "English.all") # a frequency list (just words): make.frequency.list(words) make.frequency.list(words2) # a frequency list with the numeric values make.frequency.list(words2, value = TRUE) ## Not run: # ##################################### # using the function with large text collections # first, load and pre-process a corpus from 3 text files: dataset = load.corpus.and.parse(files = c("1.txt", "2.txt", "3.txt")) # # then, return 100 the most frequent words of the entire corpus: make.frequency.list(dataset, head = 100) ## End(Not run)