R base functions
split
split is a function with arguments x, a vector or data.frame, and f, a factor vector that will divide the data into smaller groups.
A useful optional argument is drop, which indicates whether or not values not in a group should be removed. The default is drop = FALSE.
Examples
Read the first 10 lines of /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv/. Using the strsplit function, we can find out how many times each of the individual genres occur.
Click to see solution
myDF <- fread("/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv", nrows = 10)
myDF$genres
strsplit(myDF$genres, ',')
unlist(strsplit(myDF$genres, ','))
table(unlist(strsplit(myDF$genres, ',')))
'Documentary,Short''Animation,Short''Animation,Comedy,Romance''Animation,Short''Comedy,Short''Short''Short,Sport''Documentary,Short''Romance''Documentary,Short'
'Documentary''Short'
'Animation''Short'
'Animation''Comedy''Romance'
'Animation''Short'
'Comedy''Short'
'Short'
'Short''Sport'
'Documentary''Short'
'Romance'
'Documentary''Short'
'Documentary''Short''Animation''Short''Animation''Comedy''Romance''Animation''Short''Comedy''Short''Short''Short''Sport''Documentary''Short''Romance''Documentary''Short'
Animation Comedy Documentary Romance Short Sport
3 2 3 2 8 1
Using movies_and_tv/imdb2024/basics.tsv, for each of the genres, list how many times it occurs.
Click to see solution
genres <- fread("/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv", select = "genres", col.names = "genres")
sort(table(unlist(strsplit(genres$genres, ","))), decreasing = TRUE)
Drama Comedy Talk-Show Short Documentary News
3151064 2181847 1372500 1191319 1062294 1051399
Romance Family Reality-TV Animation Action Crime
1045327 824607 624854 556566 462531 459412
Adventure Game-Show Music Adult Sport Fantasy
425130 424919 418888 353525 271872 234269
Mystery Horror Thriller History Biography Sci-Fi
225390 202434 184618 165528 119759 117541
Musical War Western Film-Noir
92140 38662 30931 873