David Bowie Tribute (and III)
In the previous article I clusterized the songs and did a tree anlysis. Now, I will try a practical model to dynamically look for the nearest songs to a given one.
Los datos
Again, There are 254 songs in 25 albums.
In this case, to calculate the distance between songs, I will use the Term frequency - inverse document frequency (tf-idf) for each word in songs, instead of the absolute frequency of occurrence of each word.
The advantage of this algorithm is that it gives more weight to words that are very common in some documents, but doesn’t appears in a many documents in the corpus. That is, it makes some words more significant than others. With this system, the word cloud change a bit:
Nearest song to a given one
With the matrix of terms and using tf-idf frequencies we can easily determine which songs are the closest to a given one, based, as in the previous article, in the Euclidean distance between songs.
This method gives better results than the previous classification tree, in last article, which It was based on the absolute frequencies.
Let’s search the nearest songs to Lazarus:
song | distance |
---|---|
Lazarus | 0.0000000 |
Cygnet Committee | 0.6709544 |
Candidate | 0.7180088 |
You Feel So Lonely You Could Die | 0.7256923 |
We Are The Dead | 0.7280794 |
Shining Star (Makin' My Love) | 0.7353172 |
Somebody Up There Likes Me | 0.7379833 |
Something In The Air | 0.7549199 |
Ashes To Ashes | 0.7621797 |
Five Years | 0.7644335 |
Now, let’s see the nearest songs to Space Oditty:
song | distance |
---|---|
Space Oddity | 0.0000000 |
Ashes To Ashes | 0.8647541 |
Cygnet Committee | 0.9391786 |
Candidate | 0.9655204 |
You Feel So Lonely You Could Die | 0.9688882 |
We Are The Dead | 0.9700174 |
Shining Star (Makin' My Love) | 0.9855299 |
If You Can See Me | 0.9894812 |
Somebody Up There Likes Me | 0.9911007 |
Five Years | 1.0024585 |
The algorithm finds that the nearest song is Ashes to Ashes_. That is, the song in which appears also Major Tom. This is a significant improvement over the system based on the absolute frequencies in the previous article (You can look for this song there and observe the difference).
Let’s try one more song _ “Heroes” _:
song | distance |
---|---|
“Heroes” | 0.0000000 |
Cygnet Committee | 0.7270792 |
You Feel So Lonely You Could Die | 0.7499512 |
Candidate | 0.7565930 |
We Are The Dead | 0.7641921 |
Shining Star (Makin' My Love) | 0.7683600 |
Somebody Up There Likes Me | 0.7725134 |
If You Can See Me | 0.7843719 |
When I Live My Dream | 0.7858126 |
Something In The Air | 0.8051492 |
Network of songs
With the distances between songs, we can visualize the relationships between them through a network.
With the distances between all of them, the network would be unmanageable. We include just the most important relationships to have an idea of what is the result (in any case 254 songs, create a very complex network).
You can do infinite zoom in this network if you open this image in its own tab following this link: network of songs. Also you can download the image and open it with a svg viewer.
In the center of the image we find the songs who have more relationships. The other songs can be found around these.
References
The R document (rmarkdown) to reproduce this analysis can be found following this link