David Bowie Tribute (II)
David Bowie II (classifying his songs)
last article in my blog was dedicated to David Bowie.
I want to extend it searching relationships in his songs using as a features just his lyrics. Using ununstructured text analysis tools to find the distance between songs (I will use Euclidean distances).
Background
We have lyrics of 254 songs.
Like in last article, we have to convert lyrics to terms, discarding, if possible, the differences between plurals and verbs.
Términos
First of all, we create a word list to discard. Articles, pronoums, ...
i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, should, now, d, ll, m, o, re, ve, y, ain, aren, couldn, didn, doesn, hadn, hasn, haven, isn, ma, mightn, mustn, needn, shan, shouldn, wasn, weren, won, wouldn, chorus, 'm, 's, 'll, n't, 'd, ca, wo, oh, yeah, x2.
Then we tokenize al lyrics, e.g., Blackstar lyrics:
in the villa of ormen in the villa of ormen stands a solitary candle ah-ah ah-ah in the centre of it all in the centre of it all your eyes on the day of execution on the day of execution only women kneel and smile ah-ah ah-ah at the centre of it all at the centre of it all your eyes your eyes ah-ah-ah ah-ah-ah in the villa of ormen in the villa of ormen stands a solitary candle ah-ah ah-ah in the centre of it all in the centre of it all your eyes ah-ah-ah something happened on the day he died spirit rose a metre and stepped aside somebody else took his place and bravely cried i’m a blackstar i’m a blackstar how many times does an angel fall how many people lie instead of talking tall he trod on sacred ground he cried loud into the crowd i’m a blackstar i’m a blackstar i’m not a gangster i can’t answer why i’m a blackstar just go with me i’m not a filmstar i’m-a take you home i’m a blackstar take your passport and shoes i’m not a popstar and your sedatives boo i’m a blackstar you’re a flash in the pan i’m not a marvel star i’m the great i am i’m a blackstar i’m a blackstar way up oh honey i’ve got game i see right so white so open-heart it’s pain i want eagles in my daydreams diamonds in my eyes i’m a blackstar i’m a blackstar something happened on the day he died spirit rose a metre and stepped aside somebody else took his place and bravely cried i’m a blackstar i’m a star star i’m a blackstar i can’t answer why i’m not a gangster but i can tell you how i’m not a flam star we were born upside-down i’m a star star born the wrong way ‘round i’m not a white star i’m a blackstar i’m not a gangster i’m a blackstar i’m a blackstar i’m not a pornstar i’m not a wandering star i’m a blackstar i’m a blackstar in the villa of ormen stands a solitary candle ah-ah ah-ah at the centre of it all your eyes on the day of execution only women kneel and smile ah-ah ah-ah at the centre of it all your eyes your eyes ah-ah-ah
Now we supress diferences between similar words, discarded words, etc. e.g. Blackstar:
villa ormen villa ormen stand solitari candl ah-ah ah-ah centr centr eye day execut day execut women kneel smile ah-ah ah-ah centr centr eye eye ah-ah-ah ah-ah-ah villa ormen villa ormen stand solitari candl ah-ah ah-ah centr centr eye ah-ah-ah someth happen day die spirit rose metr step asid somebodi els took place brave cri i'm blackstar i'm blackstar mani time angel fall mani peopl lie instead talk tall trod sacr ground cri loud crowd i'm blackstar i'm blackstar i'm gangster can't answer i'm blackstar go i'm filmstar i'm-a take home i'm blackstar take passport shoe i'm popstar sedat boo i'm blackstar you'r flash pan i'm marvel star i'm great i'm blackstar i'm blackstar way honey i'v got game see right white open-heart it pain want eagl daydream diamond eye i'm blackstar i'm blackstar someth happen day die spirit rose metr step asid somebodi els took place brave cri i'm blackstar i'm star star i'm blackstar can't answer i'm gangster tell i'm flam star born upside-down i'm star star born wrong way round i'm white star i'm blackstar i'm gangster i'm blackstar i'm blackstar i'm pornstar i'm wander star i'm blackstar i'm blackstar villa ormen stand solitari candl ah-ah ah-ah centr eye day execut women kneel smile ah-ah ah-ah centr eye eye ah-ah-ah
We do this with all lyrics. At the end we select the list of more frequent words. We will measure the distance between songs using de frequency of words. These are the 30 most frequent words:
freq tok word --------- ----- ------- 16.082115 love love 13.795538 like like 11.673021 look look 11.623957 got got 11.104668 know know 11.050255 time time 10.515605 day day 10.002770 say says 9.683665 eye eyes 9.519472 could could 9.511280 one one 9.362057 never never 9.320950 let letting 9.214944 come come 9.207492 want want 9.036289 get get 8.794052 man man 8.770926 life life 8.730971 boy boy 8.549849 thing things 8.469426 go go 8.214135 see see 8.118021 feel feel 8.037416 live lives 7.983238 littl little 7.935172 girl girls 7.839930 take take 7.751650 noth nothing 7.165395 make make 7.109152 well well (...)
We saw this frequent words with the wordcloud of the last article.
Now we generate the distance matrix between songs from the matrix of frequencies of terms and documents. With this distance matrix clusterize songs using the ward method, to form a tree that allows you to see the potential clusters.
Como hay demasiadas canciones, cuesta distinguirlas en un árbol tan denso. There are too many songs, so it is difficult to guess the clusters in this dense tree. We can see the
We can assume that there are eight large groups. We'll show one by one to see if we find any relationship between the songs contained in them.
The most frequent words in group 0 are: freq term word 170 5.365048 say says 224 4.782514 want want 76 3.478711 got got 18 2.581732 boy boy 200 2.482339 thing things
The first group consists of songs that include the words say, want, got, things and boys, among others.
The most frequent words in group 1 are: freq term word 122 4.380216 man man 196 4.167142 take take 18 4.055294 boy boy 116 4.042343 long long 67 4.020220 get get
This group has 112 songs, it can be segregated more than this to analyze his songs. Nevertheless, you can detect that the two nearest songs are
The most frequent words in group 2 are: freq term word 119 7.463742 love love 111 6.080673 like like 32 5.357936 come come 204 4.716788 time time 8 4.685532 babi baby
The songs of this group speak of love, time, etc.
The most frequent words in group 3 are: freq term word 113 3.624713 littl little 68 1.478087 girl girls 56 0.751800 feel feel 7 0.749316 away away 233 0.705218 wonder wonder
The songs of this group speak of girls...
The most frequent words in group 4 are: freq term word 38 3.690027 day day 144 3.295195 one one 141 2.358883 noth nothing 41 1.519879 die dying 109 1.143182 life life
The songs of this group speak of the day, life and death.
The most frequent words in group 5 are: freq term word 119 3.861197 love love 68 0.995499 girl girls 138 0.991495 new new 70 0.553128 go go 121 0.524290 make make
The songs of this group are about girls and love.
The most frequent words in group 6 are: freq term word 36 3.604493 danc dance 107 1.408223 let letting 100 0.982583 know know 52 0.647080 face face 197 0.479504 talk talking
The songs of this group speak of dance. P>
The most frequent words in group 7 are: freq term word 117 3.509484 look look 187 0.770862 sound sound 70 0.555531 go go 7 0.545412 away away 105 0.494563 leav leave
And the last group speak primarily about look for something.
Epilogue
We can do it better, we can group similar words, we can use bigrams, improve and implement tokenization and other algorithms like tf-idf. We can select weights in the words, e.g. rare words in general but very common in certain songs, to select the songs more similar to others in terms of subject. I just hope, have contributed a bit with this tribute to one of the greatReferences
This time I used Python to do the analysis. You can use the source in this notebook of Jupyter.