Generate a list of weighted words from a .txt file randomly with Python

I need a code that generates a list of 𝑛 significant words of a conversation, randomly, according to the statistical distribution of the formed corpus, that is, they will be generated respecting the frequencies consigned in that corpus.

I start from this link, which I have cleaned up, removing the stopwords (in Spanish) and leaving only the 500 words most frequently:

Wikitext

import requests
wiki_url = "https://es.wiktionary.org/wiki/Wikcionario:Frecuentes-(1-1000)-Subt%C3%ADtulos_de_pel%C3%ADculas"
wiki_texto = requests.get(wiki_url).text

from bs4 import BeautifulSoup
wiki_datos = BeautifulSoup(wiki_texto, "html")

wiki_filas = wiki_datos.findAll("tr")
print(wiki_filas[1])

print("...............................")

wiki_celdas = wiki_datos.findAll("td")
print(wiki_celdas[0:])

fila_1 = wiki_celdas[0:]
info_1 = [elemento.get_text() for elemento in fila_1]
print(fila_1)
print(info_1)
info_1[0] = int(float(info_1[0]))
print(info_1)


print("...............................")

num_or = [int(float(elem.findAll("td")[0].get_text())) for elem in wiki_filas[1:]]
palabras = [elem.findAll("td")[1].get_text().rstrip() for elem in wiki_filas[1:]]
frecuencia = [elem.findAll("td")[2].get_text().rstrip() for elem in wiki_filas[1:]]

print(num_or[0:])
print(palabras[0:])
print(frecuencia[0:])

from pandas import DataFrame
tabla = DataFrame([num_or, palabras, frecuencia]).T
tabla.columns = ["Núm. orden", "Palabras", "Frecuencia"]
print(tabla)

print("...............................")

import pandas as pd
from nltk.corpus import stopwords 
prep = stopwords.words('spanish')
print(prep)

tabla_beta = pd.read_html(wiki_url)[0]
tabla_beta.columns = ["Núm. orden", "Palabras", "Frecuencia"]
tabla_beta = tabla_beta[~tabla_beta['Palabras'].isin(prep)].head(500)
print(tabla_beta)

Resulting in a dataframe of 500 registers and 3 columns, the last column being the frequency of each word:

enter image description here

What I need now is a code that randomly generates a sentence with those words, respecting the frequency in column 3.

Any help is welcome! Thank you.

1 answer

  • answered 2020-01-18 12:05 kleerofski

    Numpy already provides builtin parameter "p" - stands for probability - in random.choice to generate weighted sample. Minimal example:

    import pandas as pd, numpy as np
    from collections import Counter
    
    df = pd.DataFrame(dict(words=["a","e","i","o","u"],weights=np.random.randint(5,15,5)))
    df["normalized"]=df["weights"]/sum(df["weights"].values)
    print(df)
    
    words   weights normalized
    0   a   9   0.204545
    1   e   13  0.295455
    2   i   8   0.181818
    3   o   6   0.136364
    4   u   8   0.181818
    
    n = 3
    l=np.random.choice(df.words,size=(n,),p=df.normalized)
    print(l)
    array(['u', 'i', 'i'], dtype=object)
    

    How'd you know whether the probabilities are honored? Well the answer is simple, if n is large enough, the sum of individual occurrences divided by n should approximately be equal to the normalized weight of the word:

    n = 10000
    l=np.random.choice(df.words,size=(n,),p=df.normalized)
    c=Counter(l)
    for key in c: c[key]=c[key]/n
    print(c, sum(c.values()))
    Counter({'e': 0.2907, 'a': 0.2055, 'u': 0.1882, 'i': 0.1791, 'o': 0.1365}) 1.0