Apache Spark Dataframe - Get length of each column
Question: In Apache Spark Dataframe, using Python
, how can we get the data type and length of each column? I'm using latest version of python.
Using pandas
dataframe, I do it as follows:
df = pd.read_csv(r'C:\TestFolder\myFile1.csv', low_memory=False)
for col in df:
print(col, '->', df[col].str.len().max())
1 answer
do you know?
how many words do you know
See also questions close to this topic
-
Python File Tagging System does not retrieve nested dictionaries in dictionary
I am building a file tagging system using Python. The idea is simple. Given a directory of files (and files within subdirectories), I want to filter them out using a filter input and tag those files with a word or a phrase.
If I got the following contents in my current directory:
data/ budget.xls world_building_budget.txt a.txt b.exe hello_world.dat world_builder.spec
and I execute the following command in the shell:
py -3 tag_tool.py -filter=world -tag="World-Building Tool"
My output will be:
These files were tagged with "World-Building Tool": data/ world_building_budget.txt hello_world.dat world_builder.spec
My current output isn't exactly like this but basically, I am converting all files and files within subdirectories into a single dictionary like this:
def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree
Right now, my dictionary looks like this:
key:''
.In the following function, I am turning the empty values
''
into empty lists (to hold my tags):def empty_str_to_list(d): for k,v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v)
When I run my entire code, this is my output:
hello_world.dat ['World-Building Tool'] world_builder.spec ['World-Building Tool']
But it does not see
data/world_building_budget.txt
. This is the full dictionary:{'data': {'world_building_budget.txt': []}, 'a.txt': [], 'hello_world.dat': [], 'b.exe': [], 'world_builder.spec': []}
This is my full code:
import os, argparse def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree def empty_str_to_list(d): for k, v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v) parser = argparse.ArgumentParser(description="Just an example", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("--filter", action="store", help="keyword to filter files") parser.add_argument("--tag", action="store", help="a tag phrase to attach to a file") parser.add_argument("--get_tagged", action="store", help="retrieve files matching an existing tag") args = parser.parse_args() filter = args.filter tag = args.tag get_tagged = args.get_tagged current_dir = os.getcwd() files_dict = fs_tree_to_dict(current_dir) empty_str_to_list(files_dict) for k, v in files_dict.items(): if filter in k: if v == []: v.append(tag) print(k, v) elif isinstance(v, dict): empty_str_to_list(v) if get_tagged in v: print(k, v)
-
Actaully i am working on a project and in it, it is showing no module name pip_internal plz help me for the same. I am using pycharm(conda interpreter
File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\Scripts\pip.exe\__main__.py", line 4, in <module> File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_internal\__init__.py", line 4, in <module> from pip_internal.utils import _log
I am using pycharm with conda interpreter.
-
Looping the function if the input is not string
I'm new to python (first of all) I have a homework to do a function about checking if an item exists in a dictionary or not.
inventory = {"apple" : 50, "orange" : 50, "pineapple" : 70, "strawberry" : 30} def check_item(): x = input("Enter the fruit's name: ") if not x.isalpha(): print("Error! You need to type the name of the fruit") elif x in inventory: print("Fruit found:", x) print("Inventory available:", inventory[x],"KG") else: print("Fruit not found") check_item()
I want the function to loop again only if the input written is not string. I've tried to type return Under print("Error! You need to type the name of the fruit") but didn't work. Help
-
how do I dissable debian python path/recursion limit
so, as of late, I've been having path length limit and recursion limit issues, so I really need to know how to disable these.
I can't even install modules like discord.py!!!!
-
TypeError: 'float' object cannot be interpreted as an integer on linspace
TypeError Traceback (most recent call last) d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 15' in <cell line: 1>() -->1 plot_freq(signal, sample_rate) d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 10' in plot_freq(signal, sample_rate, fft_size) 2 def plot_freq(signal, sample_rate, fft_size=512): 3 xf = np.fft.rfft(signal, fft_size) / fft_size ----> 4 freq = np.linspace(0, sample_rate/2, fft_size/2 + 1) 5 xfp = 20 * np.log10(np.clip(np.abs(xf), 1e-20, 1e100)) 6 plt.figure(figsize=(20, 5)) File <__array_function__ internals>:5, in linspace(*args, **kwargs) File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\function_base.py:120, in linspace(start, stop, num, endpoint, retstep, dtype, axis) 23 @array_function_dispatch(_linspace_dispatcher) 24 def linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, 25 axis=0): 26 """ 27 Return evenly spaced numbers over a specified interval. 28 (...) 118 119 """ --> 120 num = operator.index(num) 121 if num < 0: 122 raise ValueError("Number of samples, %s, must be non-negative." % num) TypeError: 'float' object cannot be interpreted as an integer
What solution about this problem?
-
IndexError: list index out of range with api
all_currencies = currency_api('latest', 'currencies') # {'eur': 'Euro', 'usd': 'United States dollar', ...} all_currencies.pop('brl') qtd_moedas = len(all_currencies) texto = f'{qtd_moedas} Moedas encontradas\n\n' moedas_importantes = ['usd', 'eur', 'gbp', 'chf', 'jpy', 'rub', 'aud', 'cad', 'ars'] while len(moedas_importantes) != 0: for codigo, moeda in all_currencies.items(): if codigo == moedas_importantes[0]: cotacao, data = currency_api('latest', f'currencies/{codigo}/brl')['brl'], currency_api('latest', f'currencies/{codigo}/brl')['date'] texto += f'{moeda} ({codigo.upper()}) = R$ {cotacao} [{data}]\n' moedas_importantes.remove(codigo) if len(moedas_importantes) == 0: break # WITHOUT THIS LINE, GIVES ERROR
Why am I getting this error? the list actually runs out of elements, but the code only works with the if
-
Spark: retrieving old values of rows after casting made invalid input nulls
I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a double type. However, when a cast is done to anything that is not a double, spark changes it to null.
Currently, I have dataframes
df1:
num1 unique_id 1 id1 a id2 2 id3 and a copy of df1: df1_copy where the cast is made.
when running
df1_copy = df1_copy.select(df1_copy.col('num1').cast('double'), df1_copy.col('unique_id'))
it returns df1_copy:
num1 unique_id 1 id1 null id2 2 id3 I have tried putting it into a different dataframe using select and when but get an error about not being able to find the column num1. The following is what I tried:
df2 = df1_copy.select(when(df1_copy.col("unique_id").equalTo(df1.col("unique_id")),df1.col('num1)).alias("invalid"), df1_copy.col('unique_id'))
-
spark-shell commands throwing error : “error: not found: value spark”
:14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^
here is my enviroment configuration. I different many times but I keep getting this error. Anyone knows the reason? I saw a similar question but the answers did not solve my problem.
JAVA_HOME : C:\Program Files\Java\jdk1.8.0_51
HADOOP_HOME : C:\Hadoop\winutils-master\hadoop-2.7.1
SPARK_HOME : C:\Hadoop\spark-2.2.0-bin-hadoop2.7
PATH :%JAVA_HOME%\bin;%SCALA_HOME%\bin;%HADOOP_HOME%\bin;%SPARK_HOME%\bin;