U-SQL equivalent of HASHBYTES to get hash of complete row
We have to store all the changes in a data row. I was looking for in build function like SQL Server to calculate HASHBYTES and time-stamp to find the changes and get the latest one as well. Any pointer will be highly helpful.
See also questions close to this topic
Can Apache Alluxio use Azure Data Lake as under store?
I have created a HDInsight Cluster with Spark2.2 & HDI 3.6 that read data from Azure Data Lake. Users will execute Spark-SQL on it, I want to use Alluxio as a cache to speed up queries. After some research, I found Azure Blob Storage is supported: http://www.alluxio.org/docs/1.7/en/Configuring-Alluxio-with-Azure-Blob-Store.html. I am wondering does Azure Data Lake also supported?
Error while trying to read file in Data Lake storage
In my Azure Data Lake Store I seek to read a file that I imported using a pipeline in Azure Data Factory 2.
Although I am logged in with the same credentials that I used to create the Data Factory, the App Registration for the Data Factory, and the Data Lake itself, I get the following error message:
MESSAGE: OPEN failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [1a8ca11b-d726-468a-9aeb-d8ef3d93a81d] failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [1a8ca11b-d726-468a-9aeb-d8ef3d93a81d][2018-06-19T07:45:23.8686252-07:00]
My first thought was, this has obviously something to do with access permissions. So just out of curiousity I gave Read, Write and Execute access to 'Everyone else' in the Access page of the folder holding my file. Interestingly enough, the same error occurs.
The IR I use was autoselected during creation and is called 'AutoResolveIntegrationRuntime'.
Spark on HDInsights - No FileSystem for scheme: adl
I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate() import spark.implicits._ val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/") val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
U-SQL Dynamic Folder Name/FileName in
I will be invoking this U-SQL from ADF Pipleline. How can i have a dynamic OUTPUT_FILE Location, which is derived from input file. The intention is to encrypt some of the PII based column and copy this file to a different Folder.
But i am getting Error Severity Code Description Project File Line Suppression State Error E_CSC_USER_EXPRESSIONNOTCONSTANTFOLDABLE: Expression cannot be constant folded. Description: The expression cannot be evaluated at compile time. Resolution: Use a constant expression or a CONST parameter. USQLApplication1 C:\Users\admin\source\repos\FunctionApp2\USQLApplication1\Script3.usql 44
//DECLARE EXTERNAL @INPUT_FILE string = "adl://XXXX.azuredatalakestore.net/replicadbdata/Opportunity2/2018/06/22/16/2018-06-22-16.csv"; DECLARE EXTERNAL @INPUT_FILE string = @"C:/Users/admin/Downloads/2018-06-22-16.csv"; DECLARE @File_Name string = USQLApplication1.StringFunction.getFileName(@INPUT_FILE); DECLARE @Folder_Name string = USQLApplication1.StringFunction.getFolderName(@INPUT_FILE);
//The line below generates error
DECLARE @OUTPUT_FILENAME string = @Folder_Name + "Encrypted_" + @File_Name; //DECLARE @OUTPUT_FILENAME string = @"C:/Users/admin/Downloads/Encrypted_2018-06-22-16.csv";
. . . . .
OUTPUT @Encrypted TO @OUTPUT_FILENAME USING Outputters.Csv(outputHeader:true,quoting:true);
USQL - SQL.ARRAY get length?
For the context, I get data from a sensor and it store in string like this
"axes" : "...,1,23,21,0,12,10,212,12,..."
the size may change depending on the machine sending the data... So my goal is to store it like a SQL.ARRAY and I want to get the size of this array later to perform some bussiness report.
Is there a way to find the length of SQL.ARRAY ?
@outputfile = SELECT m.MachineID, COUNT( * ) AS nbAxesArray FROM MachineInfos AS m JOIN LoadDataAxes AS lda ON m.EventIoTID == lda.EventIoTID //WHERE getLength(lda.L) == 0 // something like this GROUP BY m.MachineID;
Azure Data Lake file properties and/or checksum
I'm trying to write a process that will skip execution of some processing jobs based upon whether data in the file has changed, and I would like to do this via checksum. Is there any way (currently or on the roadmap) to give visibility into a file's MD5 checksum or similar?
Alternatively, can I tag a file with a "property" such as a checksum of the file?