SSIS designer Visual Studio foreign keys integration
I need to integrate two similar databases into third DB3. DB3 is almost the same as DB1.
First database DB1:
Addresses table with: primary key
People table with: primary key
PersonId , foreign key
Second database DB2:
It is pretty similar, but in other language
Data from DB1 to DB3 flows smoothly, table after table. For example I have 1000 records in DB3 table named Addresses from DB1 and 1000 records in table named People from DB1.
Let's suppose Person with number 30 in DB3 (after transfering from DB1) has the IdAddress number 20.
Address with number 40 in DB2 has the ID number 1040 in DB3 and the Person has ID number 30 in DB2 and 1030 in DB3.
While transferring table People from B2 to B3 we need to know the address ID is not 40 but 1040.
I'm trying to use lookup to find existing record, but I'm newbie in SSIS VS designer. Could you help me? How can I resolve this problem?
You can do this using Lookup Transformation component as you mentioned, but first you have to:
- Select the basic information of each table that can distinguish each logical entity. Example if talking about Persons you can choose the
Date Of Birth, ...
- After selecting this attributes you have to add a Lookup Transformation
- Map thiese columns between Source and Lookup table
- Select the ID column (from Lookup table) as Output and rename the column to be
Ignore failureoption to handle non matches situation
- After doing these steps, if the same person was inserted previously you will get the ID in
- Select the basic information of each table that can distinguish each logical entity. Example if talking about Persons you can choose the
See also questions close to this topic
SQL Server Foreign Key Problem With Tables
Today I come with a quick question. I'm creating a DB for a shoe store, after inserting a few rows into my first table, I went on to add a few others to my second table (I only have 2 tables). And upon running the code, I get this error:
Msg 2627, Level 14, State 1, Line 1
Violation of UNIQUE KEY constraint 'ForeignKey'. Cannot insert duplicate key in object 'dbo.Product'. The duplicate key value is (reebok).
I'm not sure what is going on as the "categoryid" is the same in both tables. But here is the first code I executed then the second that gave me the error message
INSERT INTO [myStore].[dbo].[category] ([categoryid], [description]) VALUES ('puma', 'men'), ('nike', 'women'), ('reebok', 'children')
Table 2 (caused the error message)
INSERT INTO [myStore].[dbo].[product] ([productid], [description], [categoryid], [price], [size]) VALUES (1, 'Running Shoes', 'puma', 70, 'L'), (2, 'Slides', 'nike', 45, 'S'), (3, 'Kids Soccer Shoes', 'reebok', 55, 'M'), (4, 'Kids Football Shoes', 'reebok', 40, 'L'), (5, 'Basketball Shoes', 'nike', 90, 'S')
Can this dynamic SQL be replaced with the regular SQL to improve performance?
I am using the following dynamic query, but see that the performance is slow. I am not a big fan of dynamic SQL, and am looking for, if possible, a good clean and fast SQL alternative for the following. Thanks a million ton in advance! Here are some details:
In the following code, the final table missingfields_xxxx lists out the rows where we have a missing rule field. table_name has the column rule that holds the column name of the table trans_modelname (this table can be found in the dynamic part of the sql)
DECLARE @rule NVARCHAR(MAX) DECLARE @PeriodNumber INT = 1 DECLARE @SelectList NVARCHAR(MAX) DECLARE @WhereList NVARCHAR(MAX) DECLARE @SQL NVARCHAR(MAX) DECLARE @ModelName as NVARCHAR(MAX) = 'modelname' --DECLARE @MaxPeriods INT = 8 DECLARE @MaxPeriods INT SELECT @MaxPeriods = count (*) FROM ( SELECT [rule] FROM table_name WHERE ModelName = @ModelName) ab DECLARE db_cursor3 CURSOR FOR SELECT * FROM ( SELECT [rule] FROM table_name WHERE ModelName = @ModelName) cd OPEN db_cursor3 FETCH NEXT FROM db_cursor3 INTO @rule WHILE @@FETCH_STATUS = 0 BEGIN BEGIN SELECT @SelectList = COALESCE(@SelectList + ', ', '') + '' + @rule + ' AS [GLSegment_' + RIGHT('00' + CAST(@PeriodNumber AS VARCHAR), 3) + ']' SELECT @SelectList as 'Selectlist' IF @PeriodNumber < @MaxPeriods BEGIN SELECT @WhereList = COALESCE(@WhereList, '') + '(isnull([GLSegment_' + RIGHT('00' + CAST(@PeriodNumber AS VARCHAR), 3) + '],'''') = '''' ) OR ' SELECT @WhereList as 'Wherelist where periodnumber < maxperiods' END ELSE IF @PeriodNumber = @MaxPeriods BEGIN SELECT @WhereList = COALESCE(@WhereList, '') + '(isnull([GLSegment_' + RIGHT('00' + CAST(@PeriodNumber AS VARCHAR), 3) + '], '''') = '''' )' SELECT @WhereList as 'Wherelist where periodnumber = maxperiods' END SET @PeriodNumber = @PeriodNumber + 1 END FETCH NEXT FROM db_cursor3 INTO @rule END CLOSE db_cursor3 DEALLOCATE db_cursor3 -- build dynamic query SET @SQL = 'SELECT * into missingfields_' + @ModelName + ' from trans_' + @ModelName + ' WHERE id in ( SELECT id from ( SELECT id, ' + @SelectList + ' FROM trans_' + @ModelName + ')A WHERE ' + @WhereList + ' ); SELECT * from missingfields_' + @ModelName PRINT @SQL print 'missingfields_' + @ModelName EXEC sp_executesql @SQL
How can I convert days into weeks and days
I have a date range say 1-Jul-2016 to 10-Jul-2016. I want to extract number of weeks and remaining days exists in that range like
No. of Weeks = 1
No. of Remaining Days = 3
So for this case the answer i want to see is 1.3 I know how to find days OR week difference using
But how to find and get whether a date contains only full Weeks or Weeks and Days.
Any help will be highly appreciated.
Loading data from Excel Source to SQL Server Table using agent Job in SSIS
I have created a package to load the data from excel source to Sql server table .It is loading correct data when package is executed from local But it causes issue when i am running the package through Agent Job ,it is loading partial Data from the excel file . Also when i cut those skipped records and paste in the starting rows of file . It loads the data correctly. Can anyone help me out for this. Also I am reading the excel using Script component as the sheet name is dynamic.
SSIS or Sql job dynamic start time
How do i configure a SQL job or SSIS package to fire/ run in different timings defined in a column in table.
I have a table with a notification system, where note appearance is dependant on status(is it read / Updated / Closed / delay in any of those statuses). I need a job/SSIS to trigger every 10 minutes when isOpened Flag = 0. when isOpened = 1 and isUpdated = 0 run job/SSIS every 15 minutes after 30 minutes from turning isOpened to 1.
The Basics between deciding to use SSIS vs openquery?
I work with an OLAP sql server environment that relies on TSQL's openquery function for routine data integration from Oracle Data Warehouse.
It was a surprise to me that no other ETL tool was used, but the process works and has been in place for over a decade. In exploring better (or newer) practices, we benchmarked and tested using SSIS instead of openquery. Max buffer size and number of rows per buffer were both fine tuned for the table to be moved. Yet, SSIS did not benchmark higher than openquery - both performed similarly.
Some online research (may be outdated was a few years ago) suggested either using an affinity driver or tweaking a connection string property called FetchSize, but we have not followed up on these settings.
Besides the size of the data and the latency between server and client, what are the biggest factors that affect moving speeds? Is there something specialized I input for SSIS to speed up the oracle download?
Query Cassandra UDT via Spark SQL
We'd like to query data from Cassandra DB via SparkSQL. The problem is that data is stored in cassandra as UDT. The structure of UDT is deeply nested and it contains arrays of variable length, so it would be very difficult to decompose data to flat structure. I couldn't find any working example how to to query such UDTs via SparkSQL - especially to filter the results based on UDT values.
Alternatively, could you suggest different ETL pipeline (Query engine, Storage engine, ...), which would be more suitable for our use-case ?
Our ETL pipeline:
Kafka (duplicated events) -> Spark streaming -> Cassandra (deduplication to store only latest event) <- Spark SQL <- analytics platform (UI)
Solutions we've tried so far:
1) Kafka -> Spark -> Parquet <- Apache Drill
Everything worked well, we could query and filter arrays and nested data structures.
Problem: couldn't deduplicate data (rewrite parquet files with latest events)
2) Kafka -> Spark -> Cassandra <- Presto
Solved problem 1) with data deduplication.
Our main requirements are:
- support for data deduplication. We may receive many events with same ID and we need to store only the latest one.
- storing deeply nesteed data structure with arrays
- distributed storage, scalable for future expansion
- distributed query engine with SQL-like query support (for connection with Zeppelin, Tableau, Qlik, ... ). The query doesn't have to run in real time, few minutes delay is acceptable.
- support for schema evolution (AVRO style)
Thank your for any suggestions
Strategies to prevent duplicate data in Azure SQL Data Warehouse
At the moment I am setting up an Azure SQL Data Warehouse. I am using Databricks for the ETL process with JSON-files from Azure Blob Storage.
What is the best practice to make sure to not import duplicate dimensions or facts into the Azure SQL Data Warehouse?
This could happen for facts e.g. in the case of en exception while the loading process. And for dimensions this could happen as well if I would not check, which data already exists. I am using the following code to import data into the data warehouse and I found no "mode" which would only import data which not already exists:
spark.conf.set( "spark.sql.parquet.writeLegacyFormat", "true") renamedColumnsDf.write .format("com.databricks.spark.sqldw") .option("url", sqlDwUrlSmall) .option("dbtable", "SampleTable") .option( "forward_spark_azure_storage_credentials","True") .option("tempdir", tempDir) .mode("overwrite") .save()
Amazon Glue - How can I manage foreign keys on MySQL
I want to migrate CSV data to MySQL table using AWS Glue. CSV has unorganized data and MySQL tables are organized with foreign key relationships. Now I want to put foreign key id instead of any string value while running ETL job.
Is it possible or any workaround to this?
Multivariate Gaussian Function on discretized continous space with periodicity
I built a two dimensional discrete space from a two dimensional continous space.
The first dimension got mapped from [-pi, pi] to [0 ... n] with n equidistant discrete numbers, should be 2-π periodic and represents the position on a circle. The second dimenson got mapped from [-k, k] to [0 ... m] with equisidstant numbers. It represents the angular velocity, and should therefore not be periodic.
Now I am trying to add some gaussian uncertainty about where I actually are on my circle or what velocity I have.
Therefore I would like to return all discrete positions which are in the [-3σ, +3σ] range of the gaussian, with mean μ being the current position, in the 1 dimensional case. First question: Can I just add/substract the covariance matrix Σ (times 2 or 3) in the multivariate case to get an equivalent 2 dimensional range?
Second: I really struggle to deal with the periodicity at the ends of the range of my first dimension. My idea was to run a loop over the range:
list =  for p in np.arange(mu - 3.0 * sigma, mu + 3.0 * sigma, granularity: discrete_position = discretization(p) if discrete_position not in list[:,0]: list.append(discrete_position, gaussian(discrete_position, mu, sigma)) else: index = list[:,0].index(discrete_position) list[index, 1] += gaussian(discrete_position, mu, sigma)
which should give me my discrete positions inside the range and their approximately correct probabilites if I choose the granularity to be very small.
My problem is the periodicity:(1 dim case: ) If I look at position x=3 and σ = 0.1, I would have to evaluate the range [2.7 , 3.3]. The part of the array bigger than π would have to be mapped to a discrete position of [-3.1, -3.0, -2.9, ..], which would not result in a correct probability if I plug it into the gaussian. I would therefore need to map it to [3.2, 3.3, 3.4, ...]. This all get's very messy as I have to save the position, the discrete position, and the position I can plug into my gaussian. (I also need to map it to only positive integers later on, as I want to apply some lookup-table methods) I would really appreciate if there is a nicer way to do it, maybe even without tuning another hyperparameter (granularity). Cheers!
A dynamic formula to output row based on another condition of another row
My problem can be explained like this:
I have two lists (Column B & Column D)
I have another column that corresponds to values in Column B (Column A)
Coloumn A | Column B | Column C | Column D | Column E | Column F | Column G 1 | Blue | | Blue | | | OUTPUT SQL 2 | Indigo | | | BB |BluePeople| OUTPUTSQL 3 | Orange | | | BB-B |BluerPeps | OUTPUT SQL 4 | Red | | Red | RR |RedPeople | OUTPUT SQL
You can ignore the blank column c, it's there for my sanity.
The sql that I'm basically writing goes like this:
=IF(AND(LEN(E2)> 0, LEN(F2)>0),"UPDATE Color SET ColorCode = " & E3 & " WHERE Description = " & F3 & IF(COUNTIF($D:$D,B2), " AND ColorID = " & VLOOKUP(B2,$D:$D,A2), " AND COLORID = " & A2), "New Color Below for List: "&IF(LEN(D2)>0, D2, "Hmm, Error? Should Never Get Here."))
Which only works for the very first output (in Column G1). I'm trying to write it so that when the value in D changes:
- Check if that text exists in the B list & if it does:
- Output that corresponding value from list A.
For example, the first iteration (AFTER ROW 1...so starting on row 2) would look something like:
UPDATE Color SET ColorCode = BB WHERE Description = BluePeople AND ColorID = 1.
The next two iterations would be something like:
UPDATE Color SET ColorCode = BB-B WHERE Description = BluerPeps AND ColorID = 1
UPDATE Color SET ColorCode = RR WHERE Description = RedPeople AND ColorID = 4
My formula above is after a few hours of trying to piece together this monster. I've read up on a few questions on this site, asking similar inquries (most notably this one: Find something in column A then show the value of B for that row in Excel 2010) but I became somewhat perplexed on how to apply that solution to my problem.
I'm no expert in excel so any help would really be appreciated and thank you in advanced!
Excel help: How to fill a cell with data from a table dependent on multiple variables
I have a report sheet I have made on excel that has age, gender and skinfold measurements (in mm) and two tables displaying body fat percentages depending on skinfold measurement, age and gender.
I would like a cell to report the percentage of bodyfat depending on other cells (age, gender, skinfold)
So far I have the following equation:
B18 = 9.2
The (male) table being looked up has skinfold in mm in the left column, and age ranges across the top (17-19, 20-29, 30-39, 40-49, 50+). At the moment the formula finds the closest value (10) in the age column 17-19, however I want the formula to find skinfold first then depending on what age range turn out the corresponding percentage.
The cell highlighted has the current formula in.
Deleting records from a table in Talend
I am writing a job to integrate data from different sources into some tables in an Oracle DB using Talend Open Studio for Data Integration (188.8.131.5281026_1147). The schema I am integrating these data to is almost saturated, so before I integrate the data into some of the tables I want to clear the data in them while for others I want to just delete some of the records based on some conditions.
I have searched on how to acheive this in the Talend community, but I have not gotten any resourceful answer. I have also tried to used the tOracleRow and tOracleOutput but my solutions are not working. I think this can be possible with the tOracleOutput since its it possible to insert, update, and delete with the component.
When I run the job with a query like
DELETE FROM TEST WHERE WEEK='44'I get the error message shown below. But I am able to run and delete the data using my TOAD client as I have the privilege to do so on the schema I am connected to.
So my question is how does one delete data in Talend or how does one open a connection with all the privileges in Talend.
Kettle - pan.sh "No repository provided, can't load transformation"
I've created a kettle transformation and i've tested on my pc and it works. However, i've insered it in the server and starting as bash script by pan.sh. It was working but after few times it started to give this problem.
server$ bash pan.sh file="API_Mining_LatestVersion.ktr" ####################################################################### WARNING: no libwebkitgtk-1.0 detected, some features will be unavailable Consider installing the package with apt-get or yum. e.g. 'sudo apt-get install libwebkitgtk-1.0-0' ####################################################################### Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 14:56:00,682 INFO [KarafBoot] Checking to see if org.pentaho.clean.karaf.cache is enabled 14:56:00,803 INFO [KarafInstance] ******************************************************************************* *** Karaf Instance Number: 2 at /data/Fernando/data-integration_updated/./s *** *** ystem/karaf/caches/pan/data-1 *** *** FastBin Provider Port:52902 *** *** Karaf Port:8803 *** *** OSGI Service Port:9052 *** ******************************************************************************* Nov 20, 2018 2:56:01 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 *ERROR* [org.osgi.service.cm.ManagedService, id=255, bundle=53/mvn:org.apache.aries.transaction/org.apache.aries.transaction.manager/1.1.1]: Updating configuration org.apache.aries.transaction caused a problem: null org.osgi.service.cm.ConfigurationException: null : null at org.apache.aries.transaction.internal.TransactionManagerService.<init>(TransactionManagerService.java:136) at org.apache.aries.transaction.internal.Activator.updated(Activator.java:63) at org.apache.felix.cm.impl.helper.ManagedServiceTracker.updateService(ManagedServiceTracker.java:148) at org.apache.felix.cm.impl.helper.ManagedServiceTracker.provideConfiguration(ManagedServiceTracker.java:81) at org.apache.felix.cm.impl.ConfigurationManager$ManagedServiceUpdate.provide(ConfigurationManager.java:1448) at org.apache.felix.cm.impl.ConfigurationManager$ManagedServiceUpdate.run(ConfigurationManager.java:1404) at org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:103) at java.lang.Thread.run(Thread.java:745) Caused by: org.objectweb.howl.log.LogConfigurationException: Unable to obtain lock on /data/Fernando/data-integration/system/karaf/caches/pan/data-1/txlog/transaction_1.log at org.objectweb.howl.log.LogFile.open(LogFile.java:191) at org.objectweb.howl.log.LogFileManager.open(LogFileManager.java:784) at org.objectweb.howl.log.Logger.open(Logger.java:304) at org.objectweb.howl.log.xa.XALogger.open(XALogger.java:893) at org.apache.aries.transaction.internal.HOWLLog.doStart(HOWLLog.java:233) at org.apache.aries.transaction.internal.TransactionManagerService.<init>(TransactionManagerService.java:133) ... 7 more 2018-11-20 14:56:04.508:INFO:oejs.Server:jetty-8.1.15.v20140411 2018-11-20 14:56:04.544:INFO:oejs.AbstractConnector:Started NIOSocketConnectorWrapper@0.0.0.0:9052 [...] INFO: New Caching Service registered SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/Fernando/data-integration_updated/launcher/../lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/Fernando/data-integration_updated/plugins/pentaho-big-data-plugin/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2018/11/20 14:56:09 - Pan - Start of run. ERROR: No repository provided, can't load transformation.
I don't understand where the problem is. The transformation file hasn't been changed and it contains also repo, user and pass paramethers.
How to sync data from hosted database to local database
I have local central
databaseand 6 hosted
databasefor clothes branches. My task is collecting all data daily from each barnch
databasethen insert this data in central
databaseRealted with branch id
What is the best practise to load data from each branch to central
database. I have a lot of ideas but i don’t know if this isthe best or not ! And i don’t want dublicate data as well