Insert array from csv using pymongo

I have a csv file with array in string format as below:

date,name,criteria
2018-05-16,John,"[{'age':35},{'birthyear':1983}]"
2018-05-16,Jane,"[{'age':36},{'birthyear':1982}]"

I am using Python with pandas and numpy for processing this

I need to import this file into MongoDB collection in following format :

{
   "date":'2018-05-16',
   "name":"John",
   "criteria" : [
         {"age":35},
         {"birthyear" : 1983}
   ]
},
{
   "date":'2018-05-16',
   "name":"Jane",
   "criteria" : [
         {"age":36},
         {"birthyear" : 1982}
   ]
 }

`

I tried using json formatter , but after insertion into Mongodb I get the array to be same as in csv file.

I have tried following approaches:

#Approach 1
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np
import json
from datetime import datetime

df = pd.read_csv("file.csv")
records = json.loads(df.T.to_json()).values()
db.tmp_collection.insert_many(data.to_dict('record'))



#Approach 2
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np
import json
from datetime import datetime

df = pd.read_csv("file.csv")
data_json = json.loads(df.to_json(orient='records'))
db.tmp_collection.insert_many(data_json)

Both give following output in Mongodb collection :

{
"date" : "2018-05-16",
"name" : "John",
"criteria" : "[{age:35},{birthyear:1983}]"
}

Can you suggest some better way. P.S. i am new to Python.

Thanks in advance.

1 answer

  • answered 2018-05-16 07:03 Neil Lunn

    As noted the main issue is the JSON "like" data within the "string" for the criteria lacks quotes around the keys. With quotes properly in place you can parse the string into a list the data is structured how you want.

    You could actually run map and re.sub() over the existing list and replace the criteria with the parsed version.

    Given the source data in the form you stated:

    date,name,criteria
    2018-05-16,John,"[{age:35},{birthyear:1983}]"
    2018-05-16,Jane,"[{age:36},{birthyear:1982}]"
    

    Then the important parts are:

    df = pd.read_csv("file.csv")
    records = json.loads(df.to_json(orient='records'))
    
    pattern = r"({|,)(?:\s*)(?:')?([A-Za-z_$\.][A-Za-z0-9_ \-\.$]*)(?:')?(?:\s*):"
    
    records = map(lambda x:
      dict(x.items() +
        {
          'criteria': json.loads(
            re.sub(pattern, "\\1\"\\2\":", x['criteria'])
          )
        }.items()
      ),
      records
    )
    

    Which is basically going through each item in the list which came out earlier and doing a substitution on the "string" to quote the keys in the objects. Then of course parses the now valid JSON string into a list of dictionary objects.

    That would make data like:

    [{u'criteria': [{u'age': 35}, {u'birthyear': 1983}],
      u'date': u'2018-05-16',
      u'name': u'John'},
     {u'criteria': [{u'age': 36}, {u'birthyear': 1982}],
      u'date': u'2018-05-16',
      u'name': u'Jane'}]
    

    Which you can then pass to the insert_many() to create the documents in the collection keeping that format that you want.

    db.tmp_collection.insert_many(records)
    

    Attribution to regular expression to add double quotes around keys in javascript for the regular expression pattern used here.

    Personally I would take that a little further and at least parse to datetime:

    records = map(lambda x:
      dict(x.items() +
        {
          'date': datetime.strptime(x['date'], '%Y-%m-%d'),
          'criteria': json.loads(
            re.sub(pattern, "\\1\"\\2\":", x['criteria'])
          )
        }.items()
      ),
      records
    )
    

    MongoDB will use a BSON Date when inserted into the collection, and that's a lot more useful than a string.

    And again, "personally" I would not be using "named keys" inside a list for MongoDB. Instead I would rather "remap" to something more standard like "k" and "v" as in:

    records = map(lambda x:
      dict(x.items() +
        {
          'date': datetime.strptime(x['date'], '%Y-%m-%d'),
          'criteria':
          [i for s in map(lambda y: [{ 'k': k, 'v': v } for k,v, in y.iteritems()] , json.loads(
            re.sub(pattern, "\\1\"\\2\":", x['criteria'])
          )) for i in s]
        }.items()
      ),
      records
    )
    

    Which gives a structure like:

    [{u'criteria': [{'k': u'age', 'v': 35}, {'k': u'birthyear', 'v': 1983}],
      u'date': datetime.datetime(2018, 5, 16, 0, 0),
      u'name': u'John'},
     {u'criteria': [{'k': u'age', 'v': 36}, {'k': u'birthyear', 'v': 1982}],
      u'date': datetime.datetime(2018, 5, 16, 0, 0),
      u'name': u'Jane'}]
    

    With the main reason being that "querying" with MongoDB is going to be a lot more useful if the path is more consistent like that.