UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 57: character maps to <undefined> (but unable to use UTF-8)

This question was already asked a couple of times before but every time people said "just add UTF-8" and it's all good. The case that I am dealing with right now seems to not be fixable with the UTF-8 hack as I understand it? Basically my program scrapes data from a website but this data contains special European characters like "č, š, ř" etc... After adding encoding="UTF-8" the error is gone but then the result CSV file contains completely broken characters where the special characters were supposed to be located. This destroys the entire file and renders it unusable.

I wasn't able to find any solution to this on the internet myself and I am not sure how to deal with it. I need to write those special characters into the file. Another caveat is that I also need the script to be cross-platform. I don't want it to be somehow Windows specific just for the sake of "getting rid of the error".

This is my code :

with open('links.csv') as read:
    reader = csv.reader(read)
    link_list = list(reader)
    with open('ScrapedContent.csv', 'w+', newline='') as write:
        writer = csv.writer(write)
        for link in link_list:
            driver.get(', '.join(link))
            title = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "h1.page-title span.text.ng-binding")))
            offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "a.switcher.ng-binding.ng-scope span.ng-binding.ng-scope")))
            address = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "tr.c-aginfo__table__row td.ng-binding")))
            try:
                wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "button.value.link.ng-binding.ng-scope"))).click()
                phone_number = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.phone.ng-binding")))
            except TimeoutException:
                pass
            try:
                wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "button.value.link.ng-binding"))).click()
                email = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "a.value.link.ng-binding")))
            except TimeoutException:
                pass
            try:
                phone_number = phone_number.text
            except AttributeError:
                phone_number = ""
                pass
            try:
                email = email.text
            except AttributeError:
                email = ""
                pass
            print(title.text, " ", offers.text, " ", address.text, " ", phone_number, " ", email)
            writer.writerow([title.text, offers.text, address.text, phone_number, email])
        driver.quit()

I couldn't find any errors in the code that could cause this to happen in the first place. I am thankful for any suggestions on how to fix this!

1 answer

  • answered 2021-02-22 23:16 Balduin

    do the files look correctly when you don't add utf-8? what encoding are they in?

    I once had a similar issue when scraping a webpage that returned data in a different encoding than it stated in the response header, which screwed up requests a bit.

    I ended up with the following function that solved it for me:

    def _load_xml_content(url):
        """Loade XML content from URL, ensuring the encoding is correct."""
        response = requests.get(url)
        try:
            xml = response.text.encode(response.encoding).decode('utf-8')
        except Exception:
            xml = response.text
        return xml
    

    To this day I'm not 100% sure what happened... but it might be worth giving it a shot too - maybe it magically solves it for you too.