Google Translate API - detect language + translate document (xlsx, csv)

I'm trying to use Google Cloud Translation API for translating an excel (or csv) document that includes text in multiple languages and my target language is english.

I would like to use "Translate text in batches (Advanced edition only)" code sample (link here: https://cloud.google.com/translate/docs/samples/translate-v3-batch-translate-text) but in the code sample is a line that defines the source language so there can only be one source language.

But I need to detect the langugage first in the document and then translate the text to english. There is code sample for detecting language in a simple string of a text "Detecting languages (Advanced)" (link: https://cloud.google.com/translate/docs/advanced/detecting-language-v3) but I need to combine the first code sample that translates documents (but only has one source language defined) with the ability to detect language instead of having one source language defined.

Is there this type of code sample in the resources? How could this be solved?

Here is the sample code in question:

from google.cloud import translate


def batch_translate_text(
    input_uri="gs://YOUR_BUCKET_ID/path/to/your/file.txt",
    output_uri="gs://YOUR_BUCKET_ID/path/to/save/results/",
    project_id="YOUR_PROJECT_ID",
    timeout=180,
):
    """Translates a batch of texts on GCS and stores the result in a GCS location."""

    client = translate.TranslationServiceClient()

    location = "us-central1"
    # Supported file types: https://cloud.google.com/translate/docs/supported-formats
    gcs_source = {"input_uri": input_uri}

    input_configs_element = {
        "gcs_source": gcs_source,
        "mime_type": "text/plain",  # Can be "text/plain" or "text/html".
    }
    gcs_destination = {"output_uri_prefix": output_uri}
    output_config = {"gcs_destination": gcs_destination}
    parent = f"projects/{project_id}/locations/{location}"

    # Supported language codes: https://cloud.google.com/translate/docs/language
    operation = client.batch_translate_text(
        request={
            "parent": parent,
            "source_language_code": "en",
            "target_language_codes": ["ja"],  # Up to 10 language codes here.
            "input_configs": [input_configs_element],
            "output_config": output_config,
        }
    )

    print("Waiting for operation to complete...")
    response = operation.result(timeout)

    print("Total Characters: {}".format(response.total_characters))
    print("Translated Characters: {}".format(response.translated_characters))

1 answer

  • answered 2021-07-28 03:28 Ricco D

    Unfortunately it is not possible to pass array of values to field source_language_code using batchTranslateText. What I could suggest is to perform detectLanguage and translateText per file.

    What the code below does is:

    1. It extracts the content to be translated. For testing purposes the the csv files used only have 1 column and content for sample1.csv is in tl(Tagalog) and sample2.csv is in es(Spanish).
    2. Pass the extracted content to detect_language() to get detected language code.
    3. Pass all the required parameters to translate_text() to translate

    NOTE: The code below is only tested using csv files with one column. Edit the code at main() to pattern on what column you would like to extract data.

    from google.cloud import translate
    import csv
    
    
    def listToString(s):
        """ Transform list to string"""
        str1 = " "
        return (str1.join(s))
    
    def detect_language(project_id,content):
        """Detecting the language of a text string."""
    
        client = translate.TranslationServiceClient()
        location = "global"
        parent = f"projects/{project_id}/locations/{location}"
    
        response = client.detect_language(
            content=content,
            parent=parent,
            mime_type="text/plain",  # mime types: text/plain, text/html
        )
    
        for language in response.languages:
            return language.language_code
    
    
    def translate_text(text, project_id,source_lang):
        """Translating Text."""
    
        client = translate.TranslationServiceClient()
        location = "global"
        parent = f"projects/{project_id}/locations/{location}"
    
        # Detail on supported types can be found here:
        # https://cloud.google.com/translate/docs/supported-formats
        response = client.translate_text(
            request={
                "parent": parent,
                "contents": [text],
                "mime_type": "text/plain",  # mime types: text/plain, text/html
                "source_language_code": source_lang,
                "target_language_code": "en-US",
            }
        )
    
        # Display the translation for each input text provided
        for translation in response.translations:
            print("Translated text: {}".format(translation.translated_text))
            
    def main():
    
        project_id="your-project-id"
        csv_files = ["sample1.csv","sample2.csv"]
        # Perform your content extraction here if you have a different file format #
        for csv_file in csv_files:
            csv_file = open(csv_file)
            read_csv = csv.reader(csv_file)
            content_csv = []
    
            for row in read_csv:
                content_csv.extend(row)
            content = listToString(content_csv) # convert list to string
            detect = detect_language(project_id=project_id,content=content)
            translate_text(text=content,project_id=project_id,source_lang=detect)
    
    if __name__ == "__main__":
        main()
    

    sample1.csv:

    kamusta
    ayos
    

    sample2.csv:

    cómo estás
    okey
    

    Output using the code above:

    Translated text: how are you okay
    Translated text: how are you ok
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum