how to scrape a <script type = "text / javascript"> tag in php?

  1. my question is how can I scrape this tag
<script type="text/javascript">
var BCData = {"csrf_token":"686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};
</script>
  1. what I want to extract is the value of csrf_token or 686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8

  2. I already tried as below but did not get the result I expected

$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, '$url');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36');
curl_setopt($ch,CURLOPT_HTTPHEADER,array("accept-language: es-419,es;q=0.9"));
curl_setopt($ch,CURLOPT_TIMEOUT, 10);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
preg_match_all('(<script type="text/javascript">
var BCData = {"csrf_token":\"(.*)\","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};</script>)siU', $result, $matches1);
$titulo = $matches1[1][0];
echo $titulo;
  1. I can't get the result

5 answers

  • answered 2020-06-27 04:39 user1597430

    The simplest expression to extract the CSRF from the page:

    # matches all occurrences of the format of the CSRF token
    if (preg_match_all('/[a-f0-9]{64}/', $string, $matches))
    {
        # should equal the value of the transmitted CSRF
        print_r($matches[0][0]);
    }
    

  • answered 2020-06-27 04:42 Keral Patel

    Use json decode Then loop through the returned array.

  • answered 2020-06-27 09:16 Prof83

    This specifically matches multiple instances of the "csrf_token":"..." portion of the JSON and extracts the token value in a named group

    
    // Match all occurrences
    if (preg_match_all('/\"csrf_token\"\s?\:\s?\"(?<csrf>[a-f0-9]{64})\"/', $string, $matches)) {
    
        // One or more token matches extracted from the JSON
        print_r($matches['csrf']);
    
    }
    

  • answered 2020-06-27 09:19 apokryfos

    You can probably grab the variable BCData and then convert it into JSON:

    $data = preg_match_all('/var\s+BCData\s*=\s*({.*?});/m', $result , $matches);
    if (!empty($matches[1]) && !empty($matches[1][0])) {
       $data = json_decode($matches[1][0], true);
       echo $data['csrf_token'];
    }
    

    This assumes that the code will have a JSON valid value within the script tag, which seems to be true now, but may not be true forever.

    Sandbox link

  • answered 2020-06-27 10:19 mickmackusa

    For reliability, the whole html document should be parsed by a DOM parser to isolate the <script> node.

    Then use regex to carve out the json string. The m modifier makes ^ match the start of a line and $ match the end of a line. \K restarts the fullstring match so that no capture groups are needed.

    Then, for reliability, parse the json string and access the desired value by key.

    Code: (Demo)

    $html = <<<HTML
    <script type="text/javascript">
    var BCData = {"csrf_token":"686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};
    </script>
    HTML;
    
    echo preg_match(
             '~^var BCData = \K.*(?=;$)~m',
             $html,
             $match
         )
         ? json_decode($match[0])->csrf_token
         : 'pattern found no match';
    

    Output:

    686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8
    

    Admittedly, I don't know how the input string may vary so I can only build a pattern for the string provided.