How can I extract the URL Paths from a html file? (In bash)

I have the file urls-list.html with multiples URL paths inside, in this format:

   <body contenteditable="true">
      <h1>File: <a href="https://test.com/Config.js" target="_blank" rel="nofollow noopener noreferrer">https://test.com/Config.js</a></h1>
      <div>
         <a href='/common/assets/locale/language_en.props' class='text'>/common/assets/locale/language_en.props</a>
         <div class='container'>                        urls: [e.get("app.content.domain") + "<span style='background-color:yellow'>/common/assets/locale/language_en.props</span>"]</div>
      </div>
      <div>
         <a href='/common/assets/locale/language_en1.props' class='text'>/common/assets/locale/language_en1.props</a>
         <div class='container'>                            remote: a + n + brandSuffix + "<span style='background-color:yellow'>>/common/assets/locale/language_en1.props</span>",</div>
      </div>
      <div>
         <a href='/common/assets/locale/language_en2.props' class='text'>/common/assets/locale/language_en2.props</a>
         <div class='container'>                            remote: a + n + "<span style='background-color:yellow'>>/common/assets/locale/language_en2.props</span>",</div>
      </div>
      <div>
         <a href='/common/assets/locale/language_en2.props' class='text'>/common/assets/locale/language_en2.props</a>
         <div class='container'>                            remote: a + n + "<span style='background-color:yellow'>>/common/assets/locale/language_en3.props</span>",</div>
      </div>
      <div>
         <a href='/common/assets/locale/language_en3.props' class='text'>/common/assets/locale/language_en3.props</a>
         <div class='container'>                            remote: a + n + "<span style='background-color:yellow'>>/common/assets/locale/language_en4.props</span>",</div>
      </div>
      <div>
     <a href='/main' class='text'>/main</a>
     <div class='container'>                    versionedAssets.isEnabled() &amp;&amp; (i = versionedAssets.getJSAsset("dashboard/boot"), r = versionedAssets.getJSAsset("dashboard<span style='background-color:yellow'>/main</span>"), l = versionedAssets.getJSAsset("appkit-utilities<span style='background-color:yellow'>/main</span>"), hybrid &amp;&amp; (i = versionedAssets.getHybridAsset("dashboard/boot"), r = versionedAssets.getHybridAsset("dashboard<span style='background-color:yellow'>/main</span>"))), envProps.get("app.blueJSVersion.enabled") ? (n.push([envProps.get("app.blueVendor.version") + "<span style='background-color:yellow'>/main</span>", envProps.get("app.blue.version") + "<span style='background-color:yellow'>/main</span>", envProps.get("app.blueApp.version") + "<span style='background-color:yellow'>/main</span>", envProps.get("app.blueView.version") + "<span style='background-color:yellow'>/main</span>", "blue-ui/dist/blue-ui/js<span style='background-color:yellow'>/main</span>", l, i, r]), n.push([{</div>
  </div>

I like help to extract all URLs Paths showing inside the span tags from the file urls-list.html.

To be more clear I need this output:

    Command: ./extra-path.sh urls-list.html (or simialr)
    result:
    /common/assets/locale/language_en.props
    /common/assets/locale/language_en1.props
    /common/assets/locale/language_en2.props
    /main

Can anyone help me with this?

UPDATE: I only need the URL paths from with yellow. (background-color:yellow)

3 answers

  • answered 2018-05-16 06:08 PPL

    Try below code

    var href = window.location.href;
    var dir = href.substring(0, href.lastIndexOf('/')) + "/";
    

  • answered 2018-05-16 06:10 RavinderSingh13

    Following may help you on same.

    cat script.ksh
    awk '/span/ && match($0,/<span style=\047background-color:yellow\047>>[^<]*/){print substr($0,RSTART+39,RLENGTH-39)}'  "$1"
    

    Adding a non-one liner form of solution too now.

    cat script.ksh
    awk '
    /span/ && match($0,/<span style=\047background-color:yellow\047>>[^<]*/){
      print substr($0,RSTART+39,RLENGTH-39)
    }'   "$1"
    

  • answered 2018-05-16 06:26 Paza

    You can do this in bash using awk:

    awk -F'[ =]' '/href/ {print $3}' urls-list.html
    

    Explanation:
    -F tells awk to use spaces and '=' as delimiters
    /href/ makes the print command run on every line that contains "/href/"
    print $3 prints the third token

    However, this will work only when the input line format is exactly the same as in your example. Something more robust would be:

    awk -F'href=' '/href/ {print $2}' urls-list.html | awk -F'[ <>]' '{print $1}'