C, C++ Program to extract URLs(hyperlinks) from a web page of a website

This program connects to the internet and downloads the html source of a web page. After that it extracts links (urls) from the downloaded string, and displays them on the console. The program can be easily altered to extract urls from a string or a text file also.

Categories | About |     |  

Parveen,

The program connects to the internet and downloads the html source of a web page. After that it extracts links (urls) from the downloaded string, and displays them on the console.

The Plan

  1. Obtain the web source html as a string as explained in the article download file from a URL
  2. Once the string is available, we use regular expression for internet url to match both http: and https: sites.
  3. The regex match is run in a loop to print each matching link on the console.

Pre-requisites

For a better understanding please refer the article(s):

The Source Code

// ONLY FOR WINDOWS 
// NEEDS WINDOWS SET UP 
// COMPILE USING Visual Studio Express, 
// Visual C++ Express any edition 
// any version 
#pragma comment(lib, "urlmon.lib")

#define getURL URLOpenBlockingStreamA

#include <urlmon.h>

#include <iostream>

#include <regex>

using namespace std;

typedef string::iterator iter;

typedef regex_iterator<iter> regIter;

// c++ program get a list of URLs 
// from a site 
int main()
{

  // the first part is about obtaining 
  // the html source of the web page 
  // Windows IStream interface 
  IStream* stream;

  const char* URL = "http://google.com";

  // make a call to the URL 
  // a non-zero return means some error 
  if (getURL(0, URL, &stream, 0, 0))
  {

    cout << "Error occured.";

    cout << "Check internet";

    cout << "Check URL. Is it correct?";

    return -1;

  }

  // this char array will be cyclically 
  // filled with bytes from URL 
  char buff[100];

  // we shall keep appending the bytes 
  // to this string 
  string s;

  unsigned long bytesRead;

  while(true)
  {

    // Reads a specified number of bytes 
    // from the stream object into char 
    // array and stores the actual 
    // bytes read to "bytesRead" 
    stream->Read(buff, 100, &bytesRead);

    if(0U == bytesRead)
    {

      break;

    }

    // append and collect to the string 
    s.append(buff, bytesRead);

  };

  // release the interface 
  // good programming practice 
  stream->Release();

  cout << s << endl;

  cout << "URL EXTRACTION ..."<< endl;

  // the second part is about extracting 
  // the links using regular expressions 
  // regular expression for matching an 
  // internet address 
  // note: we have wrapped it to four 
  // lines to prevent scrolling of 
  // this page you can type it in one 
  // single line but remove the three 
  // trailing slashes 
  regIter::regex_type rx(
  "https?:\\/\\/(www\\.)?"\
  "[-a-zA-Z0-9@:%._\\+~#=]{1,256}"\
  "\\.[a-zA-Z0-9()]{1,6}\\b([-a-z"\
  "A-Z0-9()@:%_\\+.~#?&//=]*)");

  // stl iterator for looping through 
  // all the links found in the string 
  regIter n(s.begin(), s.end(), rx), e;

  // link counter 
  int lnkCounter = 0;

  while (n != e)
  {

    cout << "Link# " << ++lnkCounter;

    cout << "====\r\n";

    cout << n->str() << "\r\n" << endl;

    n++;

  }

  return 0;

}

// program kept small for clarity 
// may not be efficient 

This Blog Post/Article "C, C++ Program to extract URLs(hyperlinks) from a web page of a website" by Parveen is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.