Table of Contents (top down ↓)
The program connects to the internet and downloads the html source of a web page. After that it extracts links (urls) from the downloaded string, and displays them on the console.
The Plan
- Obtain the web source html as a string as explained in the article download file from a URL
- Once the string is available, we use regular expression for internet url to match both http: and https: sites.
- The regex match is run in a loop to print each matching link on the console.
Pre-requisites
For a better understanding please refer the article(s):
- download file from a URL
- The concept of regular expressions using the C++ Standard Template Library.
The Source Code
// ONLY FOR WINDOWS // NEEDS WINDOWS SET UP // COMPILE USING Visual Studio Express, // Visual C++ Express any edition // any version #pragma comment(lib, "urlmon.lib") #define getURL URLOpenBlockingStreamA #include <urlmon.h> #include <iostream> #include <regex> using namespace std; typedef string::iterator iter; typedef regex_iterator<iter> regIter; // c++ program get a list of URLs // from a site int main() { // the first part is about obtaining // the html source of the web page // Windows IStream interface IStream* stream; const char* URL = "http://google.com"; // make a call to the URL // a non-zero return means some error if (getURL(0, URL, &stream, 0, 0)) { cout << "Error occured."; cout << "Check internet"; cout << "Check URL. Is it correct?"; return -1; } // this char array will be cyclically // filled with bytes from URL char buff[100]; // we shall keep appending the bytes // to this string string s; unsigned long bytesRead; while(true) { // Reads a specified number of bytes // from the stream object into char // array and stores the actual // bytes read to "bytesRead" stream->Read(buff, 100, &bytesRead); if(0U == bytesRead) { break; } // append and collect to the string s.append(buff, bytesRead); }; // release the interface // good programming practice stream->Release(); cout << s << endl; cout << "URL EXTRACTION ..."<< endl; // the second part is about extracting // the links using regular expressions // regular expression for matching an // internet address // note: we have wrapped it to four // lines to prevent scrolling of // this page you can type it in one // single line but remove the three // trailing slashes regIter::regex_type rx( "https?:\\/\\/(www\\.)?"\ "[-a-zA-Z0-9@:%._\\+~#=]{1,256}"\ "\\.[a-zA-Z0-9()]{1,6}\\b([-a-z"\ "A-Z0-9()@:%_\\+.~#?&//=]*)"); // stl iterator for looping through // all the links found in the string regIter n(s.begin(), s.end(), rx), e; // link counter int lnkCounter = 0; while (n != e) { cout << "Link# " << ++lnkCounter; cout << "====\r\n"; cout << n->str() << "\r\n" << endl; n++; } return 0; } // program kept small for clarity // may not be efficient
Similar Posts
This Blog Post/Article "C, C++ Program to extract URLs(hyperlinks) from a web page of a website" by Parveen is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.