Internet

Beginning CGI Programming with Perl

HTML, HTTP, and Your CGI Program

HTML, HTTP, and your CGI program have to work closely together to make your online Internet application work. The HTML code defines the way the user sees your program interface, and it is responsible for collecting user input. This frequently is referred to as the Human Computer Interface code; it is the window through which your program and the user interact. HTTP is the transport mechanism for sending data between your CGI program and the user. This is the behind-the-scenes director that translates and sends information between your Web client and your CGI program. Your CGI program is responsible for understanding both the HTTP directions and the user requests. The CGI program takes the requests from the user and sends back valid and useful responses to the Web client who is clicking away on your HTML Web page.

The Role of HTML

HTML is designed primarily for formatting text. It is basically a typesetting language that specifies the shape of the text, the color, where to put it, and how large to make it. It’s not much different from most other typesetting languages, except that it doesn’t have as many typesetting options as most simple What You See Is What You Get (WYSIWYG) editors, such as Microsoft Word. So how does it get involved with your CGI program? The primary method is through the HTML Form tags. Your CGI program does not have to be called through an HTML form, however; it can be invoked through a simple hypertext link using the anchor (<a>) tag-something like this:

<a href="A CGI program"> Some text </a>

The CGI program in this hypertext reference or link is called (or activated) in a manner similar to that used when being called from an HTML form.

You even can use a hypertext link to pass extra data to your CGI program. All you have to do is add more information after the CGI program name. This information usually is referred to as extra path information, but it can be any type of data that might help identify to your CGI program what it needs to do.

The extra path information is provided to your CGI program in a variable called PATH_INFO, and it is any data after the CGI program name and before the first question mark (?) in the href string. If you include a question mark (?) after the CGI program name and then include more data after the question mark, the data goes in a variable called the QUERY_STRING. Both PATH_INFO and QUERY_STRING are covered in Tutorial 6.

So to put this all into an example, suppose that you create a link to your CGI program that looks like this:

<a href=www.practical-inet.com/cgibook/chap1/program.cgi/ extra-path-info?test=test-number-1>A CGI Program </a>

Then when you select the link A CGI program, the CGI program named program.cgi is activated. The environment variable PATH_INFO is set to extra-path-info and the QUERY_STRING environment variable is set to test=test-number-1.

Usually, this is not considered a good way to send data to your CGI program. First, it’s harder for the programmer to modify data that is hard-coded in an HTML file because it cannot be done on-the-fly. Second, the data is easier to modify for the Web page visitor who is a hacker. Your Web page visitor can download the Web page onto his own computer and then modify the data your program is expecting. Then he can use the modified file to call your CGI program. Neither of these scenarios seems very pleasant. Many other people felt the same way, so this is where the HTML form comes in. Don’t completely ignore this method of sending data to your program. There are valid reasons for using the extra-path-info variables. The imagemap program, for example, uses extra-path-info as an input parameter that describes the location of mapfiles. Imagemaps are covered in Tutorial 9 “Using Imagemaps on Your Web Page.”

The HTML form is responsible for sending dynamic data to your CGI program. The basics outlined here are still the same. Data is passed to the server for use by your CGI program, but the way you build your HTML form defines how that data is sent, and your browser does most of the data formatting for you.

The most important feature of the HTML form is the capability of the data to change based on user input. This is what makes the HTML Form tag so powerful. Your Web page client can send you letters, fill out registration forms, use clickable buttons and pull-down menus to select merchandise, or fill out a survey. With a clear understanding of the HTML Form tag, you can build highly interactive Web pages. Because this topic is so important, it is covered in Tutorials 4 and 5, and the hidden field of the HTML form is explained in Tutorial 7 “Building an Online Catalog.”

So, to sum up, HTML and, in particular, the HTML Form tag, are responsible for gathering data and sending it to your CGI program.

The HTTP Headers

If HTML is responsible for gathering data to send to your CGI program, how does it get there? The data gathered by the browser gets to your CGI program through the magic of the HTTP request header. The HTML tags tell the browser what type of HTTP header to use to talk to the server-your CGI program. The basic HTTP headers for beginning communication with your CGI program are Get and Post.

If the HTML tag calling your program is a hypertext link, the default HTTP request method Get is used to communicate with your CGI program, as in this example:

<a href="www.domain.com/program.cgi">, call a CGI program </a>

If, instead of using a hypertext link to your program, you use the HTML Form tag, the Method attribute of the Form tag defines what type of HTTP request header is used to communicate with your CGI program. If the Method field is missing or is set to Get, the HTTP method request header type is Get. If the Method attribute is set to Post, a Post method request header is used to communicate with your CGI program. (The Get and Post methods are covered in Tutorials 4 and 5.)

After the method of sending the data is determined, the data is formatted and sent using one of two methods. If the Get method is used, the data is sent via the Uniform Resource Identifier (URI) field. (URI is covered in Tutorial 2.) If the Post method is used, the data is sent as a separate message, after all the other HTTP request headers have been sent.

After the browser determines how it is going to send the data, it creates an HTTP request header identifying where on the server your CGI program is located. The browser sends to the server this HTTP request header. The server receives the HTTP request header and calls your CGI program. Several other request headers can go along with the main request header to give the server and your CGI program useful information about the browser and this connection.

Your CGI program now performs some useful function and then tells the server what type of response it wants to send back to the server.

So where are we so far? The data has been gathered by the browser using the format defined by the HTML tags. The data/URI request has been sent to the server using HTTP request headers. The server used the HTTP request headers to find your CGI program and call it. Now your CGI program has done its thing and is ready to respond to the browser. What happens next? The server and your CGI program collaborate to send HTTP response headers back to the browser.

What about the data-the Web page-your CGI program generated? Well, that’s why the HTTP response headers are used. They describe to the browser what type of data is being returned to the browser.

Your CGI program can generate all the HTTP response headers required for sending data back to the client/browser by calling itself a non-parsed header CGI program. If your CGI program is an NPH-CGI program, the server does not parse or look at the HTTP response headers generated by your CGI program; they are sent directly to the requesting browser, along with data/HTML generated by your CGI program.

The more common method of returning HTTP response headers is for your CGI program to generate the minimum required HTTP request headers; usually, just a Content-Type HTTP response header is required. The server then parses, or looks for, the response header your CGI program generated and determines what additional HTTP response headers should be returned to the browser.

The Content-Type HTTP response header identifies to the browser the type of data that will be returned to the browser. The browser uses the Content-Type response header to determine the types of viewers to activate so that the client can view things like inline images, movies, and HTML text.

The server adds the additional HTTP response headers it knows are required, bundles up the set of the headers and data in a nice TCP/IP package, and then sends it to the browser. The browser receives the HTTP response headers and displays the returned data as described by the HTTP response headers to your customer, the human.

So now you have the whole picture (which you will learn about in detail throughout the book), made up of the HTML used to format the data and the HTTP request and response headers used to communicate between the browser and server what type of data is being sent back and forth. Among all this is your very cool CGI program, aware of what is going on around it and driving the real applications in which your Web client really is interested.

Your CGI Program

What about your CGI program? What is it and how does it fit into this scenario? Well, your CGI program can be anything you can imagine. That is what makes programming so much fun. Your CGI program must be aware of the HTTP request headers coming in and its responsibility to send HTTP response headers back out. Beyond that, your CGI program can do anything and work in any manner you choose.

For the purposes of these tutorials, I concentrate on CGI programs that work on UNIX platforms, and I use the Perl programming language. I focus on the UNIX platform because that is the platform of choice on the Net at this time. The most popular WWW servers are the ncSA httpd, CERN, Apache, and Netscape servers; all these Web servers sit most comfortably on UNIX operating systems. So, for the moment, most platforms on which CGI programs are developed are UNIX servers. It just makes sense to concentrate on the operating system on which most of the CGI applications are required to run.

But why Perl? Well, wouldn’t it be nice to work with a language that you didn’t have to compile? No messing with painful linker commands. No compilation steps at all. Just type it in and it’s ready to go. What about a language that is free? Easy to get a hold of and available on just about any machine on the Net? How about a language that works well with and even looks like C, arguably the most popular programming language in the world? And wouldn’t it be nice if that language worked well with the operating system, making each of your system calls easy to implement? And what about a programming language that works on almost any operating system? That way, if you change platforms from UNIX to Windows, NT, or Mac, your programs still run. Heck, why not just ask for a language that’s easy to learn and for which a ton of free technical help is available? Ask for it. You’ve got it! Did that sound like an advertisement? And no, I don’t have any vested interest in Perl.

Perl is rapidly becoming one of the most popular scripting languages anywhere because it really does satisfy most of the needs outlined here. It’s free, works on almost any platform, and runs as soon as you type it in. As long as you don’t have any bugs…

Perl is an excellent choice for all these reasons and more. The more is probably what makes the language so popular. If Perl could do all those wonderful things and turned out to be hard to work with, slow, and not secure, it probably would have lost the popularity war. But Perl is easy to work with, has built-in security features, and is relatively fast.

In fact, Perl was designed originally for working with text, generating reports, and manipulating files. It does all these things fairly well and fairly easily. Larry Wall and Randal L. Schwartz of Programming perl state that “The pattern matching and textual manipulation capabilities of Perl often outperform dedicated C programs.”

In addition, Perl has a lovely data structure called the associative array that you can use for database manipulation. The designers of Perl also thought of security when they built the language. It has built-in security features like data-flow tracing, which enables you to find out where data that is not secure originated. This capability often prevents nonsecure operations before they can occur.

Most of these features are not covered in these tutorials. these tutorials does take the time to show you how to use Perl to develop CGI programs, however, which you will find helpful if you have never used Perl or are new to programming. After you get the basics from these tutorials, you should be able to understand other Perl CGI programs on the Net. As an added bonus, by learning Perl, you get an introduction to UNIX and C for free. These reasons were enough to make me want to learn Perl and are the reasons why you will use Perl throughout these tutorials.

At this point, you have a good overview of CGI programming and how the different pieces fit together. As you go through the book, you will see that most of the topics in these first two sections are covered again in more detail and with specific examples. The next steps now are for you to learn more about your server, how to install CGI programs, and what makes CGI programming so different from other programming paradigms.