Myths About CGI Scalability

TekMol · on Sept 3, 2022

I really like the PHP approach of "one url, one file" by letting Apache handle the routing. I love to build a whole application by just putting a bunch of php files into a directory.

And I hate that in a typical Python+Django application, I have to restart the application to see changes I make.

But apart from that, Python is the nicer language. So I would like to do more web dev in Python. Most Python devs get red faces when you tell them that the PHP approach has benefits and you would prefer to go that route though.

What is a good Apache virtualhost entry to accomplish the typical PHP approach but for Python? How do I tell Apache "My application is in /var/www/myapplication/, execute all python files in there"? Are there any arguments to not do this in production?

BiteCode_dev · on Sept 3, 2022

Indeed, but:

- when declaring routing in code, it's harder to shoot yourself in the foot. So many php sites expose what they should not, and .httaccess is rarely crafted correctly

- renaming files don't change your website

- coding locally doesn't require apache if your tooling don't need it for routing. No need for stuff like EasyPhp

- python dev servers automatically reload for you on changes anyway

- declaring routes and parsing them at one place to get url id and slug is less duplication

- declaring url yourself mean you can use the language typing system for their checking too (see fastapi)

- I rarely need an entire file for one route endpoint, making 10 files for 10 enpoints that could fit in one is overkill

- wsgi servers like gunicorn will hapilly reload the workers when it receives SIGHUP in prod. No need to restart the master process manually.

- restarting a process every time mean you can't benefit much from nice python features like lru_cache, process/thread pools, connection pools, etc

- python is slow to start, cgi takes a toll

- it's easier to control the number of max workers that run an any given time

Honestly, even when coding in PHP I will use phpfpm (which is one long running process like python) and frameworks like laravel (in which you declare your routing).

I understand the desire for simplicity, and the one-file/one-route declaration is easy to get into. But if I want simple, I do a static website with a bit of client side JS, not need for a backend.

When I do need a backend, I much prefer long running workers and explicit route declaration.

TekMol · on Sept 3, 2022

None of the points you mention are an argument against CGI for me. But it is mostly a matter of taste.

There is one point I would like to address though:

"python is slow to start"

Python is slow. Dead slow. But not to start:

    echo 'print(1)' > test.py; time for i in {1..100}; do python3 test.py > /dev/null; done

Gives me 1.6s

    echo '<?php echo(1);' > test.php; time for i in {1..100}; do php test.php > /dev/null; done

Gives me 1.9s

tyingq · on Sept 3, 2022

I'd say they are both slow. Try Perl, Tcl, Lua or some other interpreted language that starts fast, and you'll likely see something like 200-300ms for the equivalent commands to above.

cb321 · on Sept 3, 2022

As in so many measurement/benchmarking situations, "compared to what?" is indeed The Big Question.

For example, you can also compare to a C program not doing much, overhead-wise:

    /usr/bin/time bash -c 'for i in {1..100}; do /bin/true; done'

which gives me like 70 ms for all 100 or 0.7ms per each startup, with also some adjustment to start-up the bash driver shell itself.

Even that 700 microseconds is a high number, from glibc & dynamic linking mostly. If you

    echo 'int main(int ac,char**av){return 0;}' > /tmp/true.c
    musl-gcc -static -O2 /tmp/true.c -o /tmp/true

and repeat, you see 380 microseconds per each.

In benchmarking scenarios requiring more elaboration than makes sense in an HN comment, I see times as low as 140 microsec per run on that same machine/OS. In fact, as you push things to low overhead extremes, you can easily see the impact of how much environment variable data is in play (`env -i` vs. not).

It is true that Python start-up time is not that bad compared to a "mean diameter of the Internet in milliseconds" which is probably the relevant time scale for CGIs, but it's probably not so small as to be truly negligible, esp. once imports are happening.

Also, Python3 remains quite a bit slower (1.7X slower on that same machine) to start-up than Python2 even after more than a decade of promises to claw back start-up performance.

Meanwhile, on the other side of the comparison, starting up a Julia REPL seems to take quite a bit longer than the Python 16 ms..Like 340ms or 20X longer.

So, the overarching point of this overlong post is just a follow-up/supporting of @tyingq's "compared to what?" with a little more detail.

TekMol · on Sept 3, 2022

    echo 'print 1' > test.perl; time for i in {1..100}; do perl test.perl > /dev/null; done

Gives me 0.25s

Clearly the winner!

viraptor · on Sept 3, 2022

> "one url, one file" by letting Apache handle the routing

Source of many issues when someone forgets the "include auth.php" equivalent in that one specific file. (not theoretical, this happened so many times)

> I have to restart the application to see changes I make.

https://adamj.eu/tech/2021/12/16/introducing-django-browser-...

ThePadawan · on Sept 3, 2022

FYI: Your link about reloading is for browser-side reloading - it seems to me that parent talks about reloading the server on code changes, which as a sibling points out, should definitely already be happening using django's `runserver`.

I'd be interested to hear if parent's "typical" django app does something more involved that makes hot reloading impossible (like run in 3 levels of containers or the like).

hakre · on Sept 3, 2022

Well if you already execute an application server like Apache, you should handle auth&z in there. Would be stupid to let PHP deal with it. Or would you handle the TLS termination as well in the PHP script? Well just like that. Perhaps the mistake was to do it otherwise.

giantrobot · on Sept 3, 2022

The HTTP server/proxy is only going to handle TLS termination. It can also handle basic resource authentication in a pinch. But for all other auth tasks need to be done by the application. Just because a user can access a URL doesn't mean they have access to a particular row in a database or write access or something. The user data may not even live locally and be provided by an external provider which is also beyond the scope of the HTTP server.

I don't know why you would think it's the HTTP front end's job to do any of that. It's all clearly the domain of the back end application.

franga2000 · on Sept 3, 2022

It's not perfect yet, but Reloadium [0] enables hot reloading of Django views.

As for running python scripts like PHP, something like this should work:

    Options +ExecCGI    
    AddHandler cgi-script .py

Just add a Python shebang at the top of the file and it should execute. If I'm not mistaken, you'll have to print headers and then content, so you'll probably want a lib to make that easier (I don't know any, but surely they exist).

[0] https://reloadium.io/features/django

bastawhiz · on Sept 3, 2022

> Are there any arguments to not do this in production?

When the way the code is run is also the vehicle for returning output, you necessarily end up in a situation where you expose bad output to the user. This is an insanely common problem with PHP: warnings, errors, and stack traces are regularly returned in HTTP responses by default.

Consider PHP: the output of the code (echo, closing the PHP tag, etc) gets dumped to the HTTP response, unless you configure it otherwise. Error? To the HTTP response. This exposes internal state and may be a security issue (it often is) or reveals your source code.

By contrast, the only way to return data to a HTTP response in Python is to construct a response object and return it. All the built in methods for printing or generating output write to stdout/stderr or the built in python logging framework. You can get PHP-style errors, but only deliberately by having your framework catch exceptions and convert them to responses.

As much as this seems like a non-issue, it's actually really dangerous when files map to URLs, because you don't want all your files being URLs. Some files are supposed to be shared, or only loaded in certain circumstances. If you can access arbitrary chunks of code directly, and then read any error output from them (or even regular-but-unintended output), you could be leaking data. Without lots of explicit manual boilerplate to block this direct access, you're running a huge risk.

And this isn't just speculation. I've worked at a large PHP shop, and this was a frequent source of frustration and pain. Nobody is immune from this.

withinboredom · on Sept 3, 2022

Php’s default production configuration doesn’t output any errors. Perhaps you were running the default dev configuration in production?

bastawhiz · on Sept 3, 2022

I publish a WordPress plugin and have worked with PHP for twenty years, and given the number of sites I've investigated that output notices and errors, what you've said seems false. Even if the stock version that is officially distributed doesn't output this by default, a wild number of hosts do.

withinboredom · on Sept 4, 2022

Belief isn’t required: https://github.com/php/php-src/blob/master/php.ini-productio...

I suspect many people don’t RTFM or have any idea what they are doing with php (or js) because they are historically “entry-level” languages.

9dev · on Sept 3, 2022

I would argue this is mostly a thing of the past. Properly configuring your production environment is a thing for all languages, and stuff like that simply doesn’t happen with a modern framework, ie. Anything written after 2010.

djxfade · on Sept 3, 2022

Not many write PHP code like this anymore, except for perhaps some small one off projects or legacy.

The reason is that it gets incredible hard to maintain, and almost impossible to test.

hakre · on Sept 3, 2022

if impossible to test certainly incredibly done wrong, yes and therefore to maintain. but I naturally know what you mean but its not that it must be that way. and at least there is still the one file for the entry-point. but not every entrypoint needs to have equally much bootstrapping code brought up just to process a request.

fulafel · on Sept 3, 2022

Django dev server has reloading enabled out of the box, no?

https://docs.djangoproject.com/en/4.0/intro/tutorial01/ "The development server automatically reloads Python code for each request as needed"

petercooper · on Sept 3, 2022

I'm only beginning to get into Python, but maybe the solution to the "red faces" reaction is for someone to build something like what you've described and package and present it as a fantastic new way to go.

Remix (https://remix.run/) is doing something like that in the JavaScript space - some of its core ideas could make a 2014 JavaScript developer bristle, but it's a great approach now and developers are loving it.

tjpnz · on Sept 3, 2022

Perhaps give mod_python a look. Tried and tested but I wouldn't want to write an application like that in 2022 when I could use Flask or Fast API.

duskwuff · on Sept 3, 2022

mod_python is dead. Last release was in 2013, and according to the web site, "there is no plan for additional releases".

irq-1 · on Sept 3, 2022

No new development but it still runs. If you can get by with the older version of python that's embedded, it's a good start for someone who is learning.

duskwuff · on Sept 3, 2022

> If you can get by with the older version of python that's embedded, it's a good start for someone who is learning.

I'd have to disagree with that. Teaching a new developer how to use a tool that's already on its last legs is not setting them up for success.

habibur · on Sept 3, 2022

PHP is now a days ran as fcgi, just like how Python, Ruby or Go runs. The Apache mod days have been left in the past.

That you can't render a python script.py like php's script.php more likely has to do with language construction difference between the two.

tyingq · on Sept 3, 2022

>> That you can't render a python script.py like php's script.php more likely has to do with language construction difference between the two.

You can, actually, and how close it is to the way php works is all to do with the fastcgi handler, and little to do with the language. Wsgi has configuration like WSGIScriptAlias, WSGIScriptAliasMatch, and WSGIScriptReloading that would make it almost exactly the same as default php behavior.

I suppose it may not run efficiently that way, but it would run. And you don't have the built-in html templating with php, where any html file is also a valid php file.

habibur · on Sept 4, 2022

> I suppose it may not run efficiently that way, but it would run. And you don't have the built-in html templating with php, where any html file is also a valid php file.

Indeed.

On an abstract level anything can be done. Efficiency is the question.

rollcat · on Sept 3, 2022

There's no publication date and archive.org's earliest snapshot is 2007. For which decade is this advice relevant?

jmillikin · on Sept 3, 2022

The article is an explanation (defense?) of the author's library Powtils[0], which was created in February 2004[1].

Even in 2004 I think it would have been extremely unusual to write web applications in Pascal and serve them via CGI. The first edition of Lazarus was released in 2001, and the name gives a hint about Pascal's popularity at the time. From what I remember of that era, PHP was dominant and FastCGI was a popular way to hook it up to non-Apache webservers such as IIS.

[0] http://z505.com/powtils/idx.shtml

[1] https://web.archive.org/web/20040604122859/http://www.psp.fu...

cylinder714 · on Sept 3, 2022

Code was last modified two years ago: https://github.com/z505/powtils

TazeTSchnitzel · on Sept 3, 2022

The site's favicon has:

  Last-Modified: Sun, 04 May 2008 00:32:10 GMT

Which means very little, they may have added it long after this page was written.

As this page is rendered with CGI, we don't get a Last-Modified for it. That's a CGI disadvantage :p

EDIT: Oh, http://z505.com/cgi-bin/qkcont/qkcont.cgi?p=PasWiki%20Direct... says "Info on this site dates back to 2004 and may be outdated."! 2004 or earlier then.

torstenvl · on Sept 3, 2022

> As this page is rendered with CGI, we don't get a Last-Modified for it. That's a CGI disadvantage :p

I'm assuming you already know that's false, based on the emoticon, but for anyone else reading... CGI gives you the flexibility to set (or not set) the Last-Modified header as you see fit.

Additionally, CGI is a pretty decent way to give web access to a static site generator. Write or copy your Markdown in a textbox, hit submit, md file is saved, SSG is run, done.

chasil · on Sept 3, 2022

It is also unlikely to be running on Windows, as Server 2003 (or perhaps 2008) would not last long connected to the internet.

Windows is also around 100x slower in in forking new processes, so CGI that requires this would not do well.

antonvs · on Sept 3, 2022

Since it's talking about Perl CGI websites, the decade for which it's relevant was the 1990s.

Phil Greenspun's book Database Backed Websites was published in 1997, and its coverage of CGI already started seeming rather quaint over the next few years as better approaches took over.

eesmith · on Sept 3, 2022

With a

  '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'

it's no earlier than about 1999, when HTML 4.01 came out.

The last big CGI-based web app I did started in 2001. By 2005 (when TurboGears came out) we knew the design was outdated.

throwaway787544 · on Sept 3, 2022

Perl was widely used for websites through at least 2005, mod_perl being the dominant expression of it.

antonvs · on Sept 7, 2022

It was widely used but also widely criticized by that point. Dead man walking, basically.

petercooper · on Sept 3, 2022

It doesn't always lead anywhere, but it can be useful to revisit old patterns because sometimes there have been enough advances in either speed or logistics to make them relevant again. Reinvigorated echoes of old ideas spring up all the time and it's fun to see (I'm currently a huge fan of the new SQLite scene and also new uses being found for VMs thanks to Firecracker).

Starting up a process is now so trivial, efficient and fast that there are certainly places where a CGI-like approach could scale a long way! Quite a bit of what's been going on in the "serverless" space has parallels, including some of the optimizations made to scale it up.

dragontamer · on Sept 3, 2022

Wait, does anyone really use CGI instead of FCGI these days? Let alone like, php-fpm pools and the like?

The general gist I've come up with myself is 1000x processes for fork()/exec() pattern, 100,000 threads for pthreads, and 10,000,000 golang / async / whatever.

Maybe I'm off by a magnitude or two, but that's my general assumption. If you've got less than 1000x processes, CGI probably is fine. But I can't think of any reason to use CGI over FCGI, and FCGI's threads should get you to 100k or so.

I do feel like golang/async offers significantly more performance, but at the tradeoff of extra complexity. There's something to be said about the simplicity innate to collection of .php files in a directory, compared to a Django / Ruby on Rails / Maven Java Tomcat / Golang proxy / etc. etc (or whatever you need to get properly async and/or golang / cooperative multitasking code up)

-----

Then again, I'm not really a web developer.

Anyway, if anyone has reasons to go with fork()/exec() apache / one-process-per-connection CGI 90s style methodology, I'm all ears. Its just... not really that much simpler than FCGI / php-fpm in my experience.

Its not like Python / PHP / Ruby are very fast anyway. We're all doing these langauges because convenience is more important than performance. Still though, there's probably a few orders of magnitude between CGI and FCGI.

torstenvl · on Sept 3, 2022

Modern Web Dev:

saves a fraction of a second on a fork() call

loads 5MB of React over user's two-bar 4G connection

"I'm helping!"

KronisLV · on Sept 3, 2022

> Its just... not really that much simpler than FCGI / php-fpm in my experience.

I mean, there are a lot of options to run PHP out there: for sites that don't get a lot of traffic, something like mod_php in Apache might just be enough, because configuring php-fpm means that you'll have another process running for executing PHP and you'll need to proxy traffic to it.

Personally, I recently tried building my own Ubuntu container images for all of the software that I want to run (excluding databases) and found this in particular to be a pain point. I could get php-fpm installed, used Supervisord to launch multiple processes in the container, but PHP was only executed when I built the container image locally, yet failed to execute PHP when the container image was built on the CI server.

I have no idea why exactly that was, even though I tried documenting that mess on my blog "Containers are broken": https://blog.kronis.dev/everything%20is%20broken/containers-...

Now, that was probably a build issue (that I managed to resolve by just moving PHP over to Nginx, which seemed to integrate with no issues for whatever random reason), but operationally mod_php seems way simpler, just something that plugs into your web server and lets you process requests directly - even if in a less scalable manner.

Then again, nowadays you have multiple MPM choices: https://httpd.apache.org/docs/2.4/mpm.html

So as far as I can tell, many folks might be using something like mpm_event, which generally performs a bit more efficiently than mpm_prefork or something like it.

But using PHP just for CGI, without even trying to integrate with a web server per se? I'm not sure, but given the example above, I'm sure that there are reasons for doing that as well, much like my example of moving one step back from php-fpm, to mod_php if need be.

effie · on Sept 5, 2022

The mainstream view is that php-fpm has better performance than old CGI, and is much better for domain isolation (no suexec hacks needed). But it is also more buggy in operation. It hangs from time to time, requires monitoring that restarts FPM when that happens.

Old CGI is nice in that you can easily set up Apache to launch any kind of executable, not just the one lagnuage script that the Fastcgi process manager supports. So you can have bash// python scripts, C binaries, all accesible through the web page.

midasuni · on Sept 3, 2022

Yes. I don’t need anything more. I want to deliver the output various scripts for me and my team, Bash, Perl, Python, etc, why would I care about saving a few milliseconds on a fork?

CodesInChaos · on Sept 3, 2022

How well does that sharing work on Windows nowadays with ASLR being popular? Traditionally Windows uses non relocatable exe-files, and DLLs that require fixups on relocation where you lose sharing for every page containing a fixup. I assume that nowadays relocatable exe-files are more common to enable ASLR.

But do modern compilers targeting Windows prefer position independent code over fixups? Or are fixups of x86-64 code rare enough and clustered enough that it doesn't really matter?

sgt · on Sept 3, 2022

That's it, I'm sold. I'm building my startup around CGI scripts! /s

petercooper · on Sept 3, 2022

It wasn't a typical use case, but my current business leant on a CGI script for a few years at the start. I needed to let users sign up for an email list and rather than spin up an entire app for that, I wrote a Ruby script and ran it CGI-style on httpd because the rest of the site was all static HTML files anyway. Would a dev team have ever taken such an approach? No. But when you're a solo founder, it's a privilege to use as little as is necessary.

sgt · on Sept 3, 2022

I understand you. The bigger the team is, the harder it is to use a technology that isn't the latest and greatest, despite being potentially the right tool for the job. A good example is a VueJS or React frontend backed by a couple of API services when all you needed was a Django app with a view and a model. Or even a PHP page.

CodesInChaos · on Sept 3, 2022

Just build it on AWS lambda, so you pay for a whole VM for each concurrent request. Progress!

icedchai · on Sept 3, 2022

I was going to say that Lambda is the new CGI (except Lambda is generally a worse developer experience and often higher latency.)

habibur · on Sept 3, 2022

I tried the "run a C program as cgi" on the web. Request per second per server dropped significantly.

Later converted those into fcgi and throughput increased drastically, like an increase of 10 times over previous.

But fcgi is a little bit tricky as your memory and resource leaks will accumulate, and if your program crashes, it will fault on subsequent requests too, and not just that one request as happens in cgi.

samus · on Sept 3, 2022

That part has an easy infrastructural solution: a TCP loadbalancer in front that switches backend every X days, and two backend processes that are restarted after the other took over. Can be on a single machine if high availability is not a concern.

somat · on Sept 4, 2022

fastcgi is the real elephant in the room. cgi has a reason for existing. a standard calling convention for loading a process with http request info. but fastcgi??? it is network server protocol. "hey guys, let transform our http network protocol into another almost but not quite the same network protocol" for crying out loud, why? just stick with http.

I am almost convinced the only reason it exists is that when it came time to replace the process per request to a single process that would handle multiple requests, people had a hard time thinking about it as a server, so when fastcgi came around and said hey we are cgi(it's not) but fast, they bought the name recognition. it almost appears like a scam but I am not sure who the grifters are.

effie · on Sept 5, 2022

> for crying out loud, why? just stick with http.

Because on a busy server, it's better to have lower number of interpreter processes than one per HTTP request. So multiple requests need to be passed to a single interpreter process, so there has to be some multiplexing of the requests into single data channel. Also some aspects of HTTP are traditionally handled by the frontend web server, and some requests do not reach the language interpreter (various access denials, client errors, etc.) So simple HTTP between web server and language interpreter on the same machine is not necessarily a particularly great idea; it would inhibit the development and even performance. It makes sense to parse text once and then pass data between programs in binary form. Fastcgi was good enough, so it stuck.

abigail95 · on Sept 3, 2022

> The only way of optimizing Ruby is not to use Ruby

lol

chrisseaton · on Sept 3, 2022

I don't think it's true - we've done tons of work to optimise Ruby.