Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Okay but I can't avoid noticing the bug in the copilot-generated code. The generated code is:

    async function isPositive(text: string): Promise<boolean> {
        const response = await fetch('https://text-processing.com/api/sentiment', {
            method: "POST",
            body: `text=${text}`,
            headers: {
                "Content-Type": "application/x-www-form-urlencoded",
            },
        });
        const json = await response.json();
        return json.label === "pos";
    }
This code doesn't escape the text, so if the text contains the letter '&' or other characters with special meanings in form URL encoding, it will break. Moreover, these kinds of errors can cause serious security issues; probably not in this exact case, the worst an attacker could do is change the sentiment analysis language, but this class of bug in general is rife with security implications.

This isn't the first time I've seen this kind of bug either -- and this class of bug is always shown by people trying to showcase how amazing Copilot is, so it seems like an inherent flaw. Is this really the future of programming? Is programming going to go from a creative endeavor to make the machine do what you want, to a job which mostly consists of reviewing and debugging auto-generated code?



> Is this really the future of programming?

Not the one I want. I'd like to see better theorem proving, proof repair, refinement and specification synthesis.

It's not generating code that seems to be the difficulty, it's generating the correct code. We need to be able to write better specifications and have our implementations verified to be correct with respect to them. The future I want is one where the specification/proof/code are shipped together.


> It's not generating code that seems to be the difficulty, it's generating the correct code.

Also, it's not actually writing the code that is the difficulty when generating code changes for maintaining existing code, but figuring out what changes need to be made to add up to the desired feature, and where in the codebase controls each of those, and possibly rediscovering important related invariants.

For that matter, the really critical part of implementing a feature isn't implementing the feature -- it's deciding exactly what feature if any needs to be built! Even if many organizations try to split that among people who do and don't have the title of programmer (eg business analyst, product manager...), it's fundamentally programming and once you have a complex enough plan, even pushing around and working on the written English feature requirements resembles programming with some of the same tradeoffs. And we all know the nature of the job is that even the lowliest line programmer in the strictest agile process will still end up having to do some decision making about small details that end up impacting the user.

Copilot is soooooo far away from even attempting to help with any of this, and the part it's starting with is not likely to resemble any of the other problems or contribute in any way to solving them. It's just the only part that might be at all tractable at the moment.


Can't +1 enough. This is clearly the future, as the increasing popularity of type systems shows.

As interesting as those systems may be, they have it exactly backwards. Don't let the computer generate code and have the human check it for correctness: let the human write code and have the computer check it for correctness.


Actually both don't work.

The human also needs to make sure the code does what the Ticket asked for. And sometimes not even what the Ticket asked for, but what the business and user want/need.

Honestly writing code is such a small part of a dev's job. It's figuring out how to translate the human (often non-tech) requirements into code that is also in a way that's also looking at requests that haven't been made yet.


> let the human write code and have the computer check it for correctness.

Isn't that pretty much what ML does?


That’s exactly the inverse of copilot and LLM code generation. The computer generates the code, the human checks if it’s correct


I bet most humans won't check. See the mindset of the author:

> ... I will not delve into the questions of code quality, security, legal & privacy issues, pricing, and others of similar character. ... Let’s just assume all this is sorted out and see what happens next.


How would ML know what the correct code should be?


Nailed it. I see AI-coding is like hiring junior devs. It'll split out code fast, but unless carefully reviewed, it's just a means to push the tech debt forward. Which isn't bad in itself, as tech debt is good when companies understand the cost. But both are bad if there isn't a true understanding.


It's kind of like copying and pasting code from StackOverflow.. Except the code hasn't had as many eyes looking at it as something that was pasted on StackOverflow, and it has a higher likelihood of seeming like a correct solution to your specific use case.

Honestly, it seems pretty dangerous. Seemingly correct code generated by a neural network that we think probably understands your code but may not. Easy to just autocomplete everything, accept the code as valid because it compiles, and discover problems at run-time.


Ha ha. I almost spit out my coffee. "when companies understand the cost". OMG that's hilarious.


Hit the nail on the head. The major problem with NN is that they are completely unexplainable black boxes. The algorithm spits out a piece of code (or recommends clickbait videos, or bans you from a website, or denies you credit) based on a mishmash of assorted examples and a few million/billion parameters, but can neither explain why it picked it, nor verify that it is correct.

Formal methods and theorem proving on the other hand, place a strong emphasis on proof production and verification.


The unsettling part is that something like Copilot requires a lot of the work that would need to create an amazing semantic-aware linter like no other on the market.

Instead, they decided to waste their work on that.


> amazing semantic-aware linter like no other on the market.

AWS has ML driven code analysis tools in their "CodeGuru" suite: https://aws.amazon.com/codeguru/


I think this requires a totally different approach to programming. The spec/proof/code cycle exists rarely in the real world. The AI generated code has the disadvantage of being trained on imperfect code, without having any context, so it's like an enthusiastic, very junior copy pasta coder.


Totally.


Yep 100% agree. And the good news is that we are moving closer every day to having tools that make proving code correct practical. Languages like Idris, Agda, Liquid Haskell, F* (F-star), LEAN etc. are spearheading this movement. If Rust had Refinement Types added it would turn it into a killer language for writing correct high performance code.


In all honesty, I typically spend almost twice as much time writing tests as I do writing the code that makes them pass.

More often than not, once I have the tests written, the code nearly writes itself. There are edge cases when I have to look up things, but most of the time it's I'm happily TDD'ing.


Isn't a sufficiently complete specification + proof with which to check generated code against just going to mean that a human spent exactly the same amount of time writing code as they would without the AI generation side of things?


Sure, but you get probably correct code out of it. Not every productivity increase has to do with quantity.


Exactly!

A couple of years ago, we invested in using the libraries from Microsoft Code Contracts for a couple of projects. It was a really promising and interesting project. With the libraries you could follow the design-by-contract paradigm in your C# code. So you could specify pre-conditions, post-conditions and invariants. When the code was compiled you could configure the compiler to generate or not generate code for these. And next to the support for the compiler, the pre- and post-conditions and invariants were also explicitly listed in the code documentation, and there was also a static analyzer that gave some hints/warnings or reported inconsistencies at compile-time. This was a project from a research team at Microsoft and we were aware of that (and that the libraries were not officially supported), but still sad to see it go. The code was made open-source, but was never really actively maintained. [0]

Next to that, there is also the static analysis from JetBrains (ReSharper, Rider): you can use code annotations that are recognized by the IDE. It can be used for (simple) null/not-null analysis, but also more advanced stuff like indicating that a helper method returns null when its input is null (see the contract annotation). The IDE uses static analysis and then gives hints on where you can simplify code because you added null checks that are not needed, or where you should add a null-check and forgot it. I've noticed several times that this really helps and makes my code better and more stable. And I also noticed in code reviews bugs due to people ignoring warnings from this kind of analysis.

And finally, in the Roslyn compiler, when you use nullable reference types, you get the null/not-null compile-time analysis.

I wish the tools would go a lot further than this...

[0] https://www.microsoft.com/en-us/research/project/code-contra... [1] https://www.jetbrains.com/help/resharper/Reference__Code_Ann...


There are tools and programming language that are already way ahead of what you are describing. Examples are Idris, F* (F-Star), Dafny etc. They use Dependent Types and/or Refinement Types to make it possible to prove your code correct. There is now proven correct code in the Windows Kernel and other large projects implemented with those tools.


Thank you for the references to those languages.. interesting!


>> I'd like to see better theorem proving, proof repair, refinement and specification synthesis.

Thanks for making me smile. Some of us are working on exactly that kind of thing :)


Machine readable specs?

Something along lines of this perhaps...

https://www.fixtrading.org/standards/fix-orchestra/


Aren't "Machine Readable Specs" a programming language?


I would argue a programmer who doesn't notice the encoding issue when looking at this code would be the same kind of programmer who would write this kind of encoding issue. You definitely need to manage copilot's generated code to meet your requirements, the key is that massaging takes way less time than writing it all from 0. Copilot is, as the name suggests, a human-in-the-loop system.

For me, this is definitely a piece of the future of coding, and it doesn't change coding from a creative endeavour. It's just the difference between "let me spend five minutes googling this canvas API pattern I used 5 years ago and forgot, stumble through 6 blogs that are garbage, find one that's ok, ah right that's how it works. Now what was I doing?" To just writing some prompt comment like "// fn to draw an image on the canvas" and then being like "ah that's how it works. Tweak tweak write write". For me the creativity is in thinking up the high level code relationships, product decisions, etc. I don't find remembering random APIs or algos I've written a million times before to be creative.


Let’s take the security aspect one step farther: could a bad actor introduce popular code into copilot, which is full of subtle back-doors? Asking for a befriended state actor.


Why would you need a bad actor introducing a bug when Copilot already generates code that is indistinguishable from buggy (and possible backdoored) stuff? It scraped random stuff on Github and was trained on it, nobody knows what the quality or content of that training set was.


Given our experience with Stack Overflow we probably know the answer to that question.


Why not put copilot code into a SAST tool as a training layer?


Is it cost effective? If companies think that this way of working is cutting development time and/or cost then it's the future. At least in the professional world.

Even bugs are acceptable if the advantages of auto generated code are large enough.


Depends on the bug and how much damage it causes, I guess. The more nefarious and subtle the bug, the more intimate you have to be with the code to understand it. For developers, this intimacy comes with actually writing the code. I know that when I write code manually, I'm not just tapping keys on a keyboard but building a mental model in my brain. When there's a bug, I can usually intuit where it's located if that model is detailed enough. Maybe developers of the future will just be code reviewers for ML models. Yuck, who wants to do that? So then will we just have ML models reviewing ML generated code?

My concern is that ML code generation will create very shallow mental models of code in developers' minds. If everyone in the organization is building code this way, then who has the knowledge and skills to debug it? I foresee showstopping bugs from ML-generated that bring entire organizations to a grinding halt in the future. I remember back in the day when website generators were in vogue, it became a trend for many webdevs to call their sites "handmade" if they typed all the code themselves. I predict we'll be seeing the same in the future in other domains to contrast "handwritten" source with ML generated source.

What's that famous quote? "Debugging is twice as hard as writing the code in the first place." Well where does that leave us when you don't even have to write the code in the first place? I guess then we will have to turn to ML to handle that for us as well.


It's the old "C vs. Assembly" all over again.


Not quite - C is predicable(ish). It would be like C vs Asm if there was no stable language spec, platform spec, or compiler.

I think it’s more like WYSIWYG website generators (anyone remember Cold Fusion? The biz guys LOVED that - for like a year).


Cold Fusion isn't WYSIWYG (although it was intended to eventually be, and that never panned out haha), are you thinking of DreamWeaver or MS FrontPage?


I must be remembering the marketing copy and not how well it worked out.


Except that it's not. I can be reasonably sure that my C compiler will generate correct machine code if my C code is correct. I can also treat the generated machine code as a build artifact which can be thrown away and re-generated on demand; I can treat my C code as my only source of truth. None of these things are true for Copilot, you need to keep around and maintain the generated code, and you can't just assume that the generated code is correct if your input is correct.


If companies think that this way of working is cutting development time and/or cost then it's the future.

I think most companies' management would still leave decisions like that in the hands of developers/development-teams. At the level of management, I think companies aren't asking for "code" but for results (a website, an application, a feature). Results with blemishes or insecurities are fine but a stream of entirely broken code wouldn't be seen as cost effective.


If shortsighted management can offshore to the lowest bidder, they can believe that a few junior devs and ML code generation can replace a team of seniors. Sure, they can leave the decisions of frameworks in the hands of developers and development teams, but if those people are asked to justify the budgetary costs when compared to ML code generation it is no longer a technical argument.

The problem always comes later (often after the manager has moved on and is no longer aware of the mistakes that were made and the repercussions can't follow) where the maintenance costs of poorly written initial code come back to haunt the origination for years to come.


I don't disagree with that. My comment is more intended to be read as, "Is this the future we want?" rather than, "This doesn't seem like something which will take off".


Except for companies that are ‘Software Eng is the core business’ anyway, which is super rare, I doubt anyone on the business side cares if we want it. They care if it is cost effective/works.

Which I personally suspect this will end up about as well as WYSIWYG HTML editors/site generators. Which do have a niche.

That would be the ‘makes a lot of cheap crap quickly, but no one uses it for anything serious or at scale’ niche.

Because what comes out the other end/under the hood is reprocessed garbage that only works properly in a very narrow set of circumstances.

I suspect with this market correction however we’ll see a decrease in perceived value for many of these tools, as dev pay is going to be going down (or stay the same despite inflation), and dev availability is going to go up.


You can't have cost effective when the code doesn't work because it has been generated by a monkey that doesn't understand neither the logic nor the requirements.

That the monkey is actually a computer and not an actual banana eating primate doesn't change anything there.

Any business fool that tries this will discover this very quickly - that's like claiming we don't need drivers because we have cruise control in our cars! It is not perfect but it is cost effective/works (some of the time ...)!


I agree - that wouldn’t stop a bunch of sales folks from selling it though, or some folks maybe trying it and claiming it worked.

Do it enough times, and I’m sure it will eventually compile - might even complete a regression test suite!


Why are you comparing Indians to monkeys? That's totally not ok.


I don't think "we" have any say in this---it's a question of whether this is what the people with money want.


For the benefit of others wondering how it should be done instead: it’s best to use a proper type that will do the encoding, instead of passing a raw body. Sure, you can perform the encoding yourself (e.g. `text=${encodeURIComponent(text)}`) and pass a raw body, but as things scale that’s much more error-prone; it’s better to let types do the lifting from the start (kinda like lifting heavy things by squatting rather than bending at the waist).

For application/x-www-form-urlencoded as shown, that’d be using URLSearchParams:

  fetch("https://text-processing.com/api/sentiment", {
      method: "POST",
      body: new URLSearchParams({ text }),
  })
If you wanted multipart/form-data instead (which I expect the API to support), you’d use FormData, which sadly can’t take a record in its constructor:

  const body = new FormData();
  body.append("text", text);
  fetch("https://text-processing.com/api/sentiment", { method: "POST", body })
Note that in each case the content-type header is now superfluous: the former will give you "application/x-www-form-urlencoded;charset=UTF-8" (which is all the better for specifying the charset) and the latter a "multipart/form-data; boundary=…" string. (Spec source: https://fetch.spec.whatwg.org/#concept-bodyinit-extract.)

As a fun additional aside, you’ve made a tiny error in your transcription: the URL was actually surrounded in backticks (a template literal), not single quotation marks. Given the absence of a tag or placeholders (which would justify a template literal) and the use of double-quoted strings elsewhere, both of these would be curious choices that a style linter would be very likely to complain about. So yeah, just another small point where it’s generating weird code.


The thing is, devs already write this kind of bug on a regular basis. I saw it firsthand as a pentester. So at worst, it's still matching the status quo.


That's kind of fallacious. First it assumes that humans and Copilot are about on par just because humans sometimes write that kind of bug; in reality we need statistics about the relative rates, where I would assume that experienced programmers write that kind of bug less frequently than Copilot, since it seems to be in virtually every Copilot show-case. Second, it categorizes humans as a whole into one group and Copilot into another; in reality, there are better and worse human programmers, while there's just the one Copilot which seems to be on par with a pretty shitty programmer.


The programmers that consistently create this kind of code are what is normally called "incompetent".

Yes, many companies prefer to hire them, or can only retain them for a variety of reasons. None of this is good in any way.

Anyway, those programmers have really good odds to stop creating this kind of code given some learning. While adding Copilot as a "peer" just makes them less likely to learn, and all programmers more likely to act like them. That's not matching the status-quo, that's a very real worsening of it.


Well, except humans don’t do it as consistently at scale! Hah


Not to excuse security holes, but at least they're adversarial - I would expect this code to generate all kinds of error logs if tested against any reasonable corpus along the "happy path" - if you ran your code against a sample of extremely representative input you'd find this.

Security holes are more excusable because someone who didn't realize the above could happen maybe never tested it... given the use case, this is more like "did you even run your code?"


>> This isn't the first time I've seen this kind of bug either -- and this class of bug is always shown by people trying to showcase how amazing Copilot is, so it seems like an inherent flaw.

I think it's because people copy/paste generated code without reading it carefully. They eyball it, it makes sense, they go tweet about it.

I don't know if this predicts how people will mostly use generated code. I note however that this is probably too much code to expect CoPilot to generate correctly: about 10 LoCs is too much for a system that can generate code, but can't check it for correctness of some sort. It's better to use it for small code snippets of a couple of lines, like loops and branches etc, than to ask it to genrate entire functions. The latter is asking for trouble.


I don't think you're right here frankly, since the buggy snippet is taken from the Copilot marketing page (https://github.com/features/copilot). The examples on that page which could conceivably have missing escape bugs are the sentiment analysis example (sentiments.ts), the tweet fetcher examples (fetch_tweets.js, fetch_tweets.ts, fetch_tweets.go) and the goodreads rating examples (rating.js, rating.py, rating.ts, rating.go). Of all of them, only the rating.go example is without a serious escaping bug, and only because Copilot happened to use a URL string generation library for rating.go.

These are the examples which GitHub itself uses to demonstrate what Copilot is capable of, so it's not just a matter of people tweeting without reading through the code properly. It also suggests that the people behind Copilot do believe that one primary use-case for Copilot is to generate entire functions.


Hm, thanks, I wasn't aware of that.

Well, OpenAI are certainly trying to sell copilot as more capable than it is, or anyway they haven't done much to explain the limitations of their systems. But they're not alone in that. I can't think of many companies with a product they sell that tell you how you _can't_ use it.

Not to excuse misleading advertisment. On the contrary.


> Is programming going to go from a creative endeavor to make the machine do what you want, to a job which mostly consists of reviewing and debugging auto-generated code?

Maybe, in some organizations, sure. However, there are still people hand-crafting things in wood, metal, and other materials even though we have machines that can do almost anything. Maybe career programmers will turn into "ML debuggers", so perhaps all of us who enjoy building things ourselves will just stop working as programmers? I certainly won't work in the world where I'm just a "debugger" for machine-generated crap.


Written by GPT-3:

  function evil(code) {
      eval(code);
  }

  evil("alert('hi');");


So essentially Copilot is very dull and boring (yet).

Taking your comment as inspiration, I would like to add OWASP/Unit Testing thinking to Copilot. If Copilot considers your remarks and maybe others, it becomes helpful. Something like security checking on the fly, what would normally be considered by colleagues during code reviews or checks with SonaCube.


Also can't avoid to see the bug in the reasoning:

> Let’s just assume all this (code quality, security, legal & privacy issues, pricing, and others of similar character) is sorted out and see what happens next.

What could go wrong next in the systems where they employ this AI with this mindset?


This type of thing is currently useful for rapid prototyping, or building a quick framework. At the end of the day though, the experienced coders need to fix bugs, modify the features, enhance the efficiencies, and review if the program is doing what it is truly intended to do.


>This code doesn't escape the text, so if the text contains the letter '&' or other characters with special meanings in form URL encoding, it will break

Why? POST data doesn't need to be URL encoded.


Yes, it does. POST data with `Content-Type: application/x-www-form-urlencoded` uses the same encoding format as query parameters in a URL -- `foo=a&bar=b` -- hence the name.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: