I run a file sharing / content delivery platform called pixeldrain: https://pixeldrain.com
The system serves 4 PB of data to 60 million visitors per month. I have served 30 PB and 700 million file views since I started tracking usage somewhere in 2018.
I'll go from front to back:
- Most of the frontend is plain HTML, CSS and JS. I have started transitioning some pages to Svelte. I like this framework for its speed and simplicity
- Cloudflare Analytics to get basic info like which pages are popular and where my users are from
- The structure of the website (page wrap, menu, footer, etc) is managed with Go's template system
- Constellix for Geo-DNS. This automatically sends users to the server closest to them by doing Geo-IP lookups on the nameservers
- The user-facing servers are dedicated 10 Gbps Leaseweb servers, stuffed to the brim with SSDs in RAID6 for caching. Each of these servers cost €1200 per month. The storage servers are from Hetzner's SX line.
- The OS is Ubuntu 20.04 server edition. I use Ubuntu over Debian because it ships with TCP BBR
- The API is written in plain Go. The only HTTP libraries I use are httprouter for routing and Gorilla Websockets
- The storage system is custom built to spread files over multiple servers. I call it pixelstore, it's not open source (yet)
- The database is ScyllaDB. I landed on this one after going through multiple other systems with severe bottlenecks. I started with MySQL which was limited to a single location, so other locations had high latency. Then I tried CockroachDB, but it kept hanging under the load no matter how much hardware I threw at it. ScyllaDB is very fast and relatively reliable.
You mention that you have user-facing servers as well as storage servers. So do the user-facing servers act as reverse proxies for the storage servers, or do you simply serve a redirect?
I'd expect that the file access patterns are power-law distributions, i.e. recently uploaded files are requested more often than older files. If that's the case, can you use this property for sharding by having hot and cold storage servers?
How do you handle users uploading forbidden content? I see from another comment that you ban the usual types of illegal content. But in practice, do you manually review every mail you get on your abuse contact and take appropriate action? What's the most common type of complaint?
From a business perspective: How did you grow your site? I imagine the competition must be rough since file hosting is such a "simple" service.
> So do the user-facing servers act as reverse proxies for the storage servers, or do you simply serve a redirect?
Yes, my user-facing servers are proxying the files to the users.
> I'd expect that the file access patterns are power-law distributions
Exactly. On each server I have a sorted slice in-memory which keeps track of how often files are being requested. The most popular files are cached and files which drop out of the cache are moved to HDD storage servers in Helsinki. This way I can serve 10 Gbps of data to the users while only putting a 1 Gbps load on the storage nodes.
> How do you handle users uploading forbidden content?
I get a lot of abuse mails, mainly copyright violations. I have a mailserver which is hooked into pixeldrain which scans mails from common copyright offices and automatically blocks the files. For other types of abuse I have a report button on the download page. Users report content which breaks the rules and once the number of reports hits a certain threshold the file is blocked.
> From a business perspective: How did you grow your site?
Most file sharing sites are really terrible. Like almost unusable. It's pretty easy to beat the competition in UX. I just started using pixeldrain myself on reddit and other forums. In the beginning people complained in the comments about the ergonomics of the site and I listened carefully. Eventually when the kinks were worked out other people started using it too.
Thanks for satisfying my curiosity! Also, congrats on your success!
> Yes, my user-facing servers are proxying the files to the users.
I've never operated a service as large as yours, so take my question with a grain of salt:
I'm wondering whether it would make sense to split off the actual file front-end servers from the user-facing servers (going for a redirect approach instead of proxying), since the requirements for serving the UI (low latency, low bandwidth) are so different from the file serving requirements (high bandwidth, but latency is not an issue). In theory, the traffic load from the files could negatively impact the UI latency leading to perceived sluggishness of the website. But perhaps that's not an issue in practice?
Since you mentioned elsewhere that you wanted to move to content delivery: What kind of content delivery do you have in mind? At the moment I can only think of either classic CDNs (but that's a few order of magnitudes larger) or ads (but that's an entirely different area).
- The first is that I can have all my API endpoints under one domain. This simplifies downloading as you don't need to make a separate request to figure out where the file is stored.
- The storage servers that Hetzner sells only have 1 Gbps bandwidth. That runs out very quickly when a file goes viral. The 10 Gbps caching servers do a lot of heavy lifting here, this makes sure the disks in the storage nodes last longer.
- I can also decide to switch to a different storage system on my storage nodes when I want. I have been considering to deploy reed-solomon encoding for a while. That would make it impossible to link directly to a single storage server as a single file would also be distributed.
- Sending out this much data uses a lot of RAM for TCP send buffers. Installing this RAM on a single content delivery node is cheaper than installing it on every storage server.
To prevent the bandwidth load from affecting the UI speed I have a rate limiter on the download API which slows down when the uplink reaches 95% capacity. This way there is always some bandwidth left for the HTML and database communications.
With regards to content delivery: I want to use pixeldrain to serve static files. Nothing like the fancy site-wrapping tech that cloudflare uses. The idea is that users can have a file tree on pixeldrain somewhat like dropbox. They can copy the direct download link to that file and use it to embed videos, audio and pictures in their own websites. Because this is a lot simpler than other CDN services I can offer it at a very competitive price.
I have been looking at seaweedfs for a while. I suffer from a pretty bad case of NIH Syndrome, so my gut feeling says I should implement it myself. But seaweed is such a good fit that I should probably just try it.
I especially want to know how well it deals with adding / removing servers, data loss and network instability. The beautiful thing with writing my own solution is that it takes at most 10 minutes to diagnose and fix a problem when it occurs.
Hey, I was trying out Seaweed yesterday. I ran into an issue with HTTPS, it seems to be not very well supported at the moment.
I managed to get it running mostly secure, but replication didn't work because the volumes were calling eachother with HTTP instead of HTTPS. The HTTP request is hardcoded here: https://github.com/chrislusf/seaweedfs/blob/43fd11278ef81185... and probably in a lot of other places.
I even tried setting the address of the volume to https in the startup script, but then it makes a request to http://https://volume1.example.com and it still fails.
I also noticed that the master API is still available over HTTP even when HTTPS is enabled. I can make an issue for these things if you want.
These issues are currently the only showstopper for me. I need to have every endpoint on TLS with peer verification enabled. If you can get it fixed I will gladly continue testing seaweedfs and support you on Patreon :-)
Hi! I had never heard of your service, so I googled it and accidentally misspelled "Pixel" as I often do as "Pixle". The first result was a reddit thread and the second seemed to be a full-length rip of Becoming Royal that had some really malicious ads which were clickjacking and covering the content. Just thought you might like to know. :)
Thanks for letting me know. I suspect my advertising partner is doing some shady business on my site. I have been trying to catch them but I never seem to get the malicious ads myself, even when using a VPN from more eastern or southern nations.
Could you maybe send me a screenshot / screen capture of the ads and send them to support@pixeldrain.com? I'd really appreciate that.
Do you happen to be using an iphone? The malicious ads seem to be targeting iphones specifically. I can't reproduce it myself but I have told my advertiser about it.
I have some leverage here since I'm one of their largest publishers.
Note that is actually in the page, is NOT the ad, and couldn't have been injected by the ad.
It looks troublesome to me, as I don't see it on other similar pages on your site that also have ads.
Edit: Confirmed. Added "optyruntchan.com" to my local hosts file as 127.0.0.1. The page still has the ad at the bottom, but all the popups, overlays, redirects, are gone.
It's not like pixeldrain has been hacked or anything. I have been serving ads for this company for almost a year. It's just that they decided to serve full screen ads all of a sudden without consulting me.
I have removed them completely for now. I'll just be operating at a loss until I find an alternative ;-(
Ah, okay, sorry. I didn't see that piece of javascript on any other user-created pixeldrain pages, and it isn't involved in the ad at the bottom/center of the page. So, it did seem out of place to me. I can't even tell where it would place an ad.
No problem. I base the ads I show on the size of the file that's being downloaded. Larger files cost more bandwidth so they get more intrusive ads. But I don't want the ads to be so intrusive that they drive users away of course.
Normally these ads would show up as little floating windows in the bottom right corner.
My content delivery servers cost me about €2000 per month. I experiment a lot with different providers to drive cost down here. My storage servers are €800 per month. For that money I get about 500 TB of usable capacity (all my servers run raidz2). And then I spend about €150 on database servers. My Constellix DNS bill is €50 per month, and Mailgun is €5.
So about €3000 / month worth of infrastructure.
I started working on pixeldrain in 2015 during lunch breaks at school. The original version was a Spring Boot app with MariaDB. I had a lot of performance problems here. My use case doesn't really fit in the CRUD API use case. I need tighter control over my connections and threads. That's why I switched to Go.
I run ads to pay for the servers. It's really hard to find advertisers for file sharing services because they don't want to run ads in controversial content, and that's what file sharing sites are mostly used for. I have to resort to working with some really shady companies. Because of this I balance my ads so that they generate just enough revenue to pay for the servers.
Eventually I would like to turn some profit though. That's why I'm trying to shift from file sharing to content delivery.
I'm hoping to find a way to make it profitable somehow. There must be something that I can sell to these millions of people who visit my site every day.
And I simply like the technical challenges it provides. I like optimizing and scaling systems and designing database schemas. I'm learning tons of stuff this way. If an opportunity provides itself to apply this knowledge I'll have it ready.
Pixeldrain as it's running right now could probably do 20 PB / month. But I have systems in mind which could take it to 100 PB / month at little additional cost.
Curious to learn more about CockroachDB's problems you had. They market themselves as a turn-key solution to globally distributed databases, it's worrying that you had a problem with that very use case. Thanks for the additional details!
Okay this is just my understanding of the issue. I'm not sure if it's 100% correct.
Cockroachdb slices up its tables in 16 MiB shards. These shards are distributed over three nodes, one master and two slaves. The master gets chosen with the raft protocol. The master is the only server which can write to the shard. For a write to happen all three replicas need to be synchronized, they do this by locking the entire shard whenever a query comes in that writes to a row in the shard.
The problem is that if the latency between the master and the slave node is too large, you are severely limited in how many queries can be executed. For example if you have one shard in the US and one in Europe there might be a latency of 150 - 200ms between them. This means you can only execute 5 queries per second. Any more requests that come in are queued. After a while the queue gets so long that incoming queries have to be dropped.
This implementation is great for consistency, but if you frequently need to run update queries on your rows it's not the right solution. You need eventual consistency like Scylla and Cassandra provide.
Good summation. Exactly. They are solving for different things. Cockroach is solving for consistency of a distributed database. Scylla is solving for performance of a distributed database.
> Pixeldrain is struggling to get by financially. Because anyone can upload anything it's hard to find reputable advertisers who want to advertise on pixeldrain. Every month the ad revenue just barely covers the bandwidth costs. If this continues I will have to resort to adding more shady ads, or reducing the file size and bandwidth limits. That's not something I would like to do.
The vast majority of my income is from ads. It's a 10:1 ratio. Patreon definitely helps, and I hope that it will get so far that I have to rely less on ads to keep the site running.
The nice thing about Patreon is that they do my taxes for me. When I receive my money the taxes have already been paid. I can put it in my accounting software as tax free money. Patreon's fees are high though, so if something else comes along I'll give that a shot too.
> Pixeldrain is struggling to get by financially. Because anyone can upload anything it's hard to find reputable advertisers who want to advertise on pixeldrain. Every month the ad revenue just barely covers the bandwidth costs. If this continues I will have to resort to adding more shady ads, or reducing the file size and bandwidth limits. That's not something I would like to do
The system serves 4 PB of data to 60 million visitors per month. I have served 30 PB and 700 million file views since I started tracking usage somewhere in 2018.
I'll go from front to back:
- Most of the frontend is plain HTML, CSS and JS. I have started transitioning some pages to Svelte. I like this framework for its speed and simplicity
- Cloudflare Analytics to get basic info like which pages are popular and where my users are from
- The structure of the website (page wrap, menu, footer, etc) is managed with Go's template system
- Constellix for Geo-DNS. This automatically sends users to the server closest to them by doing Geo-IP lookups on the nameservers
- The user-facing servers are dedicated 10 Gbps Leaseweb servers, stuffed to the brim with SSDs in RAID6 for caching. Each of these servers cost €1200 per month. The storage servers are from Hetzner's SX line.
- The OS is Ubuntu 20.04 server edition. I use Ubuntu over Debian because it ships with TCP BBR
- The API is written in plain Go. The only HTTP libraries I use are httprouter for routing and Gorilla Websockets
- The storage system is custom built to spread files over multiple servers. I call it pixelstore, it's not open source (yet)
- The database is ScyllaDB. I landed on this one after going through multiple other systems with severe bottlenecks. I started with MySQL which was limited to a single location, so other locations had high latency. Then I tried CockroachDB, but it kept hanging under the load no matter how much hardware I threw at it. ScyllaDB is very fast and relatively reliable.
- UptimeRobot for monitoring
- Mailgun for account e-mail verifications
Feel free to ask me more questions :-)