I am Chris the author. btw: This is not a new project. It has been around since 2011. Many people tested it and contributed on it. It may not be the tool for everyone. But I guess many companies will need a key~file store.
There is also a "Filer" component.
This is a side project. Welcome any VC to invest in this.
I was wondering why so many distributed file systems - all claiming high performance / availability / scalability - are being implemented from scratch, when mature ones like Gluster and Ceph have been around from years. Why start new ones instead of improving existing ones? My guess is that these new infra software like SeaweedFS and LeoFS are created mainly to implement them in new languages like Go, Rust and Erlang. I'm not saying it's a bad thing, just that it's one possible reason so many alternatives are being implemented from scratch.
I'm reading between the lines here, and I'd love for Chris Lu (the SeaweedFS author) to chime in here and correct me, but it seems like Chris read Facebook's Haystack paper and decided to implement it himself in Go. Kind of how Riak started as "let's take the DynamoDB paper and implement it in Erlang."
The README mentions that Seaweed is much easier to set up than Ceph. The more apt comparison would be the RADOS component of Ceph--both Seaweed and RADOS are key/value oriented. I've now set up both, and for sure Seaweed is easier to get going.
I have a fair bit of experience with RADOS (all quite positive) but I really do appreciate the ease of use Seaweed presents. I've only been using it for an hour or so, just doing benchmarks and walking through fault scenarios, and so far Seaweed looks quite good!
FWIW, I don't think this should be called a filesystem at all. Seaweed isn't a filesystem which forsakes some aspects of POSIX, it's nothing like a filesystem. The more proper term would be "distributed object store."
> but it seems like Chris read Facebook's Haystack paper and decided to implement it himself in Go
That definitely sounds like the initial impetus for this project, and a perfectly good reason to make "yet another" file system. Additionally, I can understand how taking POSIX compatibility out of the requirements might make writing and maintaining this project a bit more fun.
Not everybody takes that approach. We developed HopsFS by re-implementing the metadata layer for HDFS as a distributed scale-out layer (paper being presented in Usenix FAST next month - https://www.usenix.org/conference/fast17/technical-sessions). As a researcher, this was not a good idea, as it has taken us many years. Time-to-market for our paper was extremely slow, but the upshot is that we have a production system. The big risk for us was that our work would not be seen as relevant when we were finished, and it has been a high risk research angle to take.
Very interesting, I'll try out Hops. Hadn't come across it before - perhaps you can post a "Show HN" for others like me. And best wishes for your paper presentation!
Not to trash talk ceph, but it has had some substantial performance & resiliency issues which are only beginning to be addressed in a production-ready way in 2017 (via bluestore, multi-MDS cephfs, local SSD ordered writeback caching for VMs, etc).
I've been following Ceph (testing it but not using it in production) because of my interest in using RADOS for a project. BlueStore, hot/warm tiering, erasure coded pools are all pretty darn nice features. Are you aware of ongoing issues at the RADOS level?
In my past explorations of Ceph's higher level pieces, e.g. RGW and CephFS, those appeared much less-baked than RADOS. That experience was a couple years old, however, and they may have fixed all that since.
My main concern has been FileStore i.e. running an obj store on top of btrfs/xfs/zfs which even an armchair storage hobbyist (like me) can see, from first principles, is a truly awful idea (journal-on-journal, etc).
You are right, but it seems to me Erlang became _popular_ somewhat recently when functional programming started getting attention. CouchDB was the first major project in Erlang that I'd come across, and that was around 6-7 years ago.
There's still a lot of room for improvement in distributed filesystems - existing distributed filesystems are nowhere near as capable as they ought to be and it's going to require new more or less clean sheet architecture(s) before we get there. There's a lot of hard problems that still haven't had really adequate solutions implemented in production code.
Hell, even network filesystems (non distributed) aren't where they should be.
To illustrate: there is no fundamental reason why a distributed filesystem shouldn't, in practice, perform more or less equivalently to a network, non distributed filesystem. That's not the case today; existing distributed filesystems have slowpaths - generally, things involving metadata and anything that requires coordinating servers - where they're drastically slower than say existing NFS or 9p implementations, and these aren't obscure corner cases, these are things real applications hit.
Reasoning from first principles, we know that:
- It's not going to be possible to create a distributed filesystem that doesn't have these slowpaths, it's probably always going to be possible to create an evil/pathalogical workload that requires lots of cross server synchronization, but:
- Most real world applications don't behave so badly. Imagine you just want a single filesystem that has a crapton of home directories, and you want to throw it all on a distributed filesystem instead of giant Netapp filers or whatever. We should be able to do that, and have it perform just as well (to with epsilon) as using NFS and explicit sharding. We can't really do that today.
In my opinion, we don't really have an architecture implemented today that's capable of solving these problems - that said, I'm not familiar with Ceph's architecture so I don't know how it tackles sharding. I have my own ideas about how it could be done (for awhile, I was working on something on top of the bcache/bcachefs codebase; I actually implemented working, fast distributed block storage, but not filesystem, and it didn't have nearly everything implemented that it would have needed to scale out horizontally very far). But I never got far enough into the filesystem aspect to really prove anything.
For network (non distributed) filesystems - again, those aren't as good as they ought to be either. But the problem there is much simpler - nobody's built anything to my knowledge that would let you really do cache coherency correctly: you can't do it by just using an existing local filesystem implementation and serving it over a new protocol, because you can't do cache invalidation correctly without storing fine grained versioning information that local filesystems don't need, and nobody's wanted to try grafting that onto existing local filesystems. A lot of the story of NFS is hacks to get around that problem.
I've been using SeaweedFS (previously WeedFS) in production for a couple of years. I needed to share files on a LAN, including small VPS, so distributed file systems(like GlusterFS) that require a kernel module did not work for me.
I store small files. Random access with WeedFS is very constant and fast. If you use multiple volume servers, the ID is unique across all servers. You get redirected to the right volume if you try to access an ID on the wrong server, which does add latency.
My biggest complain is about replication. You can define rules, like replicate on the same rack or different rack, how many copies, etc. But it SeaweedFS fails to replicate a file, it does not store it at all. I would prefer a best-effort duplication rather than a guarantee. This is a big issue when master synchronization fails for some reason and only 1 volume is "visible" for some period of time.
Can you tell us more about your production environment? Like how much data, average IOPS, and uptime requirements? I've been poking around and I can't find any real info on who's using this or how. The project's interesting and I'd love to hear more.
I use it to store many small files (JSON and images). My requirements were:
1. predictable read time
2. read/write access through LAN
3. server easily installable on VPS
4. simple API
The main downsides I found so far:
1. Master splitting are very hard to recover from. They happen rarely but when it does, be ready to restart master many times in random order. Latest version seems to have fixed it
2. No easy way to list the content of a volume. You need to use the command line on the volume files
3. Cleanup is hard (compacting the volume files), I fee that the GarbageThreshold is not working well
4. Hard to resize the volumes (i.e. set lower limits) after it has been started
5. Replication, as I mentionned in another comment
Overall it works fine for me. The potential alternative (not fully tested) was Cassandar (the 64MB limit per object when I checked is fine with me).
>When testing read performance on SeaweedFS, it basically becomes performance test your hard drive's random read speed. Hard Drive usually get 100MB/s~200MB/s.
There are no modern hard drives that get 100-200MB/sec on RANDOM reads. Random ends up being an exercise in IOPS not throughput, and there's absolutely no way you're pulling enough IOPS out of a SAS or SATA hard drive to get anywhere NEAR 100-200MB/sec on random workloads. I get the "it's the software dummy" thing everyone is so hyped about, but if you know THAT little about the underlying hardware and how it functions, you shouldn't be writing filesystems.
Also - what's the protection mechanism for files? If there is none, I literally have no idea what the purpose of this FS is. Under what circumstances would I need a giant slow data storage tier that has no redundancy to speak of?
Sorry for my wording, if it is misleading. I am only saying the SeaweedFS code does not try to do anything unnecessary, so you can extract most of the performance juice out of your hard drive or SSD.
My best guess is these numbers are from short benchmarks they ran on SSD's before they reached the steady state... I've pretty frequently been able to hit upwards of 200 mb/s when running short (read: misleading) benchmarks.
(Not an author or anything, just found this project today, and it's interesting enough for me to test it out.)
Edit to add: I've done some testing, SeaweedFS definitely replicates. When the client wants to read from a volume, it should do a lookup and Seaweed will return the URLs of all volume servers which are up at the moment. So far I've just been testing 2-way replication with one server down at a time; it works fine.
Agree, that sentence doesn't make sense. But I think the point is simply that accessing data doesn't have any overhead, so you benchmarking your disks. Given the design, that seems likely.
> Also - what's the protection mechanism for files? If there is none, I literally have no idea what the purpose of this FS is. Under what circumstances would I need a giant slow data storage tier that has no redundancy to speak of?
Caches, that's what makes me interested in this (having to cache lots of images). We're using mogilefs for this currently.
IPFS is a protocol and implementation of the distributed web. SeaweedFS is an open source version of Facebook's Haystack, a centralized filesystem (with redundancy across multiple servers). See https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...
TLDR, Haystack implements a sort of append-only journal design where smaller files are concatenated into a single large "journal" file. After adding, you get a "filename" back that is based on the information needed to locate the data you are interested in. volumename_seekposition.jpg, for example.
Essentially, you eliminate the need for FS metadata this way, which leads to less disk hits for read operations. Open volume, seek to position, read contents (including metadata) until end of file, you're done.
There is then a single-master database that controls the location of the volumes (with a redundant failover), which is possible because managing volumes is less difficult than managing the position of every single file in all the volumes in the system. I think they just use MySQL, amazingly.
Pretty neat idea. It's designed very specifically for photo storage at Facebook.
The main drawback with this design is that it doesn't support binding filesystem paths well - to do that you need a K/V database to match the name you want to the filename needed for Haystack/SeaweedFS. SeaweedFS can do this with a separate Redis and Cassandra "filer" database, but the Redis and Cassandra backends are currently implemented as "flat namespace" stores, so filers using them may not perform directory listings at this time. Huge drawback if you need to 'ls' subdirectories of files, for example.
That was a dealbreaker for me for neocities.org implementation, also maintenance was still an open question with SeaweedFS (when things fail, what do I do to fix them?) so I ended up going with CephFS.
That's great news! I'm going to take a look at it tonight.
Postgres of course does needs some special work to make it failover in production gracefully to a standby server. It would be nice to see something that's designed for it like redis work, but of course, that's a limitation of how redis is designed.
Regardless, I don't think I mentioned this properly, but SeaweedFS is amazing and it's very impressive, and using it at some point is not yet off the table for us. :)
"Instead of managing chunks, SeaweedFS manages data
volumes in the master server. Each data volume is size
32GB, and can hold a lot of files. And each storage node
can have many data volumes. So the master node only needs
to store the metadata about the volumes, which is fairly
small amount of data and is generally stable."
You can (should) have multiple masters. But each volume has a primary master. I did see other masters loose the volume state information after the primary master has gone down and been taken out of the cluster.
I am not sure that SeaweedFS and IPFS fit the same use case, IPFS seems like it wants to a bedrock layer of a newer style of internet. SeaweedFS is more of a Blob store. So comparing them might be possible but also might not be the right idea.
It's not a filesystem either...I don't get why they keep using that terminology. It doesn't have directories, permissions, seek(), copy without the client pulling all the data and pushing it back, etc.
It's a KV data store. Which is fine...nothing wrong with that. It just isn't a filesystem in any sense I'm familiar with.
Right. The word filesystem has a defacto meaning for many, though. Not POSIX either...FAT, NTFS, etc are filesystems. There's implied functionality that goes with the word.
This isn't a filesystem anymore than etcd would be.
Seems to gloss over the details. Who is grooming the replication in case a volume replica disappears? What happens when a client is writing a replicated file and crashes in the middle?
If one volume replica disappears, the whole set of volume will become read-only. There are no magic to auto correct things, which I personally think is where problems start.
I've just looked at it. The concept seem to be the same. SeaweedFS is much simpler. You have only 2 services: master and volume, both speak HTTP only. MogileFS seems to offer more control and more tools (rebalancing, for example).
There is also a "Filer" component.
This is a side project. Welcome any VC to invest in this.