SeaweedFS – A simple and highly scalable distributed file system

chrislusf · on Jan 4, 2017

I am Chris the author. btw: This is not a new project. It has been around since 2011. Many people tested it and contributed on it. It may not be the tool for everyone. But I guess many companies will need a key~file store.

There is also a "Filer" component.

This is a side project. Welcome any VC to invest in this.

lovelearning · on Jan 4, 2017

I was wondering why so many distributed file systems - all claiming high performance / availability / scalability - are being implemented from scratch, when mature ones like Gluster and Ceph have been around from years. Why start new ones instead of improving existing ones? My guess is that these new infra software like SeaweedFS and LeoFS are created mainly to implement them in new languages like Go, Rust and Erlang. I'm not saying it's a bad thing, just that it's one possible reason so many alternatives are being implemented from scratch.

jdcarter · on Jan 4, 2017

I'm reading between the lines here, and I'd love for Chris Lu (the SeaweedFS author) to chime in here and correct me, but it seems like Chris read Facebook's Haystack paper and decided to implement it himself in Go. Kind of how Riak started as "let's take the DynamoDB paper and implement it in Erlang."

The README mentions that Seaweed is much easier to set up than Ceph. The more apt comparison would be the RADOS component of Ceph--both Seaweed and RADOS are key/value oriented. I've now set up both, and for sure Seaweed is easier to get going.

I have a fair bit of experience with RADOS (all quite positive) but I really do appreciate the ease of use Seaweed presents. I've only been using it for an hour or so, just doing benchmarks and walking through fault scenarios, and so far Seaweed looks quite good!

FWIW, I don't think this should be called a filesystem at all. Seaweed isn't a filesystem which forsakes some aspects of POSIX, it's nothing like a filesystem. The more proper term would be "distributed object store."

SEJeff · on Jan 4, 2017

I'd love to see someone run aphyr's Jepsen tool on this and see if it lives up to what it aspires to be.

Kyle has a real way of destroying databases and really any distributed datastore:

https://aphyr.com/tags/jepsen

jusob · on Jan 4, 2017

You are correct, it is not really a file system. It is used with the HTTP API (upload, download, etc.).

tomphoolery · on Jan 4, 2017

> but it seems like Chris read Facebook's Haystack paper and decided to implement it himself in Go

That definitely sounds like the initial impetus for this project, and a perfectly good reason to make "yet another" file system. Additionally, I can understand how taking POSIX compatibility out of the requirements might make writing and maintaining this project a bit more fun.

chrislusf · on Jan 4, 2017

Not a simple re-write.

My original plan is to break normal file system into composable pieces, and make the file system more scalable.

The first piece is the key~blob storage, which consists of master and volume servers. The second piece is the filer server.

The first piece should be self contained and agnostic of the layer above, yet still easy-to-use in normal cases where only key~blob service is needed.

jamesblonde · on Jan 4, 2017

Not everybody takes that approach. We developed HopsFS by re-implementing the metadata layer for HDFS as a distributed scale-out layer (paper being presented in Usenix FAST next month - https://www.usenix.org/conference/fast17/technical-sessions). As a researcher, this was not a good idea, as it has taken us many years. Time-to-market for our paper was extremely slow, but the upshot is that we have a production system. The big risk for us was that our work would not be seen as relevant when we were finished, and it has been a high risk research angle to take.

lovelearning · on Jan 5, 2017

Very interesting, I'll try out Hops. Hadn't come across it before - perhaps you can post a "Show HN" for others like me. And best wishes for your paper presentation!

chrislusf · on Jan 4, 2017

Good to know this! The metadata layer with NewSQL storage seems exactly what SeaweedFS is already doing.

rphlx · on Jan 4, 2017

Not to trash talk ceph, but it has had some substantial performance & resiliency issues which are only beginning to be addressed in a production-ready way in 2017 (via bluestore, multi-MDS cephfs, local SSD ordered writeback caching for VMs, etc).

jdcarter · on Jan 4, 2017

I've been following Ceph (testing it but not using it in production) because of my interest in using RADOS for a project. BlueStore, hot/warm tiering, erasure coded pools are all pretty darn nice features. Are you aware of ongoing issues at the RADOS level?

In my past explorations of Ceph's higher level pieces, e.g. RGW and CephFS, those appeared much less-baked than RADOS. That experience was a couple years old, however, and they may have fixed all that since.

rphlx · on Jan 4, 2017

My main concern has been FileStore i.e. running an obj store on top of btrfs/xfs/zfs which even an armchair storage hobbyist (like me) can see, from first principles, is a truly awful idea (journal-on-journal, etc).

Bluestore will fix that at long last.

c22 · on Jan 4, 2017

Is Erlang really a new language? I recall first looking at it in the '90s.

lovelearning · on Jan 5, 2017

You are right, but it seems to me Erlang became _popular_ somewhat recently when functional programming started getting attention. CouchDB was the first major project in Erlang that I'd come across, and that was around 6-7 years ago.

koverstreet · on Jan 5, 2017

There's still a lot of room for improvement in distributed filesystems - existing distributed filesystems are nowhere near as capable as they ought to be and it's going to require new more or less clean sheet architecture(s) before we get there. There's a lot of hard problems that still haven't had really adequate solutions implemented in production code.

Hell, even network filesystems (non distributed) aren't where they should be.

To illustrate: there is no fundamental reason why a distributed filesystem shouldn't, in practice, perform more or less equivalently to a network, non distributed filesystem. That's not the case today; existing distributed filesystems have slowpaths - generally, things involving metadata and anything that requires coordinating servers - where they're drastically slower than say existing NFS or 9p implementations, and these aren't obscure corner cases, these are things real applications hit.

Reasoning from first principles, we know that: - It's not going to be possible to create a distributed filesystem that doesn't have these slowpaths, it's probably always going to be possible to create an evil/pathalogical workload that requires lots of cross server synchronization, but: - Most real world applications don't behave so badly. Imagine you just want a single filesystem that has a crapton of home directories, and you want to throw it all on a distributed filesystem instead of giant Netapp filers or whatever. We should be able to do that, and have it perform just as well (to with epsilon) as using NFS and explicit sharding. We can't really do that today.

In my opinion, we don't really have an architecture implemented today that's capable of solving these problems - that said, I'm not familiar with Ceph's architecture so I don't know how it tackles sharding. I have my own ideas about how it could be done (for awhile, I was working on something on top of the bcache/bcachefs codebase; I actually implemented working, fast distributed block storage, but not filesystem, and it didn't have nearly everything implemented that it would have needed to scale out horizontally very far). But I never got far enough into the filesystem aspect to really prove anything.

For network (non distributed) filesystems - again, those aren't as good as they ought to be either. But the problem there is much simpler - nobody's built anything to my knowledge that would let you really do cache coherency correctly: you can't do it by just using an existing local filesystem implementation and serving it over a new protocol, because you can't do cache invalidation correctly without storing fine grained versioning information that local filesystems don't need, and nobody's wanted to try grafting that onto existing local filesystems. A lot of the story of NFS is hacks to get around that problem.

daviesliu · on Jan 5, 2017

Have you looked into MooseFS [1]? It seems like moving towards the direction you may like.

[1] http://moosefs.org/

WhatOn1 · on Jan 11, 2017

What do you think about HAMMER2?

jusob · on Jan 4, 2017

I've been using SeaweedFS (previously WeedFS) in production for a couple of years. I needed to share files on a LAN, including small VPS, so distributed file systems(like GlusterFS) that require a kernel module did not work for me.

I store small files. Random access with WeedFS is very constant and fast. If you use multiple volume servers, the ID is unique across all servers. You get redirected to the right volume if you try to access an ID on the wrong server, which does add latency.

My biggest complain is about replication. You can define rules, like replicate on the same rack or different rack, how many copies, etc. But it SeaweedFS fails to replicate a file, it does not store it at all. I would prefer a best-effort duplication rather than a guarantee. This is a big issue when master synchronization fails for some reason and only 1 volume is "visible" for some period of time.

jiqiren · on Jan 4, 2017

GlusterFS does not require any kernel modules. It runs in user space and only when mounting with FUSE or NFS is the kernel in play.

jdcarter · on Jan 4, 2017

Can you tell us more about your production environment? Like how much data, average IOPS, and uptime requirements? I've been poking around and I can't find any real info on who's using this or how. The project's interesting and I'd love to hear more.

jusob · on Jan 5, 2017

I use it to store many small files (JSON and images). My requirements were:

1. predictable read time 2. read/write access through LAN 3. server easily installable on VPS 4. simple API

The main downsides I found so far:

1. Master splitting are very hard to recover from. They happen rarely but when it does, be ready to restart master many times in random order. Latest version seems to have fixed it 2. No easy way to list the content of a volume. You need to use the command line on the volume files 3. Cleanup is hard (compacting the volume files), I fee that the GarbageThreshold is not working well 4. Hard to resize the volumes (i.e. set lower limits) after it has been started 5. Replication, as I mentionned in another comment

Overall it works fine for me. The potential alternative (not fully tested) was Cassandar (the 64MB limit per object when I checked is fine with me).

tw04 · on Jan 4, 2017

Call me skeptical:

>When testing read performance on SeaweedFS, it basically becomes performance test your hard drive's random read speed. Hard Drive usually get 100MB/s~200MB/s.

There are no modern hard drives that get 100-200MB/sec on RANDOM reads. Random ends up being an exercise in IOPS not throughput, and there's absolutely no way you're pulling enough IOPS out of a SAS or SATA hard drive to get anywhere NEAR 100-200MB/sec on random workloads. I get the "it's the software dummy" thing everyone is so hyped about, but if you know THAT little about the underlying hardware and how it functions, you shouldn't be writing filesystems.

Also - what's the protection mechanism for files? If there is none, I literally have no idea what the purpose of this FS is. Under what circumstances would I need a giant slow data storage tier that has no redundancy to speak of?

chrislusf · on Jan 4, 2017

Sorry for my wording, if it is misleading. I am only saying the SeaweedFS code does not try to do anything unnecessary, so you can extract most of the performance juice out of your hard drive or SSD.

pdeuchler · on Jan 4, 2017

My best guess is these numbers are from short benchmarks they ran on SSD's before they reached the steady state... I've pretty frequently been able to hit upwards of 200 mb/s when running short (read: misleading) benchmarks.

jdcarter · on Jan 4, 2017

Re protection, see:

https://github.com/chrislusf/seaweedfs#rack-aware-and-data-c...

(Not an author or anything, just found this project today, and it's interesting enough for me to test it out.)

Edit to add: I've done some testing, SeaweedFS definitely replicates. When the client wants to read from a volume, it should do a lookup and Seaweed will return the URLs of all volume servers which are up at the moment. So far I've just been testing 2-way replication with one server down at a time; it works fine.

discordianfish · on Jan 4, 2017

Agree, that sentence doesn't make sense. But I think the point is simply that accessing data doesn't have any overhead, so you benchmarking your disks. Given the design, that seems likely.

> Also - what's the protection mechanism for files? If there is none, I literally have no idea what the purpose of this FS is. Under what circumstances would I need a giant slow data storage tier that has no redundancy to speak of?

Caches, that's what makes me interested in this (having to cache lots of images). We're using mogilefs for this currently.

tyingq · on Jan 4, 2017

Distributed KV blob database seems more apt. I don't see anything here resembling a filesystem.

vog · on Jan 4, 2017

How does this compare with the already popular distributed file systems IPFS?

kyledrake · on Jan 4, 2017

IPFS is a protocol and implementation of the distributed web. SeaweedFS is an open source version of Facebook's Haystack, a centralized filesystem (with redundancy across multiple servers). See https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

TLDR, Haystack implements a sort of append-only journal design where smaller files are concatenated into a single large "journal" file. After adding, you get a "filename" back that is based on the information needed to locate the data you are interested in. volumename_seekposition.jpg, for example.

Essentially, you eliminate the need for FS metadata this way, which leads to less disk hits for read operations. Open volume, seek to position, read contents (including metadata) until end of file, you're done.

There is then a single-master database that controls the location of the volumes (with a redundant failover), which is possible because managing volumes is less difficult than managing the position of every single file in all the volumes in the system. I think they just use MySQL, amazingly.

Pretty neat idea. It's designed very specifically for photo storage at Facebook.

The main drawback with this design is that it doesn't support binding filesystem paths well - to do that you need a K/V database to match the name you want to the filename needed for Haystack/SeaweedFS. SeaweedFS can do this with a separate Redis and Cassandra "filer" database, but the Redis and Cassandra backends are currently implemented as "flat namespace" stores, so filers using them may not perform directory listings at this time. Huge drawback if you need to 'ls' subdirectories of files, for example.

That was a dealbreaker for me for neocities.org implementation, also maintenance was still an open question with SeaweedFS (when things fail, what do I do to fix them?) so I ended up going with CephFS.

chrislusf · on Jan 4, 2017

Thanks for the write up! You are correct on most of the points.

There are postgres backed filer added already, which can list files.

kyledrake · on Jan 5, 2017

That's great news! I'm going to take a look at it tonight.

Postgres of course does needs some special work to make it failover in production gracefully to a standby server. It would be nice to see something that's designed for it like redis work, but of course, that's a limitation of how redis is designed.

Regardless, I don't think I mentioned this properly, but SeaweedFS is amazing and it's very impressive, and using it at some point is not yet off the table for us. :)

chrislusf · on Jan 5, 2017

Actually there is a pull request to add redis support to list files. https://github.com/chrislusf/seaweedfs/pull/254

I just haven't have time to test it and make it work yet. If you want to, you can help to verify this.

kyledrake · on Jan 5, 2017

If that worked, it would be quite the game changer.

I could play with it a bit, but a full code analysis would be over my head.

Can I incentivize you in any way to review it? :)

sabujp · on Jan 4, 2017

This isn't truly distributed in that sense:

    "Instead of managing chunks, SeaweedFS manages data
    volumes in the master server. Each data volume is size
    32GB, and can hold a lot of files. And each storage node
    can have many data volumes. So the master node only needs
    to store the metadata about the volumes, which is fairly
    small amount of data and is generally stable."

chrislusf · on Jan 5, 2017

The master nodes only have soft states. All actual states are stored on volume servers, together with the actual data.

If one master node goes down, the other master nodes will elect a new leader and control the system.

the_duke · on Jan 4, 2017

So there's no replication?

chrislusf · on Jan 4, 2017

There is. I am not following your reasoning.

sabujp · on Jan 5, 2017

i think he's asking if there's replication for the data stored on the master in case the master goes down, i.e. can there be multiple masters?

jusob · on Jan 5, 2017

You can (should) have multiple masters. But each volume has a primary master. I did see other masters loose the volume state information after the primary master has gone down and been taken out of the cluster.

chrislusf · on Jan 6, 2017

Good observation. I will make the recovery much faster with gRPC calls, which can detect broken network much faster than http heartbeats.

Royalaid · on Jan 4, 2017

I am not sure that SeaweedFS and IPFS fit the same use case, IPFS seems like it wants to a bedrock layer of a newer style of internet. SeaweedFS is more of a Blob store. So comparing them might be possible but also might not be the right idea.

verbatim · on Jan 4, 2017

For one, SeaweedFS states that it is not POSIX-compliant.

tyingq · on Jan 4, 2017

It's not a filesystem either...I don't get why they keep using that terminology. It doesn't have directories, permissions, seek(), copy without the client pulling all the data and pushing it back, etc.

It's a KV data store. Which is fine...nothing wrong with that. It just isn't a filesystem in any sense I'm familiar with.

Edit: Ahh, I see. There's an optional piece that does provide at least directories. Still short of being a "filesystem" IMO, though: https://github.com/chrislusf/seaweedfs/wiki/Filer

chrislusf · on Jan 4, 2017

There are many ways to build a filer layer. Current implementation is just one way to do it.

Adding POSIX or not is more of a design choice. The blob storage layer can support both ways.

tyingq · on Jan 4, 2017

Right. The word filesystem has a defacto meaning for many, though. Not POSIX either...FAT, NTFS, etc are filesystems. There's implied functionality that goes with the word.

This isn't a filesystem anymore than etcd would be.

stuckagain · on Jan 4, 2017

Seems to gloss over the details. Who is grooming the replication in case a volume replica disappears? What happens when a client is writing a replicated file and crashes in the middle?

chrislusf · on Jan 4, 2017

If one volume replica disappears, the whole set of volume will become read-only. There are no magic to auto correct things, which I personally think is where problems start.

If the client crashes, the writes would fail.

jwatte · on Jan 5, 2017

Can you compare it to MogileFS, which seems very similar in design assumptions? (Also, Mogile has been around for > 10 years!)

jusob · on Jan 5, 2017

I've just looked at it. The concept seem to be the same. SeaweedFS is much simpler. You have only 2 services: master and volume, both speak HTTP only. MogileFS seems to offer more control and more tools (rebalancing, for example).