When I worked on SocialGrapple (similar featureset to Fruji), here's what I did:
* Technically: I had a well-optimized PostgreSQL database which had a few parts: Follower graph schema with a revision id (which node is following which), it got cleaned out every N revisions; a delta schema which took the last two revision ids and diff'd them; an aggregate schema which did a bunch of queries and summarized the results every T interval; metadata schema which stored cached information about each node (updated every time that user object was fetched).
I'm pretty obsessive about query and schema optimization, and I had a comprehensive benchmarking suite which helped me consistently improve my performance with bulk insertion as well as aggregate queries as well as user displayed queries. Each job was broken down into small efficient pieces that were executed in dependency order by my custom task scheduler, Turnip (open source at https://github.com/shazow/turnip).
I don't remember the exact numbers but I was approaching 100M rows on a single 512mb Linode.
Redis would have worked too but I would have needed much more RAM or more moving pieces to move things in and out of RAM for processing. None of my queries were slow enough to worry about this.
* Pricing: As others mentioned, higher prices make it easier to scale. I charged based on the size of the account and how many accounts you wanted to monitor (basically proxies to how many API calls you'll cost me). A small account cost something like $6/mo, 5 medium accounts $14/mo, 25 bigger accounts at $50/mo, 100 large accounts (1M+ followers) at $125/mo. I had modest revenue but I can't say my pricing scheme was perfect. I was actively messing with it towards the end.
* I had a legacy Twitter whitelisted account which gave me 10K api hits per hour. This helped me a lot. At the same time, I was careful to not become too dependent on that account in case I lost it. I was well within the boundary of normal user limits the entire time and only really used my whitelisted account to experiment or backfill new data. I made sure to always make the most efficient API calls to avoid wasting them. I too had issues with timeouts but it more came in waves when Twitter was having infrastructure issues rather than consistently. It wouldn't surprise me if this has gotten worse.
Also, I used and stitched all three Twitter APIs: REST, Search, and Streaming. It was painful.
* Diversify. Twitter is becoming an increasingly developer-unfriendly platform to build on, and your business should not be dependent on it. I added Facebook support to SocialGrapple, and I was going to add Google+ support too. Today, I'd also add app.net support. That said, the majority of my business was still Twitter, and that sucked. This was a big factor in my decision to sell out and shut it down—I didn't see the developer ecosystem as a place where you can have a sustainable business, let alone a thriving one.
I actually had several conversations/negotiations with Twitter about how they'd interpret their terms of service wrt my product. It helped to know people at the company to get a favourable ruling, but I still felt like it could be reversed—err, "provided with guidance" at any moment.
For what it's worth, I found it more rewarding to build an analytics product that was super useful for a smaller group of people than a little helpful for a lot of people (I'd say tweepsect.com is the latter). Think about where on the spectrum you want to be as this makes decisions, like pricing, easier.
Best of luck! Shoot me an email (in my profile) if you'd like more details.
andrey, just wanted to say thanks for this comprehensive answer. it seems like you speak from a lot of experience and while it's scary to read through what you have had to do to survive, it all makes a lot of sense and helps me in picking my battles a bit.
again, thanks for this, i am going through it later again, just responding to over 60 other e-mails with help and support, just fascinating to see this. if one thing, we can hopefully make it clear that betting on someone's platform will provide tremendous opportunity but also introduce a considerable uncertainty if it takes off.
* Technically: I had a well-optimized PostgreSQL database which had a few parts: Follower graph schema with a revision id (which node is following which), it got cleaned out every N revisions; a delta schema which took the last two revision ids and diff'd them; an aggregate schema which did a bunch of queries and summarized the results every T interval; metadata schema which stored cached information about each node (updated every time that user object was fetched).
I'm pretty obsessive about query and schema optimization, and I had a comprehensive benchmarking suite which helped me consistently improve my performance with bulk insertion as well as aggregate queries as well as user displayed queries. Each job was broken down into small efficient pieces that were executed in dependency order by my custom task scheduler, Turnip (open source at https://github.com/shazow/turnip).
I don't remember the exact numbers but I was approaching 100M rows on a single 512mb Linode.
Redis would have worked too but I would have needed much more RAM or more moving pieces to move things in and out of RAM for processing. None of my queries were slow enough to worry about this.
* Pricing: As others mentioned, higher prices make it easier to scale. I charged based on the size of the account and how many accounts you wanted to monitor (basically proxies to how many API calls you'll cost me). A small account cost something like $6/mo, 5 medium accounts $14/mo, 25 bigger accounts at $50/mo, 100 large accounts (1M+ followers) at $125/mo. I had modest revenue but I can't say my pricing scheme was perfect. I was actively messing with it towards the end.
* I had a legacy Twitter whitelisted account which gave me 10K api hits per hour. This helped me a lot. At the same time, I was careful to not become too dependent on that account in case I lost it. I was well within the boundary of normal user limits the entire time and only really used my whitelisted account to experiment or backfill new data. I made sure to always make the most efficient API calls to avoid wasting them. I too had issues with timeouts but it more came in waves when Twitter was having infrastructure issues rather than consistently. It wouldn't surprise me if this has gotten worse.
Also, I used and stitched all three Twitter APIs: REST, Search, and Streaming. It was painful.
* Diversify. Twitter is becoming an increasingly developer-unfriendly platform to build on, and your business should not be dependent on it. I added Facebook support to SocialGrapple, and I was going to add Google+ support too. Today, I'd also add app.net support. That said, the majority of my business was still Twitter, and that sucked. This was a big factor in my decision to sell out and shut it down—I didn't see the developer ecosystem as a place where you can have a sustainable business, let alone a thriving one.
I actually had several conversations/negotiations with Twitter about how they'd interpret their terms of service wrt my product. It helped to know people at the company to get a favourable ruling, but I still felt like it could be reversed—err, "provided with guidance" at any moment.
For what it's worth, I found it more rewarding to build an analytics product that was super useful for a smaller group of people than a little helpful for a lot of people (I'd say tweepsect.com is the latter). Think about where on the spectrum you want to be as this makes decisions, like pricing, easier.
Best of luck! Shoot me an email (in my profile) if you'd like more details.