Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think one of the most desirable and under-appreciated goals of schema languages and serialization formats is safety. These tools are typically used in places that deal with untrusted inputs, and features and design choices can go a long way in either exposing or shielding developers from potential safety bugs.

My read of Cap’n’Proto didn’t make it sound that safety was the highest priority. At least not above performance.



With all due respect, you read completely wrong.

* The very first use case for which Cap'n Proto was designed was to be the protocol that Sandstorm.io used to talk between sandbox and supervisor -- an explicitly adversarial security scenario.

* The documentation explicitly calls out how implementations should manage resource exhaustion problems like deep recursion depth (stack overflow risk), were many serialization formats leave these things as the app's problem.

* The implementation has been fuzz-tested multiple ways, including as part of Google's oss-fuzz.

* When there are security bugs, I issue advisories like this:

https://github.com/capnproto/capnproto/tree/v2/security-advi...

* The primary aim of the entire project is to be a Capability-Based Security RPC protocol. That's what "Cap" in the name comes from. The zero-copy serialization is actually a bonus feature.

(I'm the author of Cap'n Proto.)


I agree entirely, and this is one of my single greatest frustrations with the majority of the current popular IDLs/schema languages.

ASN.1 is hilariously bad in a lot of ways, but one thing it gets absolutely right is strong typing and being able to express constraints (ranges, values dependent on other values). That combined with a canonicalized encoding form (DER) goes a long way in making various error states unrepresentable.


Except that ASN.1 is egregiously terrible at being able to be checked for wonky values due to complex parsing.

Exactly how many vulnerabilities have been exploited in LDAP, SNMP, etc. because ASN.1 is so terrible?


ASN.1 isn’t an encoding; DER is.

The problem with LDAP, etc. is that they all permit BER, which is a looser superset of DER. It includes (among other things) the ability to represent indefinite-length fields, which are the single biggest source of exploitable bugs in a typical application of ASN.1. Without that, the exploitable surface of DER is much smaller (and especially when implemented in a memory-safe language).


I've written an ASN.1 parser. The problem isn't the specification (though it is definitely a kitchen sink spec). The problem is the majority of ASN.1 code was written before the year 2000.

ASN.1 started in 1984. That means there are decades of shitty implementations, written well before adversarial input was considered a factor.


Is there a reasonable subset of ASN.1 that could get traction nowadays if specified separately?


There’s a wide set of best practices (use only DER for encoding, avoid legacy string types, etc.) that are widely applied in cryptographic applications, although I don’t know if anybody has written them down explicitly.

More generally: this wasn’t intended to be an endorsement of ASN.1 per se! It was only to say that it got some things right, things that Cap’n Proto and Protobuf appear to have eschewed. I’m not sure it is the right IDL for modern purposes, but I think it’s a useful piece of reference material.


I had to use it at work in a C++ environment and ended up settling on patching a copy of https://github.com/vlm/asn1c

Can't say I'd feel confident putting any of this stuff in a public service. Too complex and prone to bugs.


Check out DFDL/Apache Daffodil. [0] a large portion of the development team is working on it specifically for use in a cybersecurity context. (Disclaimer, I was one such contributer. Although am presently not working on Daffodil).

Having said that, DFDL fails pretty miserably by the standards set in the article. The main design goal was to be able to describe as many existing data formats as possible, which means the spec is massive and supports a lot of bad ideas.

Despite having its 1.0 release in 2015, and being the most complete implementation, Apache Daffodil still does not fully implement the DFDL spec. And it is not an easy code base to jump into and understand.

[0] https://daffodil.apache.org/


> and serialization formats is safety. These tools are typically used in places that deal with untrusted inputs, and features and design choices can go a long way in either exposing or shielding developers from potential safety bugs

My potentially incorrect understanding is that Cap'n Proto's zero copy nature means the serialization format IS the in-memory representation, which means that if you build a Cap'n Proto object on top of non-zeroed memory you can leak data in the padding when transmitting. [Presumably not an issue if the packed encoding is used rather than the zero-copy one]


It's only zero copy to parse/read. The builders allocate all over the place.


A MessageBuilder allocates a single large buffer, writes into it, and only allocates further if that buffer is exhausted. If you use a preallocated buffer you can avoid allocation entirely. Very different from Protobuf which allocates strings, arrays, and sub-messages all as separate heap objects.


hmm, i don’t understand how schema-languages can do anything about that though. afterall, you just serialize/de-serialize based on provided inputs, GIGO if you will.

safety w.r.t bad/malicious inputs should be a ‘higher level’ concern afaik.


Safety means: garbage in, error out.


right, and it should not be left to the serialization layer for that.


Security is a concern for every layer. It's not magic pixie dust that' can be sprinkled on top of software to renders it secure!

A while ago I read a great article about how the Adobe PDF serialization format is nearly impossible to secure because it allows inherently unsafe constructs.

For example, it allows cross-references that are basically just arbitrary unaligned pointers. It uses many different alignment and padding algorithms. It has length-prefixed and not-length prefixed sections. Etc, etc...

Apparently it was a serious research exercise to make a safe PDF parser, and they only covered a fraction of the full spec!

To put things in perspective: Originally, PDF allowed arbitrary code execution as a core feature, allowing the output of shell commands to be used as document content.

Most people like the Chromium and Firefox teams have just given up and now parse PDF using a sandboxed JavaScript VM because it's too hard to do it safely with C++. They parse HTML and JavaScript with C++, but not PDF. Think about that.

A similar issue caused Log4j, where a "format string parser" contained a vulnerability because it was too flexible and allowed network requests to be triggered by user-controlled data.

Even trivial, "surely it must be safe" formats like XML and JSON are riddled with security issues, such as different layers in a microservice architecture having different handling semantics for duplicate keys, null values, etc... This can result in exploits such as authentication and authorization tokens being interpreted by a system one way, but a different way by a different system. For real-world attacks along these lines, search for "request smuggling".

Serialization and parsing are security minefields and it is dangerously naive to just hand-wave that away.

See: https://seriot.ch/projects/parsing_json.html


> Serialization and parsing are security minefields and it is dangerously naive to just hand-wave that away. well, i am not hand-waving them away, i am not sure what can the serialization framework possibly _do_ to make things secure during the serialization ?

when execution of user-supplied code is allowed (in the examples that you have outlined above), surely, the layer _executing_ the code cannot really do anything about it ! perhaps you actually did intend to `rm -rf /` ?

policy checking, enforcement etc. has to happen at a higher / different layer. i am not sure why mechanism and policy are being conflated here.

in the same way, you gave the serialization layer a 10mb or whatever sized input to serialize, sure...you get an valid serialized output etc. maybe there is a genuine usecase for that in some context or another f.e. when serializing say image files, or something else etc. etc.

[edit] : minor comment.


> I am not sure what can the serialization framework possibly _do_ to make things secure during the serialization

Loads of things!

A strict specification that can only be interpreted one way goes very far. E.g.: a machine-readable BNF grammar file or something similar with no ambiguities.

A conformance test suite covering corner-cases is surprisingly effective, even with a supposedly perfect spec.

"Be strict with what you generate and lax with what you accept" has been demonstrated over and over again to be a disaster over the long-term in an ecosystem of many groups. Be strict always with what is accepted, not just generated!

Speaking of being strict: schema validation is essential. Strong typing for scalars helps a lot.

The actual implementations of the spec can obviously have a wide range of security features. Never allowing arbitrary type instantiation is critical, yet is a mistake that keeps reoccurring much like SQL injection.

Etc, etc...


> I am not sure what can the serialization framework possibly _do_ to make things secure during the serialization

>> Loads of things!

>> A strict specification that can only be interpreted one way goes very far. E.g.: a machine-readable BNF grammar file or something similar with no ambiguities.

once again, that is not the domain of the serialization framework ! it is a policy which needs to be established and enforced at input / output layer by the entity which implements it.

a serialization framework should just serialize and deserialize objects to / from an i/o 'channel' f.e. file, network, etc. shackling it with specification / enforcement of security etc. policies seems conflating one concern with another.


what's the best modern alternative that is designed in this way?


gRPC ticks most of the checkboxes.

Unfortunately these are lessons that have to be learned over and over. Anything based on JSON is generally suspect. If you see the terms "quick" or "simple" in some marketing splash-page, assume the author has not thought about the hard problems like security and long-term interoperability.

Similarly, if you find yourself hand-rolling RPC client code and calling methods on something like "HttpClient" manually, you've done it wrong. That code should have been spat out by a code-generator from a schema.


> gRPC ticks most of the checkboxes.

huh :) ! gRPC is a 'r-p-c' framework, and uses protobuf for serialization. you should be comparing protobuf to cap'nproto.


it depends on what type of safety.

The schema language might for example allow you to specify that an input string/blob should be smaller than 10MB and refuse to deserialize it if it is longer, same for array/list/vector length.


It feels like a check against an input size of 10MB is something you would do well before deserialization, no?


The limit might apply to some specific part of the message, rather than the whole. You can't check this without actually deserialising, or at least doing most of the same work.


not if it is a message you receive from a third party.

A concrete example might be a batching third party client: the app sends N messages in a single batch and each message has its own size limit.


You would, but others might not. Defense in depth.


> ... allow you to specify that an input string/blob should be smaller than 10MB and refuse to deserialize it if it is longer ...

why ? are there no cases where serializing even larger file is valid ?


sure, a lot of cases, I suspect that S3 upload limits are different from imgur.


and feel free to do that in _your_ application. don’t shackle others with the limitations of your domain.

mechanism vs policy and all that.


I believe I have already justified why it might be useful at the protocol/schema level in ways that cannot be replicated at the application level: to eagerly fail on expensive (eg memory) deserialization.


Disregard for safety and security in serialization is one of the most common, if not the most common, cause for security vulnerabilities.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: