What volume of data are we talking about? The estimate at 2030 is between 1 to 2 TB of data per car per day! Today’s connected car have over hundred sensors and the estimate is a production of 25GB per hour. A mind boggling figure. At least four questions arise from these numbers:
- how can a car generate that amount of data?
- are all those data meaningful (i.e. do they have a value)?
- how can the network possibly manage that deluge?
- who owns those data?
Amount of data
A 2020 model of connected car is estimated to generate some 25GB of data per hour of use. That is quite a lot but it is nothing if compared with what it is expected by 2030: over 1 TB! Where are all these data coming from? Depending on the model there may be between 60 to 100 sensors in a car today and this figure is expected to reach 200 sensors by the end of this decade. These sensors are quite different in terms of data produced: an oil temperature sensor may generate a few bytes per minute, on the other extreme a LIDAR sensor (watch the clip) generates GB every second. Overall a car may produce several TB of data in a single day (an estimate is up to 4TB per hour of use if one takes into consideration the raw set of data generated). With the progress in image recognition (and so far the much lower cost of image sensors with respect to LIDAR sensors) some self-driving car will rely on this technology to detect obstacles and be aware of the surrounding. A full HD stream of data from the video cameras on a self-driving car (a Tesla is equipped with 9 of them) can generate a flow 10 GB per hour.
Are all those data meaningful?
Well, it really depends what you mean by meaningful. In a way, yes, if these data are generated it is because they serve a purpose. However, most of those data can be processed locally and the result of the processing is a meaning that can be coded in a much more limited amount of data. As an example: through the use of LIDAR or video cameras the image of a stray cat is captured and that may involve hundreds of MB. However, those data are crunched by image recognition software (and GPUs) resulting in the detection of the cat. This information can be represented by very few data, a few bytes to indicate you are talking about a cat, plus a few more indicating where is the cat with respect to the car and its direction of movement (or a trace of its movement in the last few seconds to predict its likely next steps). You see, with processing and AI a huge amount of data can be reduced to a very little amount.
Clearly, the value is not in the data themselves, it is in the meaning of those data and in the relation of that meaning with what matters to a specific entity (person, institution, application). Our grandparents used to say that the “beauty is in the eye of the beholder”. That is still valid today. The value is not in the data per sé but in the answer to the “so what?” resulting form the capturing of the data. This is where artificial intelligence dominates.
Notice that this is exactly what happens to all data that continuously stream towards our brain: we get a few (equivalent) GB of data harvested by our senses (here again vision has the lion share with an estimated bandwidth of 10 Mbps connecting each eye to the brain) but all these data are squeezed into meaningful chunks of information, like “That’s a cat” a meaning (most often coded in wording -can you think without placing your thoughts in words?) that can be stored in an equivalent of a few bytes. This compression is the magic of intelligence.
A good deal of “semantic” compression can be done at the sensor level or in its proximity. The problem is that compression usually discard some potential useful bit, the problem is that you may not know if it would be useful and to whom it might be useful. This is why a balance should be achieved between compression and preservation of raw data. Data that would be completely useless to the car, like the presence of a shop – shops don’t move around so they don’t need to be taken into account by the driving application-, can be useful to a third party that is interested in what shops are out there, what merchandise they display on their shelves…
The definition of what should be preserved and shared is called a “data space” and this is the kind of work being done, as an example, by the Gaia-X working group on the automotive data space.
How can the network manage all these data?
It can’t. There are going to be thousands of cars packed in a small urban area competing for network access. Current networks would be overwhelmed and it will be hard to deliver the quality of service that some of the applications involved may require. However, as cars will become more “chatty”, the networks will evolve, particularly in its edge architecture. Edge cloud and edge computing will both shorten the distance between the car and the data crunching, thus keeping the latency low. The “intelligence” will likely be shared among the cars and the edge cloud and there will likely be a continuum of the cloud spanning from the edge to the core (or service centres). By 2030 the 5G will be a pervasive reality and we will likely be bombarded by the marvel of 6G. The denser network sustained by the 6G architecture with cars playing the role of network nodes will provide all the network capacity and cloud processing capacity (federation of edges and devices) that will be needed.
The management of the distributed applications, federated data spaces and the emerging intelligence out of a multeplicity of local intelligence will require new software architectures and paradigms (to manage massive parallelism and increased complexity) and this is what research is focussing on today. Hence I do not expect any stumbling blocks on the way from a technology point of view. The main problem will be finding business models that can sustain the growing investment in a scenario of flat (being optimistic) revenues on the telecommunication side.
The massive volume of data and the intelligence that can be derived from these data will make both privacy and security major issues, most likely the most difficult ones to address.
Who is the owner of the data?
This is a very difficult question because the raw data have a well defined source, and as such one could associate a owner (even though there is a fuzzy area around this, such as: are the data generated by my car mine, or are they owned by the manufacturer, or by the dealer that sold me the car, or by the application provider that is capturing those data?). However, most of the data are meta-data, that is they are generated through data analytics, often applied to different streams of data (owned by different parties). More than that. These meta-data are produced by software that has been created by a third party and has been paid by another one. So who is the real owner of the data? Once data are shared, data spaces are federated, there will be very little control on the meta-data that can be arising form them. Additionally, a good portion of the data are environmental data (like the data on traffic in a given area at a given time) and it would be difficult to associate an owner to them. I could, in a way, claim that the data on traffic are, at least partially, owned by me, since I am the owner of the car that is detected by ambient sensors and contribute to the information of traffic. I understand that this is really stretching the point but still… it emphasises how difficult it is to sort out the ownership of data (once these are shared). Likewise, from data analytics and artificial intelligence the generation of meta-data can (and will) create privacy issues that again will be very tricky to address.
The ownership of data is not a theoretical issue: it has very solid economic implication. Who will be reaping the benefit that McKinsey estimate in 400+billion $ in 2030? The very concrete risk is that those revenues will end up in the pockets of parties that neither own the data nor have invested to make this economic value a reality.